Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

Mohamed Elhadi*, Amjad Al-Tobi

*المؤلف المقابل لهذا العمل

نتاج البحث: المساهمة في مجلةArticleمراجعة النظراء

6 اقتباسات (Scopus)

ملخص

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

اللغة الأصليةEnglish
الصفحات (من إلى)138-147
عدد الصفحات10
دوريةInternational Journal of Information Processing and Management
مستوى الصوت1
رقم الإصدار1
المعرِّفات الرقمية للأشياء
حالة النشرPublished - 2010
منشور خارجيًانعم

ASJC Scopus subject areas

  • ???subjectarea.asjc.1700.1700???
  • ???subjectarea.asjc.1800.1802???

بصمة

أدرس بدقة موضوعات البحث “Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence'. فهما يشكلان معًا بصمة فريدة.

قم بذكر هذا