Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

Mohamed Elhadi; Amjad Al-Tobi

doi:10.4156/ijipm.vol1.issue1.16

Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

Mohamed Elhadi^*, Amjad Al-Tobi

^*المؤلف المقابل لهذا العمل

نتاج البحث: المساهمة في مجلة › Article › مراجعة النظراء

6 اقتباسات (Scopus)

ملخص

This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

اللغة الأصلية	English
الصفحات (من إلى)	138-147
عدد الصفحات	10
دورية	International Journal of Information Processing and Management
مستوى الصوت	1
رقم الإصدار	1
المعرِّفات الرقمية للأشياء	https://doi.org/10.4156/ijipm.vol1.issue1.16
حالة النشر	Published - 2010
منشور خارجيًا	نعم

ASJC Scopus subject areas

???subjectarea.asjc.1700.1700???
???subjectarea.asjc.1800.1802???

الوصول إلى المستند

10.4156/ijipm.vol1.issue1.16

الملفات والروابط الأخرى

قم بذكر هذا

Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence. / Elhadi, Mohamed; Al-Tobi, Amjad.
في: International Journal of Information Processing and Management, المجلد 1, رقم 1, 2010, صفحة 138-147.

نتاج البحث: المساهمة في مجلة › Article › مراجعة النظراء

@article{31d0b33cf3d44030916ee49afbbf9db4,

title = "Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence",

abstract = "This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.",

keywords = "Duplication filtering, Longest common subsequence, POS, Syntactical structure",

author = "Mohamed Elhadi and Amjad Al-Tobi",

year = "2010",

doi = "10.4156/ijipm.vol1.issue1.16",

language = "English",

volume = "1",

pages = "138--147",

journal = "International Journal of Information Processing and Management",

issn = "2093-4009",

publisher = "Advanced Institute of Convergence Information Technology Research Center",

number = "1",

}

TY - JOUR

T1 - Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

AU - Elhadi, Mohamed

AU - Al-Tobi, Amjad

PY - 2010

Y1 - 2010

N2 - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

AB - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.

KW - Duplication filtering

KW - Longest common subsequence

KW - POS

KW - Syntactical structure

UR - http://www.scopus.com/inward/record.url?scp=80053957867&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053957867&partnerID=8YFLogxK

U2 - 10.4156/ijipm.vol1.issue1.16

DO - 10.4156/ijipm.vol1.issue1.16

M3 - Article

AN - SCOPUS:80053957867

SN - 2093-4009

VL - 1

SP - 138

EP - 147

JO - International Journal of Information Processing and Management

JF - International Journal of Information Processing and Management

IS - 1

ER -

Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence

ملخص

ASJC Scopus subject areas

الوصول إلى المستند

الملفات والروابط الأخرى

بصمة

قم بذكر هذا