TY - JOUR
T1 - Detection of duplication in documents and webpages based documents syntactical structures through an improved longest common subsequence
AU - Elhadi, Mohamed
AU - Al-Tobi, Amjad
PY - 2010
Y1 - 2010
N2 - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.
AB - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.
KW - Duplication filtering
KW - Longest common subsequence
KW - POS
KW - Syntactical structure
UR - http://www.scopus.com/inward/record.url?scp=80053957867&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053957867&partnerID=8YFLogxK
U2 - 10.4156/ijipm.vol1.issue1.16
DO - 10.4156/ijipm.vol1.issue1.16
M3 - Article
AN - SCOPUS:80053957867
SN - 2093-4009
VL - 1
SP - 138
EP - 147
JO - International Journal of Information Processing and Management
JF - International Journal of Information Processing and Management
IS - 1
ER -