TY - GEN
T1 - Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures
AU - Elhadi, Mohamed
AU - Al-Tobi, Amjad
PY - 2009
Y1 - 2009
N2 - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
AB - This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for documents. An improved LCS algorithm was applied to such a representation to compare and rank the documents according to the similarity of their representative string. The approach was applied in detecting duplicate documents within a corpus, and in the filtering of search engine results. Results obtained were encouraging.
KW - Component: part-of-speech
KW - Duplication filtering
KW - Longest common subsequence
KW - Syntactical structure
UR - http://www.scopus.com/inward/record.url?scp=77749301855&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77749301855&partnerID=8YFLogxK
U2 - 10.1109/ICCIT.2009.235
DO - 10.1109/ICCIT.2009.235
M3 - Conference contribution
AN - SCOPUS:77749301855
SN - 9780769538969
T3 - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
SP - 679
EP - 684
BT - ICCIT 2009 - 4th International Conference on Computer Sciences and Convergence Information Technology
T2 - 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT 2009
Y2 - 24 November 2009 through 26 November 2009
ER -