TY - GEN
T1 - CentralMatch
T2 - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
AU - Park, Heejin
AU - Lee, Sang Chul
AU - Lee, Soon Haeng
AU - Kim, Sang Wook
PY - 2010
Y1 - 2010
N2 - A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, CentralMatch, to identify blog-duplicates efficiently and accurately. When searching a document database for possible blog-duplicates of a given document, CentralMatch runs 50 times faster than MinHashing. In addition, CentralMatch identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and CentralMatch are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means CentralMatch finds 80% more blog-duplicates than MinHashing.
AB - A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, CentralMatch, to identify blog-duplicates efficiently and accurately. When searching a document database for possible blog-duplicates of a given document, CentralMatch runs 50 times faster than MinHashing. In addition, CentralMatch identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and CentralMatch are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means CentralMatch finds 80% more blog-duplicates than MinHashing.
KW - Blog posts
KW - Duplicate identification
KW - Indexing
KW - String matching
KW - Web search engines
UR - http://www.scopus.com/inward/record.url?scp=78649896831&partnerID=8YFLogxK
U2 - 10.1109/WI-IAT.2010.98
DO - 10.1109/WI-IAT.2010.98
M3 - Conference contribution
AN - SCOPUS:78649896831
SN - 9780769541914
T3 - Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
SP - 112
EP - 119
BT - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
Y2 - 31 August 2010 through 3 September 2010
ER -