CentralMatch: A fast and accurate method to identify blog-duplicates

Heejin Park, Sang Chul Lee, Soon Haeng Lee, Sang Wook Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, CentralMatch, to identify blog-duplicates efficiently and accurately. When searching a document database for possible blog-duplicates of a given document, CentralMatch runs 50 times faster than MinHashing. In addition, CentralMatch identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and CentralMatch are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means CentralMatch finds 80% more blog-duplicates than MinHashing.

Original languageEnglish
Title of host publication2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
Pages112-119
Number of pages8
DOIs
StatePublished - 2010
Event2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 - Toronto, ON, Canada
Duration: 31 Aug 20103 Sep 2010

Publication series

NameProceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
Volume1

Conference

Conference2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
Country/TerritoryCanada
CityToronto, ON
Period31/08/103/09/10

Keywords

  • Blog posts
  • Duplicate identification
  • Indexing
  • String matching
  • Web search engines

Fingerprint

Dive into the research topics of 'CentralMatch: A fast and accurate method to identify blog-duplicates'. Together they form a unique fingerprint.

Cite this