Volume 7 - Issue 5
A real environment oriented parallel duplicates removal approach for large scale Chinese WebPages
Abstract
Aiming at the problems that existing Chinese webpage duplicate removal approaches are restricted by the page scale and the algorithm efficiency, we propose an efficient distributed parallel duplicates elimination approach based on Linux cluster. This paper not only solves the problems of memory limitations and large computation caused by the huge data scale, but also researches into the time-series problems in the distributed parallel computing and gives an effective solution. Experimental results on 10 million webpage dataset show that the proposed approach can deal with duplicates from massive web pages well and truly.
Paper Details
PaperID: 79957663073
Author's Name: Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Y.
Volume: Volume 7
Issues: Issue 5
Keywords: Distributed parallel computing, Duplicates eliminatio,n Linux cluster, Speedup, Time-series
Year: 2011
Month: May
Pages: 1420 - 1427