Learning-based web data cleansing for information retrieval – Journal of Computational Information Systems

Volume 2 - Issue 4

Learning-based web data cleansing for information retrieval

Abstract

With rapid growth of web information, to select high quality web pages that cover valuable information query-independently becomes more and more important in web Information Retrieval (IR) research. Based on query-independent feature analysis, a data cleansing algorithm is proposed by selecting an important type of high quality pages (key resources) on the web. Study into the cleansed page set shows that the set contains only 44.3% pages of the whole collection, while involves more than 98% of hyperlinks and covers about 90% of key information. Experiments based on TREC 2003 data show that the cleansed collection outperforms the whole collection by less than a half size and 8% improvement of retrieval performance.

Paper Details

PaperID: 33748136924

Author's Name: Liu, Y., Zhang, M., Wang, C., Ma, S.

Volume: Volume 2

Issues: Issue 4

Keywords: Data cleansing, Query-independent features, Web information retrieval

Year: 2005

Month: December

Pages: 709 - 716