Learning-based web data cleansing for information retrieval
With rapid growth of web information, to select high quality web pages that cover valuable information query-independently becomes more and more important in web Information Retrieval (IR) research. Based on query-independent feature analysis, a data cleansing algorithm is proposed by selecting an important type of high quality pages (key resources) on the web. Study into the cleansed page set shows that the set contains only 44.3% pages of the whole collection, while involves more than 98% of hyperlinks and covers about 90% of key information. Experiments based on TREC 2003 data show that the cleansed collection outperforms the whole collection by less than a half size and 8% improvement of retrieval performance.
Author's Name: Liu, Y., Zhang, M., Wang, C., Ma, S.
Volume: Volume 2
Issues: Issue 4
Keywords: Data cleansing, Query-independent features, Web information retrieval