Optimize document identifier assignment for inverted index compression
Document identifier assignment is a technique for inverted file index compression, by reducing d-gap value of posting lists. It was approached by either TSP or clustering methods in existing study. However, there is no proper formulation for this problem and the existing approaches has no theory guarantee to be good approximations. In this paper, we first formulate document identifier assignment problem as an optimization problem, and then propose a new method to solve it approximately. Our method first clusters the documents by URL information and then rearranges the documents and clusters with benefit function, which is derived by minimizing posting space directly. TSP method can be considered as one simple case of our method. The experiments show that it achieves a good trade-off between efficiency and effectiveness.
Author's Name: Chen, C., He, J., Shan, D., Yan, H.
Volume: Volume 6
Issues: Issue 2
Keywords: Cluster, Document identifier, Inverted index compression, Optimization