Multi-view LDA for semantics-based document representation
Each document and word can be modeled as a mixture of topics by Latent Dirichlet Allocation (LDA), which does not contain any external semantic information. In this paper, we represent documents as two feature spaces consisting of words and Wikipedia categories respectively, and propose a new method called Multi-View LDA (M-LDA) by combining LDA with explicit human-defined concepts in Wikipedia. M-LDA improves document topic model by taking advantage of both two feature spaces and their mapping relationship. Experimental results on classification and clustering tasks show M-LDA outperforms traditional LDA.
Author's Name: Yun, J., Jing, L., Huang, H., Yu, J.
Volume: Volume 7
Issues: Issue 14
Keywords: Latent dirichlet allocation, Semantics, Topic model, Wikipedia category