Volume 12 - Issue 2
Uyghur text clustering based on semantic word set
Abstract
In view of problems of high dimension, the sparsity of information and inconsideration of semantic relation between words of TF-IDF space vector, a method that uses semantic word set as features to reduce dimension and strengthen information density is proposed. This study uses the latent semantic analysis algorithm to obtain the semantic relations between words, and establishes the semantic dictionary by ESD, then we use the word set as features to express text features, and form TCSD combining with the clustering algorithm to cluster the corpus. The experimental results show that the precision rate is 94.29% and the recall rate is 94.28%, which indicate that TCSD performs better than the algorithms that use words as features.
Paper Details
PaperID: 84875767775
Author's Name: Tian, S., Zhai, X., Yu, L., Guo, H.
Volume: Volume 12
Issues: Issue 2
Keywords: Latent semantic analysis, Reducing dimensions, Semantic dictionary, Uyghur
Year: 2016
Month: April
Pages: 781-790