|
|
A New Method of Keywords Extraction for Chinese Short-text Classification |
Hu Yongjun1, Jiang Jiaxin2, Chang Huiyou3 |
1. Business School, Sun Yat-Sen University, Guangzhou 510275, China; 2. School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, China; 3. School of Software, Sun Yat-Sen University, Guangzhou 510006, China |
|
|
Abstract Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high-frequency words expansion method based on LDA. By extracting high-frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short-text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short-text classification can obtain a higher classification performance comparing with the conventional classification methods.
|
Received: 05 April 2013
Published: 24 July 2013
|
|
[1] Hotho A, Staab S, Stumme G. Ontologies Improve Text Document Clustering[C]. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). Washington, D C: IEEE Computer Society, 2003: 541-544. [2] Pinto D, Rosso P, Benajiba Y, et al. Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach[C]. In: Proceedings of the 7th Conference on Language Engineering of the Egyptian Society of Language Engineering (ESOLE 2007). 2007: 235-245. [3] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). New York: ACM, 2007: 787-788. [4] Pinto D, Jiménez-Salazar H, Rosso P. Clustering Abstracts of Scientific Texts Using the Transition Point Technique[C]. In: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’06). Heidelberg, Berlin: Springer-Verlag, 2006: 536-546. [5] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension[C]. In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence (AICI’10). Washington, D C: IEEE Computer Society, 2010,2: 7-11. [6] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]. In: Proceedings of the 15th International Conference on World Wide Web (WWW’06). New York: ACM, 2006: 377-386. [7] Hu X, Sun N, Zhang C, et al. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge[C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). New York: ACM, 2009: 919-928. [8] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web (WWW’08). New York: ACM, 2008: 91-100. [9] Quan X, Liu G, Lu Z, et al. Short Text Similarity Based on Probabilistic Topics[J]. Knowledge and Information Systems, 2010,25(3): 473-491. [10] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. [11] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). New York: ACM, 1999: 50-57. [12] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003(3):993-1022. [13] Rubin T N, Chambers A, Smyth P, et al. Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208. [14] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics[C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). AAAI Press, 2011: 1776-1781. [15] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235. [16] Jurka T P, Collingwood L, Boydstun A E, et al. RTextTools: Automatic Text Classification via Supervised Learning[OL]. [2012-08-18]. http://cran.r-project.org/web/packages/RTextTools/index.html. [17] Blei D M, McAuliffe J D. Supervised Topic Models[OL]. [2010-09-16]. http://arxiv.org/abs/1003.0783/. [18] Berger A L, Pietra V J D, Pietra S A D. A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1): 39-71. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|