Please wait a minute...
New Technology of Library and Information Service  2013, Vol. Issue (6): 42-48    DOI: 10.11925/infotech.1003-3513.2013.06.07
Current Issue | Archive | Adv Search |
A New Method of Keywords Extraction for Chinese Short-text Classification
Hu Yongjun1, Jiang Jiaxin2, Chang Huiyou3
1. Business School, Sun Yat-Sen University, Guangzhou 510275, China;
2. School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, China;
3. School of Software, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF(1831 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high-frequency words expansion method based on LDA. By extracting high-frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short-text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short-text classification can obtain a higher classification performance comparing with the conventional classification methods.
Key wordsShort-text classification      High frequency words      LDA      Feature expansion     
Received: 05 April 2013      Published: 24 July 2013
:  TP391  

Cite this article:

Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification. New Technology of Library and Information Service, 2013, (6): 42-48.

URL:     OR

[1] Hotho A, Staab S, Stumme G. Ontologies Improve Text Document Clustering[C]. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). Washington, D C: IEEE Computer Society, 2003: 541-544.
[2] Pinto D, Rosso P, Benajiba Y, et al. Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach[C]. In: Proceedings of the 7th Conference on Language Engineering of the Egyptian Society of Language Engineering (ESOLE 2007). 2007: 235-245.
[3] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). New York: ACM, 2007: 787-788.
[4] Pinto D, Jiménez-Salazar H, Rosso P. Clustering Abstracts of Scientific Texts Using the Transition Point Technique[C]. In: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’06). Heidelberg, Berlin: Springer-Verlag, 2006: 536-546.
[5] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension[C]. In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence (AICI’10). Washington, D C: IEEE Computer Society, 2010,2: 7-11.
[6] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]. In: Proceedings of the 15th International Conference on World Wide Web (WWW’06). New York: ACM, 2006: 377-386.
[7] Hu X, Sun N, Zhang C, et al. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge[C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). New York: ACM, 2009: 919-928.
[8] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web (WWW’08). New York: ACM, 2008: 91-100.
[9] Quan X, Liu G, Lu Z, et al. Short Text Similarity Based on Probabilistic Topics[J]. Knowledge and Information Systems, 2010,25(3): 473-491.
[10] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[11] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). New York: ACM, 1999: 50-57.
[12] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003(3):993-1022.
[13] Rubin T N, Chambers A, Smyth P, et al. Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208.
[14] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics[C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). AAAI Press, 2011: 1776-1781.
[15] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[16] Jurka T P, Collingwood L, Boydstun A E, et al. RTextTools: Automatic Text Classification via Supervised Learning[OL]. [2012-08-18].
[17] Blei D M, McAuliffe J D. Supervised Topic Models[OL]. [2010-09-16].
[18] Berger A L, Pietra V J D, Pietra S A D. A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1): 39-71.
[1] Lixin Xia,Jieyan Zeng,Chongwu Bi,Guanghui Ye. Identifying Hierarchy Evolution of User Interests with LDA Topic Model[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[2] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[3] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[4] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[5] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[6] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[7] Yue He,Yue Feng,Shupeng Zhao,Yufeng Ma. Recommending Contents Based on Zhihu Q&A Community: Case Study of Logistics Topics[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[8] Tao Zhang,Haiqun Ma. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[9] Yanhua Xu,Yujie Miao,Lin Miao,Xueqiang Lv. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[10] Ziming Zeng,Qianwen Yang. Sentiment Analysis for Micro-blogs with LDA and AdaBoost[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[11] Beibei Pang,Juanqiong Gou,Wenxin Mu. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[12] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[13] Jingqi Wang,Rui Li,Huayi Wu. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[14] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[15] Jiabin Qu,Shiyan Ou. Analyzing Topic Evolution with Topic Filtering and Relevance[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938