Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. Issue (6): 42-48    DOI: 10.11925/infotech.1003-3513.2013.06.07
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于LDA高频词扩展的中文短文本分类
胡勇军1, 江嘉欣2, 常会友3
1. 中山大学管理学院 广州 510275;
2. 中山大学信息科学与技术学院 广州 510006;
3. 中山大学软件学院 广州 510006
A New Method of Keywords Extraction for Chinese Short-text Classification
Hu Yongjun1, Jiang Jiaxin2, Chang Huiyou3
1. Business School, Sun Yat-Sen University, Guangzhou 510275, China;
2. School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, China;
3. School of Software, Sun Yat-Sen University, Guangzhou 510006, China
全文: PDF(1831 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对短文本特征稀疏、噪声大等特点,提出一种基于LDA高频词扩展的方法,通过抽取每个类别的高频词作为向量空间模型的特征空间,用TF-IDF方法将短文本表示成向量,再利用LDA得到每个文本的隐主题特征,将概率大于某一阈值的隐主题对应的高频词扩展到文本中,以降低短文本的噪声和稀疏性影响。实验证明,这种方法的分类性能高于常规分类方法。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
常会友
胡勇军
江嘉欣
关键词 短文本分类高频词LDA特征扩展    
Abstract:Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high-frequency words expansion method based on LDA. By extracting high-frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short-text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short-text classification can obtain a higher classification performance comparing with the conventional classification methods.
Key wordsShort-text classification    High frequency words    LDA    Feature expansion
收稿日期: 2013-04-05     
:  TP391  
基金资助:本文系国家863计划基金项目“农产品全供应链多源信息感知技术与产品开发”(项目编号:2012AA101701-03)的研究成果之一。
通讯作者: 胡勇军     E-mail: hyjsdu96@126.com
引用本文:   
胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.06.07.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.06.07
[1] Hotho A, Staab S, Stumme G. Ontologies Improve Text Document Clustering[C]. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). Washington, D C: IEEE Computer Society, 2003: 541-544.
[2] Pinto D, Rosso P, Benajiba Y, et al. Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach[C]. In: Proceedings of the 7th Conference on Language Engineering of the Egyptian Society of Language Engineering (ESOLE 2007). 2007: 235-245.
[3] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). New York: ACM, 2007: 787-788.
[4] Pinto D, Jiménez-Salazar H, Rosso P. Clustering Abstracts of Scientific Texts Using the Transition Point Technique[C]. In: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’06). Heidelberg, Berlin: Springer-Verlag, 2006: 536-546.
[5] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension[C]. In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence (AICI’10). Washington, D C: IEEE Computer Society, 2010,2: 7-11.
[6] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]. In: Proceedings of the 15th International Conference on World Wide Web (WWW’06). New York: ACM, 2006: 377-386.
[7] Hu X, Sun N, Zhang C, et al. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge[C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). New York: ACM, 2009: 919-928.
[8] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web (WWW’08). New York: ACM, 2008: 91-100.
[9] Quan X, Liu G, Lu Z, et al. Short Text Similarity Based on Probabilistic Topics[J]. Knowledge and Information Systems, 2010,25(3): 473-491.
[10] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[11] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). New York: ACM, 1999: 50-57.
[12] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003(3):993-1022.
[13] Rubin T N, Chambers A, Smyth P, et al. Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208.
[14] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics[C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). AAAI Press, 2011: 1776-1781.
[15] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[16] Jurka T P, Collingwood L, Boydstun A E, et al. RTextTools: Automatic Text Classification via Supervised Learning[OL]. [2012-08-18]. http://cran.r-project.org/web/packages/RTextTools/index.html.
[17] Blei D M, McAuliffe J D. Supervised Topic Models[OL]. [2010-09-16]. http://arxiv.org/abs/1003.0783/.
[18] Berger A L, Pietra V J D, Pietra S A D. A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1): 39-71.
[1] 夏立新,曾杰妍,毕崇武,叶光辉. 基于LDA主题模型的用户兴趣层级演化研究 *[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[2] 关鹏,王曰芬,傅柱. 基于LDA的主题语义演化分析方法研究 * ——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[5] 张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析*[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[6] 刘俊婉,龙志昕,王菲菲. 基于LDA主题模型与链路预测的新兴主题关联机会发现研究*[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[7] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[8] 何跃,丰月,赵书朋,马玉凤. 基于知乎问答社区的内容推荐研究——以物流话题为例[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[9] 张涛,马海群. 一种基于LDA主题模型的政策文本聚类方法研究*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[10] 徐艳华,苗雨洁,苗琳,吕学强. 基于LDA模型的HSK作文生成*[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[11] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[12] 曾子明,杨倩雯. 基于LDA和AdaBoost多特征组合的微博情感分析*[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[13] 庞贝贝,苟娟琼,穆文歆. 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究*[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[14] 王丽,邹丽雪,刘细文. 基于LDA主题模型的文献关联分析及可视化研究[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[15] 王璟琦,李锐,吴华意. 基于空间自相关的网络舆情话题演化时空规律分析*[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn