Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. Issue (6): 42-48     https://doi.org/10.11925/infotech.1003-3513.2013.06.07
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于LDA高频词扩展的中文短文本分类
胡勇军1, 江嘉欣2, 常会友3
1. 中山大学管理学院 广州 510275;
2. 中山大学信息科学与技术学院 广州 510006;
3. 中山大学软件学院 广州 510006
A New Method of Keywords Extraction for Chinese Short-text Classification
Hu Yongjun1, Jiang Jiaxin2, Chang Huiyou3
1. Business School, Sun Yat-Sen University, Guangzhou 510275, China;
2. School of Information Science and Technology, Sun Yat-Sen University, Guangzhou 510006, China;
3. School of Software, Sun Yat-Sen University, Guangzhou 510006, China
全文: PDF (1831 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对短文本特征稀疏、噪声大等特点,提出一种基于LDA高频词扩展的方法,通过抽取每个类别的高频词作为向量空间模型的特征空间,用TF-IDF方法将短文本表示成向量,再利用LDA得到每个文本的隐主题特征,将概率大于某一阈值的隐主题对应的高频词扩展到文本中,以降低短文本的噪声和稀疏性影响。实验证明,这种方法的分类性能高于常规分类方法。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
常会友
胡勇军
江嘉欣
关键词 短文本分类高频词LDA特征扩展    
Abstract:Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high-frequency words expansion method based on LDA. By extracting high-frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short-text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short-text classification can obtain a higher classification performance comparing with the conventional classification methods.
Key wordsShort-text classification    High frequency words    LDA    Feature expansion
收稿日期: 2013-04-05      出版日期: 2013-07-24
:  TP391  
基金资助:本文系国家863计划基金项目“农产品全供应链多源信息感知技术与产品开发”(项目编号:2012AA101701-03)的研究成果之一。
通讯作者: 胡勇军     E-mail: hyjsdu96@126.com
引用本文:   
胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification. New Technology of Library and Information Service, 2013, (6): 42-48.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.06.07      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V/I6/42
[1] Hotho A, Staab S, Stumme G. Ontologies Improve Text Document Clustering[C]. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). Washington, D C: IEEE Computer Society, 2003: 541-544.
[2] Pinto D, Rosso P, Benajiba Y, et al. Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach[C]. In: Proceedings of the 7th Conference on Language Engineering of the Egyptian Society of Language Engineering (ESOLE 2007). 2007: 235-245.
[3] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). New York: ACM, 2007: 787-788.
[4] Pinto D, Jiménez-Salazar H, Rosso P. Clustering Abstracts of Scientific Texts Using the Transition Point Technique[C]. In: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’06). Heidelberg, Berlin: Springer-Verlag, 2006: 536-546.
[5] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension[C]. In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence (AICI’10). Washington, D C: IEEE Computer Society, 2010,2: 7-11.
[6] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]. In: Proceedings of the 15th International Conference on World Wide Web (WWW’06). New York: ACM, 2006: 377-386.
[7] Hu X, Sun N, Zhang C, et al. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge[C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). New York: ACM, 2009: 919-928.
[8] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web (WWW’08). New York: ACM, 2008: 91-100.
[9] Quan X, Liu G, Lu Z, et al. Short Text Similarity Based on Probabilistic Topics[J]. Knowledge and Information Systems, 2010,25(3): 473-491.
[10] Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[11] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). New York: ACM, 1999: 50-57.
[12] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003(3):993-1022.
[13] Rubin T N, Chambers A, Smyth P, et al. Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208.
[14] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics[C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). AAAI Press, 2011: 1776-1781.
[15] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[16] Jurka T P, Collingwood L, Boydstun A E, et al. RTextTools: Automatic Text Classification via Supervised Learning[OL]. [2012-08-18]. http://cran.r-project.org/web/packages/RTextTools/index.html.
[17] Blei D M, McAuliffe J D. Supervised Topic Models[OL]. [2010-09-16]. http://arxiv.org/abs/1003.0783/.
[18] Berger A L, Pietra V J D, Pietra S A D. A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1): 39-71.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[3] 伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[4] 王伟, 高宁, 徐玉婷, 王洪伟. 基于LDA的众筹项目在线评论主题动态演化分析*[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[5] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6] 蔡永明,刘璐,王科唯. 网络虚拟学习社区重要用户与核心主题联合分析*[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[7] 叶光辉,曾杰妍,胡婧岚,毕崇武. 城市画像视角下的社会公众情感演化研究*[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[8] 潘有能,倪秀丽. 基于Labeled-LDA模型的在线医疗专家推荐研究*[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[9] 刘玉文,王凯. 面向地域的网络话题识别方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[10] 黄微,赵江元,闫璐. 网络热点事件话题漂移指数构建与实证研究*[J]. 数据分析与知识发现, 2020, 4(11): 92-101.
[11] 叶光辉,徐彤,毕崇武,李心悦. 基于多维度特征与LDA模型的城市旅游画像演化分析*[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[12] 王晰巍,张柳,黄博,韦雅楠. 基于LDA的微博用户主题图谱构建及实证研究*——以“埃航空难”为例[J]. 数据分析与知识发现, 2020, 4(10): 47-57.
[13] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[14] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[15] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn