Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (5): 42-49    DOI: 10.11925/infotech.1003-3513.2015.05.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于加权LDA模型和多粒度的文本特征选择方法
李湘东1,2, 巴志超1, 黄莉3
1 武汉大学信息管理学院 武汉 430072;
2 武汉大学信息资源研究中心 武汉 430072;
3 武汉大学图书馆 武汉 430072
Allocation and Multi-granularity
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
全文: PDF(559 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]为改善图书和期刊书目信息的分类性能, 结合书目文本的体例结构特点, 提出一种基于加权LDA模型和多粒度的文本特征选择方法。[方法]在点互信息(PMI)模型的基础上, 结合词性、位置等要素修正特征词的权重并扩展至LDA的生成模型中, 以抽取表意性较强的粗粒度特征; 结合TF-IDF计算模型采用一定策略获取细粒度特征, 基于多粒度特征作为核心特征词集表征书目文本; 采用KNN、SVM等算法实现书目文本的分类。[结果]在自建图书、期刊材料上进行分类实验, 与LDA方法以及传统特征选择方法相比, 该方法分类准确率分别平均提高3.60%和4.79%。[局限]实验材料的数量以及丰富度有待进一步扩展; 需探索更多的加权策略模型进行实验, 以提高书目文本的分类效果。[结论]实验结果表明, 该方法是有效的、可行的, 能够提高特征选择后的特征词集对文本的表示能力, 从而提高文本分类的准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
黄莉
李湘东
巴志超
关键词 书目信息加权LDA模型多粒度特征文本分类特征选择    
Abstract

[Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.

Key wordsBibliographic information    Weighted Latent Dirichlet Allocation    Multi-granularity feature    Text classification    Feature selection
收稿日期: 2014-10-31     
:  TP391  
通讯作者: 黄莉,ORCID:0000-0002-3547-3831,E-mail:709934404@qq.com。     E-mail: 709934404@qq.com
作者简介: 作者贡献声明: 李湘东:提出研究思路,设计研究方案;巴志超:采集、清洗、分析数据,完成实验,撰写论文;黄莉:探讨、分析研究思路及方案的可行性。
引用本文:   
李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015, 31(5): 42-49.
Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.05.06.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.05.06

[1] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第三版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012: 211-220. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012: 211-220.)
[2] Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 李锋刚, 梁钰, Gao X, 等. 基于LDA-WSVM模型的文本分类研究[J]. 计算机应用研究, 2015, 32(1): 21-25. (Li Fenggang, Liang Yu, Gao X, et al. Research on Text Categorization Based on LDA-WSVM Model [J]. Application Research of Computers, 2015, 32(1): 21-25.)
[5] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48. (Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification [J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[6] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21): 46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model [J]. Computer Engineering, 2011, 37(21): 46-48.)
[7] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[8] Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384.
[9] Bao Y, Collier N, Datta A. A Partially Supervised Cross-collection Topic Model for Cross-domain Text Classification [C]. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 239-248.
[10] Elhadad M, Gabay D, Netzer Y. Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Text Classification[A]. // Applied Semantic Technologies: Using Semantics in Intelligent Information Processing [M]. Taylor & Francis, 2011: 351-367.
[11] Wilson A T, Chew P A. Term Weighting Schemes for Latent Dirichlet Allocation [C]. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, USA. 2010: 465-473.
[12] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009.
[13] 李湘东, 巴志超, 黄莉. 基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J]. 计算机应用, 2014, 34(5): 1354-1359. (Li Xiangdong, Ba Zhichao, Huang Li. News Topic Mining Method Based on Weighted Latent Dirichlet Allocation Model [J]. Journal of Computer Applications, 2014, 34(5): 1354-1359.)
[14] Si X, Liu Z, Li P, et al. Content-based and Graph-based Tag Suggestion [C]. In: Proceedings of the ECML/PKDD Discovery Challenge. 2009: 243-261.
[15] Iwata T, Yamada T, Ueda N. Modeling Social Annotation Data with Content Relevance Using a Topic Model [C]. In: Proceedings of Annual Conference on Neural Information Processing Systems. 2009: 835-843.
[16] Golder S A, Huberman B A. Usage Patterns of Collaborative Tagging Systems [J]. Journal of Information Science, 2006, 32(2): 198-208.
[17] 张小平, 周学忠, 黄厚宽, 等. 一种改进的LDA主题模型[J], 北京交通大学学报, 2010, 34(2): 111-114. (Zhang Xiaoping, Zhou Xuezhong, Huang Houkuan, et al. An Improved LDA Topic Model [J]. Journal of Beijing Jiaotong University, 2010, 34(2): 111-114.)
[18] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J], 计算机工程与应用, 2010, 46(34): 123-125. (Fan Xiaoli, Liu Xiaoxia. Study on Mutual Information-based Feature Selection in Text Categorization [J]. Computer Engineering and Applications, 2010, 46(34): 123-125.)
[19] Zhu H D, Zhao X H, Zhong Y. Feature Selection Method Combined Optimized Document Frequency with Improved RBF Network [C]. In: Processing of the 5th International Conference on Advanced Data Mining and Applications, Beijing, China. 2009: 796-803.
[20] Chew P A, Bader B W, Helmreich S, et al. An Information-theoretic, Vector-Space-Model Approach to Cross-language Information Retrieval [J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[21] 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探 [J]. 情报学报, 2005, 24(1): 87-92. (Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research on the Weighting of Indexing Sources for Web Concept Mining [J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 87-92.)

[1] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] 梁家铭,赵洁,Jianlong Zhou,董振宁. 用户隐式行为挖掘在抗信誉共谋中的应用研究*[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[8] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[9] 温廷新,李洋子,孙静霜. 基于改进的果蝇优化算法的文本特征选择优化模型[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[10] 刘浏,王东波. 基于论文自动分类的社科类学科跨学科性研究*[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[11] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[12] 操玮,李灿,贺婷婷,朱卫东. 基于集成学习的中国P2P网络借贷信用风险预警模型的对比研究*[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[13] 李志鹏,李卫忠. 基于可拓小生境量子粒子群算法的特征选择*[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[14] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[15] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn