Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (5): 42-49     https://doi.org/10.11925/infotech.1003-3513.2015.05.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于加权LDA模型和多粒度的文本特征选择方法
李湘东1,2, 巴志超1, 黄莉3
1 武汉大学信息管理学院 武汉 430072;
2 武汉大学信息资源研究中心 武汉 430072;
3 武汉大学图书馆 武汉 430072
Allocation and Multi-granularity
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
全文: PDF (559 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]为改善图书和期刊书目信息的分类性能, 结合书目文本的体例结构特点, 提出一种基于加权LDA模型和多粒度的文本特征选择方法。[方法]在点互信息(PMI)模型的基础上, 结合词性、位置等要素修正特征词的权重并扩展至LDA的生成模型中, 以抽取表意性较强的粗粒度特征; 结合TF-IDF计算模型采用一定策略获取细粒度特征, 基于多粒度特征作为核心特征词集表征书目文本; 采用KNN、SVM等算法实现书目文本的分类。[结果]在自建图书、期刊材料上进行分类实验, 与LDA方法以及传统特征选择方法相比, 该方法分类准确率分别平均提高3.60%和4.79%。[局限]实验材料的数量以及丰富度有待进一步扩展; 需探索更多的加权策略模型进行实验, 以提高书目文本的分类效果。[结论]实验结果表明, 该方法是有效的、可行的, 能够提高特征选择后的特征词集对文本的表示能力, 从而提高文本分类的准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
黄莉
李湘东
巴志超
关键词 书目信息加权LDA模型多粒度特征文本分类特征选择    
Abstract

[Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.

Key wordsBibliographic information    Weighted Latent Dirichlet Allocation    Multi-granularity feature    Text classification    Feature selection
收稿日期: 2014-10-31      出版日期: 2015-06-11
:  TP391  
通讯作者: 黄莉,ORCID:0000-0002-3547-3831,E-mail:709934404@qq.com。     E-mail: 709934404@qq.com
作者简介: 作者贡献声明: 李湘东:提出研究思路,设计研究方案;巴志超:采集、清洗、分析数据,完成实验,撰写论文;黄莉:探讨、分析研究思路及方案的可行性。
引用本文:   
李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015, 31(5): 42-49.
Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity. New Technology of Library and Information Service, 2015, 31(5): 42-49.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.05.06      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I5/42

[1] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第三版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012: 211-220. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012: 211-220.)
[2] Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 李锋刚, 梁钰, Gao X, 等. 基于LDA-WSVM模型的文本分类研究[J]. 计算机应用研究, 2015, 32(1): 21-25. (Li Fenggang, Liang Yu, Gao X, et al. Research on Text Categorization Based on LDA-WSVM Model [J]. Application Research of Computers, 2015, 32(1): 21-25.)
[5] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48. (Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification [J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[6] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21): 46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model [J]. Computer Engineering, 2011, 37(21): 46-48.)
[7] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[8] Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384.
[9] Bao Y, Collier N, Datta A. A Partially Supervised Cross-collection Topic Model for Cross-domain Text Classification [C]. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 239-248.
[10] Elhadad M, Gabay D, Netzer Y. Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Text Classification[A]. // Applied Semantic Technologies: Using Semantics in Intelligent Information Processing [M]. Taylor & Francis, 2011: 351-367.
[11] Wilson A T, Chew P A. Term Weighting Schemes for Latent Dirichlet Allocation [C]. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, USA. 2010: 465-473.
[12] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009.
[13] 李湘东, 巴志超, 黄莉. 基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J]. 计算机应用, 2014, 34(5): 1354-1359. (Li Xiangdong, Ba Zhichao, Huang Li. News Topic Mining Method Based on Weighted Latent Dirichlet Allocation Model [J]. Journal of Computer Applications, 2014, 34(5): 1354-1359.)
[14] Si X, Liu Z, Li P, et al. Content-based and Graph-based Tag Suggestion [C]. In: Proceedings of the ECML/PKDD Discovery Challenge. 2009: 243-261.
[15] Iwata T, Yamada T, Ueda N. Modeling Social Annotation Data with Content Relevance Using a Topic Model [C]. In: Proceedings of Annual Conference on Neural Information Processing Systems. 2009: 835-843.
[16] Golder S A, Huberman B A. Usage Patterns of Collaborative Tagging Systems [J]. Journal of Information Science, 2006, 32(2): 198-208.
[17] 张小平, 周学忠, 黄厚宽, 等. 一种改进的LDA主题模型[J], 北京交通大学学报, 2010, 34(2): 111-114. (Zhang Xiaoping, Zhou Xuezhong, Huang Houkuan, et al. An Improved LDA Topic Model [J]. Journal of Beijing Jiaotong University, 2010, 34(2): 111-114.)
[18] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J], 计算机工程与应用, 2010, 46(34): 123-125. (Fan Xiaoli, Liu Xiaoxia. Study on Mutual Information-based Feature Selection in Text Categorization [J]. Computer Engineering and Applications, 2010, 46(34): 123-125.)
[19] Zhu H D, Zhao X H, Zhong Y. Feature Selection Method Combined Optimized Document Frequency with Improved RBF Network [C]. In: Processing of the 5th International Conference on Advanced Data Mining and Applications, Beijing, China. 2009: 796-803.
[20] Chew P A, Bader B W, Helmreich S, et al. An Information-theoretic, Vector-Space-Model Approach to Cross-language Information Retrieval [J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[21] 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探 [J]. 情报学报, 2005, 24(1): 87-92. (Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research on the Weighting of Indexing Sources for Web Concept Mining [J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 87-92.)

[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 *[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[7] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[14] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn