Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (5): 42-49    DOI: 10.11925/infotech.1003-3513.2015.05.06
Current Issue | Archive | Adv Search |
Allocation and Multi-granularity
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.

Key wordsBibliographic information      Weighted Latent Dirichlet Allocation      Multi-granularity feature      Text classification      Feature selection     
Received: 31 October 2014      Published: 11 June 2015
:  TP391  

Cite this article:

Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity. New Technology of Library and Information Service, 2015, 31(5): 42-49.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.05.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I5/42

[1] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第三版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012: 211-220. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012: 211-220.)
[2] Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 李锋刚, 梁钰, Gao X, 等. 基于LDA-WSVM模型的文本分类研究[J]. 计算机应用研究, 2015, 32(1): 21-25. (Li Fenggang, Liang Yu, Gao X, et al. Research on Text Categorization Based on LDA-WSVM Model [J]. Application Research of Computers, 2015, 32(1): 21-25.)
[5] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48. (Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification [J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[6] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21): 46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model [J]. Computer Engineering, 2011, 37(21): 46-48.)
[7] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[8] Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384.
[9] Bao Y, Collier N, Datta A. A Partially Supervised Cross-collection Topic Model for Cross-domain Text Classification [C]. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 239-248.
[10] Elhadad M, Gabay D, Netzer Y. Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Text Classification[A]. // Applied Semantic Technologies: Using Semantics in Intelligent Information Processing [M]. Taylor & Francis, 2011: 351-367.
[11] Wilson A T, Chew P A. Term Weighting Schemes for Latent Dirichlet Allocation [C]. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, USA. 2010: 465-473.
[12] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009.
[13] 李湘东, 巴志超, 黄莉. 基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J]. 计算机应用, 2014, 34(5): 1354-1359. (Li Xiangdong, Ba Zhichao, Huang Li. News Topic Mining Method Based on Weighted Latent Dirichlet Allocation Model [J]. Journal of Computer Applications, 2014, 34(5): 1354-1359.)
[14] Si X, Liu Z, Li P, et al. Content-based and Graph-based Tag Suggestion [C]. In: Proceedings of the ECML/PKDD Discovery Challenge. 2009: 243-261.
[15] Iwata T, Yamada T, Ueda N. Modeling Social Annotation Data with Content Relevance Using a Topic Model [C]. In: Proceedings of Annual Conference on Neural Information Processing Systems. 2009: 835-843.
[16] Golder S A, Huberman B A. Usage Patterns of Collaborative Tagging Systems [J]. Journal of Information Science, 2006, 32(2): 198-208.
[17] 张小平, 周学忠, 黄厚宽, 等. 一种改进的LDA主题模型[J], 北京交通大学学报, 2010, 34(2): 111-114. (Zhang Xiaoping, Zhou Xuezhong, Huang Houkuan, et al. An Improved LDA Topic Model [J]. Journal of Beijing Jiaotong University, 2010, 34(2): 111-114.)
[18] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J], 计算机工程与应用, 2010, 46(34): 123-125. (Fan Xiaoli, Liu Xiaoxia. Study on Mutual Information-based Feature Selection in Text Categorization [J]. Computer Engineering and Applications, 2010, 46(34): 123-125.)
[19] Zhu H D, Zhao X H, Zhong Y. Feature Selection Method Combined Optimized Document Frequency with Improved RBF Network [C]. In: Processing of the 5th International Conference on Advanced Data Mining and Applications, Beijing, China. 2009: 796-803.
[20] Chew P A, Bader B W, Helmreich S, et al. An Information-theoretic, Vector-Space-Model Approach to Cross-language Information Retrieval [J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[21] 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探 [J]. 情报学报, 2005, 24(1): 87-92. (Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research on the Weighting of Indexing Sources for Web Concept Mining [J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 87-92.)

[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[7] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[8] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[9] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[10] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[11] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[12] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[13] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[14] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[15] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn