Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (5): 42-49    DOI: 10.11925/infotech.1003-3513.2015.05.06
Current Issue | Archive | Adv Search |
Allocation and Multi-granularity
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
Download: PDF(559 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.

Key wordsBibliographic information      Weighted Latent Dirichlet Allocation      Multi-granularity feature      Text classification      Feature selection     
Received: 31 October 2014      Published: 11 June 2015
:  TP391  

Cite this article:

Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity. New Technology of Library and Information Service, 2015, 31(5): 42-49.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.05.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I5/42

[1] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第三版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012: 211-220. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012: 211-220.)
[2] Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 李锋刚, 梁钰, Gao X, 等. 基于LDA-WSVM模型的文本分类研究[J]. 计算机应用研究, 2015, 32(1): 21-25. (Li Fenggang, Liang Yu, Gao X, et al. Research on Text Categorization Based on LDA-WSVM Model [J]. Application Research of Computers, 2015, 32(1): 21-25.)
[5] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48. (Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification [J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[6] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21): 46-48. (Huang Xiaoliang, Yu Shusi, Guan Jihong. Software Bug Triage Method Based on LDA Topic Model [J]. Computer Engineering, 2011, 37(21): 46-48.)
[7] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[8] Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384.
[9] Bao Y, Collier N, Datta A. A Partially Supervised Cross-collection Topic Model for Cross-domain Text Classification [C]. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 239-248.
[10] Elhadad M, Gabay D, Netzer Y. Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Text Classification[A]. // Applied Semantic Technologies: Using Semantics in Intelligent Information Processing [M]. Taylor & Francis, 2011: 351-367.
[11] Wilson A T, Chew P A. Term Weighting Schemes for Latent Dirichlet Allocation [C]. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, California, USA. 2010: 465-473.
[12] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009.
[13] 李湘东, 巴志超, 黄莉. 基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J]. 计算机应用, 2014, 34(5): 1354-1359. (Li Xiangdong, Ba Zhichao, Huang Li. News Topic Mining Method Based on Weighted Latent Dirichlet Allocation Model [J]. Journal of Computer Applications, 2014, 34(5): 1354-1359.)
[14] Si X, Liu Z, Li P, et al. Content-based and Graph-based Tag Suggestion [C]. In: Proceedings of the ECML/PKDD Discovery Challenge. 2009: 243-261.
[15] Iwata T, Yamada T, Ueda N. Modeling Social Annotation Data with Content Relevance Using a Topic Model [C]. In: Proceedings of Annual Conference on Neural Information Processing Systems. 2009: 835-843.
[16] Golder S A, Huberman B A. Usage Patterns of Collaborative Tagging Systems [J]. Journal of Information Science, 2006, 32(2): 198-208.
[17] 张小平, 周学忠, 黄厚宽, 等. 一种改进的LDA主题模型[J], 北京交通大学学报, 2010, 34(2): 111-114. (Zhang Xiaoping, Zhou Xuezhong, Huang Houkuan, et al. An Improved LDA Topic Model [J]. Journal of Beijing Jiaotong University, 2010, 34(2): 111-114.)
[18] 范小丽, 刘晓霞. 文本分类中互信息特征选择方法的研究[J], 计算机工程与应用, 2010, 46(34): 123-125. (Fan Xiaoli, Liu Xiaoxia. Study on Mutual Information-based Feature Selection in Text Categorization [J]. Computer Engineering and Applications, 2010, 46(34): 123-125.)
[19] Zhu H D, Zhao X H, Zhong Y. Feature Selection Method Combined Optimized Document Frequency with Improved RBF Network [C]. In: Processing of the 5th International Conference on Advanced Data Mining and Applications, Beijing, China. 2009: 796-803.
[20] Chew P A, Bader B W, Helmreich S, et al. An Information-theoretic, Vector-Space-Model Approach to Cross-language Information Retrieval [J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[21] 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探 [J]. 情报学报, 2005, 24(1): 87-92. (Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research on the Weighting of Indexing Sources for Web Concept Mining [J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 87-92.)

[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[8] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[9] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[10] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[13] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[14] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[15] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn