New Technology of Library and Information Service  2014, Vol. 30 Issue (7): 24-33    DOI: 10.11925/infotech.1003-3513.2014.07.04
An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution
Li Xiangdong1,2, He Haihong1, Cao Huan1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
[Objective] To improve digital resources text categorization in hierarchical structure by adjusting skewed distribution in training sets.[Methods] This paper proposes a new method named B-LDA to improve text categorization by integrating granule partitions with LDA. The new method firstly divides rare classes based on granular partition criteria to realize transferring the granularity space of training set, then modeles important texts based on probabilistic topic models, and generates new texts by using global semantic information represented by probabilistic topic models, until the distribution of different categories becomes more balanced.[Results] The results show that with the changing of the number of characters, the F1-Value for different unbalanced level training sets has been improved between 2.7% and 9.9%.[Limitations] This paper involves only part of imbalance condition, when constructs training set for experiments because of the limitation of corpus scale. In addition, the overlap degree of the two categories selected randomly will affect the classification performance of the new method.[Conclusions] The new method can achieve better performance under imbalance data sets which composed by the text information of the bibliography of books, the title of journals and Web pages.

Key wordsSkewed distribution      Granule partitions      Probabilistic topic models      Text categorization      Digital resources     
Received: 09 March 2014      Published: 20 October 2014
Li Xiangdong, He Haihong, Cao Huan, Huang Li. An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution. New Technology of Library and Information Service, 2014, 30(7): 24-33.

