Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (7): 24-33    DOI: 10.11925/infotech.1003-3513.2014.07.04
Current Issue | Archive | Adv Search |
An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution
Li Xiangdong1,2, He Haihong1, Cao Huan1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
Download: PDF(749 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve digital resources text categorization in hierarchical structure by adjusting skewed distribution in training sets.[Methods] This paper proposes a new method named B-LDA to improve text categorization by integrating granule partitions with LDA. The new method firstly divides rare classes based on granular partition criteria to realize transferring the granularity space of training set, then modeles important texts based on probabilistic topic models, and generates new texts by using global semantic information represented by probabilistic topic models, until the distribution of different categories becomes more balanced.[Results] The results show that with the changing of the number of characters, the F1-Value for different unbalanced level training sets has been improved between 2.7% and 9.9%.[Limitations] This paper involves only part of imbalance condition, when constructs training set for experiments because of the limitation of corpus scale. In addition, the overlap degree of the two categories selected randomly will affect the classification performance of the new method.[Conclusions] The new method can achieve better performance under imbalance data sets which composed by the text information of the bibliography of books, the title of journals and Web pages.

Key wordsSkewed distribution      Granule partitions      Probabilistic topic models      Text categorization      Digital resources     
Received: 09 March 2014      Published: 20 October 2014
:  TP391  

Cite this article:

Li Xiangdong, He Haihong, Cao Huan, Huang Li. An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution. New Technology of Library and Information Service, 2014, 30(7): 24-33.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.07.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I7/24

[1] 魏大威, 刘金哲, 薛尧予. 以数字图书馆推广工程为抓手, 构建覆盖全国的数字图书馆服务体系[J]. 国家图书馆学刊, 2012, 21(5): 14-19. (Wei Dawei, Liu Jinzhe, Xue Yaoyu. Using the Digital Library Promotion Project as a Driver, Construct a Country-Wide Digital Library Service Architecture[J]. Journal of the National Library of China, 2012, 21(5): 14-19.)
[2] 王军. 数字图书馆的知识组织系统: 从理论到实践[M]. 北京: 北京大学出版社, 2008. (Wang Jun.The Knowledge Organization System in Digital Library——From Theory to Practice[M]. Beijing: Peking University Press, 2008.)
[3] Wang J. An Extensive Study on Automated Dewey Decimal Classification[J]. Journal of the American Society for Information Science & Technology, 2009, 60(11): 2269-2286.
[4] 肖雪, 何中市. 基于向量空间模型的中文文本层次分类方法研究[J]. 计算机应用, 2006, 26(5): 1125-1126, 1133. (Xiao Xue, He Zhongshi. Hierarchical Categorization Methods of Chinese Text Based on Vector Space Model[J]. Computer Applications, 2006, 26(5): 1125-1126, 1133.)
[5] 何琳, 侯汉清, 白振田, 等. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006, 25(6): 725-729. (He Lin, Hou Hanqing, Bai Zhentian, et al. Automatic Multi- Layer Classification Method Based on Integration of Machine Learning and Indexing Experience[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25 (6): 725-729.)
[6] 张启蕊, 张凌, 董守斌, 等. 训练集类别分布对文本分类的影响[J]. 清华大学学报: 自然科学版, 2005, 45(S1): 1802-1805. (Zhang Qirui, Zhang Ling, Dong Shoubin, et al. Effects of Category Distribution in a Training Set on Text Categorization[J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1802-1805.)
[7] 肖希明, 郑燃. 国外图书馆、档案馆和博物馆数字资源整合研究进展[J]. 中国图书馆学报, 2012, 38(3): 26-39. (Xiao Ximing, Zheng Ran. Research Progress on Digital Resources Convergence of Libraries, Archives and Museums in Foreign Countries[J]. Journal of Library Science in China, 2012, 38(3): 26-39.)
[8] 林琛, 李弼程, 周杰. 基于信息粒度的交叠类文本分类方法[J]. 情报学报, 2011, 30(4): 339-346. (Lin Chen, Li Bicheng, Zhou Jie. A Text Categorization Method for Overlapping Classes Based on Information Granularity[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(4): 339-346.)
[9] García V, Alejo R, Sánchez J S, et al. Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification[A] //Intelligent Data Engineering and Automated Learning–IDEAL 2006[M]. Berlin, Heidelberg: Springer, 2006: 371-378.
[10] Orriols A, Bernadó-Mansilla E. The Class Imbalance Problem in Learning Classifier Systems: A Preliminary Study[C]. In: Proceedings of the 2005 Workshops on Genetic and Evolutionary Computation. ACM, 2005: 74-78.
[11] Japkowicz N, Stephen S. The Class Imbalance Problem: A Systematic Study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449.
[12] 夏战国, 夏士雄, 蔡世玉, 等.类不均衡的半监督高斯过程分类算法[J]. 通信学报, 2013, 34(5):42-51. (Xia Zhanguo, Xia Shixiong, Cai Shiyu, et al. Semi-Supervised Gaussian Process Classification Algorithm Addressing the ClassImbalance[J]. Journal on Communications, 2013, 34(5): 42-51.)
[13] Jo T, Japkowicz N. Class Imbalances Versus Small Disjuncts[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49.
[14] 江颉, 王卓芳, Gong Rongsheng, 等. 不平衡数据分类方法及其在入侵检测中的应用研究[J]. 计算机科学, 2013, 40(4): 131-135. (Jiang Jie, Wang Zhuofang,Gong Rongsheng, et al. Imbalanced Data Classification and Its Application Research for Intrusion Detection[J]. Computer Science, 2013, 40(4): 131-135.)
[15] Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets[J]. Computational Intelligence, 2004, 20(1): 18-36.
[16] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[17] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]. In: Proceedings of International Conference on intelligent Computing (ICIC 2005), Hefei, China. Berlin, Heidelberg: Springer, 2005: 878-887.
[18] Batista G E, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6(1): 20-29.
[19] Chen E, Lin Y, Xiong H, et al. Exploiting Probabilistic Topic Models to Improve Text Categorization Under Class Imbalance[J]. Information Processing & Management, 2011, 47(2): 202-214.
[20] 张清华, 王国胤, 胡军, 等. 多粒度知识获取与不确定性度量[M]. 北京: 科学出版社, 2013. (Zhang Qinghua, Wang Guoyin, Hu Jun, et al. Multi-Granularity Knowledge Acquisition and Measure of Uncertainty[M]. Beijing: Science Press, 2013.)
[21] 郭虎升, 亓慧, 王文剑. 处理非平衡数据的粒度SVM学习算法[J]. 计算机工程, 2010, 36(2): 181-183. (Guo Husheng, Qi Hui, Wang Wenjian. Granular SVM Learning Algorithm for Processing Imbalanced Data[J]. Computer Engineering, 2010, 36(2): 181-183.)
[22] 林洋港, 陈恩红. 文本分类中基于概率主题模型的噪声处理方法[J]. 计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Classification[J]. Computer Engineering and Science, 2010, 32(7): 89-92, 119.)
[23] Zadeh L A. Fuzzy Sets and Information Granularity[A] //Advances in Fuzzy Set Theory and Applications[M]. Amsterdam: North-Holland Publishing Co., 1979: 3-18.
[24] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[25] Heinrich G. Parameter Estimation for Text analysis[R]. Germany: Fraunhofer IGD, 2005.
[26] Cao J, Xia T, Li J, et al. A Density-based Method for Adaptive LDA Model Selection[J]. Neurocomputing, 2009, 72(7-9): 1775-1781.
[27] 张华平. ICTCLAS汉语分词系统[EB/OL].[2014-01-01]. http://ictclas.nlpir.org/. (Zhang Huaping. ICTCLAS Chinese Word Segmentation System[EB/OL].[2014-01-01]. http://ictclas.nlpir.org/.)
[28] 李荣陆. 复旦大学中文分类语料库[DB/OL].[2014-01-01]. http://www.datatang.com/data/43318. (Li Ronglu. Chinese Categorization Corpus from Fudan University[DB/OL].[2014-01-01]. http://www.datatang.com/data/43318. )
[29] 搜狗实验室. 文本分类语料库[DB/OL].[2013-08-22]. http://www.sogou.com/labs/dl/t.html. (Sogou Labs. Text Categorization Corpus[DB/OL].[2013-08-22]. http://www.sogou.com/labs/dl/t.html.)

[1] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[2] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[3] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[4] Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description[J]. 现代图书情报技术, 2015, 31(3): 39-48.
[5] Gu Jiawei, Wang Shengqing, Zhao Danqun, Chen Wenguang. A Centralized Identity Authentication in the Cloud Service of Public Culture Digital Resources[J]. 现代图书情报技术, 2015, 31(2): 64-71.
[6] Tan Xueqing, Zhou Tong, Luo Lin. A Text Classification Algorithm Based on the Average Category Similarity[J]. 现代图书情报技术, 2014, 30(9): 66-73.
[7] Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. 现代图书情报技术, 2014, 30(5): 18-25.
[8] Lu Yonghe, Liang Minghui. Improvement of Text Feature Extraction with Genetic Algorithm[J]. 现代图书情报技术, 2014, 30(4): 48-57.
[9] Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. 现代图书情报技术, 2014, 30(3): 80-87.
[10] Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors[J]. 现代图书情报技术, 2013, (5): 34-39.
[11] Qu Peng, Wang Huilin. Fundamental Research Questions in Patent Text Categorization[J]. 现代图书情报技术, 2013, 29(3): 38-44.
[12] Xu Kun, Cao Jindan, Bi Qiang. A Study and Application on Medical Text Categorization Based on FCA[J]. 现代图书情报技术, 2012, 28(3): 23-26.
[13] Lu Yonghe, He Xinyu. An Application of Sharpen Gaussian Template in a Text Feature Weight Adjustment Methodology[J]. 现代图书情报技术, 2012, (12): 39-44.
[14] Lu Yonghe, Cao Lichao. Text Feature Selection Method Based on Particle Swarm Optimization[J]. 现代图书情报技术, 2011, 27(7/8): 76-81.
[15] Ma Fang. Research of Patent Automatic Classification Based on RBFNN[J]. 现代图书情报技术, 2011, 27(12): 58-63.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn