Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (9): 66-73    DOI: 10.11925/infotech.1003-3513.2014.09.09
Current Issue | Archive | Adv Search |
A Text Classification Algorithm Based on the Average Category Similarity
Tan Xueqing, Zhou Tong, Luo Lin
School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF(518 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To improve the classification performance and classification speed based on the KNN algorithm. [Methods] This paper proposes a classification algorithm based on the average category similarity, to judge the type of the test text by calculating the mean value of the text similarities of the test text and all texts of each category in the training set. [Results] The experimental results on the Fudan, balanced Sogou and unbalanced Sogou public corpus show that compared with KNN classification algorithm, the Macro_F1 on the two corpora of the method in this paper is increased by 3.5%, 3.2% and 3.3% respectively, the classification speed is 1/22, 1/6 and 1/5 respectively of KNN algorithm. [Limitations] Considering the time efficiency of KNN algorithm, the number of text of the experimental data is few. [Conclusions] It is a kind of practical classification algorithm for large scale text classification contrast with KNN.

Key wordsAverage category similarity      Vector Space Model(VSM)      KNN      Text categorization      Feature selection     
Received: 10 March 2014      Published: 20 October 2014
:  TP391  

Cite this article:

Tan Xueqing, Zhou Tong, Luo Lin. A Text Classification Algorithm Based on the Average Category Similarity. New Technology of Library and Information Service, 2014, 30(9): 66-73.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.09.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I9/66

[1] Cover T M, Hart P E.Nearest Neighbor Pattern Classification [J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
[2] Lewis D D. Naive (Bayes) at Forty: the Independence Assumption Information Retrieval [C]. In: Proceedings of the 10th European Conference on Machine Learning (ECML'98), Chemnitz, Germany. London: Springer-Verlag, 1998: 4-15.
[3] Vapnik V N. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1995: 235-313.
[4] 郑凤萍. 一种新的中文文本分类算法[J]. 现代情报, 2007, 27(3): 143-144. (Zheng Fengping. A New Kind of Chinese Text Classification Algorithm [J]. Journal of Modern Information, 2007, 27(3): 143-144.)
[5] 王建会, 王洪伟, 申展, 等. 一种实用高效的文本分类算法[J]. 计算机研究与发展, 2005, 42(1): 85-93. (Wang Jianhui, Wang Hongwei, Shen Zhan, et al. A Simple and Efficient Algorithm to Classify a Large Scale of Texts [J]. Journal of Computer Research and Development, 2005, 42(1): 85-93.)
[6] 朱靖波, 姚天顺. 基于FIFA 算法的文本分类[J]. 中文信息学报, 2002, 16(3): 20-26. (Zhu Jingbo, Yao Tianshun. FIFA-based Text Classification [J]. Journal of Chinese Information Processing, 2002, 16(3): 20-26.)
[7] Yigit H. A Weighting Approach for KNN Classifier[C]. In: Proceedings of 2013 International Conference on Electronics, Computer and Computation (ICECCO). IEEE, 2013: 228-231.
[8] Mejdoub M, Amar C B. Classification Improvement of Local Feature Vectors over the KNN Algorithm [J]. Multimedia Tools and Applications, 2013, 64(1): 197-218.
[9] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[10] Salton G, Yu C T. On the Construction of Effective Vocabu-laries for Information Retrieval [C]. In: Proceedings of the Meeting on Programming Languages and Information Retrieval (SIGPLAN'73). New York: ACM, 1973: 48-60.
[11] 沈竞. 基于信息增益的LDA模型的短文本分类[J]. 重庆文理学院学报: 自然科学版, 2011, 30(6): 64-66. (Shen Jing. The Classification of LDA Model Essay Based on Information Gain [J]. Journal of Chongqing University of Arts and Sciences:Natural Science Edition, 2011, 30(6): 64-66.)
[12] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130, 194. (Pei Yingbo, Liu Xiaoxia. Study on Improved CHI for Feature Selection in Chinese Text Categorization [J]. Computer Engineering and Applications, 2011, 47(4): 128-130, 194.)
[13] 黄志艳. 一种基于信息增益的特征选择方法[J]. 山东农业大学学报: 自然科学版, 2013, 44(2): 252-256. (Huang Zhiyan. Based on the Information Gain Text Feature Selection Method [J]. Journal of Shandong Agricultural University: Natural Science, 2013, 44(2): 252-256.)
[14] 徐峻岭, 周毓明, 陈林, 等. 基于互信息的无监督特征选择[J]. 计算机研究与发展, 2012, 49(2): 372-382. (Xu Junling, Zhou Yuming, Chen Lin, et al. An Unsupervised Feature Selection Approach Based on Mutual Information [J]. Journal of Computer Research and Development, 2012, 49(2): 372-382.)
[15] 田野, 南征, 郑伟, 等. 中文文本分类中特征选择方法的改进与比较[J]. 河北北方学院学报: 自然科学版, 2012, 28(6): 33-35. (Tian Ye, Nan Zheng, Zheng Wei, et al. Improvement and Comparison of Feature Selection Methods for Chinese Text Categorization [J]. Journal of Hebei North University: Natural Science Edition, 2012, 28(6): 33-35.)
[16] 司宪策. 基于内容的社会标签推荐与分析研究[D]. 北京: 清华大学, 2010. (Si Xiance. Content-based Recommendation and Analysis of Social Tags [D]. Beijing: Tsinghua University, 2010.)
[17] 奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70. (Feng Guohe. Review of Performance Evaluation of Text Classification [J]. Journal of Intelligence, 2011, 30(8): 66-70.)

[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[7] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[8] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[9] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[10] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[13] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[14] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[15] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn