Chinese Stopwords for Text Clustering: A Comparative Study
Guan Qin, Deng Sanhong(), Wang Hao
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Lab of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This paper compares and analyzes the impacts of stopwords on textual data processing, aiming to improve the construction and use of stopwords. [Methods] We obtained stopword lists from Baidu Search Engine, Harbin Institute of Technology and the Machine Learning Laboratory of Sichuan University for this study. First, we processed text message with the stopword lists and Chinese word segmentation technique, the TF-IDF feature evaluation function and the VSM vector model. Secondly, we analysed the texts with the K-means algorithm to calculate the P, R and F1 values. [Results] Different stopword lists posed various effects to the text data processing tasks. The length of the list and the content structure of the texts directly influenced the clustering results. More importantly, the two-character stopwords was the biggest factor. [Limitations] The text types and quantity were limited. More research is needed to analyze the text with different types of stop words. [Conclusions] Stopword list poses significant impacts on text clustering, thus, it is extremely important to build or choose the appropriate Chinese stopword list. However, excessively increasing the number of stop words might not always improve the clustering results.
官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study. Data Analysis and Knowledge Discovery, 2017, 1(3): 72-80.
Feldman R, Dagan I.Knowledge Discovery in Textual Databases (KDT)[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining. 1995: 112-117.
[2]
Ahonen-Myka H, Heinonen O, Klemettinen M, et al.Applying Data Mining Techniques in Text Analysis[R]. Technical Report C-1997-23, Department of Computer Science, University of Helsinki, 1997.
[3]
Luhn H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957, 1(4): 309-317.
doi: 10.1147/rd.14.0309
[4]
Luhn H P.The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research Development, 1958, 2(2): 159-165.
doi: 10.1147/rd.22.0159
[5]
Francis W N, Kučera H, Mackie A W.Frequency Analysis of English Usage[J]. Frequency Analysis of English Usage Lexicon & Grammar, 1982, 18: 64-70.
[6]
Frakes W B, Baeza-Yates R.Information Retrieval: Data Structures and Algorithms[M]. Prentice-Hall, Inc. ,1992.
[7]
Lo T W, He B, Ounis I.Automatically Building a Stopword List for an Information Retrieval System[J]. Journal of Digital Information Management, 2005, 3(1): 3-8.
[8]
江兆中. 基于语境和停用词驱动的中文自动分词研究[D]. 合肥: 合肥工业大学, 2010.
[8]
(Jiang Zhaozhong.Chinese Words Segmentation Based on Context and Stopwords[D]. Hefei: Hefei University of Technology, 2010.)
(Xiong Wenxin, Song Rou.Removal of Stop Word in Users’ Request for Information Retrieval[J]. Computer Engineering, 2007, 33(6): 195-197.)
doi: 10.3969/j.issn.1000-3428.2007.06.068
(Zhou Qinqiang, Sun Bingda, Wang Yi.Study on New Pretreatment Method for Chinese Text Classification System[J]. Application Research of Computers, 2005(2): 85-86.)
doi: 10.3969/j.issn.1001-3695.2005.02.029
[11]
Yang B Y, Pedersen J O.A Comparative Study on Feature[C]//Proceedings of International Conference on Machine Learning. 2010.
[12]
Silva C, Ribeiro B.The Importance of Stop Word Removal on Recall Values in Text Categorization[C]// Proceedings of the International Joint Conference on Neural Networks.2003, 3: 20-24.
[13]
Tomov D T.Some Critical Remarks on the Stop Word Lists of ISI Publications[J]. Journal of Documentation, 2001, 57(6): 798-808.
doi: 10.1108/EUM0000000007101
(Hua Bolin, Stop-Word Processing Technique in Knowledge Extraction[J]. New Technology of Library and Information Service, 2007(8): 48-51.)
doi: 10.3969/j.issn.1003-3513.2007.08.011
[15]
Van Rijsbergen C J. Information Retrieval[M]. London: Butterworths, 1975.
[16]
Fox C.A Stop List for General Text[J]. ACM SIGIR Forum, 1990, 24(1-2): 19-21.
(Chen Xin, Zhang Jing, Li Xiaoguang, et al.A Text Classification Method for Chinese Pornographic Web Recognition[J]. Measurement & Control Technology, 2011,30(5): 27-31.)
doi: 10.3969/j.issn.1000-8829.2011.05.006
(Gu Yijun, Fan Xiaozhong, Wang Jianhua, et al.Automatic Selection of Chinese Stoplist[J]. Transactions of Beijing Institute of Technology, 2005, 25(4): 337-340.)
doi: 10.3969/j.issn.1001-0645.2005.04.014
(Cui Caixia.Research on the Effect of Stop Words Selection on Text Categorization[J]. Journal of Taiyuan Normal University: Natural Science Edition, 2008, 7(4): 91-93.)
doi: 10.3969/j.issn.1672-2027.2008.04.026
[20]
Zou F, Wang F L, Deng X, et al.Automatic Construction of Chinese Stop Word List[C] // Proceedings of the International Conference on Applied Computer Science. 2006: 16-18.
(Wang Suge, Wei Yingjie.The Influence of Stoplist on the Chinese Text Sentiment Categorization[J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(2): 175-179.)
doi: 10.3969/j.issn.1000-0135.2008.02.003
[22]
周姚. 基于云计算的文本挖掘技术研究[D]. 长沙: 国防科学技术大学, 2011.
[22]
(Zhou Yao.Cloud Computing-based Research on Text Mining Techniques[D]. Changsha: National University of Defense Technology, 2011. )
[23]
Makrehchi M, Kamel M S.Automatic Extraction of Domain- Specific Stopwords from Labeled Documents[C] // Proceedings of European Conference on IR Research(ECIR 2008), Glasgow, UK. 2008: 222-233.
[24]
华林森. 中文文本情感分类研究[D]. 重庆: 重庆大学, 2014.
[24]
(Hua Linsen.Study on Chinese Text Sentiment Classification[D]. Chongqing: Chongqing University, 2014.)
[25]
搜狗实验室. 搜狐新闻数据[DB/OL]. [2016-07-05]. .
[25]
(Sogou Labs. Sohu News Data [DB/OL]. [2016-07-05].
[26]
李梅. 改进的K均值算法在中文文本聚类中的研究[D]. 合肥: 安徽大学, 2010.
[26]
(Li Mei.Study of Chinese Text Clustering on Improved K-means Algorithm[D]. Hefei: Anhui University, 2010.)
(Sun Guoju, Zhang Jie.An Evaluation of Feature Selection Methods for Text Categorization[J]. Journal of Harbin University of Science and Technology, 2005, 10(1): 76-78.)
doi: 10.3969/j.issn.1007-2683.2005.01.022
[31]
数据堂. 中文文本分类语料[DB/OL]. [2016-07-05]. .
[31]
(Data Hall. Chinese Text Categorization Corpus [DB/OL]. [2016-07-05].
(Yu Juan, Yin Jidong, Fei Shu.Identifying Synonyms Based on Sentence Structure Analysis[J]. New Technology of Library and Information Service, 2013(9): 35-40. )
(Fei Hongxiao, Kang Songlin, Zhu Xiaojuan, et al.Chinese Word Segmentation Research Based on Statistic the Frequency of the Word[J]. Computer Engineering and Applications, 2005, 41(7): 67-68.)
doi: 10.3321/j.issn:1002-8331.2005.07.024