|
|
Chinese Stopwords for Text Clustering: A Comparative Study |
Guan Qin, Deng Sanhong( ), Wang Hao |
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Lab of Data Engineering and Knowledge Service, Nanjing 210023, China |
|
|
Abstract [Objective] This paper compares and analyzes the impacts of stopwords on textual data processing, aiming to improve the construction and use of stopwords. [Methods] We obtained stopword lists from Baidu Search Engine, Harbin Institute of Technology and the Machine Learning Laboratory of Sichuan University for this study. First, we processed text message with the stopword lists and Chinese word segmentation technique, the TF-IDF feature evaluation function and the VSM vector model. Secondly, we analysed the texts with the K-means algorithm to calculate the P, R and F1 values. [Results] Different stopword lists posed various effects to the text data processing tasks. The length of the list and the content structure of the texts directly influenced the clustering results. More importantly, the two-character stopwords was the biggest factor. [Limitations] The text types and quantity were limited. More research is needed to analyze the text with different types of stop words. [Conclusions] Stopword list poses significant impacts on text clustering, thus, it is extremely important to build or choose the appropriate Chinese stopword list. However, excessively increasing the number of stop words might not always improve the clustering results.
|
Received: 05 December 2016
Published: 25 March 2006
|
|
[1] |
Feldman R, Dagan I.Knowledge Discovery in Textual Databases (KDT)[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining. 1995: 112-117.
|
[2] |
Ahonen-Myka H, Heinonen O, Klemettinen M, et al.Applying Data Mining Techniques in Text Analysis[R]. Technical Report C-1997-23, Department of Computer Science, University of Helsinki, 1997.
|
[3] |
Luhn H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957, 1(4): 309-317.
doi: 10.1147/rd.14.0309
|
[4] |
Luhn H P.The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research Development, 1958, 2(2): 159-165.
doi: 10.1147/rd.22.0159
|
[5] |
Francis W N, Kučera H, Mackie A W.Frequency Analysis of English Usage[J]. Frequency Analysis of English Usage Lexicon & Grammar, 1982, 18: 64-70.
|
[6] |
Frakes W B, Baeza-Yates R.Information Retrieval: Data Structures and Algorithms[M]. Prentice-Hall, Inc. ,1992.
|
[7] |
Lo T W, He B, Ounis I.Automatically Building a Stopword List for an Information Retrieval System[J]. Journal of Digital Information Management, 2005, 3(1): 3-8.
|
[8] |
江兆中. 基于语境和停用词驱动的中文自动分词研究[D]. 合肥: 合肥工业大学, 2010.
|
[8] |
(Jiang Zhaozhong.Chinese Words Segmentation Based on Context and Stopwords[D]. Hefei: Hefei University of Technology, 2010.)
|
[9] |
熊文新, 宋柔. 信息检索用户查询语句的停用词过滤[J]. 计算机工程, 2007, 33(6): 195-197.
doi: 10.3969/j.issn.1000-3428.2007.06.068
|
[9] |
(Xiong Wenxin, Song Rou.Removal of Stop Word in Users’ Request for Information Retrieval[J]. Computer Engineering, 2007, 33(6): 195-197.)
doi: 10.3969/j.issn.1000-3428.2007.06.068
|
[10] |
周钦强, 孙炳达, 王义. 文本自动分类系统文本预处理方法的研究[J]. 计算机应用研究, 2005(2): 85-86.
doi: 10.3969/j.issn.1001-3695.2005.02.029
|
[10] |
(Zhou Qinqiang, Sun Bingda, Wang Yi.Study on New Pretreatment Method for Chinese Text Classification System[J]. Application Research of Computers, 2005(2): 85-86.)
doi: 10.3969/j.issn.1001-3695.2005.02.029
|
[11] |
Yang B Y, Pedersen J O.A Comparative Study on Feature[C]//Proceedings of International Conference on Machine Learning. 2010.
|
[12] |
Silva C, Ribeiro B.The Importance of Stop Word Removal on Recall Values in Text Categorization[C]// Proceedings of the International Joint Conference on Neural Networks.2003, 3: 20-24.
|
[13] |
Tomov D T.Some Critical Remarks on the Stop Word Lists of ISI Publications[J]. Journal of Documentation, 2001, 57(6): 798-808.
doi: 10.1108/EUM0000000007101
|
[14] |
化柏林. 知识抽取中的停用词处理技术[J]. 现代图书情报技术, 2007(8): 48-51.
doi: 10.3969/j.issn.1003-3513.2007.08.011
|
[14] |
(Hua Bolin, Stop-Word Processing Technique in Knowledge Extraction[J]. New Technology of Library and Information Service, 2007(8): 48-51.)
doi: 10.3969/j.issn.1003-3513.2007.08.011
|
[15] |
Van Rijsbergen C J. Information Retrieval[M]. London: Butterworths, 1975.
|
[16] |
Fox C.A Stop List for General Text[J]. ACM SIGIR Forum, 1990, 24(1-2): 19-21.
|
[17] |
陈欣, 张菁, 李晓光, 等. 一种面向中文敏感网页识别的文本分类方法[J]. 测控技术, 2011,30(5): 27-31.
doi: 10.3969/j.issn.1000-8829.2011.05.006
|
[17] |
(Chen Xin, Zhang Jing, Li Xiaoguang, et al.A Text Classification Method for Chinese Pornographic Web Recognition[J]. Measurement & Control Technology, 2011,30(5): 27-31.)
doi: 10.3969/j.issn.1000-8829.2011.05.006
|
[18] |
顾益军, 樊孝忠, 王建华, 等. 中文停用词表的自动选取[J]. 北京理工大学学报, 2005, 25(4): 337-340.
doi: 10.3969/j.issn.1001-0645.2005.04.014
|
[18] |
(Gu Yijun, Fan Xiaozhong, Wang Jianhua, et al.Automatic Selection of Chinese Stoplist[J]. Transactions of Beijing Institute of Technology, 2005, 25(4): 337-340.)
doi: 10.3969/j.issn.1001-0645.2005.04.014
|
[19] |
崔彩霞. 停用词的选取对文本分类效果的影响研究[J]. 太原师范学院学报:自然科学版, 2008, 7(4): 91-93.
doi: 10.3969/j.issn.1672-2027.2008.04.026
|
[19] |
(Cui Caixia.Research on the Effect of Stop Words Selection on Text Categorization[J]. Journal of Taiyuan Normal University: Natural Science Edition, 2008, 7(4): 91-93.)
doi: 10.3969/j.issn.1672-2027.2008.04.026
|
[20] |
Zou F, Wang F L, Deng X, et al.Automatic Construction of Chinese Stop Word List[C] // Proceedings of the International Conference on Applied Computer Science. 2006: 16-18.
|
[21] |
王素格, 魏英杰. 停用词表对中文文本情感分类的影响[J]. 情报学报, 2008, 27(2): 175-179.
doi: 10.3969/j.issn.1000-0135.2008.02.003
|
[21] |
(Wang Suge, Wei Yingjie.The Influence of Stoplist on the Chinese Text Sentiment Categorization[J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(2): 175-179.)
doi: 10.3969/j.issn.1000-0135.2008.02.003
|
[22] |
周姚. 基于云计算的文本挖掘技术研究[D]. 长沙: 国防科学技术大学, 2011.
|
[22] |
(Zhou Yao.Cloud Computing-based Research on Text Mining Techniques[D]. Changsha: National University of Defense Technology, 2011. )
|
[23] |
Makrehchi M, Kamel M S.Automatic Extraction of Domain- Specific Stopwords from Labeled Documents[C] // Proceedings of European Conference on IR Research(ECIR 2008), Glasgow, UK. 2008: 222-233.
|
[24] |
华林森. 中文文本情感分类研究[D]. 重庆: 重庆大学, 2014.
|
[24] |
(Hua Linsen.Study on Chinese Text Sentiment Classification[D]. Chongqing: Chongqing University, 2014.)
|
[25] |
搜狗实验室. 搜狐新闻数据[DB/OL]. [2016-07-05]. .
|
[25] |
(Sogou Labs. Sohu News Data [DB/OL]. [2016-07-05].
|
[26] |
李梅. 改进的K均值算法在中文文本聚类中的研究[D]. 合肥: 安徽大学, 2010.
|
[26] |
(Li Mei.Study of Chinese Text Clustering on Improved K-means Algorithm[D]. Hefei: Anhui University, 2010.)
|
[27] |
黄磊, 伍雁鹏, 朱群峰. 关键词自动提取方法的研究与改进[J]. 计算机科学, 2014, 41(6): 204-207.
doi: 10.3969/j.issn.1002-137X.2014.06.040
|
[27] |
(Huang Lei, Wu Yanpeng, Zhu Qunfeng.Research and Improvement of TFIDF Text Feature Weighting Method[J]. Computer Science, 2014, 41(6): 204-207.)
doi: 10.3969/j.issn.1002-137X.2014.06.040
|
[28] |
数据堂. 文本分类语料库(复旦)测试语料[DB/OL]. [2016- 07-05]. .
|
[28] |
(Data Hall. Text Classification Corpus (Fudan) Test Corpus [DB/OL]. [2016-07-05].
|
[29] |
胡晓辉. 基于团结构的文本分类技术研究[D]. 南昌: 江西师范大学, 2008.
|
[29] |
(Hu Xiaohui.The Research on Text Classification Based on Clique Model[D]. Nanchang: Jiangxi Normal University, 2008.)
|
[30] |
孙国菊, 张杰. 中文文本分类的特征选取评价[J]. 哈尔滨理工大学学报, 2005, 10(1): 76-78.
doi: 10.3969/j.issn.1007-2683.2005.01.022
|
[30] |
(Sun Guoju, Zhang Jie.An Evaluation of Feature Selection Methods for Text Categorization[J]. Journal of Harbin University of Science and Technology, 2005, 10(1): 76-78.)
doi: 10.3969/j.issn.1007-2683.2005.01.022
|
[31] |
数据堂. 中文文本分类语料[DB/OL]. [2016-07-05]. .
|
[31] |
(Data Hall. Chinese Text Categorization Corpus [DB/OL]. [2016-07-05].
|
[32] |
数据堂. 停用词集合[DB/OL]. [2016-07-05]. .
|
[32] |
(Data Hall. Stop Words Set [DB/OL]. [2016-07-05].
|
[33] |
于娟, 尹积栋, 费庶. 基于句法结构分析的同义词识别方法研究[J]. 现代图书情报技术, 2013(9): 35-40.
|
[33] |
(Yu Juan, Yin Jidong, Fei Shu.Identifying Synonyms Based on Sentence Structure Analysis[J]. New Technology of Library and Information Service, 2013(9): 35-40. )
|
[34] |
费洪晓, 康松林, 朱小娟, 等. 基于词频统计的中文分词的研究[J]. 计算机工程与应用, 2005, 41(7): 67-68.
doi: 10.3321/j.issn:1002-8331.2005.07.024
|
[34] |
(Fei Hongxiao, Kang Songlin, Zhu Xiaojuan, et al.Chinese Word Segmentation Research Based on Statistic the Frequency of the Word[J]. Computer Engineering and Applications, 2005, 41(7): 67-68.)
doi: 10.3321/j.issn:1002-8331.2005.07.024
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|