Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (1): 82-88    DOI: 10.11925/infotech.1003-3513.2015.01.12
Current Issue | Archive | Adv Search |
Research and Implementation of Textual Clustering in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(615 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-depth understood.

Key wordsDistributed environment      Clustering      Textual clustering      Hadoop      Mahout     
Received: 14 July 2014      Published: 12 February 2015
:  TP393  

Cite this article:

Zhao Huaming. Research and Implementation of Textual Clustering in Distributed Environment. New Technology of Library and Information Service, 2015, 31(1): 82-88.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.01.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I1/82

[1] 胡建军, 唐常杰, 李川, 等. 基于最近邻优先的高效聚类算法 [J]. 四川大学学报: 工程科学版, 2004, 36(6): 93-99. (Hu jianjun, Tang Changjie, Li Chuan, et al. An Efficient Multi-layer Clustering Algorithm Based on Nearest Neighbors First [J]. Journal of Sichuan University: Engineering Science Edition, 2004, 36(6): 93-99.)
[2] Han J, Kamber M. Data Mining Concepts and Techniques [M]. Beijing: China Machine Press, 2008: 261-284.
[3] Pena J M, Lozano J A, Larranaga P. An Empirical Comparison of Four Initialization Methods for the K-means Algorithm [J]. Pattern Recognition Letters, 1999, 20(10): 1027-1040.
[4] Bradley P S, Fayyad U M. Refining Initial Points for K-means Clustering [C]. In: Proceedings of the 15th International Conference on Machine Learning (ICML'98). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1998: 91-99.
[5] Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques [C]. In: Proceedings of KDD 2000 Workshop on Text Mining. 2000: 1-20.
[6] Zhao Y, Karypis G, Fayyad U. Hierarchical Clustering Algorithms for Document Datasets [J]. Data Mining and Knowledge Discovery, 2005, 10(2): 141-168.
[7] Higgs R E, Bemis K G, Watson I A, et al. Experimental Designs for Selecting Molecules from Large Chemical Databases [J]. Journal of Chemical Information and Computer Sciences, 1997, 37(5): 861-870.
[8] Snarey M, Terrett N K, Willet P, et al. Comparison of Algorithms for Dissimilarity-based Compound Selection [J]. Journal of Molecular Graphics & Modelling, 1997, 15(6): 372-385.
[9] Slonim N, Tishby N. Document Clustering Using Word Clusters via the Information Bottleneck Method [C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'00). New York, USA: ACM, 2000: 208-215.
[10] 梁维铿. 基于Hadoop的分布式文本聚类研究[D]. 广州: 华南理工大学, 2011. (Liang Weikeng. Research of Distributed Text Clustering Basic on Hadoop [D]. Guangzhou: South China University of Technology, 2011.)
[11] MapReduce [EB/OL]. [2014-08-06]. http://Hadoop.apache. org/mapreduce/.
[12] Mahout [EB/OL]. [2014-08-06]. http: //mahout.apache.org/.
[13] Hadoop [EB/OL]. [2014-08-06]. http: //hadoop.apache.org/.
[14] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[15] Bezdek J C. Pattern Recognition with Fuzzy Objective Function Algorithms [M]. Springer US, 1981.
[16] Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer [M]. Boston: Addison-Wesley Longman Publishing Co., Inc., 1989.
[17] 田润涛, 谢培山. 色谱指纹图谱相似度评价方法的规范化研究(一) [J]. 中药新药与临床药理, 2006, 17(1): 40-42. (Tian Runtao, Xie Peishan. Study on the Standardization of Similarity Evaluation Method of Chromatographic Fingerprints (Part I) [J]. Traditional Chinese Drug Research & Clinical Pharmacology, 2006, 17(1): 40-42.)
[18] Pavlo A, Paulson E, Rasin A, et al. A Comparison of Approaches to Large-scale Data Analysis [C]. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD'09). New York, USA: ACM, 2009: 165-178.
[19] Postgresql [EB/OL]. [2014-08-06]. http://www.postgresql.org/.
[20] Eclipse [EB/OL]. [2014-08-06]. http://www.eclipse.org/.
[21] Apache Tomcat [EB/OL]. [2014-08-06]. http://tomcat.apache. org/.
[22] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization [C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML'97). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1997: 143-151.

[1] Ke Li,Yuya Sasaki. Analyzing Sentiment Distribution with Spatial-textual Data of Multi-dimensional Clustering[J]. 数据分析与知识发现, 2019, 3(7): 14-22.
[2] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[3] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[4] Jiang Wu,Yinghui Zhao,Jiahui Gao. Research on Weibo Opinion Leaders Identification and Analysis in Medical Public Opinion Incidents[J]. 数据分析与知识发现, 2019, 3(4): 53-62.
[5] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[6] Jiaxin Ye,Huixiang Xiong. Recommending Personalized Contents from Cross-Domain Resources Based on Tags[J]. 数据分析与知识发现, 2019, 3(2): 21-32.
[7] Tao Zhang,Haiqun Ma. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[8] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[9] Xiufang Wang,Shu Sheng,Yan Lu. Analyzing Public Opinion from Microblog with Topic Clustering and Sentiment Intensity[J]. 数据分析与知识发现, 2018, 2(6): 37-47.
[10] Zhen Yang,Hongjun Wang,Yu Zhou. A Clustering Algorithm with Adaptive Cut-off Distance and Cluster Centers[J]. 数据分析与知识发现, 2018, 2(3): 39-48.
[11] Xiaoting Jia,Mingyang Wang,Yu Cao. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[12] Huixiang Xiong,Jiaxin Ye,Wuxuan Jiang. Clustering Social Tags with Improved DBSCAN Algorithm[J]. 数据分析与知识发现, 2018, 2(12): 77-88.
[13] Minghui Liu. Risk Assessment of Civil Aviation Terrorism Based on K-means Clustering[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[14] Tingting Wang,Man Han,Yu Wang. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[15] Yu Wang,Xiuxiu Li. Evaluating Business Reputation with E-Commerce Comments[J]. 数据分析与知识发现, 2017, 1(8): 59-67.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn