New Technology of Library and Information Service  2015, Vol. 31 Issue (1): 82-88    DOI: 10.11925/infotech.1003-3513.2015.01.12
Research and Implementation of Textual Clustering in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
[Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-depth understood.

Key wordsDistributed environment      Clustering      Textual clustering      Hadoop      Mahout     
Received: 14 July 2014      Published: 12 February 2015
