%A Zhao Huaming %T Research and Implementation of Textual Clustering in Distributed Environment %0 Journal Article %D 2015 %J Data Analysis and Knowledge Discovery %R 10.11925/infotech.1003-3513.2015.01.12 %P 82-88 %V 31 %N 1 %U {https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/article_4004.shtml} %8 2015-01-25 %X

[Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-depth understood.