%0 Journal Article
%A Zhao Huaming
%T Research and Implementation of Textual Clustering in Distributed Environment
%D 2015
%R 10.11925/infotech.1003-3513.2015.01.12
%J Data Analysis and Knowledge Discovery
%P 82-88
%V 31
%N 1
%X <p><strong>[Objective]</strong> To implement the textual clustering and classification in distributed environment through open-source tools. <strong>[Methods]</strong> According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. <strong>[Results]</strong> The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. <strong>[Limitations]</strong> The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. <strong>[Conclusions]</strong> This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-depth understood.</p>
%U https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.01.12