Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (1): 82-88     https://doi.org/10.11925/infotech.1003-3513.2015.01.12
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
分布式环境下的文本聚类研究与实现
赵华茗
中国科学院文献情报中心 北京 100190
Research and Implementation of Textual Clustering in Distributed Environment
Zhao Huaming
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (615 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 通过开源工具, 构建一种分布式环境下的文本聚类与分类应用平台。[方法] 以海量文本的词收敛性为基础, 通过词聚类指导文本聚类和分类。过程包括: 使用开源分词器等工具进行训练集的文本预处理, 结合Mahout数据挖掘平台对处理后的词集进行聚类分析, 最后通过相似度算法计算测试文本与词类簇的相似度并分类。[结果] 分布式环境下的基于词聚类的文本聚类分类计算方法, 可有效解决海量文本的词聚类瓶颈问题。经测试, 当训练文本集增加到100, 迭代收敛阈值为0.01时, 词聚类结果较理想。[局限] 测试数据规模有限, 仅限于新闻数据, 基于其他领域的词聚类效果需要进一步测试、优化、调整。[结论] 详细描述基于词聚类的文本聚类分类算法的开发环境构架和关键步骤, 有助于研究者对相关开源工具使用及分布式并行环境部署的深入理解。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵华茗
关键词 分布式环境聚类文本聚类HadoopMahout    
Abstract

[Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-depth understood.

Key wordsDistributed environment    Clustering    Textual clustering    Hadoop    Mahout
收稿日期: 2014-07-14      出版日期: 2015-02-12
:  TP393  
通讯作者: 赵华茗,ORCID:0000-0002-8829-9208,E-mail:zhaohm@mail.las.ac.cn。     E-mail: zhaohm@mail.las.ac.cn
引用本文:   
赵华茗. 分布式环境下的文本聚类研究与实现[J]. 现代图书情报技术, 2015, 31(1): 82-88.
Zhao Huaming. Research and Implementation of Textual Clustering in Distributed Environment. New Technology of Library and Information Service, 2015, 31(1): 82-88.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.01.12      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I1/82

[1] 胡建军, 唐常杰, 李川, 等. 基于最近邻优先的高效聚类算法 [J]. 四川大学学报: 工程科学版, 2004, 36(6): 93-99. (Hu jianjun, Tang Changjie, Li Chuan, et al. An Efficient Multi-layer Clustering Algorithm Based on Nearest Neighbors First [J]. Journal of Sichuan University: Engineering Science Edition, 2004, 36(6): 93-99.)
[2] Han J, Kamber M. Data Mining Concepts and Techniques [M]. Beijing: China Machine Press, 2008: 261-284.
[3] Pena J M, Lozano J A, Larranaga P. An Empirical Comparison of Four Initialization Methods for the K-means Algorithm [J]. Pattern Recognition Letters, 1999, 20(10): 1027-1040.
[4] Bradley P S, Fayyad U M. Refining Initial Points for K-means Clustering [C]. In: Proceedings of the 15th International Conference on Machine Learning (ICML'98). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1998: 91-99.
[5] Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques [C]. In: Proceedings of KDD 2000 Workshop on Text Mining. 2000: 1-20.
[6] Zhao Y, Karypis G, Fayyad U. Hierarchical Clustering Algorithms for Document Datasets [J]. Data Mining and Knowledge Discovery, 2005, 10(2): 141-168.
[7] Higgs R E, Bemis K G, Watson I A, et al. Experimental Designs for Selecting Molecules from Large Chemical Databases [J]. Journal of Chemical Information and Computer Sciences, 1997, 37(5): 861-870.
[8] Snarey M, Terrett N K, Willet P, et al. Comparison of Algorithms for Dissimilarity-based Compound Selection [J]. Journal of Molecular Graphics & Modelling, 1997, 15(6): 372-385.
[9] Slonim N, Tishby N. Document Clustering Using Word Clusters via the Information Bottleneck Method [C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'00). New York, USA: ACM, 2000: 208-215.
[10] 梁维铿. 基于Hadoop的分布式文本聚类研究[D]. 广州: 华南理工大学, 2011. (Liang Weikeng. Research of Distributed Text Clustering Basic on Hadoop [D]. Guangzhou: South China University of Technology, 2011.)
[11] MapReduce [EB/OL]. [2014-08-06]. http://Hadoop.apache. org/mapreduce/.
[12] Mahout [EB/OL]. [2014-08-06]. http: //mahout.apache.org/.
[13] Hadoop [EB/OL]. [2014-08-06]. http: //hadoop.apache.org/.
[14] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[15] Bezdek J C. Pattern Recognition with Fuzzy Objective Function Algorithms [M]. Springer US, 1981.
[16] Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer [M]. Boston: Addison-Wesley Longman Publishing Co., Inc., 1989.
[17] 田润涛, 谢培山. 色谱指纹图谱相似度评价方法的规范化研究(一) [J]. 中药新药与临床药理, 2006, 17(1): 40-42. (Tian Runtao, Xie Peishan. Study on the Standardization of Similarity Evaluation Method of Chromatographic Fingerprints (Part I) [J]. Traditional Chinese Drug Research & Clinical Pharmacology, 2006, 17(1): 40-42.)
[18] Pavlo A, Paulson E, Rasin A, et al. A Comparison of Approaches to Large-scale Data Analysis [C]. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD'09). New York, USA: ACM, 2009: 165-178.
[19] Postgresql [EB/OL]. [2014-08-06]. http://www.postgresql.org/.
[20] Eclipse [EB/OL]. [2014-08-06]. http://www.eclipse.org/.
[21] Apache Tomcat [EB/OL]. [2014-08-06]. http://tomcat.apache. org/.
[22] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization [C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML'97). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1997: 143-151.

[1] 王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[2] 王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[3] 卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[4] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[5] 丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[6] 杨辰, 陈晓虹, 王楚涵, 刘婷婷. 基于用户细粒度属性偏好聚类的推荐策略*[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[7] 于丰畅,程齐凯,陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[8] 温萍梅,叶志炜,丁文健,刘颖,徐健. 命名实体消歧研究进展综述*[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[9] 邬金鸣,侯跃芳,崔雷. 基于医学主题词标引规则的词共现聚类分析结果自动判读和表达的研究[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[10] 席运江, 杜蝶蝶, 廖晓, 仉学红. 基于超网络的企业微博用户聚类研究及特征分析*[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[11] 杨旭,钱晓东. 基于改进的Vicsek模型的社会网络同步聚类算法*[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[12] 熊回香,李晓敏,李跃艳. 基于图书评论属性挖掘的群组推荐研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[13] 魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[14] 赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[15] 李珊,姚叶慧,厉浩,刘洁,嘎玛白姆. 基于ISA联合聚类的组推荐算法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn