Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (12): 73-79    DOI: 10.11925/infotech.1003-3513.2008.12.13
Current Issue | Archive | Adv Search |
Algorithm and Experiment Research of Textual Document Clustering Based on Improved K-means
Cen Yonghua 1,2  Wang XiaorongJi Yonghui 1
1(Department of Information Management,Nanjing University,Nanjing 210093,China)
2(Department of Information Management,Nanjing University of Science & Technology,Nanjing 210094,China)
Download: PDF (602 KB)  
Export: BibTeX | EndNote (RIS)      

After a concise introduction of conotation,functions and general processs of textual document clustering, this paper expotiates the basic mechanism of a kind of improved K-means clustering based on initial centroids selection through minimum-maximum principle, designs its algorithm, implements the clustering system, and conducts several experiments taking 300 academic articles and relative characteristic words for instances, which prove the good performance of the algorithm proposed.

Key wordsTextual document clustering      K-means     
Received: 18 August 2008      Published: 25 December 2008


Corresponding Authors: Cen Yonghua     E-mail:
About author:: Cen Yonghua,Wang Xiaorong,Ji Yonghui

Cite this article:

Cen Yonghua,Wang Xiaorong,Ji Yonghui. Algorithm and Experiment Research of Textual Document Clustering Based on Improved K-means. New Technology of Library and Information Service, 2008, 24(12): 73-79.

URL:     OR

[1] 刘远超,王晓龙,徐志明,等.文档聚类综述[J].中文信息学报,2006(3):55-62.
[2] 刘远超,王晓龙,刘秉权.一种改进的K-means文档聚类初值选择算法[J]. 高技术通讯,2006 (1):11-15.
[3] 吉雍慧. 数字图书馆中的检索结果聚类和关联推荐研究[J].现代图书情报技术,2008(2):69-75.
[4] Hearst M A. Texttiling: Segmenting Text into Multi-paragraph Subtopic Passages[J]. Computational  Linguistics,1997,23(1):33-64.

[1] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[2] Liu Hongwei,Gao Hongming,Chen Li,Zhan Mingjun,Liang Zhouyang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[3] Jia Xiaoting,Wang Mingyang,Cao Yu. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[4] Liu Minghui. Risk Assessment of Civil Aviation Terrorism Based on K-means Clustering[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[5] Wang Xueying,Zhang Zixuan,Wang Hao,Deng Sanhong. Evaluating Brands of Agriculture Products: A Literature Review[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[6] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[7] Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[8] Liu Ruilun,Ye Wenhao,Gao Ruiqing,Tang Mengjia,Wang Dongbo. Research on Text Clustering Based on Requirements of Big Data Jobs[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
[9] Niu Liang. New Research and Application with Co-topics Network[J]. 现代图书情报技术, 2016, 32(7-8): 137-146.
[10] Chen Ting, Han Tao, Li Zexia, Li Guopeng, Wang Xiaomei. Research on Comparison Method of Scientific Funding Layout——Take NSF and EU FP Grants for Instance[J]. 现代图书情报技术, 2015, 31(7-8): 89-96.
[11] Ren Yuwei, Lv Xueqiang, Li Zhuo, Xu Liping. Named Entity Recognition from Search Log[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[12] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[13] Zhang Wenjun, Wang Jun, Xu Shanchuan. The Probing of E-commerce User Need States by Page Cluster Analysis ——An Empirical Study on Women's Clothes from[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[14] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[15] Zhao Pengwei, Ma Lin, Qin Chunxiu. Formation of Interest-based Peer-to-Peer Community[J]. 现代图书情报技术, 2013, 29(10): 53-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938