This paper presents a new algorithm for text character extraction and dimension reduction based on the Max Term Contribution. Its main idea is computing the contribution of each term in the high dimension document-base and extracting the maximum contribution terms to construct a low dimension document-base from the high dimension document-base using the search algorithm. Then a modified K-means clustering method based on the Simulated Annealing (SA) is presented to cluster the low dimension document datum which is obtained by MTC. Finally, some experiments show that the new method can improve the cluster precision.
陆国丽,王小华,王荣波. 最大词重降维算法与模拟退火算法相结合的文本聚类方法研究[J]. 现代图书情报技术, 2008, 24(12): 43-47.
Lu Guoli,Wang Xiaohua,Wang Rongbo. Text Clustering Research on the Max Term Contribution Dimension Reduction and Simulated Annealing Algorithm. New Technology of Library and Information Service, 2008, 24(12): 43-47.
[1] 中国科学院计算机网络信息中心. 第21次中国互联网络发展状况统计报告[R],2008.
[2] 秦进,陆汝占. 文本分类中的特征提取[J].计算机应用, 2003,23(2):45-46.
[3] 伍建军,康耀红. 一种基于特征词聚类的文本分类方法[J]. 情报理论与实践,2007,30(1):109-111.
[4] Friedman JH.Turkey JW. A Projection Pursuit Algorithm for Exploratory Data Analysis [J]. IEEE Transactions on Computer, 1974, 23(9):881-890.
[5] Gao MT. A New Algorithm for Text Clustering Based on Projection Pursuit [C]. In:Proceedings of the 6th International Conference on Machine Learning and Cybernetics, HongKong, 2007:3401-3405.
[6] 周水庚,关佶红. 隐含语义索引及其在中文文本处理中的应用研究[J].小型微型计算机系统,2001,22(2): 239-243.
[7] Gonzaga L,Grivet M. A Simple and Fast Term Selection Procedure for Text Clustering [C]. In:Proceedings of the 7th International Conference on Intelligent Systems Design and Application,2007:777-781.
[8] 杨淑莹. 模式识别与智能计算—Matlab技术实现[M],北京:电子工业出版社,2008.
[9] 张蓉,彭 宏. 一种快速的模拟退火算法及其在数据聚类中的应用[J]. 计算机工程与应用,2001,37(15):85-87.
[10] 武兆慧, 张桂娟, 刘希玉. 基于模拟退火遗传算法的聚类分析[J]. 计算机应用研究,2005,22(12):24-26.