Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 68-75    DOI: 10.11925/infotech.1003-3513.2011.07-08.12
Current Issue | Archive | Adv Search |
N-gram Based on Cluster Label Extracting Algorithm for English Paper
Wu Suhui1, Cheng Ying1, Zheng Yanning2, Pan Yuntao2
1. Department of Information Management, Nanjing University, Nanjing 210093,China;
2. Institute of Scientific & Technical Information of China, Beijing 100038,China
Download: PDF(477 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  In this paper, a novel cluster label extracting algorithm for English paper based on N-gram is proposed. Before the clustering, this algorithm first uses N-gram to generate the field phrases list by prior learning in the large-scale corpus,then clusters the English paper using K-means algorithm. Finally, the highest score N-gram terms from the cluster is extracted as the label. In the score calculation, if the term exists in the field phrases list, it is set double weight. Experimental results show that the quality of cluster label is improved. Furthermore, an improved TFIDF calculation method is developed,and a new R@N method to evaluate the cluster label is proposed.
Key wordsCluster label      N-gram      Paper clustering     
Received: 21 June 2011      Published: 09 October 2011



Cite this article:

Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. N-gram Based on Cluster Label Extracting Algorithm for English Paper. New Technology of Library and Information Service, 2011, 27(7/8): 68-75.

URL:     OR

[1] 宗成庆,统计自然语言处理[M]. 北京: 清华大学出版社,2008:74-76.

[2] Berger H, Merkl D. A Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics . In: Proceedings of the 17th Australian Joint Conference on Artificial Intelligence (AI'2004), Cairns, Australia.Lecture Notes in Computer Science,2005,3339:998-1003.

[3] Mansur M, UzZaman N, Khan M. Analysis of N-Gram Based Text Categorization for Bangla in a Newspaper Corpus . In: Proceedings of Center for Research on Bangla Language Processing,BRAC University.2006.

[4] Rahmoun A, Elberrichi Z. Experimenting N-Grams in Text Categorization[J]. The International Arab Journal of Information Technology,2007,4(4):377-385.

[5] Güran A, Akyokus S, Bayazit N G,et al. Turkish Text Categorization Using N-Gram Words . In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications. Turkey: Trabzon,2009:369-373.

[6] 何浩,杨海棠. 一种基于N-Gram技术的中文文献自动分类方法[J]. 情报学报, 2002,21(4):421-427.

[7] 于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J]. 图书情报工作, 2004,48(8):48-50,43.

[8] 许云,樊孝忠,张锋. 一种不需分词的中文文本分类方法[J]. 北京理工大学学报, 2005,25(9):778-781.

[9] 孙桂煌.基于N-Grams短语的中文Web文本聚类及其预处理的研究 .赣州:江西理工大学,2009.

[10] Zamir O, Etzioni O. Web Document Clustering: A Feasibility Demonstration . In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval.1998: 46-54.

[11] Wang J, Mo Y, Huang B, et al. Web Search Results Clustering Based on a Novel Suffix Tree Structure .In:Proceedings of the 5th International Conference on Autonomic and Trusted Computing. Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 2008:540-554.

[12] Crabtree D, Gao X, Andreae P. Improving Web Clustering by Cluster Selection . In:Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence.2005:172-178.

[13] 史庆伟,赵政,朝柯. 一种基于后缀树的中文网页层次聚类方法[J]. 辽宁工程技术大学学报:自然科学版, 2006,25(6):890-892.

[14] 杜红斌,夏克文,刘南平,等.一种改进的基于广义后缀树的文本聚类算法[J]. 信息与控制, 2009,38(3):331-336.

[15] 林庆,袁晓峰,吴旻. 中文Web文档聚类算法研究[J]. 计算机工程与设计, 2009,30(20):4759-4761.

[16] 骆雄武,万小军,杨建武,等. 基于后缀树的Web检索结果聚类标签生成方法[J]. 中文信息学报, 2009,23(2):83-88.

[17] Zeng H J, He Q C, Chen Z, et al. Learning to Cluster Web Search Results .In:Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2004:210-217.

[18] Treeratpituk P, Callan J. Automatically Labeling Hierarchical Clusters . In:Proceedings of the 2006 International Conference on Digital Government Research.2006:167-176.

[19] Tseng Y H. Generic Title Labeling for Clustered Documents[J]. Expert Systems with Applications,2010, 37 (3):2247-2254.
[1] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[2] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[3] Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram[J]. 现代图书情报技术, 2013, (4): 54-61.
[4] Sun Haixia, Li Junlian, Wu Yingjie, Wu Suhui. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method[J]. 现代图书情报技术, 2013, 29(1): 15-21.
[5] Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-Gram[J]. 现代图书情报技术, 2012, 28(2): 41-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938