New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 68-75    DOI: 10.11925/infotech.1003-3513.2011.07-08.12
N-gram Based on Cluster Label Extracting Algorithm for English Paper
Wu Suhui1, Cheng Ying1, Zheng Yanning2, Pan Yuntao2
1. Department of Information Management, Nanjing University, Nanjing 210093,China;
2. Institute of Scientific & Technical Information of China, Beijing 100038,China
Abstract  In this paper, a novel cluster label extracting algorithm for English paper based on N-gram is proposed. Before the clustering, this algorithm first uses N-gram to generate the field phrases list by prior learning in the large-scale corpus,then clusters the English paper using K-means algorithm. Finally, the highest score N-gram terms from the cluster is extracted as the label. In the score calculation, if the term exists in the field phrases list, it is set double weight. Experimental results show that the quality of cluster label is improved. Furthermore, an improved TFIDF calculation method is developed,and a new R@N method to evaluate the cluster label is proposed.
Key wordsCluster label      N-gram      Paper clustering     
Received: 21 June 2011      Published: 09 October 2011



Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. N-gram Based on Cluster Label Extracting Algorithm for English Paper. New Technology of Library and Information Service, 2011, 27(7/8): 68-75.

