Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao
In this paper, a novel cluster label extracting algorithm for English paper based on N-gram is proposed. Before the clustering, this algorithm first uses N-gram to generate the field phrases list by prior learning in the large-scale corpus,then clusters the English paper using K-means algorithm. Finally, the highest score N-gram terms from the cluster is extracted as the label. In the score calculation, if the term exists in the field phrases list, it is set double weight. Experimental results show that the quality of cluster label is improved. Furthermore, an improved TFIDF calculation method is developed,and a new R@N method to evaluate the cluster label is proposed.