Abstract:This paper proposes a feature selection method based on complex network. The weighted complex network of text is built to represent the semantic relations between words and text structure. The weighted degree, weighted clustering coefficient and betweenness are considered in the characteristics calculation of network nodes, the key words which can reflect the theme of the text are selected by the synthetic characteristics of network nodes. A Chinese text feature selection algorithm based on complex network is proposed and verified. The results of experiments show that the method proposed in this paper can get a better effect on the performance of text classification.
赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012, (9): 23-28.
Zhao Hui, Liu Huailiang, Fan Yunjie. Study on the Application of Complex Network Theory in Chinese Text Feature Selection. New Technology of Library and Information Service, 2012, (9): 23-28.
[1] John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem[C]. In:Proceedings of the 11th International Conference on Machine Learning(ICML’94). 1994:121-129. [2] Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1):81-106. [3] Church K W, Hanks P. Word Association Norms, Mutual Information and Lexicography[J]. Computational Linguistics, 1990,16(1):22-29. [4] Koller D, Sahami M. Hierarchically Classifying Documents Using Very Few Words[C]. In:Proceedings of the 14th International Conference on Machine Learning(ICML’97). San Francisco:Morgan Kaufmann Publishers Inc., 1997:170-178. [5] Kononenko I. On Biases in Estimating Multi-valued Attributes[C]. In:Proceedings of the 14th International Joint Conference on Artificial Intelligence(IJCAI’95). San Francisco:Morgan Kaufmann Publishers Inc., 1995:1034-1040. [6] Rijsbergen C V. The Selection of Good Search Terms[J]. Information Processing & Management, 1981, 17(2):77-91. [7] Huang C, Tian Y H, Huang T J, et al. Semantic Scoring Based on Small-World Phenomenon for Feature Selection in Text Mining[C]. In:Proceedings of the the 2nd International Conference on Advanced Data Mining and Applications(ADMA’06). Heidelberg,Berlin:Springer-Verlag, 2006:636-643. [8] Liu G, Zhai Z W. Research on Keywords Extraction of Chinese Documents Based on TEXT-NET[C]. In:Proceedings of the 2011 International Conference on Electric Information and Control Engineering.2011:6074-6077. [9] 赵鹏, 蔡庆生, 王清毅, 等. 一种基于复杂网络特征的中文文档关键词抽取算法[J]. 模式识别与人工智能, 2007, 20(6):827-831. (Zhao Peng, Cai Qingsheng, Wang Qingyi, et al. An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features[J]. Pattern Recognition and Artificial Intelligence, 2007, 20(6):827-831.) [10] 谢凤宏, 张大为, 黄丹, 等. 基于加权复杂网络的文本关键词提取[J]. 系统科学与数学, 2010, 30(11):1592-1596. (Xie Fenghong, Zhang Dawei, Huang Dan, et al. Keywords Extraction Based on Weighted Complex Network[J]. Journal of Systems Science and Mathematical Sciences, 2010, 30(11):1592-1596.) [11] 韩艳. 基于统计的中文文本关键短语自动抽取方法研究[D]. 苏州:苏州大学, 2009. (Han Yan. Research on Statistic-based Methods Automatic Keypharse Extraction from Chinese Texts[D]. Suzhou:Soochow University, 2009.) [12] Jia X Q. Feature Selection Algorithm Based on the Community Discovery[C]. In:Proceedings of the 7th International Conference on Computational Intelligence and Security.2011:455-458. [13] 郑碎潘.Web 数据挖掘中的文本分类研究[D]. 南京:南京航空航天大学,2007. (Zheng Suipan. Research on Text Classification of Web Data Mining[D]. Nanjing:Nanjing University of Aeronautics and Astronautics, 2007.) [14] Yang Y M, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization[C]. In:Proceedings of the 14th International Conference on Machine Learning(ICML’97). San Francisco:Morgan Kaufmann Publishers Inc., 1997:412-420. [15] Matsuo Y, Ohsawa Y, Ishizuka M. A Document as a Small World[J]. Lecture Notes in Computer Science, 2001, 2253(2001):444-448. [16] 李勇. 复杂网络理论与应用研究[D]. 广州:华南理工大学, 2005. (Li Yong. Researches on the Theory and Application of Complex Network[D]. Guangzhou:South China University of Technology, 2005.) [17] 赵鹏, 耿焕同, 蔡庆生, 等. 一种基于加权复杂网络特征的 K-means 聚类算法[J]. 计算机技术与发展, 2007, 17(9):35-37. (Zhao Peng, Geng Huantong, Cai Qingsheng, et al. A Novel K-means Clustering Algorithm Based on Weighted Complex Networks Feature[J]. Computer Technology and Development, 2007, 17(9):35-37.) [18] 王莉. 语义网、社会网络计算与Web资源共享[M]. 北京:电子工业出版社, 2011. (Wang Li. The Semantic Web, Social Network Computing and Web Resources Sharing[M]. Beijing:Publishing House of Electronics Industry, 2011.) [19] Cancho R F I, Sole R V. The Small World of Human Language[J]. Proceedings of the Royal Society of London Series B-Biological Sciences, 2001, 268 (1482):2261-2265. [20] 耿焕同, 蔡庆生, 赵鹏, 等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005, 24(6):651-656. (Geng Huantong, Cai Qingsheng, Zhao Peng, et al. Research on Document Automatic Summarization Based on Word Co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6):651-656.) [21] 苏小康. 基于维基百科构建语义知识库及其在文本分类领域的应用研究[D]. 武汉:华中师范大学, 2010.(Su Xiaokang. Research on Building Wikipedia Semantic Knowledge Base and Its Application in Text Classification[D]. Wuhan:Central China Normal University, 2010.) [22] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47. [23] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York:McGraw Hill, 1986.