复杂网络理论在中文文本特征选择中的应用研究

doi:10.11925/infotech.1003-3513.2012.09.05

现代图书情报技术

2012, Vol.

Issue (9): 23-28 https://doi.org/10.11925/infotech.1003-3513.2012.09.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

复杂网络理论在中文文本特征选择中的应用研究

赵辉, 刘怀亮, 范云杰

西安电子科技大学经济管理学院西安 710071

Study on the Application of Complex Network Theory in Chinese Text Feature Selection

Zhao Hui, Liu Huailiang, Fan Yunjie

Economy and Management College, Xidian University, Xi’an 710071, China

摘要
参考文献
相关文章
Metrics

全文: PDF (605 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要提出一种基于复杂网络的特征选择方法,通过构建文本加权复杂网络来表示词语间的语义关系及结构信息,综合考虑节点加权度、加权聚集系数、节点介数计算节点特性,利用节点综合特性提取反映文本主题的关键词作为文本的特征词。给出基于复杂网络的中文文本特征选择算法,并对其进行实验验证。结果表明,该特征选择方法较传统方法在文本分类性能上有所提高。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	赵辉
	刘怀亮
	范云杰

关键词 ：复杂网络, 语义相关关系, 节点综合特性, 特征选择

Abstract：This paper proposes a feature selection method based on complex network. The weighted complex network of text is built to represent the semantic relations between words and text structure. The weighted degree, weighted clustering coefficient and betweenness are considered in the characteristics calculation of network nodes, the key words which can reflect the theme of the text are selected by the synthetic characteristics of network nodes. A Chinese text feature selection algorithm based on complex network is proposed and verified. The results of experiments show that the method proposed in this paper can get a better effect on the performance of text classification.

Key words： Complex network Semantic relevance relation Synthetic characteristics of nodes Feature selection

收稿日期: 2012-07-25 出版日期: 2012-12-25

TP391.1

引用本文:

赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012, (9): 23-28.
Zhao Hui, Liu Huailiang, Fan Yunjie. Study on the Application of Complex Network Theory in Chinese Text Feature Selection. New Technology of Library and Information Service, 2012, (9): 23-28.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.09.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2012/V/I9/23

[1] John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem[C]. In:Proceedings of the 11th International Conference on Machine Learning(ICML’94). 1994:121-129.
[2] Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1):81-106.
[3] Church K W, Hanks P. Word Association Norms, Mutual Information and Lexicography[J]. Computational Linguistics, 1990,16(1):22-29.
[4] Koller D, Sahami M. Hierarchically Classifying Documents Using Very Few Words[C]. In:Proceedings of the 14th International Conference on Machine Learning(ICML’97). San Francisco:Morgan Kaufmann Publishers Inc., 1997:170-178.
[5] Kononenko I. On Biases in Estimating Multi-valued Attributes[C]. In:Proceedings of the 14th International Joint Conference on Artificial Intelligence(IJCAI’95). San Francisco:Morgan Kaufmann Publishers Inc., 1995:1034-1040.
[6] Rijsbergen C V. The Selection of Good Search Terms[J]. Information Processing & Management, 1981, 17(2):77-91.
[7] Huang C, Tian Y H, Huang T J, et al. Semantic Scoring Based on Small-World Phenomenon for Feature Selection in Text Mining[C]. In:Proceedings of the the 2nd International Conference on Advanced Data Mining and Applications(ADMA’06). Heidelberg,Berlin:Springer-Verlag, 2006:636-643.
[8] Liu G, Zhai Z W. Research on Keywords Extraction of Chinese Documents Based on TEXT-NET[C]. In:Proceedings of the 2011 International Conference on Electric Information and Control Engineering.2011:6074-6077.
[9] 赵鹏, 蔡庆生, 王清毅, 等. 一种基于复杂网络特征的中文文档关键词抽取算法[J]. 模式识别与人工智能, 2007, 20(6):827-831. (Zhao Peng, Cai Qingsheng, Wang Qingyi, et al. An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features[J]. Pattern Recognition and Artificial Intelligence, 2007, 20(6):827-831.)
[10] 谢凤宏, 张大为, 黄丹, 等. 基于加权复杂网络的文本关键词提取[J]. 系统科学与数学, 2010, 30(11):1592-1596. (Xie Fenghong, Zhang Dawei, Huang Dan, et al. Keywords Extraction Based on Weighted Complex Network[J]. Journal of Systems Science and Mathematical Sciences, 2010, 30(11):1592-1596.)
[11] 韩艳. 基于统计的中文文本关键短语自动抽取方法研究[D]. 苏州:苏州大学, 2009. (Han Yan. Research on Statistic-based Methods Automatic Keypharse Extraction from Chinese Texts[D]. Suzhou:Soochow University, 2009.)
[12] Jia X Q. Feature Selection Algorithm Based on the Community Discovery[C]. In:Proceedings of the 7th International Conference on Computational Intelligence and Security.2011:455-458.
[13] 郑碎潘.Web 数据挖掘中的文本分类研究[D]. 南京:南京航空航天大学,2007. (Zheng Suipan. Research on Text Classification of Web Data Mining[D]. Nanjing:Nanjing University of Aeronautics and Astronautics, 2007.)
[14] Yang Y M, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization[C]. In:Proceedings of the 14th International Conference on Machine Learning(ICML’97). San Francisco:Morgan Kaufmann Publishers Inc., 1997:412-420.
[15] Matsuo Y, Ohsawa Y, Ishizuka M. A Document as a Small World[J]. Lecture Notes in Computer Science, 2001, 2253(2001):444-448.
[16] 李勇. 复杂网络理论与应用研究[D]. 广州:华南理工大学, 2005. (Li Yong. Researches on the Theory and Application of Complex Network[D]. Guangzhou:South China University of Technology, 2005.)
[17] 赵鹏, 耿焕同, 蔡庆生, 等. 一种基于加权复杂网络特征的 K-means 聚类算法[J]. 计算机技术与发展, 2007, 17(9):35-37. (Zhao Peng, Geng Huantong, Cai Qingsheng, et al. A Novel K-means Clustering Algorithm Based on Weighted Complex Networks Feature[J]. Computer Technology and Development, 2007, 17(9):35-37.)
[18] 王莉. 语义网、社会网络计算与Web资源共享[M]. 北京:电子工业出版社, 2011. (Wang Li. The Semantic Web, Social Network Computing and Web Resources Sharing[M]. Beijing:Publishing House of Electronics Industry, 2011.)
[19] Cancho R F I, Sole R V. The Small World of Human Language[J]. Proceedings of the Royal Society of London Series B-Biological Sciences, 2001, 268 (1482):2261-2265.
[20] 耿焕同, 蔡庆生, 赵鹏, 等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005, 24(6):651-656. (Geng Huantong, Cai Qingsheng, Zhao Peng, et al. Research on Document Automatic Summarization Based on Word Co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6):651-656.)
[21] 苏小康. 基于维基百科构建语义知识库及其在文本分类领域的应用研究[D]. 武汉:华中师范大学, 2010.(Su Xiaokang. Research on Building Wikipedia Semantic Knowledge Base and Its Application in Text Classification[D]. Wuhan:Central China Normal University, 2010.)
[22] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
[23] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York:McGraw Hill, 1986.

[1]	陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法^*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[2]	梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 ^*[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[3]	李文政,顾益军,闫红丽. 基于网络贝叶斯信息准则算法的社区数量预测研究*[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[4]	关鹏,王曰芬. 国内外专利网络研究进展*[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[5]	周成,魏红芹. *专利价值评估与分类研究^——基于自组织映射支持向量机**[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[6]	梁家铭,赵洁,Jianlong Zhou,董振宁. 用户隐式行为挖掘在抗信誉共谋中的应用研究^*[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[7]	温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法^*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[8]	李想,钱晓东. 商品在线评价对消费趋同影响研究^*[J]. 数据分析与知识发现, 2019, 3(3): 102-111.
[9]	谭章禄,王兆刚,胡翰. 一种基于χ²统计的特征分类选择方法研究^*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[10]	严娇,马静,房康. 基于融合共现距离的句法网络下文本语义相似度计算 ^*[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[11]	蒋武轩,熊回香,叶佳鑫,安宁. 网络社交平台中社群标签动态生成研究 ^*[J]. 数据分析与知识发现, 2019, 3(10): 98-109.
[12]	钱晓东, 李敏. 基于复杂网络重叠社区的电子商务用户复合类型识别^*[J]. 数据分析与知识发现, 2018, 2(6): 79-91.
[13]	温廷新, 李洋子, 孙静霜. 基于改进的果蝇优化算法的文本特征选择优化模型[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[14]	操玮, 李灿, 贺婷婷, 朱卫东. 基于集成学习的中国P2P网络借贷信用风险预警模型的对比研究^*[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[15]	陈云伟, 张瑞红. 用于情报挖掘的典型网络社团划分算法比较研究^*[J]. 数据分析与知识发现, 2018, 2(10): 84-94.

Viewed

Full text

Abstract

Cited

Shared

Discussed