Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 38-44    DOI: 10.11925/infotech.1003-3513.2014.11.06
Current Issue | Archive | Adv Search |
An Algorithm of Chinese Text Representation Based on Complex Network
Yang Zhimo, Liu Huailiang, Zhao Hui
School of Economics & Management, Xidian University, Xi'an 710126, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To solve the problem of the semantic deficiency in text representation based on Vector Space Model, this paper proposes an algorithm of Chinese text representation based on complex network. [Methods] Word relevance is calculated based on the concept pages, link structure and category system which are extracted from Wikipedia. Then, it represents the feature words of texts as nodes, and puts the semantic relevance relation between words as the edges, and uses the word relevance as edge weight of weighted complex network. [Results] Results of experiments show that the proposed text representation method can improve the calculation of text similarity and improve the performance of text categorization. [Limitations] The selection rules of co-occurred window and span in this paper draw lessons from the existing researches. [Conclusions] This text representation method can better keep the structure information and the correlation information between words. Besides, the computation method of word relevance based on Wikipedia makes semantic information represented by the text network more accurate.

Key wordsText representation      Complex network      Wikipedia      Word relevance      Text similarity     
Received: 06 April 2014      Published: 18 December 2014
:  G350  

Cite this article:

Yang Zhimo, Liu Huailiang, Zhao Hui. An Algorithm of Chinese Text Representation Based on Complex Network. New Technology of Library and Information Service, 2014, 30(11): 38-44.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.11.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I11/38

[1] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] 谢凤宏, 张大为, 黄丹, 等. 基于加权复杂网络的文本关键词提取[J]. 系统科学与数学, 2010, 30(11): 1592-1596. (Xie Fenghong, Zhang Dawei, Huang Dan, et al. Keywords Extraction Based on Weighted Complex Network [J]. Journal of Systems Science and Mathematical Sciences, 2010, 30(11): 1592-1596.)
[3] 韩艳. 基于统计的中文文本关键短语自动抽取方法研究[D]. 苏州: 苏州大学, 2009. (Han Yan. Research on Statistic- based Automatic Keypharse Extraction from Chinese Texts [D]. Suzhou: Soochow University, 2009.)
[4] Grabska-Gradzinska I, Kulig A, Kwapien J, et al. Complex Network Analysis of Literary and Scientific Texts [J]. International Journal of Modern Physics C, 2012, 23(7). DOI: 10.1142/S0129183112500519.
[5] Liu J, Wang J. Keyword Extraction Using Language Network [C]. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP- KE'07), Beijing, China. IEEE, 2007: 129-134.
[6] 李纲, 毛进. 文本图表示模型及其在文本挖掘中的应用[J]. 情报学报, 2013, 32(12): 1257-1264. (Li Gang, Mao Jin. A Review on Text Graph Representation and Its Application in Text Mining [J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(12): 1257-1264.)
[7] Litvak M, Last M. Graph-based Keyword Extraction for Single-document Summarization [C]. In: Proceedings of the Workshop on Multi-source Multilingual Information Extrac­tion and Summarization (MMIES'08). Stroudsburg: Associa­tion for Computational Linguistics, 2008: 17-24.
[8] 赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012(9): 23-28. (Zhao Hui, Liu Huailiang, Fan Yunjie. Study on the Application of Complex Network Theory in Chinese Text Feature Selection [J]. New Technology of Library and Information Service, 2012(9): 23-28.)
[9] 赵辉, 刘怀亮, 张倩.一种基于复杂网络的中文文本分类算法[J]. 情报学报, 2012, 31(11): 1179-1186. (Zhao Hui, Liu Huailiang, Zhang Qian. A Chinese Text Classification Algorithm Based on Complex Network [J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1179-1186.)
[10] 钟茂生, 刘慧, 刘磊. 词汇间语义相关关系量化计算方法[J]. 中文信息学报, 2009, 23(2): 115-122. (Zhong Maosheng, Liu Hui, Liu Lei. Method of Semantic Relevance Relation Measurement between Words [J]. Journal of Chinese Information Processing, 2009, 23(2): 115-122.)
[11] Pestman W R. Mathematical Statistics [M]. Walter de Gruyter, 1998: 198-199.
[12] Van der Lubbe J C A. Information Theory [M]. London: Cambridge University Press, 2001: 16-21.
[13] Manning C D, Schutze H. Foundations of Statistical Natural Language Processing [M]. Cambridge: MIT Press, 1999: 111-114.
[14] 涂新辉, 张红春, 周琨峰, 等. 中文维基百科的结构化信息抽取及词语相关度计算方法[J]. 中文信息学报, 2012, 26(3): 109-115. (Tu Xinhui, Zhang Hongchun, Zhou Kunfeng, et al. Extracting Structured Information from Chinese Wikipedia and Measuring Relatedness between Words [J]. Journal of Chinese Information Processing, 2012, 26(3): 109-115.)
[15] Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links [C]. In: Proceedings of the 23rd Association for the Advancement of Artificial Intelligence. 2008: 25-30.
[16] Cancho R F I, Sole R V. The Small World of Human Language[J]. Proceedings of the Royal Society of London: Series B-Biological Sciences, 2001, 268 (1482): 2261-2265. DOI: 10.1098/rspb.2001.1800.
[17] 刘知远, 孙茂松. 汉语词同现网络的小世界效应和无标度特性[J]. 中文信息学报, 2007, 21(6): 52-58. (Liu Zhiyuan, Sun Maosong. Chinese Word Co-occurrence Network: Its Small World Effect and Scale-free Property [J]. Journal of Chinese Information Processing, 2007, 21(6): 52-58.)
[18] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624. (Wu Jiangning, Liu Qiaofeng. Research on Graph Structure Based Method for Chinese Text Representation [J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(4): 618-624.)
[19] 赵鹏, 耿焕同, 蔡庆生, 等. 一种基于加权复杂网络特征的 K-means 聚类算法[J]. 计算机技术与发展, 2007, 17(9): 35-37, 40. (Zhao Peng, Geng Huantong, Cai Qingsheng, et al. A Novel K-means Clustering Algorithm Based on Weighted Complex Networks Feature[J]. Computer Technology and Development, 2007, 17(9): 35-37, 40.)
[20] Van Rijsbergen C J. Information Retrieval [M]. London: Butterworths, 1979.
[21] Cover T, Hart P. Nearest Neighbor Pattern Classification [J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.

[1] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[2] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[3] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[4] Li Wenzheng,Gu Yijun,Yan Hongli. Predicting Community Numbers with Network Bayesian Information Criterion[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[5] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[6] Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses[J]. 数据分析与知识发现, 2019, 3(5): 51-56.
[7] Xiang Li,Xiaodong Qian. Research on Impact of Commodity Online Evaluation for Consumption Convergence[J]. 数据分析与知识发现, 2019, 3(3): 102-111.
[8] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[9] Wuxuan Jiang,Huixiang Xiong,Jiaxin Ye,Ning An. Creating Dynamic Tags for Social Networking Groups[J]. 数据分析与知识发现, 2019, 3(10): 98-109.
[10] Qian Xiaodong,Li Min. Identifying E-commerce User Types Based on Complex Network Overlapping Community[J]. 数据分析与知识发现, 2018, 2(6): 79-91.
[11] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[12] Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[13] Chen Yunwei,Zhang Ruihong. Comparing on Community Detection Algorithms for Information Mining[J]. 数据分析与知识发现, 2018, 2(10): 84-94.
[14] Chen Erjing,Jiang Enbo. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[15] Bai Rujiang,Leng Fuhai,Liao Junhua. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn