Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (11): 31-37     https://doi.org/10.11925/infotech.1003-3513.2014.11.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于依存句法网络的文本特征提取研究
唐晓波, 肖璐
武汉大学信息资源研究中心 武汉 430072
Research of Text Feature Extraction on Dependency Parsing Network
Tang Xiaobo, Xiao Lu
Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF (1414 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 利用依存句法分析构建更准确的文本网络, 提高基于网络图的文本特征提取方法的准确率.[方法] 根据依存句法分析的结果确定特征词之间的语义关联, 利用特征词依存方向确定其关联方向, 采用改进的PageRank算法计算节点重要性, 并以此为指标进行特征提取.[结果] 实验结果表明, 相较共词网络, 基于依存句法网络的特征提取方法能在一定程度上提高文本聚类的效果.[局限] 利用依存关系确定特征词关联方向时没有对不同的依存类型进行区分.[结论] 提出的基于依存句法网络的文本特征提取方法是有效的.

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
肖璐
唐晓波
关键词 特征提取依存句法分析复杂网络    
Abstract

[Objective] In order to promote the accuracy of text feature extraction method based on network, this paper builds a more accurate text network by dependency parsing. [Methods] This method determines the semantic association between feature words according to the result of dependency parsing and the direction of the edges by dependent direction of feature words. And then the improved PageRank algorithm is used to calculate the network node importance to complete the feature extraction. [Results] Experimental results show that to some extent, text feature extraction based on dependency parsing network can improve the effect of document clustering, compared to co-word network. [Limitations] This paper does not distinguish different dependent type when determines the direction between feature words by dependent relationship. [Conclusions] The proposed method based on dependency parsing network is effective on the text feature extraction.

Key wordsFeature extraction    Dependency parsing    Complex network
收稿日期: 2014-05-23      出版日期: 2014-12-18
:  TP391.1  
基金资助:

本文系国家自然科学基金项目"社会化媒体集成检索与语义分析方法研究"(项目编号: 71273194)的研究成果之一.

通讯作者: 肖璐 E-mail: ahjk_xiaolu@163.com     E-mail: ahjk_xiaolu@163.com
作者简介: 作者贡献声明: 唐晓波: 提出研究命题与研究方案, 实验方案设计与结果分析; 肖璐: 研究方案的具体实施, 实验数据采集, 论文起草与最终版本修订.
引用本文:   
唐晓波, 肖璐. 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014, 30(11): 31-37.
Tang Xiaobo, Xiao Lu. Research of Text Feature Extraction on Dependency Parsing Network. New Technology of Library and Information Service, 2014, 30(11): 31-37.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.11.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I11/31

[1] 赵鹏, 蔡庆生, 王清毅, 等. 一种基于复杂网络特征的中文文档关键词抽取算法[J]. 模式识别与人工智能, 2007, 20(6): 827-831. (Zhao Peng, Cai Qingsheng, Wang Qingyi, et al. An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features [J]. Pattern Recognition and Artificial Intelligence, 2007, 20(6): 827-831.)
[2] Dumais S, Platt J, Heckerman D, et al. Inductive Learning Algorithms and Representations for Text Categorization [C]. In: Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM'98). New York: ACM, 1998: 148-155.
[3] Apté C, Damerau F, Weiss S M. Automated Learning of Decision Rules for Text Categorization [J]. ACM Transactions on Information Systems, 1994, 12(3): 233-251.
[4] Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization [C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML'97). San Francisco: Morgan Kaufmann Publishers Inc., 1997: 143-151.
[5] Yang Y, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization [C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML'97). San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[6] Church K W, Hanks P. Word Association Norms, Mutual Information, and Lexicography [J]. Computational Linguistics, 1990, 16(1): 22-29.
[7] Quinlan J R. Induction of Decision Trees [J]. Machine Learning, 1986, 1(1): 81-106.
[8] Mesleh A M A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System [J]. Journal of Computer Science, 2007, 3(6): 430-435.
[9] 张玉芳, 万斌候, 熊忠阳. 文本分类中的特征降维方法研究[J]. 计算机应用研究, 2012, 29(7): 2541-2543. (Zhang Yufang, Wan Binhou, Xiong Zhongyang. Research on Feature Dimension Reduction in Text Classification [J]. Application Research of Computers, 2012, 29(7): 2541-2543.)
[10] 邹加棋, 陈国龙, 郭文忠. 基于图模型的中文文档分类研究[J]. 小型微型计算机系统, 2006, 27(4): 754-757. (Zou Jiaqi, Chen Guolong, Guo Wenzhong. Research on Chinese Document Classification Based on Graph Model [J]. Mini- Micro Systems, 2006, 27(4): 754-757.)
[11] 孟海东, 张炼, 吕海林. 基于图模型的文本分类方法的研究[J]. 计算机与现代化, 2010 (9): 38-40, 44. (Meng Haidong, Zhang Lian, Lv Hailin. Research on Document Classification Method Based on Graph Model [J]. Computer and Modernization, 2010(9): 38-40, 44.)
[12] 赵辉, 刘怀亮, 张倩. 一种基于复杂网络的中文文本分类算法[J]. 情报学报, 2012, 31(11): 1179-1186. (Zhao Hui, Liu Huailiang, Zhang Qian. A Chinese Text Classification Algorithm Based on Complex Network [J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1179-1186.)
[13] Liu H. The Complexity of Chinese Syntactic Dependency Networks [J]. Physica A: Statistical Mechanics and Its Applications, 2008, 387(12): 3048-3058.
[14] Liu G, Zhai Z. Research on Keywords Extraction of Chinese Documents Based on TEXT-NET [C]. In: Proceedings of 2011 International Conference on Electric Information and Control Engineering (ICEICE), Wuhan, China. IEEE, 2011: 6074- 6077.
[15] Hensman S. Construction of Conceptual Graph Representa­tion of Texts [C]. In: Proceedings of the Student Research Workshop at HLT-NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004: 49-54.
[16] 谢凤宏, 张大为, 黄丹, 等. 基于加权复杂网络的文本关键词提取[J]. 系统科学与数学, 2010, 30(11): 1592-1596. (Xie Fenghong, Zhang Dawei, Huang Dan, et al. Keywords Extraction Based on Weighted Complex Network [J]. Journal of Systems Science and Mathematical Sciences, 2010, 30(11): 1592-1596.)
[17] 吕西安·泰尼埃尔. 结构句法基础[G]. 北京: 中国人民大学语言文学系, 1987. (Tesniere L. The Basis of Structure Syntax [G]. Beijing: Language and Literature Department of Renmin University of China, 1987.)
[18] 李彬, 刘挺, 秦兵, 等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究, 2003, 20(12): 15-17. (Li Bin, Liu Ting, Qin Bing, et al. Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis [J]. Application Research of Computers, 2003, 20(12): 15-17.)
[19] 王鹏, 樊兴华. 中文文本分类中利用依存关系的实验研究[J]. 计算机工程与应用, 2010, 46(3): 131-133, 141. (Wang Peng, Fan Xinghua. Study on Chinese Text Classification Based on Dependency Relation [J]. Computer Engineering and Applications, 2010, 46(3): 131-133, 141.)
[20] Che W, Li Z, Liu T. LTP: A Chinese Language Technology Platform [C]. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China. Stroudsburg: Association for Computational Linguistics, 2010: 13-16.
[21] Matsuo Y, Ohsawa Y, Ishizuka M. A Document as a Small World[A]//New Frontiers in Artificial Intelligence [M]. Springer Berlin Heidelberg, 2001: 444-448.
[22] 刘知远, 郑亚斌, 孙茂松. 汉语依存句法网络的复杂网络性质[J]. 复杂系统与复杂性科学, 2008, 5(2): 37-45. (Liu Zhiyuan, Zheng Yabin, Sun Maosong. Complex Network Properties of Chinese Syntactic Dependency Network [J]. Complex Systems and Complexity Science, 2008, 5(2): 37-45.)
[23] 刘海涛. 汉语句法网络的复杂性研究[J]. 复杂系统与复杂性科学, 2007, 4(4): 38-44. (Liu Haitao. The Complexity of Chinese Syntactic Network[J]. Complex Systems and Complexity Science, 2007, 4(4): 38-44.)
[24] 刘旭. 克里米亚公投结束 民调显示93%选民赞成入俄[EB/OL]. (2014-03-17). http://news.sohu.com/20140317/ n396701134.shtml. (Liu Xu. The End of the Crimean Referendum Poll Shows 93% of Voters is in Favor of the Entry of Russia [EB/OL]. (2014-03-17). http://news.sohu. com/20140317/n396701134.shtml.)
[25] The Open Graph Viz Platform [EB/OL]. [2014-03-05]. http:// www.gephi.org.
[26] 张巍. 基于PageRank算法的搜索引擎优化策略研究[D]. 成都: 四川大学, 2005. (Zhang Wei. Research on Optimizing Strategies of Search Engine Based on PageRank Algorithm [D]. Chengdu: Sichuan University, 2005.)
[27] 陈小飞, 王轶彤, 冯小军. 一种基于网页质量的PageRank算法改进[J]. 计算机研究与发展, 2009, 46(S): 381-387. (Chen Xiaofei, Wang Yitong, Feng Xiaojun. An Improvement of PageRank Algorithm Based on Page Quality [J]. Journal of Computer Research and Development, 2009, 46(S): 381-387.)
[28] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34. (Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J].New Technology of Library and Information Service, 2013(9): 30-34.)
[29] Zhang H, Yu H, Xiong D, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN'03), Sapporo, Japan. Stroudsburg: Association for Computational Linguistics, 2003: 184-187.
[30] The Stanford Parser: A Statistical Parser [EB/OL]. [2014-05- 29]. http://nlp.stanford.edu/software/lex-parser.shtml#Download.
[31] 陈果, 胡昌平. 科研领域关键词网络的结构特征与启示——基于图情学科的实证研究[J]. 现代图书情报技术, 2014(7-8): 84-91. (Chen Guo, Hu Changping. Research on the Structural Features of Keyword Network of Scientific Research Areas: An Empirical Study of LIS [J]. New Technology of Library and Information Service, 2014(7-8): 84-91.)

[1] 陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[2] 郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究*[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[3] 李文政,顾益军,闫红丽. 基于网络贝叶斯信息准则算法的社区数量预测研究*[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[4] 蔡婧璇,吴江,王诚坤. 基于深度学习的众测报告有用性预测研究*[J]. 数据分析与知识发现, 2020, 4(11): 102-111.
[5] 关鹏,王曰芬. 国内外专利网络研究进展*[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[6] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[7] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[8] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[9] 李想,钱晓东. 商品在线评价对消费趋同影响研究*[J]. 数据分析与知识发现, 2019, 3(3): 102-111.
[10] 严娇,马静,房康. 基于融合共现距离的句法网络下文本语义相似度计算 *[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[11] 钟庆虹,乔晓东,张运良,翁梦娟. 基于LDA2Vec和残差网络的跨媒体融合方法研究 *[J]. 数据分析与知识发现, 2019, 3(10): 78-88.
[12] 蒋武轩,熊回香,叶佳鑫,安宁. 网络社交平台中社群标签动态生成研究 *[J]. 数据分析与知识发现, 2019, 3(10): 98-109.
[13] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[14] 钱晓东, 李敏. 基于复杂网络重叠社区的电子商务用户复合类型识别*[J]. 数据分析与知识发现, 2018, 2(6): 79-91.
[15] 李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn