Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (3): 25-32     https://doi.org/10.11925/infotech.1003-3513.2016.03.04
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于维基百科的中文文本层次路径生成研究*
夏天()
中国人民大学数据工程与知识工程教育部重点实验室 北京 100872;中国人民大学信息资源管理学院 北京 100872
Generating Hierarchical Paths of Chinese Text from Wikipedia
Xia Tian()
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China, Beijing 100872, China;School of Information Resource Management, Renmin University of China, Beijing 100872, China
全文: PDF (505 KB)   HTML ( 65
输出: BibTeX | EndNote (RIS)      
摘要 

目的】利用维基百科知识库生成自由文本的层次语义路径。【方法】针对维基百科的中文导出数据, 构建层次结构的树状图; 进而通过显性语义分析将自由文本表示为文章概念向量, 通过文章-类别关联关系将文本映射到树状图中构成种子类别节点, 再通过种子节点开始的信息扩散和自顶向下的路径选择与优化, 生成层次路径。【结果】首条层次路径的平均相关度在测试集上达到54.10%, 前20条路径整体上按相关度降序排序。【局限】未分析显性概念向量在保留不同概念数量时对生成路径质量的影响。【结论】基于维基百科知识库所生成的层次路径结果能够反映文本的主要语义信息。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
夏天
关键词 语义路径显性语义分析层次分类维基百科    
Abstract

[Objective] Generate hierarchical semantic paths of texts from Wikipedia. [Methods] We first establish article concept vector of Chinese texts from Wikipedia through explicit semantic analysis. And then, we mapped the vector to the category nodes of hierarchical-tree-like graph. Finally, we generated the hierarchical paths with the help of seed node information diffusion and top-down path selection, as well as optimization technology. [Results] The average relevance degree of the first generated hierarchical path was 54.10% on the test dataset, and the top 20 paths were sorted by relevance in the descending order. [Limitations] We did not analyze the effect of using different numbers of explicit concept vector to the quality of the generated path. [Conclusions] The hierarchical paths generated from Wikipedia can reflect the main semantic meaning of the given texts.

Key wordsSemantic path    Explicit semantic analysis    Hierarchical classification    Wikipedia
收稿日期: 2015-11-16      出版日期: 2016-04-12
基金资助:*本文系北京高等学校青年英才计划项目“基于链接和主题分析的微博社区挖掘研究”(项目编号:YETP0215)和国家社会科学基金重大项目“国家数字档案资源整合与服务机制研究”(项目编号:13&ZD184)的研究成果之一
引用本文:   
夏天. 基于维基百科的中文文本层次路径生成研究*[J]. 现代图书情报技术, 2016, 32(3): 25-32.
Xia Tian. Generating Hierarchical Paths of Chinese Text from Wikipedia. New Technology of Library and Information Service, 2016, 32(3): 25-32.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.03.04      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I3/25
[1] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624.
[1] (Wu Jiangning, Liu Qiaofeng.Research on Graph Structure Based Method for Chinese Text Representation[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(4): 618-624.)
[2] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] 何力, 贾焰, 韩伟红, 等. 大规模层次分类问题研究及其进展[J]. 计算机学报, 2012, 35(10): 2101-2115.
[3] (He Li, Jia Yan, Han Weihong, et al.Research and Development of Large Scale Hierarchical Classification Problem[J]. Chinese Journal of Computers, 2012, 35(10): 2101-2115.)
[4] Silla C N, Freitas A A.A Survey of Hierarchical Classification Across Different Application Domains[J]. Data Mining and Knowledge Discovery, 2011, 22(1-2): 31-72.
[5] Zhang C, Xue G R, Yu Y, et al.Web-scale Classification with Naive Bayes [C]. In: Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. 2009.
[6] Medelyan O, Milne D, Legg C, et al.Mining Meaning from Wikipedia[J]. International Journal of Human-Computer Studies, 2009, 67(9): 716-754.
[7] Muchnik L, Itzhack R, Solomon S, et al.Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies [J]. Physical Review E, 2007, 76(1): 1-12. DOI: .
[8] Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis [C]. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2007: 1606-1611.
[9] Aggarwal N, Asooja K, Buitelaar P.Exploring ESA to Improve Word Relatedness [C]. In: Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics. 2014: 51-56.
[10] Milne D N, Witten I H, Nichols D M.et al.A Knowledge-Based Search Engine Powered by Wikipedia [C]. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. 2007.
[11] Chakrabarti D, Mehta R.The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering [C]. In: Proceedings of the 19th International Conference on World Wide Web. 2010.
[1] 王鑫芸,王昊,邓三鸿,张宝隆. 面向期刊选择的学术论文内容分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[2] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[3] 任海英, 于立婷. 一种基于维基百科的多策略词义消歧方法[J]. 现代图书情报技术, 2015, 31(11): 18-25.
[4] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014, 30(3): 80-87.
[5] 杨志墨, 刘怀亮, 赵辉. 一种基于复杂网络的中文文本表示算法[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[6] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[7] 谭金波,杨晓江,李艺. 基于统计-规则方法的网页层次分类技术研究[J]. 现代图书情报技术, 2007, 2(8): 59-62.
[8] 谭金波 . 一种改进的文档层次分类方法[J]. 现代图书情报技术, 2007, 2(2): 56-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn