Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (3): 25-32    DOI: 10.11925/infotech.1003-3513.2016.03.04
Orginal Article Current Issue | Archive | Adv Search |
Generating Hierarchical Paths of Chinese Text from Wikipedia
Xia Tian()
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China, Beijing 100872, China;School of Information Resource Management, Renmin University of China, Beijing 100872, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Generate hierarchical semantic paths of texts from Wikipedia. [Methods] We first establish article concept vector of Chinese texts from Wikipedia through explicit semantic analysis. And then, we mapped the vector to the category nodes of hierarchical-tree-like graph. Finally, we generated the hierarchical paths with the help of seed node information diffusion and top-down path selection, as well as optimization technology. [Results] The average relevance degree of the first generated hierarchical path was 54.10% on the test dataset, and the top 20 paths were sorted by relevance in the descending order. [Limitations] We did not analyze the effect of using different numbers of explicit concept vector to the quality of the generated path. [Conclusions] The hierarchical paths generated from Wikipedia can reflect the main semantic meaning of the given texts.

Key wordsSemantic path      Explicit semantic analysis      Hierarchical classification      Wikipedia     
Received: 16 November 2015      Published: 12 April 2016

Cite this article:

Xia Tian. Generating Hierarchical Paths of Chinese Text from Wikipedia. New Technology of Library and Information Service, 2016, 32(3): 25-32.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.03.04     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I3/25

[1] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624.
[1] (Wu Jiangning, Liu Qiaofeng.Research on Graph Structure Based Method for Chinese Text Representation[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(4): 618-624.)
[2] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] 何力, 贾焰, 韩伟红, 等. 大规模层次分类问题研究及其进展[J]. 计算机学报, 2012, 35(10): 2101-2115.
[3] (He Li, Jia Yan, Han Weihong, et al.Research and Development of Large Scale Hierarchical Classification Problem[J]. Chinese Journal of Computers, 2012, 35(10): 2101-2115.)
[4] Silla C N, Freitas A A.A Survey of Hierarchical Classification Across Different Application Domains[J]. Data Mining and Knowledge Discovery, 2011, 22(1-2): 31-72.
[5] Zhang C, Xue G R, Yu Y, et al.Web-scale Classification with Naive Bayes [C]. In: Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. 2009.
[6] Medelyan O, Milne D, Legg C, et al.Mining Meaning from Wikipedia[J]. International Journal of Human-Computer Studies, 2009, 67(9): 716-754.
[7] Muchnik L, Itzhack R, Solomon S, et al.Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies [J]. Physical Review E, 2007, 76(1): 1-12. DOI: .
[8] Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis [C]. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2007: 1606-1611.
[9] Aggarwal N, Asooja K, Buitelaar P.Exploring ESA to Improve Word Relatedness [C]. In: Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics. 2014: 51-56.
[10] Milne D N, Witten I H, Nichols D M.et al.A Knowledge-Based Search Engine Powered by Wikipedia [C]. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. 2007.
[11] Chakrabarti D, Mehta R.The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering [C]. In: Proceedings of the 19th International Conference on World Wide Web. 2010.
[1] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[2] Li Xiangdong,Ruan Tao,Liu Kang. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[3] Zhou Pengcheng,Wu Chuan,Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. 现代图书情报技术, 2016, 32(6): 1-11.
[4] Li Hui, Xiang Huating, Tang Qiang. A Trust Model for Wikipedia Based on Structure Information and Edit History[J]. 现代图书情报技术, 2015, 31(3): 33-38.
[5] Ren Haiying, Yu Liting. A Multi-strategy Method for Word Sense Disambiguation Based on Wikipedia[J]. 现代图书情报技术, 2015, 31(11): 18-25.
[6] Yang Zhimo, Liu Huailiang, Zhao Hui. An Algorithm of Chinese Text Representation Based on Complex Network[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[7] Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[8] Liu Sa Zhang Chengzhi. Survey of Multilingual Document Representation[J]. 现代图书情报技术, 2010, 26(6): 33-41.
[9] Tan Jinbo . An Improved Hierarchical Document Classification Method[J]. 现代图书情报技术, 2007, 2(2): 56-59.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn