Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (5): 41-49    DOI: 10.11925/infotech.1003-3513.2014.05.06
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
中文UGC信息源的本体概念抽取研究*
唐晓波, 胡华
武汉大学信息管理学院 武汉 430072
Research of Ontology Concept Extraction Based on Chinese UGC Sources
Tang Xiaobo, Hu Hua
School of Information Management, Wuhan University, Wuhan 430072, China
全文: PDF(590 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】实现基于UGC信息源的本体概念抽取。【方法】针对UGC信息源特征, 提出一种基于语言学的细粒度词抽取组合并应用统计过滤组成概念的本体概念抽取方法, 建立基于UGC信息源的概念抽取模型并对原型系统进行验证。【结果】在UGC信息源概念抽取实验中, 该方法的结果比其他4组概念抽取方法的表现更为优异, 准确率达68.42%, 召回率达85.35%。【局限】概念抽取的测试集来自信息质量较高的UGC信息源, 部分信息经过人工过滤, 语料规模存在不足。【结论】概念抽取方法与技术在实现基于UGC信息源的本体概念抽取中具有一定的意义。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡华
唐晓波
关键词 概念抽取词性规则中心词互信息信息熵    
Abstract

[Objective] In order to extract Ontology concepts from Chinese UGC information sources. [Methods] This paper proposes a mixed Ontology extraction method which extracting the fine-grained words and combining them into concepts based on linguistic methods and filters the concepts based on statistical methods. To prove the methods, the paper establishes the Ontology extraction model and develops a prototype system of concept extraction which is based on the UGC sources. [Results] The method has more excellent performance than other four concept extraction methods as the comparative samples in the experiments of concept extraction from UGC. The results of the accuracy rate and the recall rate respectively reaches 68.42% and 85.35%. [Limitations] The test set of concept extraction is from high-quality UGC sources and some of the test set is filtered manually.So the corpus scale is not enough. [Conclusions] This concept extraction method and technology has some significance in the Ontology concept extraction based on UGC.

Key wordsConcept extraction    Speech rules    Seed word    Mutual information    Information entropy
收稿日期: 2013-11-11     
:  TP391  
基金资助:

*本文系国家自然科学基金项目“社会化媒体集成检索与语义分析方法研究”(项目编号: 71273194)的研究成果之一

通讯作者: 胡华 E-mail:henryhu@whu.edu.cn   
作者简介: 唐晓波: 提出研究思路, 设计研究方案; 胡华: 进行实验; 采集、清洗和分析数据; 论文起草; 最终版本修订。
引用本文:   
唐晓波, 胡华. 中文UGC信息源的本体概念抽取研究*[J]. 现代图书情报技术, 2014, 30(5): 41-49.
Tang Xiaobo, Hu Hua. Research of Ontology Concept Extraction Based on Chinese UGC Sources. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2014.05.06.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.05.06

[1] 姜业庆. 不可小视UGC市场[EB/OL]. [2013-04-04]. http:// finance.eastmoney.com/news/1350, 20130404283364128.html. (Jiang Yeqing. Research of UGC Market [EB/OL]. [2013- 04-04]. http://finance.eastmoney.com/news/13502013040428 3364128.html.)
[2] Billsus D, Pazzani M J. Learning Collaborative Information Filters[C]. In: Proceedings of the 15th International Conference on Machine Learning (ICML'98), Madison. San Francisco: Morgan Kaufmann Publishers Inc., 1998: 46-54.
[3] 于娟, 党延忠. 本体关系学习方法研究——概念特征词法[J]. 系统工程理论与实践, 2012, 32(7): 1582-1590. (Yu Juan, Dang Yanzhong. Learning Ontology Relations from Documents: The Concept-feature Method [J]. Systems Engineering-Theory & Practice, 2012, 32(7): 1582-1590.)
[4] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75.(Hua Bolin. Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[5] 丁君军, 郑彦宁, 化柏林. 基于规则的学术概念属性抽取[J]. 情报理论与实践, 2011,34(12): 10-14, 33. (Ding Junjun, Zheng Yanning, Hua Bolin. Rule-based Academic Concepts Attribute Extraction[J]. Information Studies: Theory & Application, 2011, 34(12): 10-14, 33.)
[6] Yang Y H, Du J P, Zi L L. Bootstrapping-based Automatic Acquisition of Domain Concepts for Ontology Construc-tion[J]. Chinese Journal of Electronics, 2013, 22(2): 313-318.
[7] Cohen J D. Highlights: Language-and Domain-Independent Automatic Indexing Terms for Abstracting [J]. Journal of the American Society for Information Science, 1995, 46(3): 162-174.
[8] Ji L, Sum M, Lu Q, et al. Chinese Terminology Extraction Using Window-Based Contextual Information[C]. In: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'07). Berlin, Heidelberg: Springer-Verlag, 2007: 62-74.
[9] Vu T, Aw A, Zhang M. Term Extraction Through Unithood and Termhood Unification[C]. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP-08). 2008: 631-636.
[10] 刘柏嵩. 面向数字图书馆的本体自动构建[J]. 中国图书馆学报, 2006, 32(5): 47-51. (Liu Bosong. Automatic Construction of Ontology Oriented to Digital Library[J]. Journal of Library Science in China, 2006, 32(5): 47-51.)
[11] 屈鹏, 王惠临. 面向信息分析的专利术语抽取研究[J]. 图书情报工作, 2013, 57(1): 130-135. (Qu Peng, Wang Huilin. Patent Term Extraction for Information Analysis[J]. Library and Information Service, 2013, 57(1): 130-135.)
[12] 周浪, 张亮, 冯冲, 等. 基于词频分布变化统计的术语抽取方法[J]. 计算机科学, 2009, 36(5): 177-180. (Zhou Lang, Zhang Liang, Feng Chong, et al. Terminology Extraction Based on Statistical Word Frequency Distribution Variety[J]. Computer Science, 2009, 36(5): 177-180.)
[13] 罗盛芬, 孙茂松. 基于字串内部结合紧密度的汉语自动抽词实验研究[J]. 中文信息学报, 2003, 17(3): 9-14. (Luo Shengfen, Sun Maosong. Chinese Word Extraction Based on the Internal Associative Strength of Character Strings[J]. Journal of Chinese Information Processing, 2003, 17 (3): 9-14.)
[14] Chien L. PAT-tree-based Keyword Extraction for Chinese Information Retrieval[C]. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). New York: ACM, 1997: 50-58.
[15] 徐亮. 中文新词识别研究[D]. 大连: 大连理工大学, 2009. (Xu Liang. Research of Chinese New Word Identification[D]. Dalian: Dalian University of Technology, 2009.)
[16] 自然语言处理与信息检索平台[EB/OL]. [2012-11-19]. http://www.nlpir.org/?action-viewnews-itemid-257.(Natural Language Processing & Information Retrieval Sharing Platform[EB/OL]. [2012-11-19]. http://www.nlpir.org/?action- viewnews-itemid-257.)

[1] 贾晓婷,王名扬,曹宇. 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究*[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[2] 王忠义,张鹤铭,黄京,李春雅. 基于社会网络分析的网络问答社区知识传播研究[J]. 数据分析与知识发现, 2018, 2(11): 80-94.
[3] 何跃, 宋灵犀, 齐丽云. 负面事件中的品牌网络口碑溢出效应研究——以“圆通夺命快递”事件为例[J]. 现代图书情报技术, 2015, 31(10): 58-64.
[4] 陈勇, 李红莲, 吕学强. 网络用户搜索行为特征分析[J]. 现代图书情报技术, 2014, 30(12): 10-17.
[5] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[6] 吴丹. 英汉交互式跨语言检索系统设计与实现*[J]. 现代图书情报技术, 2009, 3(2): 89-95.
[7] 朱伟丽,韩宇,肖晓旦,陈先来 . 医学关键词与叙词对照表自动构建研究[J]. 现代图书情报技术, 2006, 1(8): 51-54.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn