文献关键词链接标引方法研究

doi:10.11925/infotech.1003-3513.2015.09.05

现代图书情报技术

2015, Vol. 31

Issue (9): 31-37 https://doi.org/10.11925/infotech.1003-3513.2015.09.05

研究论文

本期目录 | 过刊浏览 | 高级检索

文献关键词链接标引方法研究

许德山¹, 李辉², 张运良¹

1 中国科学技术信息研究所北京 100038;
2 北京市科学技术情报研究所北京 100048

A Method of Keywords Annotation Based on Linked Triples

Xu Deshan¹, Li Hui², Zhang Yunliang¹

1 Institute of Scientific & Technical Information of China, Beijing 100038, China;
2 Beijing Institute of Science and Technology Information, Beijing 100048, China

摘要
参考文献
相关文章
Metrics

全文: PDF (1522 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]以本体管理与服务平台为基础, 利用三元组获取和自然语言处理技术实现中文科技文献的自动标引。[方法]通过Web Services接口将本体知识库和词汇资源集成到标注模块中, 利用词典匹配和分词组合方法分别获取文献中的领域词和未登录词, 并与本体知识库中的三元组建立链接, 形成领域概念关系网络。[结果]通过语料测试, 系统能以86篇/秒的较快速度进行文献标引和词汇链接, 并达到65%的全面率和69%的准确率。[局限]词典加载后未做索引, 匹配计算耗时过多, 空格、断行等噪声数据对文本的分词处理和词性判断产生影响。[结论]数据清洗流程和关键词筛选算法改善后, 可以进一步提高标引效率, 为深度挖掘文本提供支撑。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

Abstract：

[Objective] Build an auto-indexing system by triple acquirement and NLP for Chinese scientific and technical literatures based on Ontology management and service platform. [Methods] Merging Ontology knowledge bases and vocabularies by Web services, the system can identify the terms and unlisted words through matching vocabulary and words combination, as well as link them with the triples in the knowledge bases for building a conceptual relational network. [Results] This system can process 86 articles per second with recall rate of 65% and precision rate of 69%. [Limitations] It takes a lot of time to match terms because no index is built. The performance of Chinese word segmentation and POS tagging are influenced by the noise data such as spaces, line break, and so on. [Conclusions] Data cleaning process and algorithm optimization of keywords selecting need continuous study for supporting the deep mining and enhancing the efficiency of the system.

收稿日期: 2015-01-26 出版日期: 2016-04-06

TP391.1

基金资助:

本文系中国科学技术信息研究所重点工作项目“结构化知识服务平台建设及应用”(项目编号:ZD2015-2)和国家自然科学基金项目“面向特定情报分析应用的知识组织系统快速构建关键问题研究”(项目编号:71203208)的研究成果之一。

通讯作者: 张运良, ORCID: 0000-0003-4987-1539, E-mail: zhangyl@istic.ac.cn。 E-mail: zhangyl@istic.ac.cn

作者简介: 作者贡献声明:许德山:提出研究思路,设计研究方案,编写服务接口和标注程序,论文起草及最终版本修订;李辉:实验数据的采集、清洗、标注,实验结果分析;张运良:领域词系统内容组织、词典到本体格式的转换。

引用本文:

许德山, 李辉, 张运良. 文献关键词链接标引方法研究[J]. 现代图书情报技术, 2015, 31(9): 31-37.
Xu Deshan, Li Hui, Zhang Yunliang. A Method of Keywords Annotation Based on Linked Triples. New Technology of Library and Information Service, 2015, 31(9): 31-37.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.09.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I9/31

[1] Domingue J, Dzbor M, Motta E. Magpie: Supporting Browsing and Navigating on the Semantic Web [C]. In: Proceeding of the 9th International Conference on Intelligent User Interfaces, Funchal, Portugal. 2004:191-197.
[2] Handschuh S, Staab S. Authoring and Annotation of Web Pages in CREAM [C]. In: Proceeding of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA. 2002: 462-473.
[3] Annotea Project [EB/OL]. [2014-10-13]. http://www.w3.org/2001/Annotea/.
[4] Ontotext Semantic Platform [EB/OL]. [2014-10-13]. http://www.ontotext.com/products/ontotext-semantic-platform.
[5] Dill S, Eiron N, Gibson D, et al. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation [C]. In: Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary. 2003:178-186.
[6] Armadillo [EB/OL]. [2014-10-13]. http://www.hrionline.ac. uk/armadillo/links.html.
[7] GATE [EB/OL]. [2014-10-13] https://gate.ac.uk/overview. html.
[8] Text2Onto [EB/OL]. [2014-10-13]. http://semanticweb.org/wiki/Text2Onto.
[9] 马颖华, 王永成, 苏贵洋, 等. 一种基于字同现频率的汉语文本主题抽取方法[J]. 计算机研究与发展, 2003, 40(6): 874-878. (Ma Yinghua, Wang Yongcheng, Su Guiyang, et al. A Novel Chinese Text Subject Extraction Method Based on Character Co-occurrence [J]. Journal of Computer Research and Development, 2003, 40(6): 874-878.)
[10] 耿焕同, 蔡庆生, 于琨, 等. 一种基于词共现图的文档主题词自动抽取方法[J]. 南京大学学报: 自然科学版, 2006, 42(2): 156-162. (Geng Huantong, Cai Qingsheng, Yu Kun, et al. A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence [J]. Journal of Nanjing University: Natural Sciences, 2006, 42(2): 156-162.)
[11] 索红光, 刘玉树, 曹淑英. 一种基于词汇链的关键词抽取方法[J]. 中文信息学报, 2006, 20(6): 25-30. (Suo Hongguang, Liu Yushu, Cao Shuying. A Keyword Selection Method Based on Lexical Chains [J]. Journal of Chinese Information Processing, 2006, 20(6): 25-30.)
[12] 李素建, 王厚峰, 俞士汶, 等. 关键词自动标引的最大熵模型应用研究[J]. 计算机学报, 2004, 27(9): 1192-1197. (Li Sujian, Wang Houfeng, Yu Shiwen, et al. Research on Maximum Entropy Model for Keyword Indexing [J]. Chinese Journal of Computers, 2004, 27(9): 1192-1197.)
[13] 赵鹏, 蔡庆生, 王清毅, 等. 一种基于复杂网络特征的中文文档关键词抽取算法[J]. 模式识别与人工智能, 2007, 20(6): 827-831. (Zhao Peng, Cai Qingsheng, Wang Qingyi, et al. An Automatic Keyword Extraction of Chinese
Document Algorithm Based on Complex Network Features [J]. Pattern Recognition and Artificial Intelligence, 2007, 20(6): 827-831.)
[14] 段宇锋, 黑珍珍, 鞠菲, 等. 基于自主学习规则的中文物种描述文本的语义标注研究[J]. 现代图书情报技术, 2012(5): 41-47. (Duan Yufeng, Hei Zhenzhen, Ju Fei, et al. Study on Semantic Markup of Species Description Text in Chinese Based on Auto-learning Rules [J]. New Technology of Library and Information Service, 2012(5): 41-47.)
[15] 段宇锋, 朱雯晶, 陈巧, 等. 朴素贝叶斯算法与Bootstrapping方法相结合的中文物种描述文本语义标注研究[J]. 现代图书情报技术, 2014(5): 83-89. (Duan Yufeng, Zhu Wenjing, Chen Qiao, et al. Semantic Annotation of Species Description Text in Chinese by Combining Naive Bayes Algorithm with Bootstrapping Method [J]. New Technology of Library and Information Service, 2014(5): 83-89.)
[16] 罗军, 高琦, 王翊. 基于Bootstrapping的本体标注方法[J].计算机工程, 2010, 36(23): 85-87. (Luo Jun, Gao Qi, Wang Yi. Ontology Annotation Method Based on Bootstrapping [J]. Computer Engineering, 2010, 36(23): 85-87.)
[17] 米杨, 曹锦丹. 顶级本体统控的多本体语义标注实证研究[J]. 现代图书情报技术, 2012(9): 36-41. (Mi Yang, Cao Jindan. A Case Study of Semantic Annotation with Multi-Ontology by Upper-level Ontology Unitive Control [J]. New Technology of Library and Information Service, 2012(9): 36-41.)
[18] 许德山, 张运良. 集成化本体管理平台的设计与实现[J]. 数字图书馆论坛, 2013(11): 15-20. (Xu Deshan, Zhang Yunliang. Design and Implementation of Integrated Ontology Management Platform [J]. Digital Library Forum, 2013(11): 15-20.)

[1]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[2]	刘欢, 张智雄, 王宇飞. BERT模型的主要优化改进方法研究综述 [J]. 数据分析与知识发现, 0, (): 1-.
[3]	叶光辉, 徐彤, 毕崇武, 李心悦. 基于多维度特征与LDA模型的城市旅游画像演化分析 [J]. 数据分析与知识发现, 0, (): 1-.
[4]	刘婧茹, 宋阳, 贾睿, 张翼鹏, 罗勇, 马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别 [J]. 数据分析与知识发现, 0, (): 0-.
[5]	石磊,王毅,成颖,魏瑞斌. 自然语言处理中的注意力机制研究综述^*[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[6]	刘萍,彭小芳. 基于形式概念分析的词汇相似度计算^*[J]. 数据分析与知识发现, 2020, 4(5): 66-74.
[7]	刘书瑞,田继东,陈普春,赖立,宋国杰. 基于文本数据的过滤式与嵌入式样本选择算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 223-230.
[8]	徐建民,张丽青,王苗. 基于贝叶斯网络的静态话题追踪模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[9]	谭荧,张进,夏立新. 社交媒体情境下的情感分析研究综述[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[10]	聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[11]	李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[12]	李晓峰,马静,李驰,朱恒民. 基于XGBoost模型的电商商品品名识别算法研究 ^*[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[13]	余传明, 龚雨田, 王峰, 安璐. 基于文本价格融合模型的股票趋势预测^*[J]. 数据分析与知识发现, 2018, 2(12): 33-42.
[14]	曾子明, 杨倩雯. 基于LDA和AdaBoost多特征组合的微博情感分析^*[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[15]	贾隆嘉, 张邦佐. *高校网络舆情安全中主题分类方法研究^——以新浪微博数据为例**[J]. 数据分析与知识发现, 2018, 2(7): 55-62.

Viewed

Full text

Abstract

Cited

Shared

Discussed