Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (11): 82-93     https://doi.org/10.11925/infotech.1003-3513.2016.11.10
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
科技查新中检索词智能抽取系统的设计与实现*
王培霞1,2,余海1,2,陈力1,2,王永吉1()
1中国科学院软件研究所 北京 100190
2中国科学院大学 北京 100049
Using Intelligent System to Extract Search Terms for Sci-Tech Novelty Retrieval
Wang Peixia1,2,Yu Hai1,2,Chen Li1,2,Wang Yongji1()
1Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (655 KB)   HTML ( 48
输出: BibTeX | EndNote (RIS)      
摘要 

目的】解决科技查新领域检索词选择时的主观性强、手工工作量大、不规范、费时费力的问题。【应用背景】为了实现检索词抽取过程的自动化、智能化、规范化, 本文提出利用科技查新过程检出的实时相关语料作为领域知识的来源, 并对语料组成类型与关键词抽取效果之间的关系进行讨论。【方法】通过关键词抽取、领域特征扩展相结合的递进式迭代抽取方式实现科技查新领域检索词的智能抽取。【结果】通过与实际查新案例所采用的检索词对比, 发现使用本方法两次迭代后抽取10个检索词, 召回率达到80%。【结论】基于查新过程中检出文献构成的动态相关语料进行检索词的迭代抽取有助于快速、准确锁定绝大部分检索词, 提高检索的效率和效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王培霞
余海
陈力
王永吉
关键词 科技查新检索词关键词抽取抽取网络爬虫    
Abstract

[Objective] This paper aims to identifying the search terms more effectively in sci-tech novelty retrieval, which could reduce the subjectivity, heavy workload, de-normalization and time-consuming issues facing the manual methods. [Context] We used the corpus generated by the sci-tech novelty retrieval as the source of domain knowledge to extract search terms. Then, we discussed the relationship between the corpus and the keyword extraction. [Methods] We proposed an incremental iterative method to extract keywords from the sci-tech novelty retrieval project with the help of domain feature expansion. [Results] Compared to search terms from the real world sci-tech novelty retrieval, the recall rates of the 10 search terms extracted by the new method reached 80%. [Conclusions] The proposed method could identify most keywords and then improve the efficiency and effectiveness of the novelty retrieval tasks.

Key wordsSci-Tech novelty retrieval    Search terms    extraction    Online crawler
收稿日期: 2016-07-28      出版日期: 2016-12-20
基金资助:*本文系国家自然科学基金项目“云计算环境下的隐蔽信道机理研究”(项目编号: 61170072)、国家自然科学基金青年科学基金项目“移动智能终端隐蔽信道机理研究”(项目编号: 61303057)和中国科学院、国家外国专家局创新团队国际合作项目“安全攸关软件理论和构造方法”的研究成果之一
引用本文:   
王培霞,余海,陈力,王永吉. 科技查新中检索词智能抽取系统的设计与实现*[J]. 现代图书情报技术, 2016, 32(11): 82-93.
Wang Peixia,Yu Hai,Chen Li,Wang Yongji. Using Intelligent System to Extract Search Terms for Sci-Tech Novelty Retrieval. New Technology of Library and Information Service, 2016, 32(11): 82-93.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.11.10      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I11/82
[1] 黄江玲. 影响科技查新质量的重要因子分析[J]. 情报探索, 2008(8): 67-68.
[1] (Huang Jiangling.Analysis of Important Factors Affecting the Quality of Science and Technology Novelty Search[J]. Information Research, 2008(8): 67-68.)
[2] 曹欢增. 提高科技文献查全率的几项措施[J]. 科技情报开发与经济, 2008, 18(32): 72-74.
[2] (Cao Huanzeng.Some Measures for Increasing the Recall Ratio of Sci-tech Literatures[J]. Sci-Tech Information Development & Economy, 2008, 18(32): 72-74.)
[3] 陈予琳. 关键词检索方法在科技查新中的应用研究[J]. 河南师范大学学报: 自然科学版, 2011, 39(3): 171-173.
[3] (Chen Yulin.Keyword Search Method Application Research on Science and Technology Novelty Check[J]. Journal of Henan Normal University: Natural Science Edition, 2011, 39(3): 171-173.)
[4] 张柏秋, 吴晓鐄. 科技查新检索中的关键词选择[J]. 情报科学, 2008, 26(9): 1344-1348.
[4] (Zhang Baiqiu, Wu Xiaohuang.Keywords Selection in Science Technology Novelty Retrieval[J]. Information Science, 2008, 26(9): 1344-1348.)
[5] Hasan K, Ng V.Automatic Keyphrase Extraction: A Survey of the State of the Art [C]. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 1262-1273.
[6] Frank E, Paynter G W, Witten I H, et al.Domain-specific Learning Algorithms for Keyphrase Extraction [C]. In: Proceedings of the 16th International Conference on Artificial Intelligence (IJCAI-99), 1999: 668-673.
[7] Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2002, 2(4): 303-336.
[8] Nguyen T D, Kan M-Y.Keyphrase Extraction in Scienti?c Publications [C]. In: Proceedings of International Conference on Asian Digital Libraries (ICADL), 2007: 317-326.
[9] Lopez P, Romary L.HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID[C]. In: Proceedings of International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2010: 248-251.
[10] Krapivin M, Autayeu M, Marchese M, et al.Improving Machine Learning Approaches for Keyphrases Extraction from Scienti?c Documents with Natural Language Knowledge [C]. In: Proceedings of the Joint JCDL/ICADL’ International Digital Libraries Conference, 2010: 102-111.
[11] Jiang X, Hu Y, Li H.A Ranking Approach to Keyphrase Extraction [C]. In: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009: 756-757.
[12] Turney P D.Coherent Keyphrase Extraction via Web Mining[C]. In: Proceedings of the 18th International Joint Conference on Arti?cial Intelligence, 2003: 434-439.
[13] Kumar N, Srinathan K.Automatic Keyphrase Extraction from Scientific Documents Using N-gram Filtration Technique [C]. In: Proceedings of the 8th ACM Symposium on Document Engineering. 2008: 199-208.
[14] 潘丽敏, 吴军华, 林萌, 等. 融合多特征的中文关键词提取方法[J]. 信息网络安全, 2014(8): 40-44.
[14] (Pan Limin, Wu Junhua, Lin Meng, et al.Algorithm of Chinese Keywords Extraction Based on Multi-feature[J]. Netinfo Security, 2014(8): 40-44.)
[15] Hulth A.Improved Automatic Keyword Extraction Given More Linguistic Knowledge [C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2003: 216-223.
[16] Pasquier C.Task 5: Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation [C]. In: Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2010: 154-157.
[17] 石晶, 李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19): 81-83.
[17] (Shi Jing, Li Wanlong.Topic Words Extraction Method Based on LDA Model[J]. Computer Engineering, 2010, 36(19): 81-83.)
[18] 刘俊, 邹东升, 邢欣来, 等. 基于主题特征的关键词抽取[J]. 计算机应用研究, 2012, 29(11): 4224-4227.
[18] (Liu Jun, Zou Dongsheng, Xing Xinlai, et al.Keyphrase Extraction Based on Topic Feature[J]. Application Research of Computers, 2012, 29(11): 4224-4227.)
[19] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts [C]. In: Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[20] Page L, Rrin S, Motwani R, et al.The PageRank Citation Ranking: Bringing Order to the Web [C]. In: Proceedings of the 7th International World Wide Web Conference. 1998: 1-14.
[21] 韩其琛, 李冬梅. 基于叙词表的林业信息语义检索模型[J]. 计算机科学与探索, 2016, 10(1): 122-129.
[21] (Han Qichen, Li Dongmei.Semantic Model with Thesaurus for Forestry Information Retrieval[J]. Journal of Frontiers of Computer Science & Technology, 2016, 10(1): 122-129.)
[22] 熊霞. 基于叙词表词间关系的领域信息检索[D]. 北京: 中国农业科学院, 2011.
[22] (Xiong Xia.Domain Information Retrieval Based on Term Relationships of Thesaurus [D]. Beijing: Chinese Academy of Agricultural Sciences, 2011.)
[23] Hulth A, Karlgren J, Jonsson A, et al.Automatic Keyword Extraction Using Domain Knowledge [C]. In: Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics, 2001: 472-482.
[24] Coursey K H, Mihalcea R, Moen W E.Automatic Keyword Extraction for Learning Object Repositories[J]. Proceedings of the American Society for Information Science & Technology, 2009, 45(1): 1-10.
[25] Li G, Wang H.Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge [C]. In: Proceedings of the 3rd CCF Conference, NLPCC 2014, Shenzhen, China. 2014, 496: 403-413.
[26] Jiang B, Xun E, Qi J.A Domain Independent Approach for Extracting Terms from Research Papers[C]. In: Proceedings of the Australasian Database Conference. Springer International Publishing, 2015: 155-166.
[27] Lopes L, Fernandes P, Vieira R.Estimating Term Domain Relevance Through Term Frequency, Disjoint Corpora Frequency-TF-DCF[J]. Knowledge-Based Systems, 2016, 97: 237-249.
[28] 詹恒飞, 杨岳湘, 方宏. Nutch分布式网络爬虫研究与优化[J]. 计算机科学与探索, 2011, 5(1): 68-74.
[28] (Zhan Hengfei, Yang Yuexiang, Fang Hong.Research and Optimization of Nutch Distributed Crawler[J]. Journal of Frontiers of Computer Science & Technology, 2011, 5(1): 68-74.)
[29] 卢萍, 蔡群. 中文科技论文关键词的标引[J]. 广州医学院学报, 2000, 28(2): 93-94.
[29] (Lu Ping, Cai Qun.Keyword Indexing of Chinese Scientific and Technical Paper[J]. Academic Journal of Guangzhou Medical College, 2000, 28(2): 93-94.)
[30] Guo C, Lu X.Selecting Publication Keywords for Domain Analysis in Bibliometrics: A Comparison of Three Methods[J]. Journal of Informetrics, 2016, 10(1): 212-223.
[31] 洪道广. Google Scholar的数据整合研究[J]. 现代情报, 2010, 30(7): 39-41.
[31] (Hong Daoguang.Research on Data Integration of Google Scholar[J]. Modern Information, 2010, 30(7): 39-41.)
[32] Rossi R G, Maracini R M, Rezende S O.Analysis of Domain Independent Statistical Keyword Extraction Methods for Incremental Clustering[J]. Learning and Nonlinear Models, 2014, 12(1): 17-37.
[1] 姚俊良,乐小虬. 科技查新查新点语义匹配方法研究[J]. 数据分析与知识发现, 2019, 3(6): 50-56.
[2] 郝慧. 一种基于科技查新的跨库检索去重算法[J]. 现代图书情报技术, 2015, 31(1): 89-95.
[3] 李广利, 李书宁. 科技查新报告自动生成软件的设计与实现[J]. 现代图书情报技术, 2013, 29(2): 82-87.
[4] 夏天. Ajax站点数据采集研究综述*[J]. 现代图书情报技术, 2010, 26(3): 52-57.
[5] 白如星,张成昱,王茜. 基于缩略语转换的手机图书馆发布信息预处理机制初探*[J]. 现代图书情报技术, 2010, 26(3): 64-70.
[6] 王舜燕,李蕾,吴兵华. 基于ID3分类算法的深度网络爬虫设计[J]. 现代图书情报技术, 2008, 24(6): 41-45.
[7] 刘洁清,吴京慧 . 面向主题的个人实时搜索引擎的设计与实现[J]. 现代图书情报技术, 2006, 1(5): 40-43.
[8] 于婷,宋宇宁 . 计算机辅助软件在科技查新工作中的应用[J]. 现代图书情报技术, 2006, 1(12): 85-88.
[9] 宋玲丽,成颖. 相关反馈技术中的检索词排序算法[J]. 现代图书情报技术, 2004, 20(8): 44-47.
[10] 马景娣,田稷. 基于J2EE的科技查新综合信息系统的设计与实现[J]. 现代图书情报技术, 2004, 20(8): 77-78.
[11] 周国华,邵正荣. 建立查新工作网络管理平台的尝试[J]. 现代图书情报技术, 2004, 20(6): 64-66.
[12] 陆志红. 利用联机扩展功能提供检索词[J]. 现代图书情报技术, 1992, 8(3): 15-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn