Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (10): 29-36    DOI: 10.11925/infotech.2096-3467.2019.0069
  专题 本期目录 | 过刊浏览 | 高级检索 |
面向学术文献全文本的方法论知识抽取系统分析与设计 *
徐浩1,朱学芳2(),章成志3,江川4
1南京工程学院经济与管理学院 南京 211167
2南京大学信息管理学院 南京 210023
3南京理工大学经济管理学院 南京 210094
4南京农业大学信息科技学院 南京 210095
System Analysis and Design for Methodological Entities Extraction in Full Text of Academic Literature
Hao Xu1,Xuefang Zhu2(),Chengzhi Zhang3,Chuan Jiang4
1School of Economics & Management, Nanjing Institute of Technology, Nanjing 211167, China
2School of Information Management, Nanjing University, Nanjing 210023, China
3School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China
4 College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095, China
全文: PDF(1446 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】面向学术文献全文本抽取方法论实体, 识别其在全文本中的标引特征及使用环境。【方法】基于字典、规则及人工标注的方式抽取包含方法论知识的特征句及方法论实体, 借助Visual Studio 2012及SQL Server 2012实现方法论实体抽取核心功能模块。【结果】方法论特征句抽取的准确率为76%, 召回率大于42%; 每个特征句中约包含1.42个方法论实体, 方法论实体的正式标引比率低于27%, 对特征句的正式标引比率低于35%, 学科专用工具的正式标引率较低。【局限】系统特征句抽取准确率及召回率均较低, 虽提供了人工标注界面加以辅助, 但工作量较大, 未基于语句关系等方法论知识的语义特征进行命名实体识别。【结论】学科专用方法论知识的学术价值被忽视; 本研究所设计的方法论特征句及实体抽取方法具备多学科通用性, 可进一步探讨方法论驱动的跨学科知识扩散路径。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐浩
朱学芳
章成志
江川
关键词 学术文献全文本方法论实体知识抽取系统实体使用环境    
Abstract

[Objective] This paper proposes a new system to extract methodological entities from the full texts of academic literature, aiming to identify their indexing features and usages. [Methods] Firstly, we extracted feature sentences and methodological entities based on dictionaries, rules, and manual annotations. Then, we implemented a methodology knowledge extraction module with the help of Microsoft Visual Studio 2012 and SQL Server 2012. [Results] The precision of extracting methodological features was 76%, while the recall rate was greater than 42%. Each feature sentence had 1.42 method entities on average. The formal indexing ratio for methodological entities was less than 27%, while the ratio for feature sentences was less than 35%. We also found low formal indexing rate for subject-specific methodological entities. [Limitations] This system’s recall and precision rates were not very satisfactory. The manual workload was intensive for entity extraction and did not include the semantic features. [Conclusions] The proposed method has inter-disciplinary versatility and helps us explore the dissemination routes of interdisciplinary knowledge.

Key wordsFull Text of Academic Literature    Methodological Entities    Entity Extraction System    Entity Use Feature
收稿日期: 2019-01-15     
中图分类号:  G250  
基金资助:*本文系国家社会科学基金重大项目“情报学学科建设与情报工作未来发展路径研究”(17ZDA291);南京工程学院引进人才科研启动基金项目“方法论驱动的跨学科知识扩散规律及测度研究”(YKJ201725);南京工程学院校级基础研究专项“面向全文本的研究方法类知识学科扩散规律研究”的研究成果之一(JCYJ201826)
通讯作者: 朱学芳     E-mail: xfzhu@nju.edu.cn
引用本文:   
徐浩,朱学芳,章成志,江川. 面向学术文献全文本的方法论知识抽取系统分析与设计 *[J]. 数据分析与知识发现, 2019, 3(10): 29-36.
Hao Xu,Xuefang Zhu,Chengzhi Zhang,Chuan Jiang. System Analysis and Design for Methodological Entities Extraction in Full Text of Academic Literature. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0069.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0069
图1  方法论抽取系统功能结构
序号 模式 序号 模式
1 use<>software 6 analysis be perform with<>
2 perform use<> 7 <>statistical software
3 be perform use<> 8 <> software
4 analysis be perform use<> 9 quantify use<>
5 analyze use<> 10 be calculate use<>
表1  识别方法论知识的高频特征模式
图2  方法论知识抽取模块工作流程
图3  方法论抽取系统人工标注程序界面
文本名称 特征句数量/准确率(%) 实体数量 标引次数/百分比(%) 特征句标引次数 实体标引占特征句标引比例(%)
S_0.txt 575/76.36 812 213/26.23 269 79.18
S_1.txt 602/76.30 829 209/25.21 257 81.32
S_2.txt 572/75.66 816 206/25.25 261 78.92
S_3.txt 595/75.32 843 215/25.50 266 80.83
S_4.txt 556/74.73 794 196/24.69 241 81.33
S_5.txt 626/77.28 892 219/24.55 268 81.72
S_6.txt 610/76.44 883 221/25.03 278 74.16
S_7.txt 595/76.38 869 214/24.63 276 77.54
S_8.txt 600/76.43 800 194/24.25 249 78.22
S_9.txt 618/76.67 916 223/24.34 299 74.58
表2  样本数据特征句基本信息统计
序号 实体名称 提及次数 正式引用次数/引用率(%) 正式引用有效次数/有效率(%)
1 SPSS 376 7/1.86 1/14.29
2 Image J 269 38/14.13 29/76.32
3 GraphPad Prism 247 0/0.00 0/0.00
4 ANOVA 209 5/2.39 2/40.00
5 R 178 70/39.33 15/21.43
6 student 's t - test 147 3/2.04 2/66.67
7 SAS 142 9/6.34 2/22.22
8 Stata 113 14/12.39 2/14.29
9 MATLAB 105 25/23.81 18/72.00
10 FlowJo 91 4/4.40 4/100.00
11 BLAST 79 24/30.38 24/100.00
12 Primer 73 15/20.55 10/66.67
13 GraphPad software 56 0/0.00 0/0.00
14 EXCEL 56 25/44.64 1/4.00
15 MEGA 55 28/50.91 27/96.43
表3  高频方法论实体提及与引用情况
[1] 崔明, 潘雪莲, 华薇娜 . 我国图书情报领域的软件使用和引用研究[J]. 中国图书馆学报, 2018,44(3):68-78.
( Cui Ming, Pan Xuelian, Hua Weina . Software Usage and Citation in the Field of Library and Information Science in China[J]. Journal of Library Science in China, 2018,44(3):68-78.)
[2] Hafer L, Kirkpatrick A E . Assessing Open Source Software as a Scholarly Contribution[J]. Communications of the ACM, 2009,52(12):126-129.
[3] Piwowar H . Altmetrics: Value All Research Products[J]. Nature, 2013,493(7431):159.
[4] Research Excellence Framework. Output Information Requirements[EB/OL]. [ 2018- 11- 18]. .
[5] 孙建军, 裴雷, 蒋婷 . 面向学科领域的学术文献语义标注框架研究[J]. 情报学报, 2018,37(11):1077-1086.
( Sun Jianjun, Pei Lei, Jiang Ting . Research on Semantic Annotation in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(11):1077-1086.)
[6] 王佳敏, 李信, 刘齐进 . 全文本文献计量分析学术沙龙综述[J]. 信息资源管理学报, 2018,8(4):119-125.
( Wang Jiamin, Li Xin, Liu Qijin . A Review of the Academic Salon on Full-text Bibliometric Analysis[J]. Journal of Information Resources Management, 2018,8(4):119-125.)
[7] Gupta S, Manning C D . Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers [C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[8] Kondo T, Nanba H, Takezawa T , et al. Technical Trend Analysis by Analyzing Research Papers’ Titles [C]// Proceedings of the 4th Language and Technology Conference. 2009: 512-521.
[9] 化柏林 . 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6):68-75.
( Hua Bolin . Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6):68-75.)
[10] Girju R, Beamer B, Rozovskaya A , et al. A Knowledge-Rich Approach to Identifying Semantic Relations Between Nominals[J]. Information Processing & Management, 2010,46(5):589-610.
[11] Pan X, Yan E, Wang Q , et al. Assessing the Impact of Software on Science: A Bootstrapped Learning of Software Entities in Full-Text Papers[J]. Journal of Informetrics, 2015,9(4):860-871.
[12] Nanba H, Kondo T, Takezawa T . Automatic Creation of a Technical Trend Map from Research Papers and Patents [C]// Proceedings of the 3rd International Workshop on Patent Information Retrieval. ACM, 2010: 11-16.
[13] Tsai C T, Kundu G, Roth D . Concept-Based Analysis of Scientific Literature [C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 1733-1738.
[14] Houngbo H, Mercer R E . Method Mention Extraction from Scientific Research Papers [C]// Proceedings of the 2012 International Conference on Computational Linguistics. 2012: 1211-1222.
[15] Guo Y, Silins I, Stenius U , et al. Active Learning-Based Information Structure Analysis of Full Scientific Articles and Two Applications for Biomedical Literature Review[J]. Bioinformatics, 2013,29(11):1440-1447.
[16] 钱力, 张晓林, 王茜 . 科技论文的研究设计指纹自动识别方法构建与实现[J]. 图书情报工作, 2018,62(2):135-143.
( Qian Li, Zhang Xiaolin, Wang Qian . Building and Implement on Automatic Identification Method of Research Design Fingerprint of Scientific Papers[J]. Library and Information Service, 2018,62(2):135-143.)
[17] 程齐凯 . 学术文本的词汇功能识别[D]. 武汉: 武汉大学, 2015.
( Cheng Qikai . Term Function Recognition from Academic Text[D]. Wuhan: Wuhan University, 2015.)
[18] 李信, 程齐凯, 刘兴帮 . 基于词汇功能识别的科研文献分析系统设计与实现[J]. 图书情报工作, 2017,61(1):109-116.
( Li Xin, Cheng Qikai, Liu Xingbang . Design and Implementation of Scientific Literature Analysis System Based on Term Function Recognition[J]. Library and Information Service, 2017,61(1):109-116.)
[19] Pettigrew K E, McKechnie L E F . The Use of Theory in Information Science Research[J]. Journal of the American Society for Information Science and Technology, 2001,52(1):62-73.
[20] 王芳, 陈锋, 祝娜 , 等. 我国情报学理论的来源、应用及学科专属度研究[J]. 情报学报, 2016,35(11):1148-1164.
( Wang Fang, Chen Feng, Zhu Na , et al. Theories of Information Science in China: Source, Uses and Discipline Exclusive Degrees[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(11):1148-1164.)
[21] 王芳, 祝娜, 翟羽佳 . 我国情报学研究中混合方法的应用及其领域分布分析[J]. 情报学报, 2017,36(11):1119-1129.
( Wang Fang, Zhu Na, Zhai Yujia . Application of Mixed Methods and Their Field Distribution in Information Science Research in China[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(11):1119-1129.)
[22] 徐浩, 钱爱兵, 朱学芳 , 等. 科学知识图谱绘制工具CiteSpace的学科领域扩散特征研究[J]. 情报杂志, 2017,36(5):69-74, 68.
( Xu Hao, Qian Aibing, Zhu Xuefang , et al. Discipline Diffusion Features of the Mapping Knowledge Domains Software: CiteSpace[J]. Journal of Intelligence, 2017,36(5):69-74,68.)
[23] JATS数据标准[EB/OL]. [ 2018- 11- 09]. .
( Journal Archiving and Interchange Tag Set[EB/OL]. [ 2018- 11- 09].
[1] 刘峰, 张晓林. 科学数据元数据标准述评及其通用化设计研究[J]. 现代图书情报技术, 2015, 31(12): 3-12.
[2] 孙轶楠, 顾立平, 宋秀芳, 刘晶晶, 江娴. 学科数据知识库的政策调研与分析——以生命科学领域为例[J]. 现代图书情报技术, 2015, 31(12): 13-20.
[3] 毕强, 刘健. 数字文献资源内容服务推荐方法研究[J]. 现代图书情报技术, 2015, 31(12): 21-27.
[4] 朱光. 基于零水印的图博档彩色图像资源版权保护策略研究[J]. 现代图书情报技术, 2015, 31(12): 89-94.
[5] 王政军, 俞小怡, 金玉玲. 利用旁路监听技术约束数字资源过量下载[J]. 现代图书情报技术, 2015, 31(12): 95-100.
[6] 金玮, 赵蓉英, 殷鸽. 用户在社会化引文软件中的阅读数据积累程度与有效性分析——以Altmetrics指标为例[J]. 现代图书情报技术, 2015, 31(11): 75-81.
[7] 郑飏飏, 徐健, 肖卓. 情感分析及可视化方法在网络视频弹幕数据分析中的应用[J]. 现代图书情报技术, 2015, 31(11): 82-90.
[8] 刘悦如, 郭利敏. 微信公众号互动功能新开发[J]. 现代图书情报技术, 2015, 31(11): 104-109.
[9] 章成志, 顾晓雪. 区分标签质量的机器生成标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 22-29.
[10] 顾晓雪, 章成志. 标注内容与用户属性结合的标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 30-39.
[11] 刘丹. 利用Apache Mahout部署个性化图书推荐服务[J]. 现代图书情报技术, 2015, 31(10): 102-108.
[12] 马雨萌, 郭进京, 王昉. e-Science环境下科学数据语义组织模型框架研究[J]. 现代图书情报技术, 2015, 31(7-8): 48-57.
[13] 吴丹, 冉爱华. 移动阅读应用的用户体验比较研究[J]. 现代图书情报技术, 2015, 31(7-8): 73-79.
[14] 陈挺, 韩涛, 李泽霞, 李国鹏, 王小梅. 科研项目布局差异对比方法研究——以NSF和EUFP项目为例[J]. 现代图书情报技术, 2015, 31(7-8): 89-96.
[15] 郭振英, 赵文兵, 魏育辉. 轻量级书目本体关联数据建设实践[J]. 现代图书情报技术, 2015, 31(7-8): 139-143.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn