Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (1): 21-28     https://doi.org/10.11925/infotech.2096-3467.2017.1091
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
科技论文中数值指标实际取值识别
郭少卿1,2, 乐小虬1()
1(中国科学院文献情报中心 北京 100190)
2(中国科学院大学 北京 100049)
Identifying Actual Value of Numerical Indicator from Scientific Paper
Guo Shaoqing1,2, Le Xiaoqiu1()
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2 (University of Chinese Academy of Sciences, Beijing 100049, China)
全文: PDF (642 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

目的】科技论文中数值指标的大小有多种描述形式, 本文旨在从不同形式的描述句中准确识别数值指标的实际取值。【方法】分析数值指标句中指标实体与数字实体间最小句法树路径, 采用远程监督学习数值指标句的句法特征及描述特征, 从领域候选句中识别数值指标句; 利用少量语义标注数据学习“大于”、“小于”、“等于”、“倍数” 4类取值关系模板, 通过模板识别数值指标句的取值关系类别, 依据不同取值关系模板对应的数值指标实际取值换算关系计算指标实际数值的大小。【结果】在气候变化领域和天文学领域开展实验, F值分别达到82.35%和77.55%, 识别效果达到同类研究平均水平之上。【局限】以单句为数据单元开展识别研究, 对于跨句间的指标取值问题未做考虑。【结论】本方法能够有效识别单句中数值指标的实际取值, 识别过程不需要大量人工标注语料, 迁移到其他领域时不做额外处理, 系统性能不会明显下降, 具有一定的实用性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
郭少卿
乐小虬
关键词 数值指标实际取值模板发现远程监督    
Abstract

[Objective] This paper aims to identify the actual value of numerical indicators from the scientific literatures. [Methods] Firstly, we analyzed the Shortest-Path-Tree between the indicator and the digital entities. Then, we used by distant supervision to learn the syntactic and description characteristics of the numerical indicator sentence. Third, we created four types of relationship templates of “more than”, “less than”, “equal” and “times”. Finally, we obtained the real value of these indicators. [Results] We examined the proposed method in the fields of climate changes and astronomy. The F-values were 82.35% and 77.55%, which were above the average of related studies. [Limitations] We did not investigate the indicator real value across multiple sentences. [Conclusions] The proposed method could help us obtain the actual value of numerical indicators effectively.

Key wordsNumerical Indicator    Actual Value    Template Recognition    Distant Supervision
收稿日期: 2017-11-05      出版日期: 2018-02-05
ZTFLH:  G250.76  
引用本文:   
郭少卿, 乐小虬. 科技论文中数值指标实际取值识别[J]. 数据分析与知识发现, 2018, 2(1): 21-28.
Guo Shaoqing,Le Xiaoqiu. Identifying Actual Value of Numerical Indicator from Scientific Paper. Data Analysis and Knowledge Discovery, 2018, 2(1): 21-28.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.1091      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I1/21
  指标数值句实例
取值关
系类型
实例
等于关系 …annual precipitation measured in this study is 734mm…
大于关系 …temperature has risen by about 5 ℃ above yesterday…
小于关系 …CO2 concentration is 5% lower than the PM10 concentration…
倍数关系 …capacity of this bottle is 2/3 of the other one…
  取值关系类型及实例
  整体流程图
  例句最小句法树
  远程监督学习流程
  例句截取后词性标注序列
JJR(词性) BRB(词性) as…as(词组) Of NN(词性+词组)
Above Over Below Under
Twice Thrice Half More
Before Behind Ahead ……
  部分“比较词”词典
类型 取值关系 换算关系
大于类型 %、times等倍数单位 Baseline entity × ( 1 + value unit )
其他单位 ( Baseline entity + value ) unit
小于类型 %、times等倍数单位 Baseline entity × ( 1 - value unit )
其他单位 ( Baseline entity - value ) unit
倍数/分数类型 所有单位 Baseline entity × value [%]
等于类型 所有单位 Value unit
  换算关系
指标 单位 指标 单位
Mass median diameter mm Survival rate %
Vechicle speed kmh-1 Total weight kg
Scattering angle ° ……
  部分“指标/单位”组合
模板 频次 支持度 模板 频次 支持度
NN|NP|PP[of]|NP|CD 1591 65.61% NN|NP|VP[be]|PP|NP|CD 766 53.75%
NN|NP|PP[between]|NP|CD 228 9.41% NN|NP|VP|PP[from]|NP|CD 510 35.79%
  部分句法树路径模板
流程 正确率 召回率 F值
(1) 原识别流程 78.15% 81.21% 79.11%
(2) 将子句判断加入(1)中 85.31% 75.62% 80.18%
(3) 将常用模板加入 (1)(2)中 84.01% 80.76% 82.35%
  优化结果分析
[1] Maiya A S, Visser D, Wan A.Mining Measured Information from Text[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile. New York, USA: ACM, 2015.
[2] Santos A, Nogueira R, Lourenco A.Applying a Text Mining Framework to the Extraction of Numerical Parameters from Scientific Literature in the Biotechnology Domain[J]. Advances in Distributed Computing & Artificial Intelligence Journal, 2012(S1): 1-8.
doi: 10.14201/ADCAIJ20121118
[3] 毋菲. 数值信息的抽取方法研究[D]. 太原: 山西大学, 2010.
[3] (Wu Fei.Research on Value Extraction from Chinese Text[D]. Taiyuan: Shanxi University, 2010.)
[4] Sarker A.Automated Extraction of Number of Subjects in Randomised Controlled Trials[L]. ArXiv Preprint, arXiv: 1606.07137.
[5] Sarath P R, Mandhan S, Niwa Y.Numerical Atrribute Extraction from Clinical Texts[L]. ArXiv Preprint, arXiv: 1602.00269.
[6] Murata M, Ma Q, Torisawa K, et al.Extraction and Visualization of Numerical and Named Entity Information from a Large Number of Documents[C]//Proceedings of the 2008 International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China. New York, USA: IEEE, 2009:1-8.
[7] 杨少华, 林海略, 韩燕波. 针对模板生成网页的一种数据自动抽取方法[J]. 软件学报, 2008, 19(2): 209-223.
doi: 10.3724/SP.J.1001.2008.00209
[7] (Yang Shaohua, Lin Hailue, Yanbo. Automatic Data Extraction from Template- Generated Web Pages[J]. Journal of Software, 2008, 19(2): 209-223.)
doi: 10.3724/SP.J.1001.2008.00209
[8] Madaan A, Mittal A, Ramakrishnan G, et al.Numerical Relation Extraction with Minimal Supervision[C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence.USA: AAAI Press, 2016: 2764-2771.
[9] 吴胜, 刘茂福, 胡慧君, 等. 中文文本中实体数值型关系无监督抽取方法[J]. 武汉大学学报:理学版, 2016, 62(6): 552-560.
doi: 10.14188/j.1671-8836.2016.06.011
[9] (Wu Sheng, Liu Maofu, Hu Huijun, et al.Unsupervised Extraction of Attribute-Value Entity Relation from Chinese Texts[J]. Journal of Wuhan University: National Science Edition, 2016, 62(6): 552-560.)
doi: 10.14188/j.1671-8836.2016.06.011
[10] Lee T, Wang Z, Wang H, et al.Attribute Extraction and Scoring: A Probabilistic Approach[C]//Proceedings of the 29th International Conference on Data Engineering, Brisbane, QLD, Australia. USA: IEEE, 2013: 194-205.
[11] Davidov D, Rappoport A.Extraction and Approximation of Numerical Attributes from the Web[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 1308-1317.
[12] Chaganty A T, Liang P.How Much is 131 Million Dollars? Putting Numbers in Perspective with Compositional Descriptions[C] //Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 578-587.
[13] Mintz M, Bills S, Snow R, et al.Distant Supervision for Relation Extraction Without Labeled Data[C]// Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics. 2009.
[14] Aho A V.Efficient String Matching: An Aid to Bibliographic Search[J]. Communications of the ACM, 1975, 18(6): 333-340.
doi: 10.1145/360825.360855
[15] Zhang M, Zhang J, Su J.Exploring Syntactic Features for Relation Extraction Using a Convolution Tree Kernel[C]// Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. 2006.
[16] Jindal N, Liu B.Identifying Comparative Sentences in Text Documents[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2006: 244-251.
[17] Maguire A J, Kolian M, Rosseel K, et al.Climate Change Indicators in the United States [EB/OL]. [2017-09-11]..
[18] Manning C D, Surdeanu M, Bauer J, et al.The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.2014.
[19] 吴超, 郑彦宁, 化柏林. 数值信息抽取研究进展综述[J]. 中国图书馆学报, 2014, 40(2): 107-119.
[19] (Wu Chao, Zheng Yanning, Hua Bolin.Numerical Information Extraction: A Review of Research[J]. Journal of Library Science in China, 2014, 40(2): 107-119.)
[1] 孙轶楠, 顾立平, 宋秀芳, 刘晶晶, 江娴. 学科数据知识库的政策调研与分析——以生命科学领域为例[J]. 现代图书情报技术, 2015, 31(12): 13-20.
[2] 朱光. 基于零水印的图博档彩色图像资源版权保护策略研究[J]. 现代图书情报技术, 2015, 31(12): 89-94.
[3] 刘丹. 利用Apache Mahout部署个性化图书推荐服务[J]. 现代图书情报技术, 2015, 31(10): 102-108.
[4] 王颖, 吴振新, 谢靖. 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015, 31(5): 1-7.
[5] 吴越, 周义刚, 崔海媛, 聂华. 基于可用性研究的北京大学图书馆门户改版[J]. 现代图书情报技术, 2014, 30(11): 88-94.
[6] 姚晓娜, 祝忠明, 卢利农, 刘巍, 张旺强. 机构知识库OAI互操作数据同步策略研究[J]. 现代图书情报技术, 2014, 30(3): 14-18.
[7] 吴坤, 颉夏青, 吴旭. 云图书馆虚拟环境可信验证过程的设计与实现[J]. 现代图书情报技术, 2014, 30(3): 35-41.
[8] 张旺强, 祝忠明, 卢利农. 几种典型新型开源机构知识库软件的比较分析[J]. 现代图书情报技术, 2014, 30(2): 17-24.
[9] 王峰, 魏凤, 刘毅, 周洪, 赵德. 应用开源搜索引擎Solr构建标准信息管理与分析平台[J]. 现代图书情报技术, 2014, 30(2): 92-98.
[10] 姚晓娜, 祝忠明, 王思丽. 面向地学领域的自动语义标注研究[J]. 现代图书情报技术, 2013, (4): 48-53.
[11] 马宁宁, 李超, 曲云鹏. 面向数字资源长期保存的自动过时风险管理系统的设计与实现[J]. 现代图书情报技术, 2013, (4): 69-76.
[12] 马雨萌, 祝忠明. 数字对象语义关联组织的典型模型研究[J]. 现代图书情报技术, 2013, 29(1): 1-7.
[13] 黄永文, 钱力. 面向关联数据的信息检索服务研究综述[J]. 现代图书情报技术, 2012, (12): 2-8.
[14] 李春旺, 费大羽, 周强. 集成融汇工作流引擎研究[J]. 现代图书情报技术, 2012, (12): 27-31.
[15] 牛亚真, 祝忠明. 个性化服务中关联数据驱动的用户语义建模框架[J]. 现代图书情报技术, 2012, (10): 1-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn