[Objective] This paper aims to identify the actual value of numerical indicators from the scientific literatures. [Methods] Firstly, we analyzed the Shortest-Path-Tree between the indicator and the digital entities. Then, we used by distant supervision to learn the syntactic and description characteristics of the numerical indicator sentence. Third, we created four types of relationship templates of “more than”, “less than”, “equal” and “times”. Finally, we obtained the real value of these indicators. [Results] We examined the proposed method in the fields of climate changes and astronomy. The F-values were 82.35% and 77.55%, which were above the average of related studies. [Limitations] We did not investigate the indicator real value across multiple sentences. [Conclusions] The proposed method could help us obtain the actual value of numerical indicators effectively.
郭少卿, 乐小虬. 科技论文中数值指标实际取值识别[J]. 数据分析与知识发现, 2018, 2(1): 21-28.
Guo Shaoqing,Le Xiaoqiu. Identifying Actual Value of Numerical Indicator from Scientific Paper. Data Analysis and Knowledge Discovery, 2018, 2(1): 21-28.
…annual precipitation measured in this study is 734mm…
大于关系
…temperature has risen by about 5 ℃ above yesterday…
小于关系
…CO2 concentration is 5% lower than the PM10 concentration…
倍数关系
…capacity of this bottle is 2/3 of the other one…
取值关系类型及实例
整体流程图
例句最小句法树
远程监督学习流程
例句截取后词性标注序列
JJR(词性)
BRB(词性)
as…as(词组)
Of NN(词性+词组)
Above
Over
Below
Under
Twice
Thrice
Half
More
Before
Behind
Ahead
……
部分“比较词”词典
类型
取值关系
换算关系
大于类型
%、times等倍数单位
Baseline entity × ( 1 + value unit )
其他单位
( Baseline entity + value ) unit
小于类型
%、times等倍数单位
Baseline entity × ( 1 - value unit )
其他单位
( Baseline entity - value ) unit
倍数/分数类型
所有单位
Baseline entity × value [%]
等于类型
所有单位
Value unit
换算关系
指标
单位
指标
单位
Mass median diameter
mm
Survival rate
%
Vechicle speed
kmh-1
Total weight
kg
Scattering angle
°
……
部分“指标/单位”组合
模板
频次
支持度
模板
频次
支持度
NN|NP|PP[of]|NP|CD
1591
65.61%
NN|NP|VP[be]|PP|NP|CD
766
53.75%
NN|NP|PP[between]|NP|CD
228
9.41%
NN|NP|VP|PP[from]|NP|CD
510
35.79%
…
…
…
…
…
…
部分句法树路径模板
流程
正确率
召回率
F值
(1) 原识别流程
78.15%
81.21%
79.11%
(2) 将子句判断加入(1)中
85.31%
75.62%
80.18%
(3) 将常用模板加入 (1)(2)中
84.01%
80.76%
82.35%
优化结果分析
[1]
Maiya A S, Visser D, Wan A.Mining Measured Information from Text[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile. New York, USA: ACM, 2015.
[2]
Santos A, Nogueira R, Lourenco A.Applying a Text Mining Framework to the Extraction of Numerical Parameters from Scientific Literature in the Biotechnology Domain[J]. Advances in Distributed Computing & Artificial Intelligence Journal, 2012(S1): 1-8.
doi: 10.14201/ADCAIJ20121118
[3]
毋菲. 数值信息的抽取方法研究[D]. 太原: 山西大学, 2010.
[3]
(Wu Fei.Research on Value Extraction from Chinese Text[D]. Taiyuan: Shanxi University, 2010.)
[4]
Sarker A.Automated Extraction of Number of Subjects in Randomised Controlled Trials[L]. ArXiv Preprint, arXiv: 1606.07137.
[5]
Sarath P R, Mandhan S, Niwa Y.Numerical Atrribute Extraction from Clinical Texts[L]. ArXiv Preprint, arXiv: 1602.00269.
[6]
Murata M, Ma Q, Torisawa K, et al.Extraction and Visualization of Numerical and Named Entity Information from a Large Number of Documents[C]//Proceedings of the 2008 International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China. New York, USA: IEEE, 2009:1-8.
(Yang Shaohua, Lin Hailue, Yanbo. Automatic Data Extraction from Template- Generated Web Pages[J]. Journal of Software, 2008, 19(2): 209-223.)
doi: 10.3724/SP.J.1001.2008.00209
[8]
Madaan A, Mittal A, Ramakrishnan G, et al.Numerical Relation Extraction with Minimal Supervision[C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence.USA: AAAI Press, 2016: 2764-2771.
(Wu Sheng, Liu Maofu, Hu Huijun, et al.Unsupervised Extraction of Attribute-Value Entity Relation from Chinese Texts[J]. Journal of Wuhan University: National Science Edition, 2016, 62(6): 552-560.)
doi: 10.14188/j.1671-8836.2016.06.011
[10]
Lee T, Wang Z, Wang H, et al.Attribute Extraction and Scoring: A Probabilistic Approach[C]//Proceedings of the 29th International Conference on Data Engineering, Brisbane, QLD, Australia. USA: IEEE, 2013: 194-205.
[11]
Davidov D, Rappoport A.Extraction and Approximation of Numerical Attributes from the Web[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 1308-1317.
[12]
Chaganty A T, Liang P.How Much is 131 Million Dollars? Putting Numbers in Perspective with Compositional Descriptions[C] //Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 578-587.
[13]
Mintz M, Bills S, Snow R, et al.Distant Supervision for Relation Extraction Without Labeled Data[C]// Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics. 2009.
[14]
Aho A V.Efficient String Matching: An Aid to Bibliographic Search[J]. Communications of the ACM, 1975, 18(6): 333-340.
doi: 10.1145/360825.360855
[15]
Zhang M, Zhang J, Su J.Exploring Syntactic Features for Relation Extraction Using a Convolution Tree Kernel[C]// Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. 2006.
[16]
Jindal N, Liu B.Identifying Comparative Sentences in Text Documents[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2006: 244-251.
[17]
Maguire A J, Kolian M, Rosseel K, et al.Climate Change Indicators in the United States [EB/OL]. [2017-09-11]..
[18]
Manning C D, Surdeanu M, Bauer J, et al.The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.2014.
(Wu Chao, Zheng Yanning, Hua Bolin.Numerical Information Extraction: A Review of Research[J]. Journal of Library Science in China, 2014, 40(2): 107-119.)