Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (1): 131-138    DOI: 10.11925/infotech.2096-3467.2019.0943
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于公式描述结构和词嵌入的科技文档检索方法*
宰新宇,田学东()
河北大学网络空间安全与计算机学院 保定 071002
Retrieving Scientific Documents with Formula Description Structure and Word Embedding
Xinyu Zai,Xuedong Tian()
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
全文: PDF(762 KB)   HTML ( 11
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 提出一种公式匹配与文本排序相融合的科技文档检索方法。【方法】 利用公式描述结构对数学表达式进行解析得到公式的结构信息,实现基于数学表达式的科技文档检索;同时,通过词嵌入模型投影得到查询关键字的词向量和文档词向量,根据两种词向量之间的相似度对文档集合进行排序。【结果】 实验结果表明,方法的查全率和查准率分别为0.77和0.63,相较于传统科技文档检索方法分别提高24.2%和23.5%。【局限】 只针对LaTeX格式的查询表达式,在数学表达式描述格式方面有局限性。【结论】 数学表达式与文档关键字相结合的科技文档检索模型提高了科技文档检索的性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
宰新宇
田学东
关键词 科技文档检索公式描述结构词嵌入    
Abstract

[Objective] This study proposes a scientific document retrieval method combining formula match and text ranking, which address the challenges from mathematical expressions.[Methods] First, we used the analysis algorithm for formula description structure to study the mathematical expressions. Then, we acquired formula structure information, and retrieved technical documents based on mathematical expressions. Meanwhile, we obtained the inquiry keywords and document word vectors with the help of word embedding model. Finally, we ranked the documents based on the similarity between the two word vectors[Results] The recall and precision scores of our new model were 0.77 and 0.63, which were 24.2% and 23.5% higher than those of the traditional scientific document retrieval methods.[Limitations] Our method only focuses on expressions in LaTeX format.[Conclusions] The proposed model combining formula and document keywords improves the performance of scitific document retrieval.

Key wordsTechnical Document Retrieval    Formula Description Structure    Word Embedding
收稿日期: 2019-08-13     
中图分类号:  TP311  
基金资助:*本文系国家自然科学基金项目“数学表达式资源获取与检索模型研究”(61375075);河北省自然科学基金项目“引入犹豫模糊逻辑的数学检索结果文档排序”(F2019201329);河北省教育厅河北省高等学校科学技术研究重点项目“基于犹豫模糊集的古籍汉字图像检索”的研究成果之一(ZD2017208)
通讯作者: 田学东     E-mail: xuedong_tian@126.com
引用本文:   
宰新宇,田学东. 基于公式描述结构和词嵌入的科技文档检索方法*[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0943.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0943
图1  CBOW模型
查询表达式 LaTeX结构 FDS结构
2q 2^{q} ^\1
a×b a \times b \times\0,
ab \frac{a}{b} frac\0
-b±b2-4ac2a \frac{-b ±√({b^{2} -4 a c} )}{2 a} \frac\0,-\1,\pm\1,\sqrt\1,^\3,-\2,
1σ2πe-x-μ22σ2 \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}
{2 \sigma^{2}}}
\frac\0,\sqrt\1,^\1,-\1,\frac\1,(\2,-\2,)\2,^\3,^\3,
表1  部分表达式解析结果
EXPID EXP FileName(html)
57113 pxμσ=1σ2πe- x-μ22σ2 Computer stereo vision
127297 PGZ=1σ2πe- x-μ22σ2 Gaussian noise
206443 px|μσ=1σ2πe- x-μ22σ2 Maximum entropy probability distribution
232616 fx|μ,σ=1σ2πe- x-μ22σ2 Normal distribution
79135 gx=12πσ2e- x-μ22σ2 Differential entropy
表2  表达式的部分检索结果
Keyword WordScore
folded normal distribution 7.37
folded distribution 5.03
normal distribution 4.78
random variable 4.33
differential equations 4.02
表3  关键词组提取结果
序号 文档(html) 相似度
1 Folded normal distribution 0.93
2 Normal gamma distribution 0.86
3 Gaussian distribution 0.80
4 Exponential family 0.75
5 Stochastic simulation 0.74
6 Logit normal distribution 0.73
7 Normal distribution 0.72
8 Kernel (statistics) 0.68
9 Distributed random 0.67
10 Slice sampling 0.66
表4  文档排序Top-10结果
系统 公式 文档(html)
Search
OnMath
p(k)=λkk!e-λ Variance
fk;λ=Pr(X=k)=λke-λk! Poisson distribution
p(d)=λdd!e-λ Long tail traffic
pn=i=1T1nMinie-Mi Constellation model
Q(ψn)(x,p)=x2+p2n!e-x2+p2π Quantum harmonic oscillator
本文系统 p(k)=λkk!e-λ Variance
fk;λ=Pr(X=k)=λke-λk! Poisson distribution
p(N=k)=λkk!e-n Poisson games
Pn(t)=tkn!e-t Poisson wavelet
λkk!e-λ=5kk!e-5 Poisson limit theorem
表5  两系统Top-5检索结果
序号 公式 关键字 序号 公式 关键字
1 yt fractional 6 limn1+1nn limit theorem
2 2q exponential 7 a2+b2=c2 pythagorean theorem
3 sinθ sine function 8 λkk!e-λ poisson
4 cosx cosine function 9 -b±b2-4ac2a quadratic formula
5 a radical expression 10 1σ2πe- x-μ22σ2 normal distribution
表6  文档列表
图2  本文方法和SearchOnMath的相似度对比
图3  系统检索查全率和查准率
[1] Shahid A, Afzal M T . Section-Wise Indexing and Retrieval of Research Articles[J]. Cluster Computing, 2018,21(1):481-492.
[2] Yuan K, Gao L, Wang Y , et al. A Mathematical Information Retrieval System Based on RankBoost [C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016: 259-260.
[3] 李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
( Li Xiangdong, Ruan Tao, Liu Kang . Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[4] Tian X, Yang S, Li X , et al. An Indexing Method of Mathematical Expression Retrieval [C]// Proceedings of the 3rd International Conference on Computer Science and Network Technology. IEEE, 2013: 574-578.
[5] Yang S Q, Tian X D. A Maintenance Algorithm of FDS Based Mathematical Expression Index [C]// Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. IEEE, 2014,2:888-892.
[6] Mikolov T, Grave E, Bojanowski P , et al. Advances in Pre-Training Distributed Word Representations[OL]. arXiv Preprint, arXiv: 1712.09405.
[7] Hu X, Gao L, Lin X , et al. Wikimirs: A Mathematical Information Retrieval System for Wikipedia [C]// Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2013: 11-20.
[8] Wang Y, Gao L, Wang S, et al. WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document [C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2015: 173-182.
[9] Pineau D C . Math-Aware Search Engines: Physics Applications and Overview[OL]. arXiv Preprint,arXiv: 1609. 03457.
[10] Dhar S, Roy S, Das S K. A Critical Survey of Mathematical Search Engines [C]// Proceedings of the 2nd International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018: 193-207.
[11] 周南, 田学东 . LaTeX数学表达式解析与索引方法[J]. 计算机应用, 2016,36(3):833-836.
( Zhou Nan, Tian Xuedong . Analyzing and Indexing Method on LaTeX Formulae[J]. Journal of Computer Applications, 2016,36(3):833-836.)
[12] Sojka P, Líška M. Indexing and Searching Mathematics in Digital Libraries [C]// Proceedings of the 2011 International Conference on Intelligent Computer Mathematics. Springer, Berlin, Heidelberg, 2011: 228-243.
[13] Pathak A, Pakray P, Sarkar S , et al. Mathirs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017,21(2):253-265.
[14] Zanibbi R, Aizawa A, Kohlhase M, et al. NTCIR-12 MathIR Task Overview [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 299-308.
[15] Kristianto G Y, Goran Topic, Aizawa A. MCAT Math Retrieval System for NTCIR-12 MathIR Task [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 323-330.
[16] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[17] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[18] Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
[19] Rose S, Engel D, Cramer N , et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. John Wiley & Sons, 2010: 1-20.
[20] Oliveira R M, Gonzaga F B, Barbosa V C , et al. A Distributed System for SearchOnMath Based on the Microsoft BizSpark Program[OL]. arXiv Preprint, arXiv: 1711. 04189.
[1] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[2] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[3] 王婷婷,韩满,王宇. LDA模型的优化及其主题数量选择研究*——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[4] 张琴,郭红梅,张智雄. 融合词嵌入表示特征的实体关系抽取方法研究*[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn