基本情况 主编致辞 收录获奖
 编委会 编辑部 审稿专家
 本刊学术规范 行业规范

## Retrieving Scientific Documents with Formula Description Structure and Word Embedding

Zai Xinyu, Tian Xuedong,,

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

 基金资助: *本文系国家自然科学基金项目“数学表达式资源获取与检索模型研究”.  61375075河北省自然科学基金项目“引入犹豫模糊逻辑的数学检索结果文档排序”.  F2019201329河北省教育厅河北省高等学校科学技术研究重点项目“基于犹豫模糊集的古籍汉字图像检索”的研究成果之一.  ZD2017208

Received: 2019-08-13   Revised: 2019-11-7   Online: 2020-01-25

【目的】 提出一种公式匹配与文本排序相融合的科技文档检索方法。【方法】 利用公式描述结构对数学表达式进行解析得到公式的结构信息,实现基于数学表达式的科技文档检索;同时,通过词嵌入模型投影得到查询关键字的词向量和文档词向量,根据两种词向量之间的相似度对文档集合进行排序。【结果】 实验结果表明,方法的查全率和查准率分别为0.77和0.63,相较于传统科技文档检索方法分别提高24.2%和23.5%。【局限】 只针对LaTeX格式的查询表达式,在数学表达式描述格式方面有局限性。【结论】 数学表达式与文档关键字相结合的科技文档检索模型提高了科技文档检索的性能。

Abstract

[Objective] This study proposes a scientific document retrieval method combining formula match and text ranking, which address the challenges from mathematical expressions.[Methods] First, we used the analysis algorithm for formula description structure to study the mathematical expressions. Then, we acquired formula structure information, and retrieved technical documents based on mathematical expressions. Meanwhile, we obtained the inquiry keywords and document word vectors with the help of word embedding model. Finally, we ranked the documents based on the similarity between the two word vectors[Results] The recall and precision scores of our new model were 0.77 and 0.63, which were 24.2% and 23.5% higher than those of the traditional scientific document retrieval methods.[Limitations] Our method only focuses on expressions in LaTeX format.[Conclusions] The proposed model combining formula and document keywords improves the performance of scitific document retrieval.

Keywords： Technical Document Retrieval ; Formula Description Structure ; Word Embedding

Zai Xinyu. Retrieving Scientific Documents with Formula Description Structure and Word Embedding. Data Analysis and Knowledge Discovery[J], 2020, 4(1): 131-138 doi:10.11925/infotech.2096-3467.2019.0943

## 1 引 言

（1）普通的文档检索系统不能很好地处理数学表达式复杂的二维结构问题。

（2）不同类型文献之间对同一事物或主题使用不同的词汇或特征进行描述,产生语义上的差异,由此导致检索结果不准确[3]

## 3 研究框架与方法

### 3.1 研究框架与设计

（1）通过FDS解析算法,检索出和查询表达式相匹配的文档集合;

（2）利用Word Embedding算法,分别得到查询关键字集合的词向量和第一部分检索出的文档集中关键字集合的词向量;

（3）利用余弦距离,得到两组词向量的余弦相似度,根据相似度值对文档进行排序。

①利用FDS算法对 $Q$进行解析,得到查询表达式 $Q$的结构编码 $H$;

②根据 $H$中的信息,在数据库中检索出符合该结构的表达式 $ES∈E$,并获得 $ES$所对应的文档集合 $FS∈F$和关键字集合 $KS∈K$;

③利用Word Embedding模型,分别得到 $KS$$P$的词向量 $VKS$$VP$;

④利用余弦距离,计算 $VKS$$VP$的余弦相似度;

⑤根据余弦相似度值的大小,对 $FS$进行排序,输出排序结果。

### 3.2 数学表达式索引的建立

FDS是一种用来描述数学表达式格式的结构,通过提取数学表达式骨架的方式忽略运算符对检索的影响,这样做有利于提高数学表达式检索的效率。一个数学表达式中的每个符号在FDS中包含4个属性,如式（1）[5]所示。

$CStringstr+intlevel+intoperator+intflag$

FDS依据提取的表达式建立数学表达式索引,并存入数据表Exp(Id, fileId, fdsCode, expInfo),其中,Id为表达式的序号,fileId为当前表达式所在的文档编号,fdsCode为表达式的FDS结构编码,expInfo为表达式本身。根据表达式所在的文档,构建文档的索引结构表Fileinfo(fileId, filename),其中,fileId为文档编号,filename为当前文档的名称。

### 3.3 面向科技文档的词嵌入训练

$P=∑n=1NlogSn|Cn$

### 图1

Fig.1   CBOW Model

$F=∑n=1Nlog1+e-γSn,Cn+∑m∈MCnlog1+eγm,Cn$

$γ(S,C)=1C∑S'∈CuS'NvS$

$vC=∑A∈pdA⊙un+A$

$WE=vS+1N∑n∈Nxn$

## 4 实验过程及结果分析

### 4.1 系统实验

（1） 基于FDS的科技文档检索

Table 1  Partial Expression Parsing Results

$2q$2^{q}^\1
$a×b$a \times b\times\0,
$ab$\frac{a}{b}frac\0
$-b±b2-4ac2a$\frac{-b ±√({b^{2} -4 a c} )}{2 a}\frac\0,-\1,\pm\1,\sqrt\1,^\3,-\2,
$1σ2πe-x-μ22σ2$\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}
{2 \sigma^{2}}}
\frac\0,\sqrt\1,^\1,-\1,\frac\1,(\2,-\2,)\2,^\3,^\3,

Table 2  Partial Search Results of Expression

EXPIDEXPFileName(html)
57113$pxμσ=1σ2πe-x-μ22σ2$Computer stereo vision
127297$PGZ=1σ2πe-x-μ22σ2$Gaussian noise
206443$px|μσ=1σ2πe-x-μ22σ2$Maximum entropy probability distribution
232616$fx|μ,σ=1σ2πe-x-μ22σ2$Normal distribution
79135$gx=12πσ2e-x-μ22σ2$Differential entropy

（2） 基于词嵌入的科技文档排序

Table 3  Keyword Group Crawl Results

KeywordWordScore
folded normal distribution7.37
folded distribution5.03
normal distribution4.78
random variable4.33
differential equations4.02

Table 4  Document Sorting Top-10 Results

1Folded normal distribution0.93
2Normal gamma distribution0.86
3Gaussian distribution0.80
4Exponential family0.75
5Stochastic simulation0.74
6Logit normal distribution0.73
7Normal distribution0.72
8Kernel (statistics)0.68
9Distributed random0.67
10Slice sampling0.66

### 4.2 对比实验

SearchOnMath是Oliveira等[20]提出的一种基于数学信息的检索系统,可以根据公式或者关键字检索科技论文以及维基百科的英文文档等内容。

$p(k)=λkk!e-λ$

Table 5  Top-5 Search Results for Both Systems

Search
OnMath
$p(k)=λkk!e-λ$Variance
$fk;λ=Pr(X=k)=λke-λk!$Poisson distribution
$p(d)=λdd!e-λ$Long tail traffic
$pn=∏i=1T1nMinie-Mi$Constellation model
$Q(ψn)(x,p)=x2+p2n!e-x2+p2π$Quantum harmonic oscillator

$fk;λ=Pr(X=k)=λke-λk!$Poisson distribution
$p(N=k)=λkk!e-n$Poisson games
$Pn(t)=tkn!e-t$Poisson wavelet
$λkk!e-λ=5kk!e-5$Poisson limit theorem

Table 6  Document List

1$yt$fractional6$limn→∞1+1nn$limit theorem
2$2q$exponential7$a2+b2=c2$pythagorean theorem
3$sinθ$sine function8$λkk!e-λ$poisson
4$cosx$cosine function9$-b±b2-4ac2a$quadratic formula
5$a$radical expression10$1σ2πe-x-μ22σ2$normal distribution

### 图2

Fig.2   Comparison of Similarity Between Our Method and SearchOnMath

### 图3

Fig.3   Retrieval Recall and Precision

## 支撑数据:

[1] 宰新宇. Dataset.rar. 实验数据集.

[2] 宰新宇. SmartStoplist.txt. 停用词表.

[3] 宰新宇. Model.rar. 词向量模型和公式解析算法.

[4] 宰新宇. Index.rar. 公式解析与关键字提取索引集.

[5] 宰新宇. Result.rar. 实验结果集.

## 参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Shahid A, Afzal M T .

Section-Wise Indexing and Retrieval of Research Articles

[J]. Cluster Computing, 2018,21(1):481-492.

Yuan K, Gao L, Wang Y , et al.

A Mathematical Information Retrieval System Based on RankBoost

[C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016: 259-260.

[J]. 数据分析与知识发现, 2017,1(10):43-52.

( Li Xiangdong, Ruan Tao, Liu Kang .

Automatic Classification of Documents from Wikipedia

[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)

Tian X, Yang S, Li X , et al.

An Indexing Method of Mathematical Expression Retrieval

[C]// Proceedings of the 3rd International Conference on Computer Science and Network Technology. IEEE, 2013: 574-578.

Yang S Q, Tian X D.

A Maintenance Algorithm of FDS Based Mathematical Expression Index

[C]// Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. IEEE, 2014,2:888-892.

Mikolov T, Grave E, Bojanowski P , et al.

Advances in Pre-Training Distributed Word Representations

[OL]. arXiv Preprint, arXiv: 1712.09405.

Hu X, Gao L, Lin X , et al.

Wikimirs: A Mathematical Information Retrieval System for Wikipedia

[C]// Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2013: 11-20.

Wang Y, Gao L, Wang S, et al.

WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document

[C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2015: 173-182.

Pineau D C .

Math-Aware Search Engines: Physics Applications and Overview

[OL]. arXiv Preprint,arXiv: 1609. 03457.

Dhar S, Roy S, Das S K.

A Critical Survey of Mathematical Search Engines

[C]// Proceedings of the 2nd International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018: 193-207.

LaTeX数学表达式解析与索引方法

[J]. 计算机应用, 2016,36(3):833-836.

( Zhou Nan, Tian Xuedong .

Analyzing and Indexing Method on LaTeX Formulae

[J]. Journal of Computer Applications, 2016,36(3):833-836.)

Sojka P, Líška M.

Indexing and Searching Mathematics in Digital Libraries

[C]// Proceedings of the 2011 International Conference on Intelligent Computer Mathematics. Springer, Berlin, Heidelberg, 2011: 228-243.

Pathak A, Pakray P, Sarkar S , et al.

Mathirs: Retrieval System for Scientific Documents

[J]. Computación y Sistemas, 2017,21(2):253-265.

Zanibbi R, Aizawa A, Kohlhase M, et al.

[C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 299-308.

Kristianto G Y, Goran Topic, Aizawa A.

MCAT Math Retrieval System for NTCIR-12 MathIR Task

[C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 323-330.

Mikolov T, Chen K, Corrado G , et al.

Efficient Estimation of Word Representations in Vector Space

[OL]. arXiv Preprint, arXiv: 1301. 3781.

Mikolov T, Sutskever I, Chen K, et al.

Distributed Representations of Words and Phrases and Their Compositionality

[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.

Mnih A, Kavukcuoglu K.

Learning Word Embeddings Efficiently with Noise-Contrastive Estimation

[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.

Rose S, Engel D, Cramer N , et al.

Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory

[M]. John Wiley & Sons, 2010: 1-20.

Oliveira R M, Gonzaga F B, Barbosa V C , et al.

A Distributed System for SearchOnMath Based on the Microsoft BizSpark Program

[OL]. arXiv Preprint, arXiv: 1711. 04189.

/

 〈 〉

 版权所有 © 2015 《数据分析与知识发现》编辑部 地址：北京市海淀区中关村北四环西路33号 邮编：100190 电话/传真：(010)82626611-6626，82624938 E-mail:jishu@mail.las.ac.cn