Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (1): 131-138    DOI: 10.11925/infotech.2096-3467.2019.0943
Current Issue | Archive | Adv Search |
Retrieving Scientific Documents with Formula Description Structure and Word Embedding
Xinyu Zai,Xuedong Tian()
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
Download: PDF (762 KB)   HTML ( 18
Export: BibTeX | EndNote (RIS)      

[Objective] This study proposes a scientific document retrieval method combining formula match and text ranking, which address the challenges from mathematical expressions.[Methods] First, we used the analysis algorithm for formula description structure to study the mathematical expressions. Then, we acquired formula structure information, and retrieved technical documents based on mathematical expressions. Meanwhile, we obtained the inquiry keywords and document word vectors with the help of word embedding model. Finally, we ranked the documents based on the similarity between the two word vectors[Results] The recall and precision scores of our new model were 0.77 and 0.63, which were 24.2% and 23.5% higher than those of the traditional scientific document retrieval methods.[Limitations] Our method only focuses on expressions in LaTeX format.[Conclusions] The proposed model combining formula and document keywords improves the performance of scitific document retrieval.

Key wordsTechnical Document Retrieval      Formula Description Structure      Word Embedding     
Received: 13 August 2019      Published: 14 March 2020
ZTFLH:  TP311  
Corresponding Authors: Xuedong Tian     E-mail:

Cite this article:

Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding. Data Analysis and Knowledge Discovery, 2020, 4(1): 131-138.

URL:     OR

CBOW Model
查询表达式 LaTeX结构 FDS结构
2q 2^{q} ^\1
a×b a \times b \times\0,
ab \frac{a}{b} frac\0
-b±b2-4ac2a \frac{-b ±√({b^{2} -4 a c} )}{2 a} \frac\0,-\1,\pm\1,\sqrt\1,^\3,-\2,
1σ2πe-x-μ22σ2 \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}
{2 \sigma^{2}}}
Partial Expression Parsing Results
EXPID EXP FileName(html)
57113 pxμσ=1σ2πe- x-μ22σ2 Computer stereo vision
127297 PGZ=1σ2πe- x-μ22σ2 Gaussian noise
206443 px|μσ=1σ2πe- x-μ22σ2 Maximum entropy probability distribution
232616 fx|μ,σ=1σ2πe- x-μ22σ2 Normal distribution
79135 gx=12πσ2e- x-μ22σ2 Differential entropy
Partial Search Results of Expression
Keyword WordScore
folded normal distribution 7.37
folded distribution 5.03
normal distribution 4.78
random variable 4.33
differential equations 4.02
Keyword Group Crawl Results
序号 文档(html) 相似度
1 Folded normal distribution 0.93
2 Normal gamma distribution 0.86
3 Gaussian distribution 0.80
4 Exponential family 0.75
5 Stochastic simulation 0.74
6 Logit normal distribution 0.73
7 Normal distribution 0.72
8 Kernel (statistics) 0.68
9 Distributed random 0.67
10 Slice sampling 0.66
Document Sorting Top-10 Results
系统 公式 文档(html)
p(k)=λkk!e-λ Variance
fk;λ=Pr(X=k)=λke-λk! Poisson distribution
p(d)=λdd!e-λ Long tail traffic
pn=i=1T1nMinie-Mi Constellation model
Q(ψn)(x,p)=x2+p2n!e-x2+p2π Quantum harmonic oscillator
本文系统 p(k)=λkk!e-λ Variance
fk;λ=Pr(X=k)=λke-λk! Poisson distribution
p(N=k)=λkk!e-n Poisson games
Pn(t)=tkn!e-t Poisson wavelet
λkk!e-λ=5kk!e-5 Poisson limit theorem
Top-5 Search Results for Both Systems
序号 公式 关键字 序号 公式 关键字
1 yt fractional 6 limn1+1nn limit theorem
2 2q exponential 7 a2+b2=c2 pythagorean theorem
3 sinθ sine function 8 λkk!e-λ poisson
4 cosx cosine function 9 -b±b2-4ac2a quadratic formula
5 a radical expression 10 1σ2πe- x-μ22σ2 normal distribution
Document List
Comparison of Similarity Between Our Method and SearchOnMath
Retrieval Recall and Precision
[1] Shahid A, Afzal M T . Section-Wise Indexing and Retrieval of Research Articles[J]. Cluster Computing, 2018,21(1):481-492.
[2] Yuan K, Gao L, Wang Y , et al. A Mathematical Information Retrieval System Based on RankBoost [C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016: 259-260.
[3] 李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
[3] ( Li Xiangdong, Ruan Tao, Liu Kang . Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[4] Tian X, Yang S, Li X , et al. An Indexing Method of Mathematical Expression Retrieval [C]// Proceedings of the 3rd International Conference on Computer Science and Network Technology. IEEE, 2013: 574-578.
[5] Yang S Q, Tian X D. A Maintenance Algorithm of FDS Based Mathematical Expression Index [C]// Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. IEEE, 2014,2:888-892.
[6] Mikolov T, Grave E, Bojanowski P , et al. Advances in Pre-Training Distributed Word Representations[OL]. arXiv Preprint, arXiv: 1712.09405.
[7] Hu X, Gao L, Lin X , et al. Wikimirs: A Mathematical Information Retrieval System for Wikipedia [C]// Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2013: 11-20.
[8] Wang Y, Gao L, Wang S, et al. WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document [C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2015: 173-182.
[9] Pineau D C . Math-Aware Search Engines: Physics Applications and Overview[OL]. arXiv Preprint,arXiv: 1609. 03457.
[10] Dhar S, Roy S, Das S K. A Critical Survey of Mathematical Search Engines [C]// Proceedings of the 2nd International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018: 193-207.
[11] 周南, 田学东 . LaTeX数学表达式解析与索引方法[J]. 计算机应用, 2016,36(3):833-836.
[11] ( Zhou Nan, Tian Xuedong . Analyzing and Indexing Method on LaTeX Formulae[J]. Journal of Computer Applications, 2016,36(3):833-836.)
[12] Sojka P, Líška M. Indexing and Searching Mathematics in Digital Libraries [C]// Proceedings of the 2011 International Conference on Intelligent Computer Mathematics. Springer, Berlin, Heidelberg, 2011: 228-243.
[13] Pathak A, Pakray P, Sarkar S , et al. Mathirs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017,21(2):253-265.
[14] Zanibbi R, Aizawa A, Kohlhase M, et al. NTCIR-12 MathIR Task Overview [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 299-308.
[15] Kristianto G Y, Goran Topic, Aizawa A. MCAT Math Retrieval System for NTCIR-12 MathIR Task [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 323-330.
[16] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[17] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[18] Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
[19] Rose S, Engel D, Cramer N , et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. John Wiley & Sons, 2010: 1-20.
[20] Oliveira R M, Gonzaga F B, Barbosa V C , et al. A Distributed System for SearchOnMath Based on the Microsoft BizSpark Program[OL]. arXiv Preprint, arXiv: 1711. 04189.
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[4] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[5] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[6] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[7] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[8] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[9] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[10] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[11] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[12] Wang Tingting,Han Man,Wang Yu. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[13] Zhang Qin,Guo Hongmei,Zhang Zhixiong. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[14] Xia Tian. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[15] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938