[Objective] This study proposes a scientific document retrieval method combining formula match and text ranking, which address the challenges from mathematical expressions.[Methods] First, we used the analysis algorithm for formula description structure to study the mathematical expressions. Then, we acquired formula structure information, and retrieved technical documents based on mathematical expressions. Meanwhile, we obtained the inquiry keywords and document word vectors with the help of word embedding model. Finally, we ranked the documents based on the similarity between the two word vectors[Results] The recall and precision scores of our new model were 0.77 and 0.63, which were 24.2% and 23.5% higher than those of the traditional scientific document retrieval methods.[Limitations] Our method only focuses on expressions in LaTeX format.[Conclusions] The proposed model combining formula and document keywords improves the performance of scitific document retrieval.
宰新宇,田学东. 基于公式描述结构和词嵌入的科技文档检索方法*[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding. Data Analysis and Knowledge Discovery, 2020, 4(1): 131-138.
Shahid A, Afzal M T . Section-Wise Indexing and Retrieval of Research Articles[J]. Cluster Computing, 2018,21(1):481-492.
[2]
Yuan K, Gao L, Wang Y , et al. A Mathematical Information Retrieval System Based on RankBoost [C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016: 259-260.
( Li Xiangdong, Ruan Tao, Liu Kang . Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[4]
Tian X, Yang S, Li X , et al. An Indexing Method of Mathematical Expression Retrieval [C]// Proceedings of the 3rd International Conference on Computer Science and Network Technology. IEEE, 2013: 574-578.
[5]
Yang S Q, Tian X D. A Maintenance Algorithm of FDS Based Mathematical Expression Index [C]// Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. IEEE, 2014,2:888-892.
[6]
Mikolov T, Grave E, Bojanowski P , et al. Advances in Pre-Training Distributed Word Representations[OL]. arXiv Preprint, arXiv: 1712.09405.
[7]
Hu X, Gao L, Lin X , et al. Wikimirs: A Mathematical Information Retrieval System for Wikipedia [C]// Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2013: 11-20.
[8]
Wang Y, Gao L, Wang S, et al. WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document [C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2015: 173-182.
[9]
Pineau D C . Math-Aware Search Engines: Physics Applications and Overview[OL]. arXiv Preprint,arXiv: 1609. 03457.
[10]
Dhar S, Roy S, Das S K. A Critical Survey of Mathematical Search Engines [C]// Proceedings of the 2nd International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018: 193-207.
( Zhou Nan, Tian Xuedong . Analyzing and Indexing Method on LaTeX Formulae[J]. Journal of Computer Applications, 2016,36(3):833-836.)
[12]
Sojka P, Líška M. Indexing and Searching Mathematics in Digital Libraries [C]// Proceedings of the 2011 International Conference on Intelligent Computer Mathematics. Springer, Berlin, Heidelberg, 2011: 228-243.
[13]
Pathak A, Pakray P, Sarkar S , et al. Mathirs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017,21(2):253-265.
[14]
Zanibbi R, Aizawa A, Kohlhase M, et al. NTCIR-12 MathIR Task Overview [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 299-308.
[15]
Kristianto G Y, Goran Topic, Aizawa A. MCAT Math Retrieval System for NTCIR-12 MathIR Task [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 323-330.
[16]
Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[17]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[18]
Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
[19]
Rose S, Engel D, Cramer N , et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. John Wiley & Sons, 2010: 1-20.
[20]
Oliveira R M, Gonzaga F B, Barbosa V C , et al. A Distributed System for SearchOnMath Based on the Microsoft BizSpark Program[OL]. arXiv Preprint, arXiv: 1711. 04189.