|
|
Retrieving Scientific Documents with Formula Description Structure and Word Embedding |
Xinyu Zai,Xuedong Tian() |
School of Cyber Security and Computer, Hebei University, Baoding 071002, China |
|
|
Abstract [Objective] This study proposes a scientific document retrieval method combining formula match and text ranking, which address the challenges from mathematical expressions.[Methods] First, we used the analysis algorithm for formula description structure to study the mathematical expressions. Then, we acquired formula structure information, and retrieved technical documents based on mathematical expressions. Meanwhile, we obtained the inquiry keywords and document word vectors with the help of word embedding model. Finally, we ranked the documents based on the similarity between the two word vectors[Results] The recall and precision scores of our new model were 0.77 and 0.63, which were 24.2% and 23.5% higher than those of the traditional scientific document retrieval methods.[Limitations] Our method only focuses on expressions in LaTeX format.[Conclusions] The proposed model combining formula and document keywords improves the performance of scitific document retrieval.
|
Received: 13 August 2019
Published: 14 March 2020
|
|
Corresponding Authors:
Xuedong Tian
E-mail: xuedong_tian@126.com
|
[1] |
Shahid A, Afzal M T . Section-Wise Indexing and Retrieval of Research Articles[J]. Cluster Computing, 2018,21(1):481-492.
|
[2] |
Yuan K, Gao L, Wang Y , et al. A Mathematical Information Retrieval System Based on RankBoost [C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016: 259-260.
|
[3] |
李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
|
[3] |
( Li Xiangdong, Ruan Tao, Liu Kang . Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
|
[4] |
Tian X, Yang S, Li X , et al. An Indexing Method of Mathematical Expression Retrieval [C]// Proceedings of the 3rd International Conference on Computer Science and Network Technology. IEEE, 2013: 574-578.
|
[5] |
Yang S Q, Tian X D. A Maintenance Algorithm of FDS Based Mathematical Expression Index [C]// Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. IEEE, 2014,2:888-892.
|
[6] |
Mikolov T, Grave E, Bojanowski P , et al. Advances in Pre-Training Distributed Word Representations[OL]. arXiv Preprint, arXiv: 1712.09405.
|
[7] |
Hu X, Gao L, Lin X , et al. Wikimirs: A Mathematical Information Retrieval System for Wikipedia [C]// Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2013: 11-20.
|
[8] |
Wang Y, Gao L, Wang S, et al. WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document [C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2015: 173-182.
|
[9] |
Pineau D C . Math-Aware Search Engines: Physics Applications and Overview[OL]. arXiv Preprint,arXiv: 1609. 03457.
|
[10] |
Dhar S, Roy S, Das S K. A Critical Survey of Mathematical Search Engines [C]// Proceedings of the 2nd International Conference on Computational Intelligence, Communications, and Business Analytics. Springer, Singapore, 2018: 193-207.
|
[11] |
周南, 田学东 . LaTeX数学表达式解析与索引方法[J]. 计算机应用, 2016,36(3):833-836.
|
[11] |
( Zhou Nan, Tian Xuedong . Analyzing and Indexing Method on LaTeX Formulae[J]. Journal of Computer Applications, 2016,36(3):833-836.)
|
[12] |
Sojka P, Líška M. Indexing and Searching Mathematics in Digital Libraries [C]// Proceedings of the 2011 International Conference on Intelligent Computer Mathematics. Springer, Berlin, Heidelberg, 2011: 228-243.
|
[13] |
Pathak A, Pakray P, Sarkar S , et al. Mathirs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017,21(2):253-265.
|
[14] |
Zanibbi R, Aizawa A, Kohlhase M, et al. NTCIR-12 MathIR Task Overview [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 299-308.
|
[15] |
Kristianto G Y, Goran Topic, Aizawa A. MCAT Math Retrieval System for NTCIR-12 MathIR Task [C]// Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies. 2016: 323-330.
|
[16] |
Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
|
[17] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
|
[18] |
Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
|
[19] |
Rose S, Engel D, Cramer N , et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. John Wiley & Sons, 2010: 1-20.
|
[20] |
Oliveira R M, Gonzaga F B, Barbosa V C , et al. A Distributed System for SearchOnMath Based on the Microsoft BizSpark Program[OL]. arXiv Preprint, arXiv: 1711. 04189.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|