Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (7): 118-126     https://doi.org/10.11925/infotech.2096-3467.2019.1294
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于犹豫模糊权重的数学表达式检索 *
徐以聪,田学东(),李新福,杨芳,史青宣
河北大学网络空间安全与计算机学院 保定 071002
Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight
Xu Yicong,Tian Xuedong(),Li Xinfu,Yang Fang,Shi Qingxuan
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
全文: PDF (845 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

目的】从大量数学表达式中检索出与查询表达式相似的表达式,并对检索结果排序。【方法】提取单个数学表达式的特征子式,利用犹豫模糊集理论计算每个特征子式的权重值;将属于同一表达式的子式权重值进行累加,计算表达式间的相似度得分,按照分数从高到低的顺序排列检索结果。【结果】从时间和相似度的角度进行分析,本文方法检索效率较高,检索结果较准确,本文排序方法的NDCG值最高为0.88,表明该排序方法较合理。【局限】 本文排序方法并非完全面向数学表达式语义检索。【结论】引入犹豫模糊集计算子式权重能够更准确地检索出具有相同结构特征的数学表达式。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐以聪
田学东
李新福
杨芳
史青宣
关键词 数学表达式检索犹豫模糊集理论子式权重相似度得分    
Abstract

[Objective] This paper proposes a retrieval method for mathematical expressions, aiming to find items matching the queries from a large collection of math expressions.[Methods] Firstly, we extracted characteristic subformulas of each single mathematical expression and introduced the theory of hesitant fuzzy sets(HFSs) to compute their weights. Secondly, we added the weight values of all subformulas belonging to the same expression as the similarity scores between the index and query. Finally, we ranked retrieved results with the similarity scores.[Results] The proposed method had higher retrieval efficiency and better results than traditional methods, with the highest NDCG value reached 0.88.[Limitations] Our method did not fully address the semantics of mathematical expressions.[Conclusions] The proposed method could retrieve the needed mathematical expressions more accurately.

Key wordsMathematical Expressions Retrieval    HFSs    Weight of Subformula    Similarity Score
收稿日期: 2019-12-02      出版日期: 2020-07-25
ZTFLH:  TP393 G250  
基金资助:*本文系国家自然科学基金项目“数学表达式资源获取与检索模型研究”(61375075);河北省自然科学基金项目“引入犹豫模糊逻辑的数学检索结果文档排序”(F2019201329);河北省教育厅河北省高等学校科学技术研究重点项目“基于犹豫模糊集的古籍汉字图像检索”的研究成果之一(ZD2017208)
通讯作者: 田学东     E-mail: xuedong_tian@126.com
引用本文:   
徐以聪,田学东,李新福,杨芳,史青宣. 基于犹豫模糊权重的数学表达式检索 *[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
Xu Yicong,Tian Xuedong,Li Xinfu,Yang Fang,Shi Qingxuan. Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight. Data Analysis and Knowledge Discovery, 2020, 4(7): 118-126.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1294      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I7/118
Fig.1  数学表达式检索总体流程
subf md5-subf {ul,un,ulevel}
$x=\frac{-b \pm \sqrt{b^{2}-4 a c}}{2 a}$ B91DA5EC3DE8F0E3 {1.000,1.000,1.000}
$\frac{-b \pm \sqrt{b^{2}-4 a c}}{2 a}$ 82E1ED17885C94AD {0.941,0.857,1.000}
$-b \pm \sqrt{b^{2}-4 a c}$ EE207D7D9882D3AB {0.588,0.714,0.796}
$\sqrt{b^{2}-4 a c}$ 9099FE6F46BF69A5 {0.441,0.600,0.796}
$b^{2}-4 a c$ 965B906F4467622C {0.265,0.286,0.693}
$b^{2}$ F9DD6D7A16C781A2 {0.147,0.143,0.693}
-b A55F0819B2F990F6 {0.059,0.143,0.796}
Table 1  表达式信息描述
Fig.2  特征子式倒排索引结构
Fig.3  数学表达式检索及匹配过程
实验环境 配置
CPU型号 Intel(R)Core(TM) i7-7700, 3.6GHz
内存 8GB
操作系统 Microsoft Windows10
主要开发工具 Visual Studio2017, C#
模式 C/S
Table 2  开发环境
文档数量(篇) 数学表达式数量(个) 索引总大小(MB) 建立耗时
(ms)
1 024 1 908 0.47 0.501
10 240 115 797 24.92 15.411
20 480 251 018 55.13 32.603
31 741 391 955 82.96 79.957
Table 3  索引文件大小
索引中数学表达式数量(个) 索引文件大小(MB) 索引建立耗时
(ms)
1 000
10 000
100 000
138 539
0.13
3.78
48.21
74.99
0.070
1.641
26.540
33.909
Table 4  文献[6]索引文件大小
Fig.4  不同方法检索时间
Fig.5  top-k值的NDCG值比较
[1] Lin X Y, Gao L C, Hu X, et al. A Mathematics Retrieval System for Formulae in Layout Presentations[C] // Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2014: 697-706.
[2] Mišutka J, Galamboš L. System Description: EgoMath2 as a Tool for Mathematical Searching on Wikipedia.org[C] //Proceedings of the 10th International Conference on Intelligent Computer Mathematics. 2011: 307-309.
[3] Sojka P, Líška M. Indexing and Searching Mathematics in Digital Libraries[C] // Proceedings of the 10th International Conference on Intelligent Computer Mathematics. 2011: 228-243.
[4] Hambasan R, Kohlhase M, Prodescu C C. MathWebSearch at NTCIR-11[C] //Proceedings of the 11th NTCIR Conference. 2014: 114-119.
[5] 周南, 田学东. LaTeX数学表达式解析与索引方法[J]. 计算机应用, 2016,36(3):833-836, 842.
[5] ( Zhou Nan, Tian Xuedong. Analyzing and Indexing Method on LaTeX Formulae[J]. Journal of Computer Applications, 2016,36(3):833-836, 842.)
[6] 周南. 基于层次结构特征的数学表达式检索模型[D]. 保定: 河北大学, 2016.
[6] ( Zhou Nan. A Retrieval Model of Mathematical Expressions Based on Hierarchical Structures of Formulae[D]. Baoding: Hebei University, 2016.)
[7] Hu X, Gao L C, Lin X Y, et al. WikiMirs: A Mathematical Information Retrieval System for Wikipedia[C] //Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 2013: 11-20.
[8] Wang Y H, Gao L C, Wang S M, et al. WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document[C] //Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. 2015: 173-182.
[9] Stalnaker D, Zanibbi R. Math Expression Retrieval Using an Inverted Index over Symbol Pairs[C] //Proceedings of SPIE-IS&T Electronic Imaging. 2015,9402:940207.
[10] Xu Y X, Su W, Cheng M, et al. N-gram Index Structure Study for Semantic Based Mathematical Formula[C] // Proceedings of the 10th International Conference on Computational Intelligence and Security. 2014: 293-298.
[11] 王小龙. 基于本体的数学表达式检索技术研究[D]. 重庆: 重庆大学, 2014.
[11] ( Wang Xiaolong. Research on Ontology-Based Mathematical Expression Retrieval Technologies[D]. Chongqing: Chongqing University, 2014.)
[12] Yang S Q, Tian X D. A Maintenance Algorithm of FDS Based Mathematical Expression Index[C] // Proceedings of the 2014 International Conference on Machine Learning and Cybernetics. 2014: 888-892.
[13] 徐建民, 许彩云. 基于文本和公式的科技文档相似度计算[J]. 数据分析与知识发现, 2018,2(10):103-109.
[13] ( Xu Jianmin, Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. Data Analysis and Knowledge Discovery, 2018,2(10):103-109.)
[14] 李夏梦, 潘广贞. 基于消息摘要算法第五版和IDEA的混合加密算法[J]. 科学技术与工程, 2017,17(9):233-238.
[14] ( Li Xiameng, Pan Guangzhen. Message-digest Algorithm 5-IDEA Based Hybrid Encryption Algorithm[J]. Science Technology and Engineering, 2017,17(9):233-238.)
[15] Torra V. Hesitant Fuzzy Sets[J]. International Journal of Intelligent Systems, 2010,25(6):529-539.
[16] Torra V, Narukawa Y. On Hesitant Fuzzy Sets and Decision[C] //Proceedings of the 2009 IEEE International Conference on Fuzzy Systems. 2009: 1378-1382.
[17] Xu Z S, Xia M M. Distance and Similarity Measures for Hesitant Fuzzy Sets[J]. Information Sciences, 2011,181(11):2128-2138.
[18] 张凯歌. 基于犹豫模糊集的数学检索结果排序研究[D]. 保定: 河北大学, 2017.
[18] ( Zhang Kaige. Research on the Ranking of Mathematical Retrieval Results Based on Hesitant Fuzzy Sets[D]. Baoding: Hebei University, 2017.)
[19] 景珂. 网络数学搜索中的数学查询语言与索引的研究[D]. 兰州: 兰州大学, 2009.
[19] ( Jing Ke. Research on Math Query Language and Index in Web-based Math Search[D]. Lanzhou: Lanzhou University, 2009.)
[20] 徐月霞. 面向语义的数学公式N-grams索引结构研究[D]. 兰州: 兰州大学, 2015.
[20] ( Xu Yuexia. N-gram Index Structure for Semantic Based Mathematical Formulas[D]. Lanzhou: Lanzhou University, 2015.)
[21] Jin X B, Geng G G, Xie G S, et al. Approximately Optimizing NDCG Using Pair-wise Loss[J]. Information Sciences, 2018,453:50-65.
doi: 10.1016/j.ins.2018.04.033
[1] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[2] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[3] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[4] 黄水清,陈双喜,任妮. 基于ISO27001的数字图书馆信息安全风险评估模型研究*[J]. 现代图书情报技术, 2009, 25(6): 44-49.
[5] 胡振华,蔡新. 移动图书信息服务系统[J]. 现代图书情报技术, 2004, 20(4): 18-20.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn