Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (6): 135-144     https://doi.org/10.11925/infotech.2096-3467.2020.1006
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种面向科技文献元数据增量数据规范的多模式匹配算法*
董美1,2,常志军1,2,张润杰3()
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
3南安普顿大学电子与计算机科学学院 南安普顿 SO17 1BJ
A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature
Dong Mei1,2,Chang Zhijun1,2,Zhang Runjie3()
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
全文: PDF (970 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对期刊文献元数据日增的小规模数据,设计一种基于Hash的多模式匹配算法,对其机构信息利用大规模的模式集进行规范化。【方法】 使用Hash定位模式串,减少对系统内存的占用;抽取模式串的首个单词/字结合Word跳步匹配,减少匹配次数,加大跳转幅度,从而提升多模式匹配的效率。【结果】 以CSCD机构库182万条数据作为模式集的实验中,该算法与Aho-Corasick(AC)算法对比,能够较为快速地构建模式集对应的字典;在字符集规模约为1万条时,有更优越的时间性能,尤其是英文语料下有9.39%时间性能的提升;与Wu-Manber(WM)算法相比,该算法不受最短模式串限制。【局限】 针对不同的模式集和字符集,需要对算法或数据进行调整;该算法及其拓展的无首词模式,均不适用于模式集较小、字符集较大的场景。【结论】 该算法可以应用于中文、英文、中英混合的文本,在模式集较大(106级)、字符集较小(1万左右)的情况下,有超越经典算法AC算法(0.08%-30.41%)和WM算法时间性能的表现。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
董美
常志军
张润杰
关键词 模式匹配数据规范化名称规范哈希算法    
Abstract

[Objective] This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata. [Methods] First, we used the Hash function to locate the pattern strings and reduced the system memory usage. Then, we extracted the first words of the pattern strings, which were combined with word skipping matching. The new algorithm reduced the number of matches and increased the jump range, which improved the efficiency of multiple pattern matching. [Results] We examined our model with the CSCD’s institutional library as the pattern string set. Compared with the Aho-Corasick (AC) algorithm, our method quickly constructed the dictionary corresponding to the pattern string sets. When the data volume reached about 10 000, our model spent less time on the same tasks. For the English corpus, there was a 9.39% improvement in time performance. Compared with the Wu-Manber (WM) algorithm, our method was not restricted by the shortest pattern strings. [Limitations] The algorithm or data needs to be adjusted for different pattern strings and text strings. This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets. [Conclusions] The algorithm can be applied to Chinese, English, and Chinese-English mixed texts. The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set (106) and small string set (about 10,000).

Key wordsPattern Match    Data Standardization    Name Authority    Hash Algorithm
收稿日期: 2020-10-14      出版日期: 2021-06-29
ZTFLH:  TP391  
基金资助:*中国科学院文献情报能力建设项目(Y9100901)
通讯作者: 张润杰     E-mail: 2757809002@qq.com
引用本文:   
董美,常志军,张润杰. 一种面向科技文献元数据增量数据规范的多模式匹配算法*[J]. 数据分析与知识发现, 2021, 5(6): 135-144.
Dong Mei,Chang Zhijun,Zhang Runjie. A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature. Data Analysis and Knowledge Discovery, 2021, 5(6): 135-144.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1006      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I6/135
Fig.1  算法基本思路
Fig.2  字典存放键值对流程
Fig.3  T1模式匹配过程
first_word in FWHash start_pos stop_
pos
in Hash output
TRUE 0 1
2
3
4
5
6
7
8
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE



中山大学



中山大学岭南学院
FALSE
FALSE
FALSE
TRUE 4 5
6
7
8
FALSE
FALSE
FALSE
TRUE



岭南学院
FALSE
FALSE
FALSE
Table 1  T1的模式匹配过程
Fig.4  T2模式匹配过程
first_word in FWHash start_
pos
stop_
pos
in Hash Output
Longhna FALSE
Hospital FALSE
, FALSE
Shanghai TRUE 17 26
37
40
52
60
69
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE

Shanghai University



Shanghai University of Traditional Chinese
Medicine
University FALSE
of FALSE
Traditional FALSE
Chinese FALSE
Medicine FALSE
Table 2  T2的模式匹配过程
测试数据量/条 AC算法耗时/ms 本文算法耗时/ms 提升百分比
143 3 359.16 2 164.81 35.56%
1 430 3 364.56 2 298.49 31.69%
9 724 3 397.46 3 048.69 10.27%
10 010 3 389.46 3 071.28 9.39%
11 011 3 392.46 3 166.77 6.65%
12 012 3 395.56 3 307.59 2.59%
13 013 3 398.96 3 393.75 0.15%
14 300 3 402.66 3 391.73 0.32%
Table 3  英文语料算法时间性能对比
测试数据量/条 AC算法耗时/ms 本文算法耗时/ms 提升百分比
5 972 3 395.66 3 024.54 10.93%
9 000 3 408.46 3 245.55 4.78%
10 000 3 418.26 3 415.43 0.08%
11 945 3 436.26 4 017.76 -16.92%
Table 4  中英混合语料算法时间性能对比
Fig.5  本文算法与WM算法时间性能对比
[1] Wu X, Zhu X, Wu G Q, et al. Data Mining with Big Data[J]. IEEE Transactions on Knowledge & Data Engineering, 2013,26(1):97-107.
[2] 杨尚森. 网络管理与维护技术[M]. 北京: 电子工业出版社, 2007.
[2] (Yang Shangsen. Network Management and Maintenance Technology[M]. Beijing: Publishing House of Electronics Industry, 2007.)
[3] 蔡婷, 杨卫帅. 一种改进的字符串模式匹配算法[J]. 物联网技术, 2017,7(7):89-91,95.
[3] (Cai Ting, Yang Weishuai. An Improved String Pattern Matching Algorithm[J]. Internet of Things Technologies, 2017,7(7):89-91,95.)
[4] 刘沛骞, 冯晶晶. 一种改进的BM模式匹配算法[J]. 计算机工程, 2011,37(17):248-249.
[4] (Liu Peiqian, Feng Jingjing. Improved BM Pattern Matching Algorithm[J]. Computer Engineering, 2011,37(17):248-249.)
[5] 齐晖, 曹旻, 袁世忠. 模式匹配算法性能对比试验结果在入侵检测系统中的应用[J]. 河南科学, 2009,27(7):835-838.
[5] (Qi Hui, Cao Min, Yuan Shizhong. Testing for Contrasting Performance of Pattern Matching Algorithms and Its Application[J]. Henan Science, 2009,27(7):835-838.)
[6] Knuth D E, Morris J H, Pratt V R. Fast Pattern Matching in Strings[J]. Siam Journal on Computing, 1977,6(2):323-350.
doi: 10.1137/0206024
[7] 曹成宏, 雷迎科. 面向比特流的链路层未知帧分析技术综述[J]. 小型微型计算机系统, 2018,39(2):297-303.
[7] (Cao Chenghong, Lei Yingke. Survey of Bit-stream Oriented Link-layer Unknown Frame Analysis Techniques[J]. Journal of Chinese Computer Systems, 2018,39(2):297-303.)
[8] 屈正庚, 赵杰. 一种改进的高效多模式匹配算法[J]. 系统仿真技术, 2014,10(2):116-120, 139.
[8] (Qu Zhenggeng, Zhao Jie. An Improved and Efficient Algorithm for Multiple Patterns Matching[J]. System Simulation Technology, 2014,10(2):116-120,139.)
[9] Aho A V, Corasick M J. Efficient String Matching: An Aid to BibliograPhic Search[J]. Communications of the ACM, 1975,18(6):33-40.
[10] Wu S, Manber U. A Fast Algorithm for Multi-Pattern Searching[R]. Arizona:University of Arizona at Tuscon, 1994.
[11] 刘邦国, 陈庆春, 类先富. 一种面向PDF文本内容审查的高效多模式匹配算法[J]. 计算机应用研究, 2020,37(6):1755-1759.
[11] (Liu Bangguo, Chen Qingchun, Lei Xianfu. Efficient Multi-pattern Matching Algorithm for PDF Content Search[J]. Application Research of Computers, 2020,37(6):1755-1759.)
[12] 赵国锋, 叶飞, 姚永安, 等. 一种面向云中心网络入侵检测的多模式匹配算法[J]. 信息网络安全, 2018(1):52-57.
[12] (Zhao Guofeng, Ye Fei, Yao Yongan, et al. Design and Implementation of A Multi-pattern String Matching Algorithm in Cloud Center Network Intrusion Detection System[J]. Netinfo Security, 2018(1):52-57.)
[13] Baeza-Yates R A, Gonnet G H. A New Approach to Text Searching[J]. ACM SIGIR Forum, 1989,23(SI):168-175.
doi: 10.1145/75335.75352
[14] 朱俊. 多模式匹配算法研究[D]. 合肥:合肥工业大学, 2010.
[14] (Zhu Jun. Multiple Patterns Match Algorithm Research[D]. Hefei : Hefei University of Technology, 2010.)
[15] 巫喜红, 曾锋. AC多模式匹配算法研究[J]. 计算机工程, 2012,38(6):279-281.
[15] (Wu Xihong, Zeng Feng. Research on AC Multiple Pattern Matching Algorithm[J]. Computer Engineering, 2012,38(6):279-281.)
[16] 王培凤, 李莉. 基于Aho-Corasick算法的多模式匹配算法研究[J]. 计算机应用研究, 2011,28(4):1251-1253, 1259.
[16] (Wang Peifeng, Li Li. Research on Multi-Pattern Matching Algorithms Based on Aho-Corasick Algorithm[J]. Application Research of Computers, 2011,28(4):1251-1253, 1259.)
[17] Nieminen J, Kilpelinen P. Efficient Implementation of Aho-Corasick Pattern Matching Automata Using Unicode[J]. Software: Practice and Experience, 2010,37(6):669-690.
doi: 10.1002/(ISSN)1097-024X
[18] 李志东, 杨武, 张汝波, 等. 基于异构隐式存储的多模式匹配算法[J]. 通信学报, 2009,30(3):119-124.
[18] (Li Zhidong, Yang Wu, Zhang Rubo, et al. Multi-pattern Matching Algorithm Based on Heterogeneous Implicit Storage[J]. Journal on Communications, 2009,30(3):119-124.)
[19] 胡丽娟. 大规模多模式匹配算法的研究[D]. 西安:西安电子科技大学, 2017.
[19] (Hu Lijuan. Research on Large-Scale Multi-pattern Matching Algorithms[D]. Xi’an: Xidian University, 2017.)
[20] 陈永杰, 吾守尔·斯拉木, 于清. 一种基于Aho-Corasick算法改进的多模式匹配算法[J]. 现代电子技术, 2019,42(4):89-93.
[20] (Chen Yongjie, Wushour Silamu, Yu Qing. An Improved Multi-pattern Matching Algorithm Based on Aho-Corasick Algorithm[J]. Modern Electronics Technique, 2019,42(4):89-93.)
[21] 曹为政, 葛蒙蒙. 多模式匹配算法研究和优化[J]. 智能计算机与应用, 2018,8(2):129-133.
[21] (Cao Weizheng, Ge Mengmeng. Multi-pattern Matching Algorithm Research and Optimization[J]. Intelligent Computer and Applications, 2018,8(2):129-133.)
[22] 潘华伟. 网络防护墙中多模式匹配算法的研究[D]. 武汉:湖北工业大学, 2017.
[22] (Pan Huawei. The Research of Multi-pattern, Matching Algorithm in Network Firewall[D]. Wuhan: Hubei University of Technology, 2017.)
[23] Fredriksson K, Grabowski S. Average-Optimal String Matching[J]. Journal of Discrete Algorithms, 2009,7(4):579-594.
doi: 10.1016/j.jda.2008.09.001
[24] Coit C J, Staniford S, Mcalerney J. Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort[M]. Piscataway, NJ: IEEE Press, 2001.
[25] 董志鑫, 方滨兴. 基于AC_QS多模式匹配算法的优化研究[J]. 智能计算机与应用, 2017,7(5):100-103.
[25] (Dong Zhixin, Fang Binxing. Optimization Research of Multi Pattern Matching Algorithm Based on AC_QS[J]. Intelligent Computer and Applications, 2017,7(5):100-103.)
[26] 董志鑫. 面向千万级规模网络的多模式串匹配算法的优化研究[D]. 哈尔滨:哈尔滨工业大学, 2017.
[26] (Dong Zhixin. Multi-pattern String Matching Optimization for the Ten Million Scale Network[D]. Harbin: Harbin Institute of Technology, 2017.)
[27] 岳小伟, 詹瞻. 面向长模式串的改进型AC算法研究[J]. 网络安全技术与应用, 2018(4):40-42.
[27] (Yue Xiaowei, Zhan Zhan. Research on Improved AC Algorithm for Long Pattern Strings[J]. Network Security Technology & Application, 2018(4):40-42.)
[28] 夏念, 嵩天. 短规则有效的快速多模式匹配算法[J]. 计算机工程与应用, 2017,53(7):1-8.
[28] (Xia Nian, Song Tian. Short-rule-efficient Rapid Multi-pattern Matching Algorithm[J]. Computer Engineering and Applications, 2017,53(7):1-8.)
[29] Zhang W. An Improved Wu-Manber Multiple Patterns Matching Algorithm[C]// Proceedings of the 2016 IEEE International Conference on Electronic Information and Communication Technology. Piscataway, NJ: IEEE Press, 2016: 286-289.
[30] Saikrishna V, Rasool A, Khare N. Spam Filtering Through Multiple Pattern Bit Parallel String Matching Combining Shift AND & OR[J]. International Journal of Computer Applications, 2013,61(5):40-45.
[31] 孟旭东, 许强凯. 应用于Web服务器匹配算法的FPGA实现[J]. 计算机技术与发展, 2016,26(12):142-147,152.
[31] (Meng Xudong, Xu Qiangkai. Implementation of FPGA Applied to Web Server Matching Algorithm[J]. Computer Technology and Development, 2016,26(12):142-147,152.)
[32] 戴正华. 面向体系结构的串匹配算法优化研究[D]. 北京:中国科学院计算技术研究所, 2006.
[32] (Dai Zhenghua. Optimization of String Matching Algorithm Based on Computer Architecture[D]. Beijing: Institute of Computing Technology of Chinese Academy of Sciences, 2006.)
[33] 陈瑞, 顾乃杰, 叶鸿. 一种基于魂芯DSP的单模式位并行串匹配算法[J]. 计算机应用与软件, 2020,37(7):246-252.
[33] (Chen Rui, Gu Naijie, Ye Hong. A Single Mode Bit-Parallel String Matching Algorithm Based on HXDSP[J]. Computer Applications and Software, 2020,37(7):246-252.)
[34] Lubanovic B. Python语言及其应用[M]. 丁嘉瑞, 梁杰, 禹常隆译. 北京: 人民邮电出版社, 2016.
[34] (Lubanovic B. Introducing Python[M]. Translated by Ding Jiarui, Liang Jie, Yu Changlong. Beijing: Posts & Telecom Press, 2016.)
[35] Matthes E. Python编程:从入门到实践[M]. 袁国忠译. 北京: 人民邮电出版社, 2016.
[35] (Matthes E. Python Crash Course:A Hands-On, Project-Based Introducation to Programming[M]. Translated by Yuan Guozhong. Beijing: Posts & Telecom Press, 2016.)
[36] 中国科学引文数据库课题组. 中国科学引文数据库来源期刊的选择及其评价[J]. 中国科技期刊研究, 1997,8(4):17-21.
[36] (Chinese Science Citation Database Group. Selection and Evaluation of Source Journals of Chinese Science Citation Database[J]. Chinese Journal of Scientific and Technical Periodicals, 1997,8(4):17-21.)
[1] 张建勇,钱力,于倩倩,董智鹏,黄永文,刘建华,郭舒,王峰. 科研实体名称规范的研究与实践*[J]. 数据分析与知识发现, 2019, 3(1): 27-37.
[2] 孙海霞, 王蕾, 吴英杰, 华薇娜, 李军莲. 科技文献数据库中机构名称匹配策略研究*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[3] 史礼婷,张骞,钟永恒,胡思思,李贞贞. 双向模式匹配在年鉴数据预处理平台中的应用[J]. 现代图书情报技术, 2016, 32(9): 88-94.
[4] 郝嘉树. 利用开放语义资源丰富个人名称规范数据——基于FOAF的方案设计[J]. 现代图书情报技术, 2016, 32(2): 75-82.
[5] 白海燕, 刘耀, 郭晓峰. 新型责任者标识系统ORCID的构建机制介绍[J]. 现代图书情报技术, 2015, 31(5): 8-14.
[6] 白海燕. ORCID在机构知识库中的整合介绍[J]. 现代图书情报技术, 2015, 31(3): 8-17.
[7] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[8] 李树青,程国达,王维民. 基于加权XML模型的XML数据与DTD模式匹配*[J]. 现代图书情报技术, 2010, 26(1): 57-65.
[9] 贾美英,杨炳儒,郑德权,曹鸿强,杨靖,张练. 基于模式匹配的军事演习情报信息抽取*[J]. 现代图书情报技术, 2009, (9): 70-75.
[10] 陈金星,祝忠明. 责任者名称规范控制研究及进展*[J]. 现代图书情报技术, 2009, 25(12): 12-17.
[11] 张雅珊. 机读目录的模糊检索问题及改进[J]. 现代图书情报技术, 2008, 24(7): 91-95.
[12] 姚兴山. 基于Hash算法的中文分词的研究[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[13] 王昊 . 基于层次模式匹配的命名实体识别模型[J]. 现代图书情报技术, 2007, 2(5): 62-68.
[14] 刘春红,李凤侠,杨慧. 清华大学图书馆名称规范数据的著录探讨[J]. 现代图书情报技术, 2005, 21(2): 67-70.
[15] 沈迪飞. 金融证券系统的数据基础建设和巨灵公司的实践[J]. 现代图书情报技术, 2002, 18(2): 79-82.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn