Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (6): 135-144    DOI: 10.11925/infotech.2096-3467.2020.1006
Current Issue | Archive | Adv Search |
A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature
Dong Mei1,2,Chang Zhijun1,2,Zhang Runjie3()
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
Download: PDF (970 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata. [Methods] First, we used the Hash function to locate the pattern strings and reduced the system memory usage. Then, we extracted the first words of the pattern strings, which were combined with word skipping matching. The new algorithm reduced the number of matches and increased the jump range, which improved the efficiency of multiple pattern matching. [Results] We examined our model with the CSCD’s institutional library as the pattern string set. Compared with the Aho-Corasick (AC) algorithm, our method quickly constructed the dictionary corresponding to the pattern string sets. When the data volume reached about 10 000, our model spent less time on the same tasks. For the English corpus, there was a 9.39% improvement in time performance. Compared with the Wu-Manber (WM) algorithm, our method was not restricted by the shortest pattern strings. [Limitations] The algorithm or data needs to be adjusted for different pattern strings and text strings. This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets. [Conclusions] The algorithm can be applied to Chinese, English, and Chinese-English mixed texts. The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set (106) and small string set (about 10,000).

Key wordsPattern Match      Data Standardization      Name Authority      Hash Algorithm     
Received: 14 October 2020      Published: 29 June 2021
ZTFLH:  TP391  
Fund:Literature and Information Capacity Building Special Project of Chinese Academy of Sciences(Y9100901)
Corresponding Authors: Zhang Runjie     E-mail: 2757809002@qq.com

Cite this article:

Dong Mei,Chang Zhijun,Zhang Runjie. A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature. Data Analysis and Knowledge Discovery, 2021, 5(6): 135-144.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1006     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I6/135

Algorithm Flow Chart
Dictionary Storage Process
Pattern Matching Process of T1
first_word in FWHash start_pos stop_
pos
in Hash output
TRUE 0 1
2
3
4
5
6
7
8
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE



中山大学



中山大学岭南学院
FALSE
FALSE
FALSE
TRUE 4 5
6
7
8
FALSE
FALSE
FALSE
TRUE



岭南学院
FALSE
FALSE
FALSE
Pattern Matching Process of T1
Pattern Matching Process of T2
first_word in FWHash start_
pos
stop_
pos
in Hash Output
Longhna FALSE
Hospital FALSE
, FALSE
Shanghai TRUE 17 26
37
40
52
60
69
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE

Shanghai University



Shanghai University of Traditional Chinese
Medicine
University FALSE
of FALSE
Traditional FALSE
Chinese FALSE
Medicine FALSE
Pattern Matching Process of T2
测试数据量/条 AC算法耗时/ms 本文算法耗时/ms 提升百分比
143 3 359.16 2 164.81 35.56%
1 430 3 364.56 2 298.49 31.69%
9 724 3 397.46 3 048.69 10.27%
10 010 3 389.46 3 071.28 9.39%
11 011 3 392.46 3 166.77 6.65%
12 012 3 395.56 3 307.59 2.59%
13 013 3 398.96 3 393.75 0.15%
14 300 3 402.66 3 391.73 0.32%
Algorithm Performance on English Corpus
测试数据量/条 AC算法耗时/ms 本文算法耗时/ms 提升百分比
5 972 3 395.66 3 024.54 10.93%
9 000 3 408.46 3 245.55 4.78%
10 000 3 418.26 3 415.43 0.08%
11 945 3 436.26 4 017.76 -16.92%
Algorithm Performance on Chinese-English Mixed Corpus
Time Cost Between the Algorithm in This Paper and the WM Algorithm
[1] Wu X, Zhu X, Wu G Q, et al. Data Mining with Big Data[J]. IEEE Transactions on Knowledge & Data Engineering, 2013,26(1):97-107.
[2] 杨尚森. 网络管理与维护技术[M]. 北京: 电子工业出版社, 2007.
[2] (Yang Shangsen. Network Management and Maintenance Technology[M]. Beijing: Publishing House of Electronics Industry, 2007.)
[3] 蔡婷, 杨卫帅. 一种改进的字符串模式匹配算法[J]. 物联网技术, 2017,7(7):89-91,95.
[3] (Cai Ting, Yang Weishuai. An Improved String Pattern Matching Algorithm[J]. Internet of Things Technologies, 2017,7(7):89-91,95.)
[4] 刘沛骞, 冯晶晶. 一种改进的BM模式匹配算法[J]. 计算机工程, 2011,37(17):248-249.
[4] (Liu Peiqian, Feng Jingjing. Improved BM Pattern Matching Algorithm[J]. Computer Engineering, 2011,37(17):248-249.)
[5] 齐晖, 曹旻, 袁世忠. 模式匹配算法性能对比试验结果在入侵检测系统中的应用[J]. 河南科学, 2009,27(7):835-838.
[5] (Qi Hui, Cao Min, Yuan Shizhong. Testing for Contrasting Performance of Pattern Matching Algorithms and Its Application[J]. Henan Science, 2009,27(7):835-838.)
[6] Knuth D E, Morris J H, Pratt V R. Fast Pattern Matching in Strings[J]. Siam Journal on Computing, 1977,6(2):323-350.
doi: 10.1137/0206024
[7] 曹成宏, 雷迎科. 面向比特流的链路层未知帧分析技术综述[J]. 小型微型计算机系统, 2018,39(2):297-303.
[7] (Cao Chenghong, Lei Yingke. Survey of Bit-stream Oriented Link-layer Unknown Frame Analysis Techniques[J]. Journal of Chinese Computer Systems, 2018,39(2):297-303.)
[8] 屈正庚, 赵杰. 一种改进的高效多模式匹配算法[J]. 系统仿真技术, 2014,10(2):116-120, 139.
[8] (Qu Zhenggeng, Zhao Jie. An Improved and Efficient Algorithm for Multiple Patterns Matching[J]. System Simulation Technology, 2014,10(2):116-120,139.)
[9] Aho A V, Corasick M J. Efficient String Matching: An Aid to BibliograPhic Search[J]. Communications of the ACM, 1975,18(6):33-40.
[10] Wu S, Manber U. A Fast Algorithm for Multi-Pattern Searching[R]. Arizona:University of Arizona at Tuscon, 1994.
[11] 刘邦国, 陈庆春, 类先富. 一种面向PDF文本内容审查的高效多模式匹配算法[J]. 计算机应用研究, 2020,37(6):1755-1759.
[11] (Liu Bangguo, Chen Qingchun, Lei Xianfu. Efficient Multi-pattern Matching Algorithm for PDF Content Search[J]. Application Research of Computers, 2020,37(6):1755-1759.)
[12] 赵国锋, 叶飞, 姚永安, 等. 一种面向云中心网络入侵检测的多模式匹配算法[J]. 信息网络安全, 2018(1):52-57.
[12] (Zhao Guofeng, Ye Fei, Yao Yongan, et al. Design and Implementation of A Multi-pattern String Matching Algorithm in Cloud Center Network Intrusion Detection System[J]. Netinfo Security, 2018(1):52-57.)
[13] Baeza-Yates R A, Gonnet G H. A New Approach to Text Searching[J]. ACM SIGIR Forum, 1989,23(SI):168-175.
doi: 10.1145/75335.75352
[14] 朱俊. 多模式匹配算法研究[D]. 合肥:合肥工业大学, 2010.
[14] (Zhu Jun. Multiple Patterns Match Algorithm Research[D]. Hefei : Hefei University of Technology, 2010.)
[15] 巫喜红, 曾锋. AC多模式匹配算法研究[J]. 计算机工程, 2012,38(6):279-281.
[15] (Wu Xihong, Zeng Feng. Research on AC Multiple Pattern Matching Algorithm[J]. Computer Engineering, 2012,38(6):279-281.)
[16] 王培凤, 李莉. 基于Aho-Corasick算法的多模式匹配算法研究[J]. 计算机应用研究, 2011,28(4):1251-1253, 1259.
[16] (Wang Peifeng, Li Li. Research on Multi-Pattern Matching Algorithms Based on Aho-Corasick Algorithm[J]. Application Research of Computers, 2011,28(4):1251-1253, 1259.)
[17] Nieminen J, Kilpelinen P. Efficient Implementation of Aho-Corasick Pattern Matching Automata Using Unicode[J]. Software: Practice and Experience, 2010,37(6):669-690.
doi: 10.1002/(ISSN)1097-024X
[18] 李志东, 杨武, 张汝波, 等. 基于异构隐式存储的多模式匹配算法[J]. 通信学报, 2009,30(3):119-124.
[18] (Li Zhidong, Yang Wu, Zhang Rubo, et al. Multi-pattern Matching Algorithm Based on Heterogeneous Implicit Storage[J]. Journal on Communications, 2009,30(3):119-124.)
[19] 胡丽娟. 大规模多模式匹配算法的研究[D]. 西安:西安电子科技大学, 2017.
[19] (Hu Lijuan. Research on Large-Scale Multi-pattern Matching Algorithms[D]. Xi’an: Xidian University, 2017.)
[20] 陈永杰, 吾守尔·斯拉木, 于清. 一种基于Aho-Corasick算法改进的多模式匹配算法[J]. 现代电子技术, 2019,42(4):89-93.
[20] (Chen Yongjie, Wushour Silamu, Yu Qing. An Improved Multi-pattern Matching Algorithm Based on Aho-Corasick Algorithm[J]. Modern Electronics Technique, 2019,42(4):89-93.)
[21] 曹为政, 葛蒙蒙. 多模式匹配算法研究和优化[J]. 智能计算机与应用, 2018,8(2):129-133.
[21] (Cao Weizheng, Ge Mengmeng. Multi-pattern Matching Algorithm Research and Optimization[J]. Intelligent Computer and Applications, 2018,8(2):129-133.)
[22] 潘华伟. 网络防护墙中多模式匹配算法的研究[D]. 武汉:湖北工业大学, 2017.
[22] (Pan Huawei. The Research of Multi-pattern, Matching Algorithm in Network Firewall[D]. Wuhan: Hubei University of Technology, 2017.)
[23] Fredriksson K, Grabowski S. Average-Optimal String Matching[J]. Journal of Discrete Algorithms, 2009,7(4):579-594.
doi: 10.1016/j.jda.2008.09.001
[24] Coit C J, Staniford S, Mcalerney J. Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort[M]. Piscataway, NJ: IEEE Press, 2001.
[25] 董志鑫, 方滨兴. 基于AC_QS多模式匹配算法的优化研究[J]. 智能计算机与应用, 2017,7(5):100-103.
[25] (Dong Zhixin, Fang Binxing. Optimization Research of Multi Pattern Matching Algorithm Based on AC_QS[J]. Intelligent Computer and Applications, 2017,7(5):100-103.)
[26] 董志鑫. 面向千万级规模网络的多模式串匹配算法的优化研究[D]. 哈尔滨:哈尔滨工业大学, 2017.
[26] (Dong Zhixin. Multi-pattern String Matching Optimization for the Ten Million Scale Network[D]. Harbin: Harbin Institute of Technology, 2017.)
[27] 岳小伟, 詹瞻. 面向长模式串的改进型AC算法研究[J]. 网络安全技术与应用, 2018(4):40-42.
[27] (Yue Xiaowei, Zhan Zhan. Research on Improved AC Algorithm for Long Pattern Strings[J]. Network Security Technology & Application, 2018(4):40-42.)
[28] 夏念, 嵩天. 短规则有效的快速多模式匹配算法[J]. 计算机工程与应用, 2017,53(7):1-8.
[28] (Xia Nian, Song Tian. Short-rule-efficient Rapid Multi-pattern Matching Algorithm[J]. Computer Engineering and Applications, 2017,53(7):1-8.)
[29] Zhang W. An Improved Wu-Manber Multiple Patterns Matching Algorithm[C]// Proceedings of the 2016 IEEE International Conference on Electronic Information and Communication Technology. Piscataway, NJ: IEEE Press, 2016: 286-289.
[30] Saikrishna V, Rasool A, Khare N. Spam Filtering Through Multiple Pattern Bit Parallel String Matching Combining Shift AND & OR[J]. International Journal of Computer Applications, 2013,61(5):40-45.
[31] 孟旭东, 许强凯. 应用于Web服务器匹配算法的FPGA实现[J]. 计算机技术与发展, 2016,26(12):142-147,152.
[31] (Meng Xudong, Xu Qiangkai. Implementation of FPGA Applied to Web Server Matching Algorithm[J]. Computer Technology and Development, 2016,26(12):142-147,152.)
[32] 戴正华. 面向体系结构的串匹配算法优化研究[D]. 北京:中国科学院计算技术研究所, 2006.
[32] (Dai Zhenghua. Optimization of String Matching Algorithm Based on Computer Architecture[D]. Beijing: Institute of Computing Technology of Chinese Academy of Sciences, 2006.)
[33] 陈瑞, 顾乃杰, 叶鸿. 一种基于魂芯DSP的单模式位并行串匹配算法[J]. 计算机应用与软件, 2020,37(7):246-252.
[33] (Chen Rui, Gu Naijie, Ye Hong. A Single Mode Bit-Parallel String Matching Algorithm Based on HXDSP[J]. Computer Applications and Software, 2020,37(7):246-252.)
[34] Lubanovic B. Python语言及其应用[M]. 丁嘉瑞, 梁杰, 禹常隆译. 北京: 人民邮电出版社, 2016.
[34] (Lubanovic B. Introducing Python[M]. Translated by Ding Jiarui, Liang Jie, Yu Changlong. Beijing: Posts & Telecom Press, 2016.)
[35] Matthes E. Python编程:从入门到实践[M]. 袁国忠译. 北京: 人民邮电出版社, 2016.
[35] (Matthes E. Python Crash Course:A Hands-On, Project-Based Introducation to Programming[M]. Translated by Yuan Guozhong. Beijing: Posts & Telecom Press, 2016.)
[36] 中国科学引文数据库课题组. 中国科学引文数据库来源期刊的选择及其评价[J]. 中国科技期刊研究, 1997,8(4):17-21.
[36] (Chinese Science Citation Database Group. Selection and Evaluation of Source Journals of Chinese Science Citation Database[J]. Chinese Journal of Scientific and Technical Periodicals, 1997,8(4):17-21.)
[1] Jianyong Zhang,Li Qian,Qianqian Yu,Zhipeng Dong,Yongwen Huang,Jianhua Liu,Shu Guo,Feng Wang. Constructing Name Authority for Research Entities[J]. 数据分析与知识发现, 2019, 3(1): 27-37.
[2] Shi Liting,Zhang Qian,Zhong Yongheng,Hu Sisi,Li Zhenzhen. Using Bidirectional Pattern Matching Model to Pre-Process Yearbook Data[J]. 现代图书情报技术, 2016, 32(9): 88-94.
[3] Hao Jiashu. Enriching Personal Name Authority with Open Semantic Resources:FOAF for Schema Design[J]. 现代图书情报技术, 2016, 32(2): 75-82.
[4] Bai Haiyan. Introduction of Integration Between ORCID and Institutional Repository[J]. 现代图书情报技术, 2015, 31(3): 8-17.
[5] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[6] Yu Liping, Wu Yishan. Study on Influence of Data Standardization to Evaluation Results in Science and Technology[J]. 现代图书情报技术, 2011, 27(9): 66-71.
[7] Jia Meiying,Yang Bingru,Zheng Dequan,Cao Hongqiang,Yang Jing,Zhang Lian. Sham Battle Information Extraction Based on Pattern Matching[J]. 现代图书情报技术, 2009, (9): 70-75.
[8] Chen Jinxing,Zhu Zhongming. Research Progress of the Name Authority Control for the Contributor[J]. 现代图书情报技术, 2009, 25(12): 12-17.
[9] Zhang Yashan. The Fuzzy Retrieval Problem and Its Improvement for MARC[J]. 现代图书情报技术, 2008, 24(7): 91-95.
[10] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[11] Wang Hao . Named Entity Extraction Model Based on Hierarchical Pattern Matching[J]. 现代图书情报技术, 2007, 2(5): 62-68.
[12] Liu Chunhong,Li fengxia,Yang Hui. On the Description of Name Authority Headings at the  Library of  Tsinghua University[J]. 现代图书情报技术, 2005, 21(2): 67-70.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn