|
|
Review of Technology Term Recognition Studies Based on Machine Learning |
Hu Yamin1,2,Wu Xiaoyan1,Chen Fang1,2( ) |
1Chengdu Library and Information Center, Chinese Academy of Sciences, Chengdu 610041, China 2Department of Library, Information and Archives Management, School of Economics and;Management,University of Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract [Objective] This paper reviews the status quo and future directions of technology term recognition studies based on machine learning. [Coverage] We searched “technology term* recognition” in Chinese and English with the Web of Science and CNKI. Then, we expanded our search to include the relevant algorithms literature. A total of 62 representative papers were chosen for this review. [Methods] We summarized the application and differences of machine learning in technology term recognition, and then examined it from four prospects: the classification of algorithms, general procedures, the existing problems, and downstream applications. Finally, we discussed the development trends and future studies. [Results] The algorithms can be divided into single statistical machine learning, single deep learning and hybrid algorithms. The most widely used algorithm is the hybrid method, i.e., the BiLSTM-CRF model. Transfer learning is an important research direction in the future. [Limitations] With the rapid progress of deep learning, hybrid models are constantly emerging, this paper only summarized the popular ones. [Conclusions] There are many issues needs to be addressed. In the future, research on fine-grained entity recognition, feature representation, evaluation and open source toolkits should be strengthened.
|
Received: 22 September 2021
Published: 14 April 2022
|
|
Corresponding Authors:
Chen Fang,ORCID:0000-0001-9060-784X
E-mail: chenf@clas.ac.cn
|
[1] |
孙镇, 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010(6):42-47.
|
[1] |
( Sun Zhen, Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6):42-47.)
|
[2] |
Zadeh B Q, Handschuh S. Evaluation of Technology Term Recognition with Random Indexing[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. 2014: 4027-4032.
|
[3] |
刘建华, 张智雄, 徐健, 等. 自动术语识别—对科技文献进行文本挖掘的重要技术方法[J]. 现代图书情报技术, 2008(8):12-17.
|
[3] |
( Liu Jianhua, Zhang Zhixiong, Xu Jian, et al. Automatic Term Recognitions—An Important Method for Text Mining on Scientific Literature[J]. New Technology of Library and Information Service, 2008(8):12-17.)
|
[4] |
Mima H, Ananiadou S, Nenadić G. The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms[C]// Proceedings of the 4th International Conference on Text, Speech and Dialogue. 2001: 126-133.
|
[5] |
Linguistic Data Consortium. Entity Detection and Tracking:Phase 1—ACE Pilot Study Task Detection[EB/OL]. [2021-03-10]. https://www.ldc.upenn.edu/collaborations/past-projects/ace.
|
[6] |
Lan Y, Xu H G, Xu K. Research on Named Entity Recognition for Science and Technology Terms in Chinese Based on Dependent Entity Word Vector[C]// Proceedings of the 14th International Conference on Anti-Counterfeiting, Security, and Identification. 2020: 25-30.
|
[7] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
|
[8] |
宋欣娜, 郭颖, 席笑文. 基于专利文献的多指标新兴技术识别研究[J]. 情报杂志, 2020, 39(6):76-81, 88.
|
[8] |
( Song Xinna, Guo Ying, Xi Xiaowen. Research on Multi-Indicator Emerging Technology Identification Based on Patent Literature[J]. Journal of Intelligence, 2020, 39(6):76-81, 88.)
|
[9] |
王凌燕, 方曙, 季培培. 利用专利文献识别新兴技术主题的技术框架研究[J]. 图书情报工作, 2011, 55(18):74-78, 23.
|
[9] |
( Wang Lingyan, Fang Shu, Ji Peipei. Using Patent Documents to Study the Technology Framework of Detecting Emerging Technology Topics[J]. Library and Information Service, 2011, 55(18):74-78, 23.)
|
[10] |
潘东华, 徐珂珂. 基于专利文献分类码的技术知识图谱绘制方法研究[J]. 情报学报, 2015, 34(8):866-874.
|
[10] |
( Pan Donghua, Xu Keke. Study on the Method of Mapping Technology Networks Based on Patent Classification Codes[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(8):866-874.)
|
[11] |
刘忠宝, 康嘉琦, 张静. 基于主题突变检测的颠覆性技术识别——以无人机技术领域为例[J]. 科技导报, 2020, 38(20):97-105.
|
[11] |
( Liu Zhongbao, Kang Jiaqi, Zhang Jing. The Disruptive Technology of Recognition Based on Topic Mutation Detection: With the Drone Technology as an Example[J]. Science & Technology Review, 2020, 38(20):97-105.)
|
[12] |
王海龙, 和法清, 丁堃. 基于社会网络分析的专利基础技术识别——以半导体产业为例[J]. 情报杂志, 2017, 36(4):78-84.
|
[12] |
( Wang Hailong, He Faqing, Ding Kun. An Identifying Method of Industrial Essential Technologies Based on Social Network Analysis: Semiconductor Industry as a Case[J]. Journal of Intelligence, 2017, 36(4):78-84.)
|
[13] |
吴颖文, 纪杨建, 顾新建. 基于专利技术共现网络的共性技术识别——以家电行业为例[J]. 情报探索, 2020(3):1-10.
|
[13] |
( Wu Yingwen, Ji Yangjian, Gu Xinjian. Generic Technology Identification Based on Technology Co-occurrence Network of Patents: Case Study of Household Appliance Industry[J]. Information Research, 2020(3):1-10.)
|
[14] |
许海云, 王振蒙, 胡正银, 等. 利用专利文本分析识别技术主题的关键技术研究综述[J]. 情报理论与实践, 2016, 39(11):131-137.
|
[14] |
( Xu Haiyun, Wang Zhenmeng, Hu Zhengyin, et al. Review on Key Techniques of Technical Theme Identification Using Patent Text Analysis[J]. Information Studies:Theory & Application, 2016, 39(11):131-137.)
|
[15] |
谷俊. 专利文献中新技术术语识别研究[J]. 现代图书情报技术, 2012(11):53-59.
|
[15] |
( Gu Jun. Study on New Technology Detection in Patents Documents[J]. New Technology of Library and Information Service, 2012(11):53-59.)
|
[16] |
Chang J S. Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora[C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Learning. 2005: 64-71.
|
[17] |
陈颖, 张晓林. 专利中技术词和功效词识别方法研究[J]. 现代图书情报技术, 2011(12):24-30.
|
[17] |
( Chen Ying, Zhang Xiaolin. Study on the Differentiating Method of Technical and Effect Words in Patent[J]. New Technology of Library and Information Service, 2011(12):24-30.)
|
[18] |
曹国忠, 杨雯丹, 刘新星. 基于主体-行为-客体(SAO)三元结构的专利分析方法研究综述[J]. 科技管理研究, 2021, 41(4):158-167.
|
[18] |
( Cao Guozhong, Yang Wendan, Liu Xinxing. Review of Patent Analysis Methods Based on Subject-Action-Object Ternary Structure[J]. Science and Technology Management Research, 2021, 41(4):158-167.)
|
[19] |
邱科达, 马建玲. 机器学习在术语抽取研究中的文献计量分析[J]. 图书情报工作, 2020, 64(14):94-103.
|
[19] |
( Qiu Keda, Ma Jianling. A Statistical Analysis of Literature on Term Extraction Based on Machine Learning[J]. Library and Information Service, 2020, 64(14):94-103.)
|
[20] |
Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 473-480.
|
[21] |
岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12):54-58.
|
[21] |
( Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12):54-58.)
|
[22] |
Doan S, Xu H. Recognizing Medication Related Entities in Hospital Discharge Summaries Using Support Vector Machine[C]// Proceedings of the 23rd International Conference on Computational Linguistics. 2010: 259-266.
|
[23] |
Takeuchi K. Use of Support Vector Machines in Extended Named Entity Recognition[C]// Proceedings of the 6th Conference on Natural Language Learning. 2002: 1-7.
|
[24] |
Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
|
[25] |
McCallum A, Li W. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. 2003: 188-191.
|
[26] |
McDonald R, Pereira F. Identifying Gene and Protein Mentions in Text Using Conditional Random Fields[J]. BMC Bioinformatics, 2005, 6(Suppl 1):S6.
|
[27] |
黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019, 3(6):66-74.
|
[27] |
( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
|
[28] |
Sahu S, Anand A. Recurrent Neural Network Models for Disease Name Recognition Using Domain Invariant Features[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2216-2225.
|
[29] |
李明浩, 刘忠, 姚远哲. 基于LSTM-CRF的中医医案症状术语识别[J]. 计算机应用, 2018, 38(S2):42-46.
|
[29] |
( Li Minghao, Liu Zhong, Yao Yuanzhe. LSTM-CRF Based Symptom Term Recognition on Traditional Chinese Medical Case[J]. Journal of Computer Applications, 2018, 38(S2):42-46.)
|
[30] |
刘宇飞, 尹力, 张凯, 等. 基于深度迁移学习的技术术语识别——以数控系统领域为例[J]. 情报杂志, 2019, 38(10):168-175.
|
[30] |
( Liu Yufei, Yin Li, Zhang Kai, et al. Deep Transfer Learning for Technical Term Extraction—A Case Study in Computer Numerical Control System[J]. Journal of Intelligence, 2019, 38(10):168-175.)
|
[31] |
曹依依. 基于命名实体识别的医学术语发现及应用[D]. 重庆: 重庆邮电大学, 2019.
|
[31] |
( Cao Yiyi. Medical Terminology Discovery and Application Based on Named Entity Recognition[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2019.)
|
[32] |
Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 260-270.
|
[33] |
袁慧. 基于Bi-LSTM与CRF的命名实体识别研究——以生态治理技术相关实体为例[D]. 北京: 中国科学院大学, 2017.
|
[33] |
( Yuan Hui. Bi-LSTM+CRF-Based Named Entity Recognition——Taking Ecological Management Technology as an Example[D]. Beijing: University of Chinese Academy of Sciences, 2017.)
|
[34] |
王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8):817-828.
|
[34] |
( Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8):817-828.)
|
[35] |
王学锋, 杨若鹏, 朱巍. 基于深度学习的军事命名实体识别方法[J]. 装甲兵工程学院学报, 2018, 32(4):94-98.
|
[35] |
( Wang Xuefeng, Yang Ruopeng, Zhu Wei. Military Named Entity Recognition Method Based on Deep Learning[J]. Journal of Academy of Armored Force Engineering, 2018, 32(4):94-98.)
|
[36] |
冯鸾鸾, 李军辉, 李培峰, 等. 面向国防科技领域的技术和术语识别方法研究[J]. 计算机科学, 2019, 46(12):231-236.
|
[36] |
( Feng Luanluan, Li Junhui, Li Peifeng, et al. Technology and Terminology Detection Oriented National Defense Science[J]. Computer Science, 2019, 46(12):231-236.)
|
[37] |
Li P H, Dong R P, Wang Y S, et al. Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2664-2669.
|
[38] |
Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
|
[39] |
Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2670-2680.
|
[40] |
蒋翔, 马建霞, 袁慧. 基于BiLSTM-IDCNN-CRF模型的生态治理技术领域命名实体识别[J]. 计算机应用与软件, 2021, 38(3):134-141.
|
[40] |
( Jiang Xiang, Ma Jianxia, Yuan Hui. Named Entity Recognition in the Field of Ecological Management Technology Based on BiLSTM-IDCNN-CRF Model[J]. Computer Applications and Software, 2021, 38(3):134-141.)
|
[41] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 5998-6008.
|
[42] |
马千程, 王崑声, 周晓纪. 基于深度学习的竞争情报命名实体识别研究[J]. 情报探索, 2020(9):1-7.
|
[42] |
( Ma Qiancheng, Wang Kunsheng, Zhou Xiaoji. Named Entity Recognition of Competitive Intelligence Based on Deep Learning[J]. Information Research, 2020(9):1-7.)
|
[43] |
赵鹏飞, 赵春江, 吴华瑞, 等. 基于注意力机制的农业文本命名实体识别[J]. 农业机械学报, 2021, 52(1):185-192.
|
[43] |
( Zhao Pengfei, Zhao Chunjiang, Wu Huarui, et al. Named Entity Recognition of Chinese Agricultural Text Based on Attention Mechanism[J]. Transactions of the Chinese Society for Agricultural Machinery, 2021, 52(1):185-192.)
|
[44] |
Ruder S, Peters M E, Swayamdipta S, et al. Transfer Learning in Natural Language Processing[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 15-18.
|
[45] |
迟玉琢. 大数据背景下的情报分析[J]. 情报杂志, 2015, 34(1):18-22.
|
[45] |
( Chi Yuzhuo. Intelligence Analysis under Big Data Background[J]. Journal of Intelligence, 2015, 34(1):18-22.)
|
[46] |
毛明毅, 吴晨, 钟义信, 等. 加入自注意力机制的BERT命名实体识别模型[J]. 智能系统学报, 2020, 15(4):772-779.
|
[46] |
( Mao Mingyi, Wu Chen, Zhong Yixin, et al. BERT Named Entity Recognition Model with Self-attention Mechanism[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4):772-779.)
|
[47] |
Yao L, Liu H, Liu Y, et al. Biomedical Named Entity Recognition Based on Deep Neutral Network[J]. International Journal of Hybrid Information Technology, 2015, 8(8):279-288.
|
[48] |
Hripcsak G, Rothschild A S. Agreement, the F-Measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3):296-298.
pmid: 15684123
|
[49] |
Zeng Q K, Yu M X, Yu W H, et al. Validating Label Consistency in NER Data Annotation[OL]. arXiv: 2101.08698.
|
[50] |
马娜, 张智雄, 吴朋民. 基于特征融合的术语型引用对象自动识别方法研究[J]. 数据分析与知识发现, 2020, 4(1):89-98.
|
[50] |
( Ma Na, Zhang Zhixiong, Wu Pengmin. Automatic Identification of Term Citation Object with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2020, 4(1):89-98.)
|
[51] |
Li Z, Ko B, Choi H J. Naive Semi-supervised Deep Learning Using Pseudo-label[J]. Peer-to-Peer Networking and Applications, 2019, 12(5):1358-1368.
doi: 10.1007/s12083-018-0702-9
|
[52] |
高佳奕, 杨涛, 董海艳, 等. 基于LSTM-CRF的中医医案症状命名实体抽取研究[J]. 中国中医药信息杂志, 2021, 28(5):20-24.
|
[52] |
( Gao Jiayi, Yang Tao, Dong Haiyan, et al. Study on Named Entity Extraction of TCM Clinical Medical Records Symptoms Based on LSTM-CRF[J]. Chinese Journal of Information on Traditional Chinese Medicine, 2021, 28(5):20-24.)
|
[53] |
Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
|
[54] |
周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2010.
|
[54] |
( Zhou Lang. A Study on the Chinese Term Extraction[D]. Nanjing: Nanjing University of Science and Technology, 2010.)
|
[55] |
曾文, 李智杰, 王小玉, 等. 科技政策术语自动识别技术初探[J]. 中国科技资源导刊, 2017, 49(3):20-25.
|
[55] |
( Zeng Wen, Li Zhijie, Wang Xiaoyu, et al. Research on Automatic Recognition Technology of Science and Technology Policy Term[J]. China Science & Technology Resources Review, 2017, 49(3):20-25.)
|
[56] |
Fu J L, Liu P F, Neubig G. Interpretable Multi-Dataset Evaluation for Named Entity Recognition[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6058-6069.
|
[57] |
Zheng S C, Wang F, Bao H Y, et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1227-1236.
|
[58] |
李晓曼. 基于专利要素特征的技术演化分析[D]. 北京: 中国农业科学院, 2020.
|
[58] |
( Li Xiaoman. Technology Evolution Analysis Based on Patent Elements Features[D]. Beijing: Chinese Academy of Agricultural Sciences, 2020.)
|
[59] |
冯鸾鸾, 李军辉, 李培峰, 等. 面向国防科技领域的技术和术语语料库构建方法[J]. 中文信息学报, 2020, 34(8):41-50.
|
[59] |
( Feng Luanluan, Li Junhui, Li Peifeng, et al. Constructing a Technology and Terminology Corpus Oriented National Defense Science[J]. Journal of Chinese Information Processing, 2020, 34(8):41-50.)
|
[60] |
杨品莉, 谢志长. 基于BiLSTM-CRF的司法领域实体识别研究[J]. 现代计算机, 2020(25):3-8.
|
[60] |
( Yang Pinli, Xie Zhichang. Research on Named Entity Recognition in Legal Documents Based on BiLSTM-CRF[J]. Modern Computer, 2020(25):3-8.)
|
[61] |
Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1):50-70.
doi: 10.1109/TKDE.2020.2981314
|
[62] |
Babych B, Hartley A. Improving Machine Translation Quality with Automatic Named Entity Recognition[C]// Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools. 2003: 1-8.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|