Review of Technology Term Recognition Studies Based on Machine Learning
Hu Yamin1,2,Wu Xiaoyan1,Chen Fang1,2()
1Chengdu Library and Information Center, Chinese Academy of Sciences, Chengdu 610041, China 2Department of Library, Information and Archives Management, School of Economics and;Management,University of Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper reviews the status quo and future directions of technology term recognition studies based on machine learning. [Coverage] We searched “technology term* recognition” in Chinese and English with the Web of Science and CNKI. Then, we expanded our search to include the relevant algorithms literature. A total of 62 representative papers were chosen for this review. [Methods] We summarized the application and differences of machine learning in technology term recognition, and then examined it from four prospects: the classification of algorithms, general procedures, the existing problems, and downstream applications. Finally, we discussed the development trends and future studies. [Results] The algorithms can be divided into single statistical machine learning, single deep learning and hybrid algorithms. The most widely used algorithm is the hybrid method, i.e., the BiLSTM-CRF model. Transfer learning is an important research direction in the future. [Limitations] With the rapid progress of deep learning, hybrid models are constantly emerging, this paper only summarized the popular ones. [Conclusions] There are many issues needs to be addressed. In the future, research on fine-grained entity recognition, feature representation, evaluation and open source toolkits should be strengthened.
胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 7-17.
( Sun Zhen, Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6):42-47.)
[2]
Zadeh B Q, Handschuh S. Evaluation of Technology Term Recognition with Random Indexing[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. 2014: 4027-4032.
( Liu Jianhua, Zhang Zhixiong, Xu Jian, et al. Automatic Term Recognitions—An Important Method for Text Mining on Scientific Literature[J]. New Technology of Library and Information Service, 2008(8):12-17.)
[4]
Mima H, Ananiadou S, Nenadić G. The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms[C]// Proceedings of the 4th International Conference on Text, Speech and Dialogue. 2001: 126-133.
[5]
Linguistic Data Consortium. Entity Detection and Tracking:Phase 1—ACE Pilot Study Task Detection[EB/OL]. [2021-03-10]. https://www.ldc.upenn.edu/collaborations/past-projects/ace.
[6]
Lan Y, Xu H G, Xu K. Research on Named Entity Recognition for Science and Technology Terms in Chinese Based on Dependent Entity Word Vector[C]// Proceedings of the 14th International Conference on Anti-Counterfeiting, Security, and Identification. 2020: 25-30.
[7]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
( Song Xinna, Guo Ying, Xi Xiaowen. Research on Multi-Indicator Emerging Technology Identification Based on Patent Literature[J]. Journal of Intelligence, 2020, 39(6):76-81, 88.)
( Wang Lingyan, Fang Shu, Ji Peipei. Using Patent Documents to Study the Technology Framework of Detecting Emerging Technology Topics[J]. Library and Information Service, 2011, 55(18):74-78, 23.)
( Pan Donghua, Xu Keke. Study on the Method of Mapping Technology Networks Based on Patent Classification Codes[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(8):866-874.)
( Liu Zhongbao, Kang Jiaqi, Zhang Jing. The Disruptive Technology of Recognition Based on Topic Mutation Detection: With the Drone Technology as an Example[J]. Science & Technology Review, 2020, 38(20):97-105.)
( Wang Hailong, He Faqing, Ding Kun. An Identifying Method of Industrial Essential Technologies Based on Social Network Analysis: Semiconductor Industry as a Case[J]. Journal of Intelligence, 2017, 36(4):78-84.)
( Wu Yingwen, Ji Yangjian, Gu Xinjian. Generic Technology Identification Based on Technology Co-occurrence Network of Patents: Case Study of Household Appliance Industry[J]. Information Research, 2020(3):1-10.)
( Xu Haiyun, Wang Zhenmeng, Hu Zhengyin, et al. Review on Key Techniques of Technical Theme Identification Using Patent Text Analysis[J]. Information Studies:Theory & Application, 2016, 39(11):131-137.)
[15]
谷俊. 专利文献中新技术术语识别研究[J]. 现代图书情报技术, 2012(11):53-59.
[15]
( Gu Jun. Study on New Technology Detection in Patents Documents[J]. New Technology of Library and Information Service, 2012(11):53-59.)
[16]
Chang J S. Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora[C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Learning. 2005: 64-71.
( Chen Ying, Zhang Xiaolin. Study on the Differentiating Method of Technical and Effect Words in Patent[J]. New Technology of Library and Information Service, 2011(12):24-30.)
( Cao Guozhong, Yang Wendan, Liu Xinxing. Review of Patent Analysis Methods Based on Subject-Action-Object Ternary Structure[J]. Science and Technology Management Research, 2021, 41(4):158-167.)
( Qiu Keda, Ma Jianling. A Statistical Analysis of Literature on Term Extraction Based on Machine Learning[J]. Library and Information Service, 2020, 64(14):94-103.)
[20]
Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 473-480.
( Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12):54-58.)
[22]
Doan S, Xu H. Recognizing Medication Related Entities in Hospital Discharge Summaries Using Support Vector Machine[C]// Proceedings of the 23rd International Conference on Computational Linguistics. 2010: 259-266.
[23]
Takeuchi K. Use of Support Vector Machines in Extended Named Entity Recognition[C]// Proceedings of the 6th Conference on Natural Language Learning. 2002: 1-7.
[24]
Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[25]
McCallum A, Li W. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. 2003: 188-191.
[26]
McDonald R, Pereira F. Identifying Gene and Protein Mentions in Text Using Conditional Random Fields[J]. BMC Bioinformatics, 2005, 6(Suppl 1):S6.
( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
[28]
Sahu S, Anand A. Recurrent Neural Network Models for Disease Name Recognition Using Domain Invariant Features[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2216-2225.
( Li Minghao, Liu Zhong, Yao Yuanzhe. LSTM-CRF Based Symptom Term Recognition on Traditional Chinese Medical Case[J]. Journal of Computer Applications, 2018, 38(S2):42-46.)
( Liu Yufei, Yin Li, Zhang Kai, et al. Deep Transfer Learning for Technical Term Extraction—A Case Study in Computer Numerical Control System[J]. Journal of Intelligence, 2019, 38(10):168-175.)
[31]
曹依依. 基于命名实体识别的医学术语发现及应用[D]. 重庆: 重庆邮电大学, 2019.
[31]
( Cao Yiyi. Medical Terminology Discovery and Application Based on Named Entity Recognition[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2019.)
[32]
Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 260-270.
( Yuan Hui. Bi-LSTM+CRF-Based Named Entity Recognition——Taking Ecological Management Technology as an Example[D]. Beijing: University of Chinese Academy of Sciences, 2017.)
( Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8):817-828.)
( Wang Xuefeng, Yang Ruopeng, Zhu Wei. Military Named Entity Recognition Method Based on Deep Learning[J]. Journal of Academy of Armored Force Engineering, 2018, 32(4):94-98.)
( Feng Luanluan, Li Junhui, Li Peifeng, et al. Technology and Terminology Detection Oriented National Defense Science[J]. Computer Science, 2019, 46(12):231-236.)
[37]
Li P H, Dong R P, Wang Y S, et al. Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2664-2669.
[38]
Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
[39]
Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2670-2680.
( Jiang Xiang, Ma Jianxia, Yuan Hui. Named Entity Recognition in the Field of Ecological Management Technology Based on BiLSTM-IDCNN-CRF Model[J]. Computer Applications and Software, 2021, 38(3):134-141.)
[41]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 5998-6008.
( Ma Qiancheng, Wang Kunsheng, Zhou Xiaoji. Named Entity Recognition of Competitive Intelligence Based on Deep Learning[J]. Information Research, 2020(9):1-7.)
( Zhao Pengfei, Zhao Chunjiang, Wu Huarui, et al. Named Entity Recognition of Chinese Agricultural Text Based on Attention Mechanism[J]. Transactions of the Chinese Society for Agricultural Machinery, 2021, 52(1):185-192.)
[44]
Ruder S, Peters M E, Swayamdipta S, et al. Transfer Learning in Natural Language Processing[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 15-18.
[45]
迟玉琢. 大数据背景下的情报分析[J]. 情报杂志, 2015, 34(1):18-22.
[45]
( Chi Yuzhuo. Intelligence Analysis under Big Data Background[J]. Journal of Intelligence, 2015, 34(1):18-22.)
( Mao Mingyi, Wu Chen, Zhong Yixin, et al. BERT Named Entity Recognition Model with Self-attention Mechanism[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4):772-779.)
[47]
Yao L, Liu H, Liu Y, et al. Biomedical Named Entity Recognition Based on Deep Neutral Network[J]. International Journal of Hybrid Information Technology, 2015, 8(8):279-288.
[48]
Hripcsak G, Rothschild A S. Agreement, the F-Measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3):296-298.
pmid: 15684123
[49]
Zeng Q K, Yu M X, Yu W H, et al. Validating Label Consistency in NER Data Annotation[OL]. arXiv: 2101.08698.
( Ma Na, Zhang Zhixiong, Wu Pengmin. Automatic Identification of Term Citation Object with Feature Fusion[J]. Data Analysis and Knowledge Discovery, 2020, 4(1):89-98.)
[51]
Li Z, Ko B, Choi H J. Naive Semi-supervised Deep Learning Using Pseudo-label[J]. Peer-to-Peer Networking and Applications, 2019, 12(5):1358-1368.
doi: 10.1007/s12083-018-0702-9
( Gao Jiayi, Yang Tao, Dong Haiyan, et al. Study on Named Entity Extraction of TCM Clinical Medical Records Symptoms Based on LSTM-CRF[J]. Chinese Journal of Information on Traditional Chinese Medicine, 2021, 28(5):20-24.)
[53]
Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
[54]
周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2010.
[54]
( Zhou Lang. A Study on the Chinese Term Extraction[D]. Nanjing: Nanjing University of Science and Technology, 2010.)
( Zeng Wen, Li Zhijie, Wang Xiaoyu, et al. Research on Automatic Recognition Technology of Science and Technology Policy Term[J]. China Science & Technology Resources Review, 2017, 49(3):20-25.)
[56]
Fu J L, Liu P F, Neubig G. Interpretable Multi-Dataset Evaluation for Named Entity Recognition[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6058-6069.
[57]
Zheng S C, Wang F, Bao H Y, et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1227-1236.
[58]
李晓曼. 基于专利要素特征的技术演化分析[D]. 北京: 中国农业科学院, 2020.
[58]
( Li Xiaoman. Technology Evolution Analysis Based on Patent Elements Features[D]. Beijing: Chinese Academy of Agricultural Sciences, 2020.)
( Feng Luanluan, Li Junhui, Li Peifeng, et al. Constructing a Technology and Terminology Corpus Oriented National Defense Science[J]. Journal of Chinese Information Processing, 2020, 34(8):41-50.)
( Yang Pinli, Xie Zhichang. Research on Named Entity Recognition in Legal Documents Based on BiLSTM-CRF[J]. Modern Computer, 2020(25):3-8.)
[61]
Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1):50-70.
doi: 10.1109/TKDE.2020.2981314
[62]
Babych B, Hartley A. Improving Machine Translation Quality with Automatic Named Entity Recognition[C]// Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools. 2003: 1-8.