Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning
Xu Chenfei1,2,Ye Haiying2,Bao Ping1()
1Institution of Chinese Agricultural Civilization, Nanjing Agricultural University, Nanjing 210095, China 2Economics and Management School, Nantong University, Nantong 226019, China
[Objective] This paper tries to automatically identify the produce aliases, related human figures, places of origin and cited books from ancient local chronicles, aiming to establish a knowledge base for traditional products. [Methods] Firstly, we chose Local Chronicle of Yunnan: Produce as the basic corpus and preprocessed its texts to carry out corpus tagging. Then, we adopted four deep learning models (Bi-RNN, Bi-LSTM, Bi-LSTM-CRF and BERT) to identify the needed entities. Finally, we compared outputs of these models. [Results] The P-value and F-value of the Bi-LSTM model were 5.54% and 3.51% higher than those of the Bi-LSTM-CRF model. The R-value of the BERT model reached 83.36%, which was the best among all models. The Bi-LSTM-CRF model yielded the best results with the entity recognition of cited books (F-value=89.71%), and the BERT model had the best performance on character entities with a F-value of 87.90%. [Limitations] Due to the linguistic characteristics of ancient local chronicles and the domain knowledge required for identifying related entities, there may be errors in tagging. [Conclusions] Deep learning could help us identify needed entities from ancient local chronicles effectively.
徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.
( Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017,61(12):42-49.)
[2]
仓修良. 方志学通论(增订本)[M]. 上海: 华东师范大学出版社, 2014.
[2]
( Cang Xiuliang. General Theory of the Study of Local Chronicles (Revised Edition)[M]. Shanghai: East China Normal University Press, 2014.)
( Bao Ping, Li Xinsheng, Lu Yong. The Value and Utilization and Prospect of the Historical Materials of Products in Local Chronicles——Take Products in Local Chronicles for Example[J]. Agricultural History of China, 2018,37(3):117-126.)
[4]
谢韬. 基于古文学的命名实体识别的研究与实现[D]. 北京: 北京邮电大学, 2018.
[4]
( Xie Tao. Research and Implementation of Named Entity Recognition Based on Ancient Literature[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
( Wang Zheng. Conditional Random Fields Based Location Name Recognition in Ancient Chinese——Take the “Romance of the Three Kingdoms” as an Example[D]. Nanning: Guangxi University for Nationalities, 2008.)
[6]
肖磊. 《左传》地名研究初探[J]. 文教资料, 2009(18):204-207.
[6]
( Xiao Lei. A Preliminary Study on Place Names in Zuo Zhuan[J]. Data of Culture and Education, 2009(18):204-207.)
[7]
汪青青. 先秦人名识别初探[J]. 文教资料, 2009(18):202-204.
[7]
( Wang Qingqing. A Preliminary Study on Name Recognition in Pre-Qin Period[J]. Data of Culture and Education, 2009(18):202-204.)
( Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015,59(12):135-140.)
( Ye Hui, Ji Donghong. Research on Symptom and Medicine Information Abstraction of TCM Book Jin Gui Yao Lue Based on Conditional Random Field[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2016,40(5):14-17.)
( Wang Dongbo, Gao Ruiqing, Shen Si, et al. Research on Automatic Recognition of Basic Entity Component of Historic Events for Pre-Qin Classics[J]. Journal of the National Library of China, 2018,27(1):65-77.)
( Gong Deshan. A Comparative Study of Named Entity Recognition in Recognizing the Names of Chinese Medicine Herbs and Formulae[D]. Beijing: Beijing University of Chinese Medicine, 2019.)
[12]
刘士纲. 《清实录》人名撷取自动化[D]. 台北: 台湾大学, 2012.
[12]
( Liu Shigang. Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty[D]. Taipei: Taiwan University, 2012.)
( Heng Zhongqing. Research on Knowledge Organization & Content Mining of the Chinese Local Chronicle——Taking Local Chronicle of Guangdong: Produce as an Example[M]. Wuhu: Anhui Normal University Press, 2012.)
( Zhu Suoling. Research on the Application of Named Entity Recognition in Content Mining of Chinese Local Chronicles——Taking Local Chronicle: Produce of Guangdong, Fujian and Taiwan as Examples[D]. Nanjing: Nanjing Agricultural University, 2011.)
( Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018,32(11):41-48, 61.)
( Qiu Xipeng. Neural Networks and Deep Learning[EB/OL]. [2019-11-21].https://nndl.github.io/nndl-book.pdf.)
[18]
Bengio Y, Simard P, Frasconi P. Learning Long-term Dependencies with Gradient Descent is Difficult[J]. IEEE Transactions on Neural Networks, 1994,5(2):157-166.
doi: 10.1109/72.279181
pmid: 18267787
[19]
Greff K, Srivastava R K, Koutník J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015,28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924
pmid: 27411231
[20]
Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[21]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[22]
Rong X. Word2vec Parameter Learning Explained[OL]. arXiv Preprint, arXiv: 1411.2738.
[23]
Khare R, Çelik T. Microformats: A Pragmatic Path to the Semantic Web[C]// ACM, Proceedings of the 15th International Conference on World Wide Web. 2006: 865-866.