Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (8): 86-97     https://doi.org/10.11925/infotech.2096-3467.2020.0032
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的方志物产资料实体自动识别模型构建研究*
徐晨飞1,2,叶海影2,包平1()
1南京农业大学中华农业文明研究院 南京 210095
2南通大学经济与管理学院 南通 226019
Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning
Xu Chenfei1,2,Ye Haiying2,Bao Ping1()
1Institution of Chinese Agricultural Civilization, Nanjing Agricultural University, Nanjing 210095, China
2Economics and Management School, Nantong University, Nantong 226019, China
全文: PDF (2235 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】探究古籍方志物产资料中物产别名、人物、产地及引书等4种实体的自动识别,用于方志物产知识库的构建。【方法】 以机构特藏《方志物产》云南卷为基础语料,在文本预处理与语料标注基础上,采用4种深度学习模型Bi-RNN、Bi-LSTM、Bi-LSTM-CRF、BERT进行实验,并对实验结果进行对比分析。【结果】Bi-LSTM-CRF模型与Bi-LSTM模型相比,P值提高5.54%,F值提高3.51%;BERT模型的R值达到了83.36%,优于其他模型;Bi-LSTM-CRF模型对引书实体识别效果最好,F值为89.71%;BERT模型对人物实体识别效果最好,F值为87.90%。【局限】由于古籍方志文本语料特性,以及相关实体的认定需掌握领域知识,在人工标注过程中或存在一些漏标与错标的情况,导致模型未能最优化。【结论】研究表明深度学习方法对古籍方志文本实体识别任务的可行性与优越性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐晨飞
叶海影
包平
关键词 深度学习方志物产命名实体识别模型构建数字人文    
Abstract

[Objective] This paper tries to automatically identify the produce aliases, related human figures, places of origin and cited books from ancient local chronicles, aiming to establish a knowledge base for traditional products. [Methods] Firstly, we chose Local Chronicle of Yunnan: Produce as the basic corpus and preprocessed its texts to carry out corpus tagging. Then, we adopted four deep learning models (Bi-RNN, Bi-LSTM, Bi-LSTM-CRF and BERT) to identify the needed entities. Finally, we compared outputs of these models. [Results] The P-value and F-value of the Bi-LSTM model were 5.54% and 3.51% higher than those of the Bi-LSTM-CRF model. The R-value of the BERT model reached 83.36%, which was the best among all models. The Bi-LSTM-CRF model yielded the best results with the entity recognition of cited books (F-value=89.71%), and the BERT model had the best performance on character entities with a F-value of 87.90%. [Limitations] Due to the linguistic characteristics of ancient local chronicles and the domain knowledge required for identifying related entities, there may be errors in tagging. [Conclusions] Deep learning could help us identify needed entities from ancient local chronicles effectively.

Key wordsDeep Learning    Local Chronicle: Produce    Named Entity Recognition    Models Construction    Digital Humanities
收稿日期: 2020-01-08      出版日期: 2020-06-05
ZTFLH:  G255  
基金资助:*本文系国家社会科学基金重大项目"方志物产知识库构建及深度利用研究"(18ZDA327);教育部人文社会科学研究青年基金项目"基于语义的方志物产资料知识组织与知识聚合实证研究"的研究成果之一(19YJC870027)
通讯作者: 包平     E-mail: baoping@njau.edu.cn
引用本文:   
徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0032      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I8/86
Fig.1  基于RNN的方志物产资料实体识别模型
Fig.2  基于Bi-LSTM的方志物产资料实体识别模型
Fig.3  基于Bi-LSTM-CRF的方志物产资料实体识别模型
Fig.4  基于BERT的方志物产资料实体识别模型
Fig.5  基于BERT的方志物产资料输入表示
Fig.6  随机选取10条物产条目样例
序号 词语 标记
1 B-PN
2 E-PN
3 B-PC
4 I-PC
5 I-PC
6 I-PC
7 E-PC
8 O
9 O
10 O
11 O
12 O
13 B-PA
14 I-PA
15 E-PA
Table 1  古籍方志物产语料处理结果样例
超参数
Bi-LSTM/Bi-RNN层数 2
隐含层大小 256
学习率 0.001
Batch-size 64
Dropout比率 0.5
Clip gradient 5
Table 2  实验超参数设置
超参数
BERT层数 2
隐含层大小 128
学习率 2e-5
Batch-size 32
Train-epochs 10
Table 3  实验超参数设置(BERT)
模型 P(%) R(%) F(%)
Bi-RNN 69.91 75.10 72.38
Bi-LSTM 76.33 76.73 76.51
Bi-LSTM-CRF 81.87 78.30 80.02
BERT 76.61 83.36 79.83
Table 4  古籍方志物产语料各模型实验效果
Fig.7  Bi-RNN与Bi-LSTM对不同实体类型识别效果对比
Fig.8  Bi-LSTM与Bi-LSTM-CRF对不同实体类型识别效果对比
Fig.9  Bi-LSTM-CRF与BERT对不同实体类型识别效果对比
Fig.10  "优昙钵"检索结果
Fig.11  方志物产知识库物产详细页展示
Fig.12  物产"优昙钵"关联数据可视化
Fig.13  物产"烟草"时空展现
[1] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017,61(12):42-49.
[1] ( Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017,61(12):42-49.)
[2] 仓修良. 方志学通论(增订本)[M]. 上海: 华东师范大学出版社, 2014.
[2] ( Cang Xiuliang. General Theory of the Study of Local Chronicles (Revised Edition)[M]. Shanghai: East China Normal University Press, 2014.)
[3] 包平, 李昕升, 卢勇. 方志物产史料的价值、利用与展望——以《方志物产》为中心[J]. 中国农史, 2018,37(3):117-126.
[3] ( Bao Ping, Li Xinsheng, Lu Yong. The Value and Utilization and Prospect of the Historical Materials of Products in Local Chronicles——Take Products in Local Chronicles for Example[J]. Agricultural History of China, 2018,37(3):117-126.)
[4] 谢韬. 基于古文学的命名实体识别的研究与实现[D]. 北京: 北京邮电大学, 2018.
[4] ( Xie Tao. Research and Implementation of Named Entity Recognition Based on Ancient Literature[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[5] 王铮. 基于CRF的古籍地名自动识别研究——以《三国演义》为例[D]. 南宁: 广西民族大学, 2008.
[5] ( Wang Zheng. Conditional Random Fields Based Location Name Recognition in Ancient Chinese——Take the “Romance of the Three Kingdoms” as an Example[D]. Nanning: Guangxi University for Nationalities, 2008.)
[6] 肖磊. 《左传》地名研究初探[J]. 文教资料, 2009(18):204-207.
[6] ( Xiao Lei. A Preliminary Study on Place Names in Zuo Zhuan[J]. Data of Culture and Education, 2009(18):204-207.)
[7] 汪青青. 先秦人名识别初探[J]. 文教资料, 2009(18):202-204.
[7] ( Wang Qingqing. A Preliminary Study on Name Recognition in Pre-Qin Period[J]. Data of Culture and Education, 2009(18):202-204.)
[8] 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015,59(12):135-140.
[8] ( Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015,59(12):135-140.)
[9] 叶辉, 姬东鸿. 基于多特征条件随机场的《金匮要略》症状药物信息抽取研究[J]. 中国中医药图书情报杂志, 2016,40(5):14-17.
[9] ( Ye Hui, Ji Donghong. Research on Symptom and Medicine Information Abstraction of TCM Book Jin Gui Yao Lue Based on Conditional Random Field[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2016,40(5):14-17.)
[10] 王东波, 高瑞卿, 沈思, 等. 面向先秦典籍的历史事件基本实体构件自动识别研究[J]. 国家图书馆学刊, 2018,27(1):65-77.
[10] ( Wang Dongbo, Gao Ruiqing, Shen Si, et al. Research on Automatic Recognition of Basic Entity Component of Historic Events for Pre-Qin Classics[J]. Journal of the National Library of China, 2018,27(1):65-77.)
[11] 龚德山. 命名实体识别在中药名词和方剂名词识别中的比较研究[D]. 北京:北京中医药大学, 2019.
[11] ( Gong Deshan. A Comparative Study of Named Entity Recognition in Recognizing the Names of Chinese Medicine Herbs and Formulae[D]. Beijing: Beijing University of Chinese Medicine, 2019.)
[12] 刘士纲. 《清实录》人名撷取自动化[D]. 台北: 台湾大学, 2012.
[12] ( Liu Shigang. Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty[D]. Taipei: Taiwan University, 2012.)
[13] 张尚斌. 词夹子演算法在专有名词辨识上的应用——以历史文件为例[D]. 台北: 台湾大学, 2006.
[13] ( Zhang Shangbin. A Word-Clip Algorithm for Named Entity Recognition——by Example of Historical Documents[D]. Taipei: Taiwan University, 2006.)
[14] 衡中青. 地方志知识组织及内容挖掘研究: 以《方志物产·广东》为例[M]. 芜湖: 安徽师范大学出版社, 2012.
[14] ( Heng Zhongqing. Research on Knowledge Organization & Content Mining of the Chinese Local Chronicle——Taking Local Chronicle of Guangdong: Produce as an Example[M]. Wuhu: Anhui Normal University Press, 2012.)
[15] 朱锁玲. 命名实体识别在方志内容挖掘中的应用研究——以广东、福建、台湾三省《方志物产》为例[D]. 南京: 南京农业大学, 2011.
[15] ( Zhu Suoling. Research on the Application of Named Entity Recognition in Content Mining of Chinese Local Chronicles——Taking Local Chronicle: Produce of Guangdong, Fujian and Taiwan as Examples[D]. Nanjing: Nanjing Agricultural University, 2011.)
[16] 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018,32(11):41-48, 61.
[16] ( Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018,32(11):41-48, 61.)
[17] 邱锡鹏. 神经网络与深度学习[EB/OL]. [2019-11-21].https://nndl.github.io/nndl-book.pdf.
[17] ( Qiu Xipeng. Neural Networks and Deep Learning[EB/OL]. [2019-11-21].https://nndl.github.io/nndl-book.pdf.)
[18] Bengio Y, Simard P, Frasconi P. Learning Long-term Dependencies with Gradient Descent is Difficult[J]. IEEE Transactions on Neural Networks, 1994,5(2):157-166.
doi: 10.1109/72.279181 pmid: 18267787
[19] Greff K, Srivastava R K, Koutník J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015,28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924 pmid: 27411231
[20] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[21] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[22] Rong X. Word2vec Parameter Learning Explained[OL]. arXiv Preprint, arXiv: 1411.2738.
[23] Khare R, Çelik T. Microformats: A Pragmatic Path to the Semantic Web[C]// ACM, Proceedings of the 15th International Conference on World Wide Web. 2006: 865-866.
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[4] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[7] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[8] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[9] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[10] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[11] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[12] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[13] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[14] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[15] 李丹阳, 甘明鑫. 基于多源信息融合的音乐推荐方法 *[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn