Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (8): 86-97     https://doi.org/10.11925/infotech.2096-3467.2020.0032
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的方志物产资料实体自动识别模型构建研究*
徐晨飞1,2,叶海影2,包平1()
1南京农业大学中华农业文明研究院 南京 210095
2南通大学经济与管理学院 南通 226019
Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning
Xu Chenfei1,2,Ye Haiying2,Bao Ping1()
1Institution of Chinese Agricultural Civilization, Nanjing Agricultural University, Nanjing 210095, China
2Economics and Management School, Nantong University, Nantong 226019, China
全文: PDF (2235 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】探究古籍方志物产资料中物产别名、人物、产地及引书等4种实体的自动识别,用于方志物产知识库的构建。【方法】 以机构特藏《方志物产》云南卷为基础语料,在文本预处理与语料标注基础上,采用4种深度学习模型Bi-RNN、Bi-LSTM、Bi-LSTM-CRF、BERT进行实验,并对实验结果进行对比分析。【结果】Bi-LSTM-CRF模型与Bi-LSTM模型相比,P值提高5.54%,F值提高3.51%;BERT模型的R值达到了83.36%,优于其他模型;Bi-LSTM-CRF模型对引书实体识别效果最好,F值为89.71%;BERT模型对人物实体识别效果最好,F值为87.90%。【局限】由于古籍方志文本语料特性,以及相关实体的认定需掌握领域知识,在人工标注过程中或存在一些漏标与错标的情况,导致模型未能最优化。【结论】研究表明深度学习方法对古籍方志文本实体识别任务的可行性与优越性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐晨飞
叶海影
包平
关键词 深度学习方志物产命名实体识别模型构建数字人文    
Abstract

[Objective] This paper tries to automatically identify the produce aliases, related human figures, places of origin and cited books from ancient local chronicles, aiming to establish a knowledge base for traditional products. [Methods] Firstly, we chose Local Chronicle of Yunnan: Produce as the basic corpus and preprocessed its texts to carry out corpus tagging. Then, we adopted four deep learning models (Bi-RNN, Bi-LSTM, Bi-LSTM-CRF and BERT) to identify the needed entities. Finally, we compared outputs of these models. [Results] The P-value and F-value of the Bi-LSTM model were 5.54% and 3.51% higher than those of the Bi-LSTM-CRF model. The R-value of the BERT model reached 83.36%, which was the best among all models. The Bi-LSTM-CRF model yielded the best results with the entity recognition of cited books (F-value=89.71%), and the BERT model had the best performance on character entities with a F-value of 87.90%. [Limitations] Due to the linguistic characteristics of ancient local chronicles and the domain knowledge required for identifying related entities, there may be errors in tagging. [Conclusions] Deep learning could help us identify needed entities from ancient local chronicles effectively.

Key wordsDeep Learning    Local Chronicle: Produce    Named Entity Recognition    Models Construction    Digital Humanities
收稿日期: 2020-01-08      出版日期: 2020-06-05
ZTFLH:  G255  
基金资助:*本文系国家社会科学基金重大项目"方志物产知识库构建及深度利用研究"(18ZDA327);教育部人文社会科学研究青年基金项目"基于语义的方志物产资料知识组织与知识聚合实证研究"的研究成果之一(19YJC870027)
通讯作者: 包平     E-mail: baoping@njau.edu.cn
引用本文:   
徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0032      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I8/86
Fig.1  基于RNN的方志物产资料实体识别模型
Fig.2  基于Bi-LSTM的方志物产资料实体识别模型
Fig.3  基于Bi-LSTM-CRF的方志物产资料实体识别模型
Fig.4  基于BERT的方志物产资料实体识别模型
Fig.5  基于BERT的方志物产资料输入表示
Fig.6  随机选取10条物产条目样例
序号 词语 标记
1 B-PN
2 E-PN
3 B-PC
4 I-PC
5 I-PC
6 I-PC
7 E-PC
8 O
9 O
10 O
11 O
12 O
13 B-PA
14 I-PA
15 E-PA
Table 1  古籍方志物产语料处理结果样例
超参数
Bi-LSTM/Bi-RNN层数 2
隐含层大小 256
学习率 0.001
Batch-size 64
Dropout比率 0.5
Clip gradient 5
Table 2  实验超参数设置
超参数
BERT层数 2
隐含层大小 128
学习率 2e-5
Batch-size 32
Train-epochs 10
Table 3  实验超参数设置(BERT)
模型 P(%) R(%) F(%)
Bi-RNN 69.91 75.10 72.38
Bi-LSTM 76.33 76.73 76.51
Bi-LSTM-CRF 81.87 78.30 80.02
BERT 76.61 83.36 79.83
Table 4  古籍方志物产语料各模型实验效果
Fig.7  Bi-RNN与Bi-LSTM对不同实体类型识别效果对比
Fig.8  Bi-LSTM与Bi-LSTM-CRF对不同实体类型识别效果对比
Fig.9  Bi-LSTM-CRF与BERT对不同实体类型识别效果对比
Fig.10  "优昙钵"检索结果
Fig.11  方志物产知识库物产详细页展示
Fig.12  物产"优昙钵"关联数据可视化
Fig.13  物产"烟草"时空展现
[1] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017,61(12):42-49.
[1] ( Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017,61(12):42-49.)
[2] 仓修良. 方志学通论(增订本)[M]. 上海: 华东师范大学出版社, 2014.
[2] ( Cang Xiuliang. General Theory of the Study of Local Chronicles (Revised Edition)[M]. Shanghai: East China Normal University Press, 2014.)
[3] 包平, 李昕升, 卢勇. 方志物产史料的价值、利用与展望——以《方志物产》为中心[J]. 中国农史, 2018,37(3):117-126.
[3] ( Bao Ping, Li Xinsheng, Lu Yong. The Value and Utilization and Prospect of the Historical Materials of Products in Local Chronicles——Take Products in Local Chronicles for Example[J]. Agricultural History of China, 2018,37(3):117-126.)
[4] 谢韬. 基于古文学的命名实体识别的研究与实现[D]. 北京: 北京邮电大学, 2018.
[4] ( Xie Tao. Research and Implementation of Named Entity Recognition Based on Ancient Literature[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[5] 王铮. 基于CRF的古籍地名自动识别研究——以《三国演义》为例[D]. 南宁: 广西民族大学, 2008.
[5] ( Wang Zheng. Conditional Random Fields Based Location Name Recognition in Ancient Chinese——Take the “Romance of the Three Kingdoms” as an Example[D]. Nanning: Guangxi University for Nationalities, 2008.)
[6] 肖磊. 《左传》地名研究初探[J]. 文教资料, 2009(18):204-207.
[6] ( Xiao Lei. A Preliminary Study on Place Names in Zuo Zhuan[J]. Data of Culture and Education, 2009(18):204-207.)
[7] 汪青青. 先秦人名识别初探[J]. 文教资料, 2009(18):202-204.
[7] ( Wang Qingqing. A Preliminary Study on Name Recognition in Pre-Qin Period[J]. Data of Culture and Education, 2009(18):202-204.)
[8] 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015,59(12):135-140.
[8] ( Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015,59(12):135-140.)
[9] 叶辉, 姬东鸿. 基于多特征条件随机场的《金匮要略》症状药物信息抽取研究[J]. 中国中医药图书情报杂志, 2016,40(5):14-17.
[9] ( Ye Hui, Ji Donghong. Research on Symptom and Medicine Information Abstraction of TCM Book Jin Gui Yao Lue Based on Conditional Random Field[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2016,40(5):14-17.)
[10] 王东波, 高瑞卿, 沈思, 等. 面向先秦典籍的历史事件基本实体构件自动识别研究[J]. 国家图书馆学刊, 2018,27(1):65-77.
[10] ( Wang Dongbo, Gao Ruiqing, Shen Si, et al. Research on Automatic Recognition of Basic Entity Component of Historic Events for Pre-Qin Classics[J]. Journal of the National Library of China, 2018,27(1):65-77.)
[11] 龚德山. 命名实体识别在中药名词和方剂名词识别中的比较研究[D]. 北京:北京中医药大学, 2019.
[11] ( Gong Deshan. A Comparative Study of Named Entity Recognition in Recognizing the Names of Chinese Medicine Herbs and Formulae[D]. Beijing: Beijing University of Chinese Medicine, 2019.)
[12] 刘士纲. 《清实录》人名撷取自动化[D]. 台北: 台湾大学, 2012.
[12] ( Liu Shigang. Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty[D]. Taipei: Taiwan University, 2012.)
[13] 张尚斌. 词夹子演算法在专有名词辨识上的应用——以历史文件为例[D]. 台北: 台湾大学, 2006.
[13] ( Zhang Shangbin. A Word-Clip Algorithm for Named Entity Recognition——by Example of Historical Documents[D]. Taipei: Taiwan University, 2006.)
[14] 衡中青. 地方志知识组织及内容挖掘研究: 以《方志物产·广东》为例[M]. 芜湖: 安徽师范大学出版社, 2012.
[14] ( Heng Zhongqing. Research on Knowledge Organization & Content Mining of the Chinese Local Chronicle——Taking Local Chronicle of Guangdong: Produce as an Example[M]. Wuhu: Anhui Normal University Press, 2012.)
[15] 朱锁玲. 命名实体识别在方志内容挖掘中的应用研究——以广东、福建、台湾三省《方志物产》为例[D]. 南京: 南京农业大学, 2011.
[15] ( Zhu Suoling. Research on the Application of Named Entity Recognition in Content Mining of Chinese Local Chronicles——Taking Local Chronicle: Produce of Guangdong, Fujian and Taiwan as Examples[D]. Nanjing: Nanjing Agricultural University, 2011.)
[16] 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018,32(11):41-48, 61.
[16] ( Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018,32(11):41-48, 61.)
[17] 邱锡鹏. 神经网络与深度学习[EB/OL]. [2019-11-21].https://nndl.github.io/nndl-book.pdf.
[17] ( Qiu Xipeng. Neural Networks and Deep Learning[EB/OL]. [2019-11-21].https://nndl.github.io/nndl-book.pdf.)
[18] Bengio Y, Simard P, Frasconi P. Learning Long-term Dependencies with Gradient Descent is Difficult[J]. IEEE Transactions on Neural Networks, 1994,5(2):157-166.
doi: 10.1109/72.279181 pmid: 18267787
[19] Greff K, Srivastava R K, Koutník J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015,28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924 pmid: 27411231
[20] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[21] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[22] Rong X. Word2vec Parameter Learning Explained[OL]. arXiv Preprint, arXiv: 1411.2738.
[23] Khare R, Çelik T. Microformats: A Pragmatic Path to the Semantic Web[C]// ACM, Proceedings of the 15th International Conference on World Wide Web. 2006: 865-866.
[1] 赵旸, 张智雄, 刘欢, 丁良萍. 基于BERT模型的中文医学文献分类研究*[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[3] 王鑫芸,王昊,邓三鸿,张宝隆. 面向期刊选择的学术论文内容分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[4] 焦启航,乐小虬. 对比关系句子生成方法研究[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[5] 王末,崔运鹏,陈丽,李欢. 基于深度学习的学术论文语步结构分类方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[6] 邓思艺,乐小虬. 基于动态语义注意力的指代消解方法[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[7] 余传明,原赛,朱星宇,林虹君,张普亮,安璐. 基于深度学习的热点事件主题表示研究*[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[8] 苏传东,黄孝喜,王荣波,谌志群,毛君钰,朱嘉莹,潘宇豪. 基于词嵌入融合和循环神经网络的中英文隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[9] 刘彤,倪维健,孙宇健,曾庆田. 基于深度迁移学习的业务流程实例剩余执行时间预测方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[10] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[11] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[12] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[13] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[14] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[15] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn