Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (9): 63-74     https://doi.org/10.11925/infotech.2096-3467.2021.0460
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
古汉语实体关系联合抽取的标注方法*
王一钒1,李博2,史话3,苗威1(),姜斌2
1山东大学东北亚学院 威海 264209
2山东大学机电与信息工程学院 威海 264209
3亚利桑那州立大学跨境研究学院 图森 85257
Annotation Method for Extracting Entity Relationship from Ancient Chinese Works
Wang Yifan1,Li Bo2,Shi Hua3,Miao Wei1(),Jiang Bin2
1School of Northeast Asia Studies, Shandong University, Weihai 264209, China
2School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai 264209, China
3School of Transborder Studies, Arizona State University, Tucson 85257, USA
全文: PDF (916 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对古汉语数据集标注规范研究缺失的现实,提出一套面向古汉语的实体关系标注方法。【方法】 通过对逻辑语义学、深度学习、历史学的有机融合,提出古汉语实体关系抽取数据集标注方法,由“关系配价标注”“命题逻辑标注”以及“单一关系存在”原则构成,适用于小样本学习。【结果】 利用Word Embedding-BiGRU-CRF端到端关系序列标注模型,在《史记》文本数据集上进行实验,在实体关系抽取与命题逻辑抽取任务上F1值分别达到42.02%与34.07%。【局限】 未使用BERT、ALBERT等预训练模型,而是选择了较为经典的Word2Vec模型完成词嵌入任务。从模型最终的结果来看,相关研究仍有较大的上升空间。【结论】 初步验证了标注方法与联合抽取模型的可行性,填补了面向古汉语实体关系抽取的研究空白。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王一钒
李博
史话
苗威
姜斌
关键词 自然语言处理实体关系抽取序列标注《史记》    
Abstract

[Objective] This paper proposes an annotation method for ancient Chinese datasets, aiming to standardize the annotation procedures. [Objective] We proposed a new method integrating logical semantics, deep learning and history knowledge. This model, which is suitable for few-shot learning, includes three principles of “annotation of relationship valence”, “annotation of propositional logic”, “existence of a single relationship”. [Results] We examined the proposed annotation model with the text dataset of Shiji (Historical Records in Chinese), and found its F1 values for the tasks of relationship extraction and the propositional logic extraction reached 42.02% and 34.07% respectively. [Limitations] The proposed method, which did not include the pre-trained models like BERT or ALBERT, only used the classic Word2Vec model for word embedding. The model's performance could be further improved. [Conclusions] Our new annotation method could effectively extract entity relationship from Ancient Chinese works.

Key wordsNatural Language Processing    Relation Extraction    Sequence Tagging    Shiji
收稿日期: 2021-05-10      出版日期: 2021-10-15
ZTFLH:  分类号: TP391  
基金资助:*国家社会科学基金专项的研究成果之一(17VGB005)
通讯作者: 苗威     E-mail: miaowei@sdu.edu.cn
引用本文:   
王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
Wang Yifan,Li Bo,Shi Hua,Miao Wei,Jiang Bin. Annotation Method for Extracting Entity Relationship from Ancient Chinese Works. Data Analysis and Knowledge Discovery, 2021, 5(9): 63-74.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0460      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I9/63
名称 标记 意义
生命/Life LFE 任何与创造、伤害甚至毁灭生命有关的语句核心
社交/Social Contact SCT 任何社交类语句核心
空间/Location LOC 任何有关空间位置的语句核心
政治/Politics POL 任何政治性的语句核心
动作/Action Towards Object ATO 任何与创造、伤害甚至毁灭非生命有关的语句核心
战争/War WAR 任何与战争有关的语句核心
Table 1  父类别标注规范
名称 标记 对照设置(子类别数量/总数)
生命/Life LFE 36/209
社交/Social Contact SCT 49/85
空间/Location LOC 88/171
政治/Politics POL 437/789
动作/Action Towards Object ATO 74/126
战争/War WAR 23/36
Table 2  父类别样本对照设置
关系类型 主体 受体 标注位置 原始数据
空间/LOC 黄帝 轩辕之丘 黄帝居轩辕之丘
并列/SJ-BL ,而 黄帝居轩辕之丘,而娶于西陵之女
并列/SJ-BL ,而 黄帝居轩辕之丘,而娶于西陵之女
社交/SCT 黄帝 西陵之女 娶于 黄帝居轩辕之丘,而娶于西陵之女
属性/SX 西陵之女 嫘祖 是为 而娶于西陵之女,是为嫘祖
Table 3  古汉语实体关系联合抽取标注方法示例
Fig.1  古汉语实体关系联合抽取标注方法
Fig.2  Word Embedding-BiGRU-CRF模型
类型 标注数量 类型 标注数量
主体/SBJ 3 283 受体/OBJ 2 328
生命/LFE 209 属性/SX 157
社交/SCT 85 事件属性/SJSX 440
空间/LOC 171 递进/SJ-DJ 810
政治/POL 789 并列/SJ-BL 140
动作/ATO 126 因果/SJ-YG 57
战争/WAR 36 转折/SJ-ZZ 20
Table 4  关系数据类型及数量分布
关系类型 准确率 召回率 F1值
SBJ 65.68% 65.88% 65.78%
OBJ 53.29% 43.55% 47.93%
LFE 90.00% 75.00% 81.82%
SCT 16.67% 7.69% 10.53%
LOC 29.63% 20.00% 23.88%
POL 33.57% 36.36% 34.91%
ATO 0.00% 0.00% 0.00%
WAR 71.43% 50.00% 58.82%
SX 56.41% 53.66% 55.00%
SJSX 18.92% 9.21% 12.39%
整体结果 48.10% 38.38% 42.02%
Table 5  实体关系抽取模型训练结果
关系类型 标注条数 子类别数量
SCT 85 52
ATO 126 87
LOC 171 96
POL 789 485
SJSX 440 396
DJ 810 52
BL 140 10
YG 57 16
ZZ 20 6
Table 6  命题逻辑标注条数与子类别数量
命题逻辑 准确率 召回率 F1
DJ 60.38% 57.55% 58.93%
BL 67.35% 27.73% 39.29%
YG 28.57% 12.50% 17.39%
ZZ 0.00% 0.00% 0.00%
整体结果 42.79% 30.35% 34.07%
Table 7  命题逻辑抽取模型训练结果
[1] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2):39-45.
[1] ( Shi Min, Li Bin, Chen Xiaohe. CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2):39-45.)
[2] 陈小荷. 先秦文献信息处理[M]. 北京: 世界图书出版公司, 2013.
[2] ( Chen Xiaohe. Information Processing of Documents of Pre-Qin Period[M]. Beijing: Word Publishing Corporation, 2013.)
[3] 严顺. 基于CRF的古汉语分词标注模型研究[J]. 江苏科技信息, 2016(8):10-12.
[3] ( Yan Shun. Research on Word Segmentation and Tagging for Ancient Chinese Based on CRF[J]. Jiangsu Science & Technology Information, 2016(8):10-12.)
[4] 王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5):62-70.
[4] ( Wang Xiaoyu, Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. Data Analysis and Knowledge Discovery, 2017, 1(5):62-70.)
[5] 杨世超, 纪月, 赵立鹏. 基于条件随机场的古汉语分词研究[J]. 电脑知识与技术, 2017, 13(22):183-184.
[5] ( Yang Shichao, Ji Yue, Zhao Lipeng. Study of Ancient Chinese Word Segmentation Based on Conditional Random Field[J]. Computer Knowledge and Technology, 2017, 13(22):183-184.)
[6] 俞敬松, 魏一, 张永伟, 等. 基于非参数贝叶斯模型和深度学习的古文分词研究[J]. 中文信息学报, 2020, 34(6):1-8.
[6] ( Yu Jingsong, Wei Yi, Zhang Yongwei, et al. Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning[J]. Journal of Chinese Information Processing, 2020, 34(6):1-8.)
[7] 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12):135-140.
[7] ( Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015, 59(12):135-140.)
[8] 崔丹丹, 刘秀磊, 陈若愚, 等. 基于Lattice LSTM的古汉语命名实体识别[J]. 计算机科学, 2020, 47(S2):18-22.
[8] ( Cui Dandan, Liu Xiulei, Chen Ruoyu, et al. Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM[J]. Computer Science, 2020, 47(S2):18-22.)
[9] Marrero M, Urbano J, Sánchez-Cuadrado S, et al. Named Entity Recognition: Fallacies, Challenges and Opportunities[J]. Computer Standards & Interfaces, 2013, 35(5):482-489.
doi: 10.1016/j.csi.2012.09.004
[10] Kumar S. A Survey of Deep Learning Methods for Relation Extraction[OL]. arXiv Preprint, arXiv: 1705.03645.
[11] Zheng S, Wang F, Bao H, et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017.
[12] 曹明宇, 杨志豪, 罗凌, 等. 基于神经网络的药物实体与关系联合抽取[J]. 计算机研究与发展, 2019, 56(7):1432-1440.
[12] ( Cao Mingyu, Yang Zhihao, Luo Ling, et al. Joint Drug Entities and Relations Extraction Based on Neural Networks[J]. Journal of Computer Research and Development, 2019, 56(7):1432-1440.)
[13] Zeng X R, Zeng D J, He S Z, et al. Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2018: 506-514.
[14] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[15] Lan Z Z, Chen M D, Goodman S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations [C]//Proceedings of the International Conference on Learning Representations. 2020.
[16] 唐晓波, 刘志源. 金融领域文本序列标注与实体关系联合抽取研究[J]. 情报科学, 2021, 39(5):3-11.
[16] ( Tang Xiaobo, Liu Zhiyuan. Research on Text Sequence Tagging and Joint Extraction of Entity and Relation in Financial Field[J]. Information Science, 2021, 39(5):3-11.)
[17] 吴赛赛, 梁晓贺, 谢能付, 等 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, https://kns.cnki.net/kcms/detail/51 .
[17] ( Wu Saisai, Liang Xiaohe, Xie Nengfu, et al. Annotation Method for Joint Extraction of Domain-Oriented Entity and Relation. Journal of Computer Applications https://kns.cnki.net/kcms/detail/51 )
[18] 张坤丽, 赵旭, 关同峰, 等. 面向医疗文本的实体及关系标注平台的构建及应用[J]. 中文信息学报, 2020, 34(6):36-44.
[18] ( Zhang Kunli, Zhao Xu, Guan Tongfeng, et al. A Platform for Entity and Entity Relationship Labeling in Medical Texts[J]. Journal of Chinese Information Processing, 2020, 34(6):36-44.)
[19] Montague R. Universal Grammar[J]. Theoria, 1970, 36(3):373-398.
doi: 10.1111/(ISSN)1755-2567
[20] 朱水林. 逻辑语义学研究[M]. 上海: 上海教育出版社, 1992.
[20] ( Zhu Shuilin. The Study of Logical Semantics[M]. Shanghai: Shanghai Educational Publishing House, 1992.)
[21] 周国光, 张国宪. 汉语的配价语法理论研究[J]. 语文建设, 1994(9):33-36.
[21] ( Zhou Guoguang, Zhang Guoxian. Study on the Theory of Valence Grammar in Chinese[J]. Language Planning, 1994(9):33-36.)
[22] 周国光. 汉语配价语法论略[J]. 南京师大学报(社会科学版), 1994(4):103-106, 121.
[22] ( Zhou Guoguang. A Brief Study on Chinese Valence Grammar[J]. Journal of Nanjing Normal University(Social Science Edition), 1994(4):103-106, 121.)
[23] 靳光瑾, 陆汝占. 从汉语句子中提取逻辑函子的一种方法[J]. 软件学报, 1998, 9(6):444-447.
[23] ( Jin Guangjin, Lu Ruzhan. A Method for Extracting Logical Functors from Chinese Sentences[J]. Journal of Software, 1998, 9(6):444-447.)
[24] 司马迁. 史记[M]. 北京: 中华书局, 1959.
[24] ( Sima Qian. Shiji[M]. Beijing: Zhonghua Book Company, 1959.)
[25] 李冬梅, 张扬, 李东远, 等. 实体关系抽取方法研究综述[J]. 计算机研究与发展, 2020, 57(7):1424-1448.
[25] ( Li Dongmei, Zhang Yang, Li Dongyuan, et al. Review of Entity Relation Extraction Methods[J]. Journal of Computer Research and Development, 2020, 57(7):1424-1448.)
[26] Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space [C]//Proceedings of the 1st International Conference on Learning Representations. 2013.
[27] Miwa M, Bansal M. End-to-End Relation Extraction Using LSTMs on Sequences and Tree Structures [C] //Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1105-1116.
[28] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[29] Levy O, Goldberg Y, Dagan I. Improving Distributional Similarity with Lessons Learned from Word Embeddings[J]. Transactions of the Association for Computational Linguistics, 2015, 3:211-225.
doi: 10.1162/tacl_a_00134
[30] Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[31] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]//Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[32] Nakagawa T, Inui K, Kurohashi S, et al. Dependency Tree-based Sentiment Classification Using CRFs with Hidden Variables [C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010: 786-794.
[33] Bahdanau D, Cho K, Bengio Y, et al. Neural Machine Translation by Jointly Learning to Align and Translate [C]//Proceedings of the 3rd International Conference on Learning Representations. 2015.
[1] 黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[2] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[3] 冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[4] 杨春雷. 面向语用消歧的量化约束条件系统: 从语言学设计到计算实现*[J]. 数据分析与知识发现, 2017, 1(11): 1-11.
[5] 杨春雷. 基于HPSG的汉语词库和语法规则系统构建*[J]. 现代图书情报技术, 2016, 32(7-8): 129-136.
[6] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[7] 刘天祎,步一,赵丹群,黄文彬. 自动引文摘要研究述评[J]. 现代图书情报技术, 2016, 32(5): 1-8.
[8] 彭浩, 徐健, 肖卓. 基于比较句的网络用户评论情感分析[J]. 现代图书情报技术, 2015, 31(12): 48-56.
[9] 杨春雷, Dan Flickinger. 汉构:面向深层语言处理的语法工程[J]. 现代图书情报技术, 2014, 30(3): 57-64.
[10] 邱均平, 方国平. 基于知识图谱的中外自然语言处理研究的对比分析[J]. 现代图书情报技术, 2014, 30(12): 51-61.
[11] 佘贵清, 张永安. 审判案例自动抽取与标注模型研究[J]. 现代图书情报技术, 2013, (6): 23-29.
[12] 王秀艳, 崔雷. 采用混合方法抽取生物医学实体间语义关系[J]. 现代图书情报技术, 2013, 29(3): 77-82.
[13] 王昊, 邓三鸿, 苏新宁. 基于字序列标注的中文关键词抽取研究[J]. 现代图书情报技术, 2011, 27(12): 39-45.
[14] 张运良 梁健 朱礼军 乔晓东. 基于术语定义的科技知识组织系统自动丰富关键技术研究*[J]. 现代图书情报技术, 2010, 26(7/8): 66-71.
[15] 仲夏 张志平 王惠临. 词汇化树邻接语法研究述评及中文应用初探*[J]. 现代图书情报技术, 2010, 26(5): 35-42.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn