Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (12): 123-136     https://doi.org/10.11925/infotech.2096-3467.2021.0359
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究*
任秋彤,王昊(),熊欣,范涛
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
Extracting Drama Terms with GCN Long-distance Constrain
Ren Qiutong,Wang Hao(),Xiong Xin,Fan Tao
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (1694 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对非遗传统戏剧提出一种效果更优的术语抽取模型,并构建出传统戏剧术语库。【方法】 首先从术语类别、语义结构和文本长度上分析戏剧语言特征。根据以上语言特征,以BERT-BiLSTM-CRF模型为基础,在BERT获得的字符表示上加入词性和领域特征。之后在BiLSTM后加入图卷积网络(GCN),更好地捕捉句子中远距离词语的约束关系。【结果】 融合GCN和外部特征的术语抽取模型F1值达到91.11%,比主流的BERT-BiLSTM-CRF高出1.3个百分点。【局限】 仅选择百度百科、非遗官网作为实验数据来源,并未验证将模型扩展到其他来源的自由文本中的识别效率。戏剧术语中某些类别的训练集偏少,且实验数据和模型中外部特征的选择不够全面。【结论】 本文根据传统戏剧语言特征,提出一种融合GCN和外部特征的戏剧术语抽取模型,构建了传统戏剧术语库,并将模型应用于术语库的扩充,为后续构建传统戏剧知识图谱打下基础。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
任秋彤
王昊
熊欣
范涛
关键词 传统戏剧术语识别图卷积网络远距离约束    
Abstract

[Objective] This study proposes a new term extraction model for the intangible heritage (traditional drama), which also helps us construct a term database. [Methods] First, we analyzed the drama language characteristics from term category, semantic structure, and text length perspectives. Then, we added part of speech and domain features to the character representation obtained by the BERT-BiLSTM-CRF model. Finally, we incorporated the graph convolutional network (GCN) to the new model and captured the constraint relationship of the distant words. [Results] The F1 value of the proposed model reached 91.11%, which was 1.3 percentage points higher than the baseline BERT-BiLSTM-CRF model. [Limitations] We only retrieved the experimental data from Baidu Baike and the official website of Intangible Cultural Heritage, which should have included more free texts from other sources, more categories of drama terms, as well as the external features. [Conclusions] The proposed model and the database for traditional drama terms will help us construct the knowledge graph for traditional drama.

Key wordsTraditional Drama    Term Recognition    Graph Convolutional Network    Long-distance Constraint
收稿日期: 2021-04-13      出版日期: 2021-06-29
ZTFLH:  G122  
基金资助:* 国家自然科学基金面上项目(72074108);南京大学“中央高校基本科研业务费专项资金资助”项目(010814370113);江苏青年社科英才计划
通讯作者: 王昊,ORCID:0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn
引用本文:   
任秋彤, 王昊, 熊欣, 范涛. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(12): 123-136.
Ren Qiutong, Wang Hao, Xiong Xin, Fan Tao. Extracting Drama Terms with GCN Long-distance Constrain. Data Analysis and Knowledge Discovery, 2021, 5(12): 123-136.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0359      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I12/123
Fig.1  传统戏剧本体概念
序号 术语类别 术语示例
1 人名 潘康泉、曹梅卿、黄玉盛
2 地点 浙江、杭州、上饶
3 剧种 新昌调腔、京剧、梆子戏
4 剧目 《贵妃醉酒》、《霸王别姬》、《三国传》
5 乐器 鼓、锣、笛
6 唱腔曲牌 山坡羊、跳板、流水
7 脚色行当 小生、青衣、花脸
Table 1  中国非遗传统戏剧类术语信息
Fig.2  文本长度统计
序号 字符 字符标签 词性标注 领域特征标注
1 B-剧种 B-noun 0
2 I-剧种 I-noun 13
3 O O 0
4 B-人名 B-noun 0
5 I-人名 I-noun 0
6 O O 0
7 O B-verb 0
Table 2  标注示例
序号 数据 训练集 测试集
1 句子数目 21 205 2 410
2 术语总数 53 440 6 032
3 人名 8 872 835
4 地点 11 090 1 204
5 剧种 12 082 1 510
6 剧目 9 658 1 277
7 乐器 4 214 425
8 唱腔曲牌 3 082 433
9 脚色行当 4 442 348
Table 3  数据集样本分布
Fig.3  整体模型结构
Fig.4  BERT输出的编码结构
序号 词性 解释 示例
1 Noun 名词 高腔、新昌、调腔
2 Verb 动词 是、流传、有
3 Adjective 形容词 大、早、旧
4 Numeral 数词 个、十、百
5 Classifier 量词 批、年、句
6 Pronoun 代词 你、我、他
7 Preposition 介词 于、比、被
8 Time word 时间词 公元前
9 Noun of locality 地方名词 之内、之中
Table 4  词性标签
序号 领域特征字 序号 领域特征字
0 非特征字 8
1 9
2 10
3 11
4 12
5 13
6 14
7 15
Table 5  领域特征字
序号 从属词 从属词位置 支配词位置 依存关系
1 高腔 1 2 SBV
2 2 0 HED
3 中国 3 4 ATT
4 戏曲 4 5 ATT
5 声腔 5 6 ATT
6 之一 6 2 VOB
7 7 2 WP
Table 6  依存句法分析结果
Fig.5  依存句法树示例
Fig.6  出边邻接矩阵
Fig.7  第1组实验结果/%
Fig.8  第2组实验结果/%
Fig.9  第3组实验结果/%
Fig.10  第4组实验结果/%
Fig.11  第5组实验结果/%
序号 术语类型 Precision Recall F1 score
1 人名 85.87% 89.15% 87.48%
2 地点 84.33% 93.97% 88.89%
3 剧种 90.86% 78.07% 83.98%
4 剧目 97.24% 95.48% 96.35%
5 乐器 95.16% 95.37% 95.27%
6 唱腔曲牌 93.83% 95.95% 94.88%
7 脚色行当 97.63% 98.93% 98.27%
Table 7  BERT-pos-BiLSTM-GCN2-CRF模型结果
序号 术语类别 新增个数 总数 示例
1 人名 5 10 624 史若虚、鹤童
2 地点 20 2 594 潞城、平固
3 剧种 13 175 泗州戏、淮戏
4 剧目 2 9 216 桃园结义、崔子杀朝
5 乐器 3 490 小笛、低胡、鱼鼓
6 唱腔曲牌 10 922 二慢板、三慢板
7 脚色行当 1 125 小生
Table 8  戏剧术语库信息
Fig.12  戏剧类术语概念及实例
[1] 孟令法. 中国文化遗产保护政策的历史演进[J]. 遗产, 2019(1): 111-135, 320.
[1] (Meng Lingfa. The Historical Evolution of China’s Cultural Heritage Protection Policy[J]. Heritage, 2019(1): 111-135,320.)
[2] 李明潞. 文化产业视角下传统戏剧类非遗自救路径探究[J]. 新闻传播, 2019(7): 35-37.
[2] (Li Minglu. Research on Self-rescue Path of Intangible Cultural Heritage in Traditional Drama from the Perspective of Cultural Industry[J]. Journalism Communication, 2019(7): 35-37.)
[3] 冯鸾鸾. 面向特定科技领域的技术和术语识别方法研究[D]. 苏州: 苏州大学, 2020.
[3] (Feng Luanluan. Research on Technology and Terminology Recognition Oriented Specific Science Domains[D]. Suzhou: Soochow University, 2020.)
[4] 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
[4] (Wu Jun, Cheng Yao, Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418.)
[5] 谢腾, 杨俊安, 刘辉. 基于BERT-BiLSTM-CRF模型的中文实体识别[J]. 计算机系统应用, 2020, 29(7): 48-55.
[5] (Xie Teng, Yang Junan, Liu Hui. Chinese Entity Recognition Based on BERT-BiLSTM-CRF Model[J]. Computer System & Applications, 2020, 29(7): 48-55.)
[6] 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340.
[6] (Liu Liu, Wang Dongbo. A Review on Named Entity Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 329-340.)
[7] 王健, 殷旭, 吕学强, 等. 基于CRFs的专利文献领域术语抽取方法[J]. 计算机工程与设计, 2019, 40(1): 279-284.
[7] (Wang Jian, Yin Xu, Lü Xueqiang, et al. Method of Extracting Patent Domain Terms Based on Conditional Random Fields[J]. Computer Engineering and Design, 2019, 40(1): 279-284.)
[8] 丁君军, 郑彦宁, 化柏林. 基于规则的学术概念属性抽取[J]. 情报理论与实践, 2011, 34(12): 10-14, 33.
[8] (Ding Junjun, Zheng Yanning, Hua Bolin. Rule-based Attribute Extraction of Academic Concepts[J]. Information Studies: Theory & Application, 2011, 34(12): 10-14, 33.)
[9] Xie R, Liu Z, Jia J, et al. Representation Learning of Knowledge Graphs with Entity Descriptions [C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016.
[10] Zheng D, Zhao T, Yang J. Research on Domain Term Extraction Based on Conditional Random Fields [C]//Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Berlin: Springer-Verlag, 2009: 290-296.
[11] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12): 54-58.
[11] (Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[12] 陈睿. 基于深度学习的专业领域术语识别系统设计与实现[D]. 北京: 北京邮电大学, 2019.
[12] (Chen Rui. Design and Implementation of Deep Learning Based Area Term Recognition System[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[13] Zeng D, Sun C, Lin L, et al. LSTM-CRF for Drug-Named Entity Recognition[J]. Entropy, 2017, 19(6): 283.
doi: 10.3390/e19060283
[14] 李明浩, 刘忠, 姚远哲. 基于LSTM-CRF的中医医案症状术语识别[J]. 计算机应用, 2018, 38(S2): 42-46.
[14] (Li Minghao, Liu Zhong, Yao Yuanzhe. LSTM-CRF Based Symptom Term Recognition on Traditional Chinese Medical Case[J]. Journal of Computer Applications, 2018, 38(S2): 42-46.)
[15] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 260-270.
[16] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[17] 成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[17] (Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[18] 陈德鑫, 占袁圆, 杨兵, 等. 基于CNN-BiLSTM模型的在线医疗实体抽取研究[J]. 图书情报工作, 2019, 63(12): 105-113.
[18] (Chen Dexin, Zhan Yuanyuan, Yang Bing, et al. Research on Extraction of Online Medical Entities Based on Mixed Deep Learning Model[J]. Library and Information Service, 2019, 63(12): 105-113.)
[19] Zhang Y, Yang J. Chinese NER Using Lattice LSTM [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia:Association for Computational Linguistics, 2018: 1554-1564.
[20] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[21] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[21] (Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[22] 王子牛, 姜猛, 高建瓴, 等. 基于BERT的中文命名实体识别方法[J]. 计算机科学, 2019, 46(11A): 138-142.
[22] (Wang Ziniu, Jiang Meng, Gao Jianling, et al. Chinese Named Entity Recognition Method Based on BERT[J]. Computer Science, 2019, 46(11A): 138-142.)
[23] ICOM/CIDOC Documentation Standards Group. Definition of the CIDOC Conceptual Reference Model [EB/OL].[2020-12-02]. https://cidoc-crm.org/sites/default/files/CIDOC%20CRM_v.7.0.1_%2018-10-2020.pdf.
[24] 张建娥. 基于多特征融合的中文文本关键词提取方法[J]. 情报理论与实践, 2013, 36(10): 105-108.
[24] (Zhang Jian’e. Chinese Text Keyword Extraction Method Based on Multi-Feature Fusion[J]. Information Studies: Theory & Application, 2013, 36(10): 105-108.)
[25] 王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828.
[25] (Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 817-828.)
[26] 杨顺成, 李彦, 赵其峰. 基于GCN和Bi-LSTM的微博立场检测方法[J]. 重庆理工大学学报(自然科学), 2020, 34(6): 167-173.
[26] (Yang Shuncheng, Li Yan, Zhao Qifeng. Stance Detection Method of Chinese Micro-Blog Based on GCN and Bi-LSTM[J]. Journal of Chongqing University of Technology (Natural Science), 2020, 34(6): 167-173.)
[27] 徐冰冰, 岑科廷, 黄俊杰, 等. 图卷积神经网络综述[J]. 计算机学报, 2020, 43(5): 755-780.
[27] (Xu Bingbing, Cen Keting, Huang Junjie, et al. A Survey on Graph Convolutional Neural Network[J]. Chinese Journal of Computers, 2020, 43(5): 755-780.)
[28] Cetoli A, Bragaglia S, O'Harney A D, et al. Graph Convolutional Networks for Named Entity Recognition [C]//Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 2018: 37-45.
[29] 张军莲, 张一帆, 汪鸣泉, 等. 基于图卷积神经网络的中文实体关系联合抽取[J/OL]. 计算机工程. DOI: 10.19678/j.issn.1000-3428.0059574.
doi: 10.19678/j.issn.1000-3428.0059574
[29] (Zhang Junlian, Zhang Yifan, Wang Mingquan, et al. Joint Extraction of Chinese Entity Relations Based on Graph Convolutional Neural Network[J/OL]. Computer Engineering. DOI: 10.19678/j.issn.1000-3428.0059574.)
doi: 10.19678/j.issn.1000-3428.0059574
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014, 30(3): 73-79.
[3] 谷俊. 专利文献中新技术术语识别研究[J]. 现代图书情报技术, 2012, (11): 53-59.
[4] 叶春蕾, 冷伏海. 科技文献全文主题识别方法实证研究[J]. 现代图书情报技术, 2012, 28(1): 53-57.
[5] 许德山, 张智雄, 王峰, 邢美凤. 上下文分析与统计特征相结合的英文术语抽取研究[J]. 现代图书情报技术, 2010, 26(12): 28-33.
[6] 刘建华,张智雄,徐健,许雁冬. 自动术语识别——对科技文献进行文本挖掘的重要技术方法*[J]. 现代图书情报技术, 2008, 24(8): 12-17.
[7] 岑咏华,韩哲,季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008, 24(12): 54-58.
[8] 王昊 . 基于层次模式匹配的命名实体识别模型[J]. 现代图书情报技术, 2007, 2(5): 62-68.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn