Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (12): 123-136    DOI: 10.11925/infotech.2096-3467.2021.0359
Current Issue | Archive | Adv Search |
Extracting Drama Terms with GCN Long-distance Constrain
Ren Qiutong,Wang Hao(),Xiong Xin,Fan Tao
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1694 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study proposes a new term extraction model for the intangible heritage (traditional drama), which also helps us construct a term database. [Methods] First, we analyzed the drama language characteristics from term category, semantic structure, and text length perspectives. Then, we added part of speech and domain features to the character representation obtained by the BERT-BiLSTM-CRF model. Finally, we incorporated the graph convolutional network (GCN) to the new model and captured the constraint relationship of the distant words. [Results] The F1 value of the proposed model reached 91.11%, which was 1.3 percentage points higher than the baseline BERT-BiLSTM-CRF model. [Limitations] We only retrieved the experimental data from Baidu Baike and the official website of Intangible Cultural Heritage, which should have included more free texts from other sources, more categories of drama terms, as well as the external features. [Conclusions] The proposed model and the database for traditional drama terms will help us construct the knowledge graph for traditional drama.

Key wordsTraditional Drama      Term Recognition      Graph Convolutional Network      Long-distance Constraint     
Received: 13 April 2021      Published: 29 June 2021
ZTFLH:  G122  
Fund:National Natural Science Foundation of China(72074108);Nanjing University “Special Funds for Fundamental Funds for Fundamental Scientific Research of Central Universities” Project(010814370113);Jiangsu Youth Social Science Talent Training Program
Corresponding Authors: Wang Hao,ORCID:0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Ren Qiutong, Wang Hao, Xiong Xin, Fan Tao. Extracting Drama Terms with GCN Long-distance Constrain. Data Analysis and Knowledge Discovery, 2021, 5(12): 123-136.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0359     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I12/123

Conceptual Map of Traditional Drama
序号 术语类别 术语示例
1 人名 潘康泉、曹梅卿、黄玉盛
2 地点 浙江、杭州、上饶
3 剧种 新昌调腔、京剧、梆子戏
4 剧目 《贵妃醉酒》、《霸王别姬》、《三国传》
5 乐器 鼓、锣、笛
6 唱腔曲牌 山坡羊、跳板、流水
7 脚色行当 小生、青衣、花脸
Chinese Intangible Heritage Traditional Drama Term
Text Length Statistics
序号 字符 字符标签 词性标注 领域特征标注
1 B-剧种 B-noun 0
2 I-剧种 I-noun 13
3 O O 0
4 B-人名 B-noun 0
5 I-人名 I-noun 0
6 O O 0
7 O B-verb 0
Annotation Example
序号 数据 训练集 测试集
1 句子数目 21 205 2 410
2 术语总数 53 440 6 032
3 人名 8 872 835
4 地点 11 090 1 204
5 剧种 12 082 1 510
6 剧目 9 658 1 277
7 乐器 4 214 425
8 唱腔曲牌 3 082 433
9 脚色行当 4 442 348
Data Set Sample Distribution
Overall Model Structure
BERT Output Coding Structure
序号 词性 解释 示例
1 Noun 名词 高腔、新昌、调腔
2 Verb 动词 是、流传、有
3 Adjective 形容词 大、早、旧
4 Numeral 数词 个、十、百
5 Classifier 量词 批、年、句
6 Pronoun 代词 你、我、他
7 Preposition 介词 于、比、被
8 Time word 时间词 公元前
9 Noun of locality 地方名词 之内、之中
Part-of-Speech Tags
序号 领域特征字 序号 领域特征字
0 非特征字 8
1 9
2 10
3 11
4 12
5 13
6 14
7 15
Domain Characters
序号 从属词 从属词位置 支配词位置 依存关系
1 高腔 1 2 SBV
2 2 0 HED
3 中国 3 4 ATT
4 戏曲 4 5 ATT
5 声腔 5 6 ATT
6 之一 6 2 VOB
7 7 2 WP
Dependent Syntax Analysis Results
Dependency Syntax Tree Example
Out-edge Adjacency Matrix
Results of the First Set of Experiments
Results of the Second Set of Experiments
Results of the Third Set of Experiments
Results of the Fourth Set of Experiments
Results of the Fifth Set of Experiments
序号 术语类型 Precision Recall F1 score
1 人名 85.87% 89.15% 87.48%
2 地点 84.33% 93.97% 88.89%
3 剧种 90.86% 78.07% 83.98%
4 剧目 97.24% 95.48% 96.35%
5 乐器 95.16% 95.37% 95.27%
6 唱腔曲牌 93.83% 95.95% 94.88%
7 脚色行当 97.63% 98.93% 98.27%
BERT-pos-BiLSTM-GCN2-CRF Model Results
序号 术语类别 新增个数 总数 示例
1 人名 5 10 624 史若虚、鹤童
2 地点 20 2 594 潞城、平固
3 剧种 13 175 泗州戏、淮戏
4 剧目 2 9 216 桃园结义、崔子杀朝
5 乐器 3 490 小笛、低胡、鱼鼓
6 唱腔曲牌 10 922 二慢板、三慢板
7 脚色行当 1 125 小生
New Terms in the Drama Glossary
Concepts and Examples of Drama Terms
[1] 孟令法. 中国文化遗产保护政策的历史演进[J]. 遗产, 2019(1): 111-135, 320.
[1] (Meng Lingfa. The Historical Evolution of China’s Cultural Heritage Protection Policy[J]. Heritage, 2019(1): 111-135,320.)
[2] 李明潞. 文化产业视角下传统戏剧类非遗自救路径探究[J]. 新闻传播, 2019(7): 35-37.
[2] (Li Minglu. Research on Self-rescue Path of Intangible Cultural Heritage in Traditional Drama from the Perspective of Cultural Industry[J]. Journalism Communication, 2019(7): 35-37.)
[3] 冯鸾鸾. 面向特定科技领域的技术和术语识别方法研究[D]. 苏州: 苏州大学, 2020.
[3] (Feng Luanluan. Research on Technology and Terminology Recognition Oriented Specific Science Domains[D]. Suzhou: Soochow University, 2020.)
[4] 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
[4] (Wu Jun, Cheng Yao, Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418.)
[5] 谢腾, 杨俊安, 刘辉. 基于BERT-BiLSTM-CRF模型的中文实体识别[J]. 计算机系统应用, 2020, 29(7): 48-55.
[5] (Xie Teng, Yang Junan, Liu Hui. Chinese Entity Recognition Based on BERT-BiLSTM-CRF Model[J]. Computer System & Applications, 2020, 29(7): 48-55.)
[6] 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340.
[6] (Liu Liu, Wang Dongbo. A Review on Named Entity Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 329-340.)
[7] 王健, 殷旭, 吕学强, 等. 基于CRFs的专利文献领域术语抽取方法[J]. 计算机工程与设计, 2019, 40(1): 279-284.
[7] (Wang Jian, Yin Xu, Lü Xueqiang, et al. Method of Extracting Patent Domain Terms Based on Conditional Random Fields[J]. Computer Engineering and Design, 2019, 40(1): 279-284.)
[8] 丁君军, 郑彦宁, 化柏林. 基于规则的学术概念属性抽取[J]. 情报理论与实践, 2011, 34(12): 10-14, 33.
[8] (Ding Junjun, Zheng Yanning, Hua Bolin. Rule-based Attribute Extraction of Academic Concepts[J]. Information Studies: Theory & Application, 2011, 34(12): 10-14, 33.)
[9] Xie R, Liu Z, Jia J, et al. Representation Learning of Knowledge Graphs with Entity Descriptions [C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016.
[10] Zheng D, Zhao T, Yang J. Research on Domain Term Extraction Based on Conditional Random Fields [C]//Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Berlin: Springer-Verlag, 2009: 290-296.
[11] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12): 54-58.
[11] (Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[12] 陈睿. 基于深度学习的专业领域术语识别系统设计与实现[D]. 北京: 北京邮电大学, 2019.
[12] (Chen Rui. Design and Implementation of Deep Learning Based Area Term Recognition System[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[13] Zeng D, Sun C, Lin L, et al. LSTM-CRF for Drug-Named Entity Recognition[J]. Entropy, 2017, 19(6): 283.
doi: 10.3390/e19060283
[14] 李明浩, 刘忠, 姚远哲. 基于LSTM-CRF的中医医案症状术语识别[J]. 计算机应用, 2018, 38(S2): 42-46.
[14] (Li Minghao, Liu Zhong, Yao Yuanzhe. LSTM-CRF Based Symptom Term Recognition on Traditional Chinese Medical Case[J]. Journal of Computer Applications, 2018, 38(S2): 42-46.)
[15] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 260-270.
[16] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[17] 成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[17] (Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[18] 陈德鑫, 占袁圆, 杨兵, 等. 基于CNN-BiLSTM模型的在线医疗实体抽取研究[J]. 图书情报工作, 2019, 63(12): 105-113.
[18] (Chen Dexin, Zhan Yuanyuan, Yang Bing, et al. Research on Extraction of Online Medical Entities Based on Mixed Deep Learning Model[J]. Library and Information Service, 2019, 63(12): 105-113.)
[19] Zhang Y, Yang J. Chinese NER Using Lattice LSTM [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia:Association for Computational Linguistics, 2018: 1554-1564.
[20] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[21] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[21] (Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[22] 王子牛, 姜猛, 高建瓴, 等. 基于BERT的中文命名实体识别方法[J]. 计算机科学, 2019, 46(11A): 138-142.
[22] (Wang Ziniu, Jiang Meng, Gao Jianling, et al. Chinese Named Entity Recognition Method Based on BERT[J]. Computer Science, 2019, 46(11A): 138-142.)
[23] ICOM/CIDOC Documentation Standards Group. Definition of the CIDOC Conceptual Reference Model [EB/OL].[2020-12-02]. https://cidoc-crm.org/sites/default/files/CIDOC%20CRM_v.7.0.1_%2018-10-2020.pdf.
[24] 张建娥. 基于多特征融合的中文文本关键词提取方法[J]. 情报理论与实践, 2013, 36(10): 105-108.
[24] (Zhang Jian’e. Chinese Text Keyword Extraction Method Based on Multi-Feature Fusion[J]. Information Studies: Theory & Application, 2013, 36(10): 105-108.)
[25] 王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828.
[25] (Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 817-828.)
[26] 杨顺成, 李彦, 赵其峰. 基于GCN和Bi-LSTM的微博立场检测方法[J]. 重庆理工大学学报(自然科学), 2020, 34(6): 167-173.
[26] (Yang Shuncheng, Li Yan, Zhao Qifeng. Stance Detection Method of Chinese Micro-Blog Based on GCN and Bi-LSTM[J]. Journal of Chongqing University of Technology (Natural Science), 2020, 34(6): 167-173.)
[27] 徐冰冰, 岑科廷, 黄俊杰, 等. 图卷积神经网络综述[J]. 计算机学报, 2020, 43(5): 755-780.
[27] (Xu Bingbing, Cen Keting, Huang Junjie, et al. A Survey on Graph Convolutional Neural Network[J]. Chinese Journal of Computers, 2020, 43(5): 755-780.)
[28] Cetoli A, Bragaglia S, O'Harney A D, et al. Graph Convolutional Networks for Named Entity Recognition [C]//Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 2018: 37-45.
[29] 张军莲, 张一帆, 汪鸣泉, 等. 基于图卷积神经网络的中文实体关系联合抽取[J/OL]. 计算机工程. DOI: 10.19678/j.issn.1000-3428.0059574.
doi: 10.19678/j.issn.1000-3428.0059574
[29] (Zhang Junlian, Zhang Yifan, Wang Mingquan, et al. Joint Extraction of Chinese Entity Relations Based on Graph Convolutional Neural Network[J/OL]. Computer Engineering. DOI: 10.19678/j.issn.1000-3428.0059574.)
doi: 10.19678/j.issn.1000-3428.0059574
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Fan Tao,Wang Hao,Wu Peng. Sentiment Analysis of Online Users' Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] Liu Jianhua,Zhang Zhixiong,Xu Jian,Xu Yandong. Automatic Term Recognition——An Important Method for Text Mining on Scientific Literature[J]. 现代图书情报技术, 2008, 24(8): 12-17.
[4] Cen Yonghua,Han Zhe,Ji Peipei . Chinese Term Recognition Based on Hidden Markov Model[J]. 现代图书情报技术, 2008, 24(12): 54-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn