Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 10-25    DOI: 10.11925/infotech.2096-3467.2020.1230
Current Issue | Archive | Adv Search |
Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation
Wang Hao,Lin Kerou(),Meng Zhen,Li Xinlei
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1196 KB)   HTML ( 24
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper investigates the performance of entity recognition models for legal judgments, aiming to construct better legal knowledge base in the future. [Methods] First, we extracted the court trial process and court opinions from criminal judgment texts to build an experimental dataset. Then, we compared the entity recognition results of the CRFs model (with artificially constructed features), the IDCNN-CRFs model (with automatically generated features), and the BiLSTM-CRFs model. Both of the IDCNN-CRFs and BiLSTM-CRFs models used pre-trained word vectors for their char embedding. The models’ transferred abilities on other types of legal judgment texts were also compared. [Results] The ALBERT-BiLSTM-CRFs model had the best recognition performance. Its F1 micro-average value reached 95.28%. However, the training time of the IDCNN-CRFs model was about 1/6 of the ALBERT-BiLSTM-CRFs model. Both models had good transferred abilities. [Limitations] Most of the recognized entities were the general ones. More domain-related entities are needed in future studies to enhance the model’s practical value. [Conclusions] The ALBERT-BiLSTM-CRFs and IDCNN-CRFs models could more effectively recognize entities from legal judgments and show better transferred ability than the CRFs model.

Key wordsLegal Judgments      Feature Generation      CRFs      IDCNN-CRFs      ALBERT-BiLSTM-CRFs     
Received: 07 December 2020      Published: 11 August 2021
ZTFLH:  TP393  
Fund:National Natural Science Foundation of China(72074108);Youth Interdisciplinary Team of Liberal Arts in Nanjing University(2020300093);Jiangsu Young Talents in Social Sciences;Tang Scholar of Nanjing University
Corresponding Authors: Lin Kerou,ORCID:0000-0003-0026-8771     E-mail: keroulin@foxmail.com

Cite this article:

Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation. Data Analysis and Knowledge Discovery, 2021, 5(7): 10-25.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1230     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I7/10

Research Framework
序号 实体类别 标签 实例
1 罪名 Crime 交通肇事罪
2 人名 Per 张××、程某甲
3 地名 Loc 临清市、永馆路
4 机构名 Org 淮安区人民检察院、淮安市公安局淮安分局物证鉴定室
5 日期 Date 2016年2月11日
6 时间 Time 上午10时30分、晚
Entity Type and Corresponding Examples
序号 实体类别 训练集 测试集
Token数量 实体数量(不去重) 实体数量(去重) Token数量 实体数量(不去重) 实体数量(去重)
1 Crime 3 449 742 58 400 84 17
2 Date 15 560 1 815 977 1 420 171 127
3 Loc 9 964 3 019 1 264 611 198 110
4 Org 15 612 1 888 994 1 412 169 107
5 Per 33 708 12 625 1 348 2 942 1 137 194
6 Time 2 825 831 216 205 67 30
7 O 307 167 25 028
Number of Entities and Token in Training Set and Test Set
The Proportion of Non-Repetitive Entities in Training Set and Test Set
The Proportion of New Entities in Test Set
序号 标签 意义 实例
1 B 实体的开头 “晚”字和“豫”字
2 I 实体除开头外的其他字符 “上9时30分”
3 O 非实体的字 其他字
Explanation of the BIO Label System
序号 标签 意义 实例
1 B 实体的开头 “晚”字
2 I 实体的中间部分 “上9时30”
3 E 实体的结尾 “分”字
4 S 单个字组成的实体 “豫”字
5 O 非实体的字 其他字
Explanation of the BIEOS Label System
Design of CRFs Model
IDCNN-CRFs Model Structure
序号 参数 默认值
1 Batch Size 32
2 Epoch 100
3 dropout_keep 0.5
4 字嵌入维度 100
5 Filter数量 100
6 Gradient Clip 5
7 学习率 0.001
8 优化器 Adam
IDCNN-CRFs Model Default Parameters
Structure of BiLSTM-CRFs Embedded by Random Numbers and BERT-BiLSTM-CRFs
序号 参数 默认值
1 Batch Size 64
2 Epoch 100
3 LSTM层Units 128
4 LSTM层return_sequences TRUE
5 全连接层Units 64
6 全连接层激活函数 tanh
BiLSTM Model Default Parameters
ALBERT-BiLSTM-CRFs Model Structure
序号 实体类别 BIO BIEOS
P R F 1 P R F 1
1 Crime 98.80% 97.62% 98.20% 98.78% 96.43% 97.59%
2 Date 97.48% 90.64% 93.94% 92.55% 87.13% 89.76%
3 Loc 81.28% 76.77% 78.96% 75.89% 68.55% 72.03%
4 Org 89.13% 72.78% 80.13% 87.32% 73.37% 79.74%
5 Per 96.03% 93.58% 94.79% 94.75% 92.00% 93.35%
6 Time 100.00% 89.55% 94.49% 98.46% 95.52% 96.97%
Experimental Results of Different Label Systems
The F1 Values of Different Feature Experiments on CRFs Model
The F1 Values of Different Batch Size Experiments on IDCNN-CRFs Model When Epoch=1
The F1 Values of Different Batch Size Experiments on IDCNN-CRFs Model When Epoch=10
序号 实体类别 Epoch = 1 Epoch = 10 Epoch = 20
1 Crime 48.10% 100.00% 97.62%
2 Date 85.96% 93.29% 96.49%
3 Loc 68.85% 85.20% 84.16%
4 Org 43.48% 81.85% 77.91%
5 Per 93.34% 95.08% 95.47%
6 Time 76.39% 99.25% 94.49%
The F1 Values of Different Epoch Experiments on IDCNN-CRFs Model When Bath Size=8
序号 实体类别 Dropout = 0.4 Dropout = 0.5 Dropout = 0.6 Dropout = 0.7 Dropout = 0.8
1 Crime 98.80% 100.00% 99.40% 100.00% 97.59%
2 Date 93.49% 93.29% 93.53% 93.49% 94.15%
3 Loc 84.97% 85.20% 86.15% 87.11% 85.27%
4 Org 78.79% 81.85% 81.93% 81.08% 79.17%
5 Per 94.08% 95.08% 95.02% 95.78% 96.27%
6 Time 97.71% 99.25% 96.97% 94.49% 92.19%
The F1 Values of Different Dropout Experiments on IDCNN-CRFs Model When Batch Size=8 and Epoch=10
序号 实体类别 Epoch = 1 Epoch = 3 Epoch = 5
1 Crime 81.25% 96.89% 97.44%
2 Date 86.53% 92.04% 94.05%
3 Loc 75.63% 76.59% 80.20%
4 Org 67.68% 75.57% 77.17%
5 Per 93.29% 92.79% 94.43%
6 Time 95.52% 98.49% 99.26%
7 Micro Avg 88.38% 89.78% 91.58%
The F1 Values of Different Epoch Experiments on BiLSTM-CRFs Model Embedded by Random Numbers
序号 Epoch loss acc val _ loss val _ acc
1 1 11.0966 97.56% 209.3262 98.33%
2 2 1.9536 99.46% 198.5077 98.44%
3 3 1.1717 99.63% 188.8806 98.16%
4 4 0.8223 99.72% 178.6167 98.44%
5 5 0.5982 99.78% 170.9082 98.19%
The Change of loss and acc During the Iteration When Epoch=5
序号 Epoch loss acc val _ loss val _ acc
1 1 11.6478 97.39% 202.4137 98.93%
2 2 1.9299 99.42% 192.3672 99.12%
3 3 1.1491 99.61% 183.2427 99.16%
The Change of loss and acc During the Iteration When Epoch=3
序号 实体类别 字嵌入方式
随机数 BERT ALBERT
1 Crime 96.89% 100.00% 100.00%
2 Date 92.04% 91.01% 93.53%
3 Loc 76.59% 88.38% 87.28%
4 Org 75.57% 80.97% 83.54%
5 Per 92.79% 99.02% 98.48%
6 Time 98.49% 91.18% 99.25%
7 Micro Avg 89.78% 95.17% 95.28%
The F1 Values of Different Word Embedding Experiments on BiLSTM-CRFs Model
The F1 Values of the Three Types of Models
序号 模型 结果
类型
实体数量
Crime Date Loc Org Per Time
1 CRFs TP + FP 1 8 16 6 30 1
TP 1 8 14 3 29 1
2 IDCNN-CRFs TP + FP 3 8 15 8 32 1
TP 3 8 14 4 31 1
3 ALBERT-BiLSTM-CRFs TP + FP 3 8 17 5 36 3
TP 3 8 15 4 26 1
Total 7 8 19 5 32 1
Comparison of Experimental Results of New Data on Different Models
[1] 徐娟, 杜家明. 智慧司法实施的风险及其法律规制[J]. 河北法学, 2020, 38(8):188-200.
[1] (Xu Juan, Du Jiaming. Risks and Legal Regulation of Intelligent Justice Implementation[J]. Hebei Law Science, 2020, 38(8):188-200.)
[2] 徐亚文, 伍德志. 法律修辞、语言游戏与判决合法化——对“判决书上网”的法理思考[J]. 河南省政法管理干部学院学报, 2011, 26(1):11-18.
[2] (Xu Yawen, Wu Dezhi. Legal Eloquence, Language Game and Sentence Legalization——Jurisprudence Thought about “Judgment Online”[J]. Journal of Henan Administrative Institute of Politics and Law, 2011, 26(1):11-18.)
[3] 杨金晶, 覃慧, 何海波. 裁判文书上网公开的中国实践——进展、问题与完善[J]. 中国法律评论, 2019(6):125-147.
[3] (Yang Jinjing, Qin Hui, He Haibo. China’s Practice of Disclosing Judgment Documents Online: Progress, Problems and Improvements[J]. China Law Review, 2019(6):125-147.)
[4] 冯瑞. 基于深度学习的法院裁判文书命名实体识别研究[D]. 成都: 西南财经大学, 2019.
[4] (Feng Rui. Research on Named Entity Recognition of Court Judgment Documents Based on Deep Learning[D]. Chengdu: Southwestern University of Finance and Economics, 2019.)
[5] 谢云. 面向中文法律文本的命名实体识别研究[D]. 南京: 南京师范大学, 2018.
[5] (Xie Yun. Research on Named Entity Recognition for Chinese Legal Texts[D]. Nanjing: Nanjing Normal University, 2018.)
[6] 佘贵清, 张永安. 审判案例自动抽取与标注模型研究[J]. 现代图书情报技术, 2013(6):23-29.
[6] (She Guiqing, Zhang Yongan. Study on the Model of Automatic Extraction and Annotation of Trail Cases[J]. New Technology of Library and Information Service, 2013(6):23-29.)
[7] 王得贤, 王素格, 裴文生, 等. 基于JCWA-DLSTM的法律文书命名实体识别方法[J]. 中文信息学报, 2020, 34(10):51-58.
[7] (Wang Dexian, Wang Suge, Pei Wensheng, et al. Named Entity Recognition Based on JCWA-DLSTM for Legal Instruments[J]. Journal of Chinese Information Processing, 2020, 34(10):51-58.)
[8] Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[OL]. arXiv Preprint, arXiv:1702.02098.
[9] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[10] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[11] Devlin J, Chang M-W, Lee K, et al. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[12] Lan Z Z, Chen M D, Goodman S, et al. ALBERT: A Lite Bert for Self-Supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv:1909.11942.
[13] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[14] 周晓辉. 基于隐式马尔科夫模型的法律命名实体识别模型的设计与应用[D]. 广州: 华南理工大学, 2017.
[14] (Zhou Xiaohui. Design and Implementation of a Hidden Markov Model Based Model for Legal Named Entity Recognition[D]. Guangzhou: South China University of Technology, 2017.)
[15] 贡保才让. 深层神经网络的藏文命名实体识别研究[D]. 西宁: 青海师范大学, 2018.
[15] (Gongbaocairang. Study on Tibetan Named Entity Recognition Using Deep Neural Networks[D]. Xining: Qinghai Normal University, 2018.)
[16] 孔玲玲. 面向少量标注数据的中文命名实体识别技术研究[D]. 杭州: 浙江大学, 2019.
[16] (Kong Lingling. Research on Chinese Named Entity Recognition Technology from Sparsely Annotated Data[D]. Hangzhou: Zhejiang University, 2019.)
[17] 刘玉娇, 琚生根, 李若晨, 等. 基于深度学习的中文微博命名实体识别[J]. 四川大学学报(工程科学版), 2016, 48(S2):142-146.
[17] (Liu Yujiao, Ju Shenggen, Li Ruochen, et al. Named Entity Recognition in Chinese Micro-blog Based on Deep Learning[J]. Journal of Sichuan University (Engineering Science Edition), 2016, 48(S2):142-146.)
[18] Hu Z K, Li X, Tu C C, et al. Few-Shot Charge Prediction with Discriminative Legal Attributes[C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 487-498.
[19] Jiang H J, Wang R P, Shan S G, et al. Learning Discriminative Latent Attributes for Zero-Shot Classification[C]// Proceedings of 2017 IEEE International Conference on Computer Vision. 2017: 4223-4232.
[20] Mencia E L, Fürnkranz J. Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008: 50-65.
[21] Leitner E, Rehm G, Moreno-Schneider J. A Dataset of German Legal Documents for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2003. 13016.
[22] de Araujo P H L, de Campos T E, de Oliveira R R, et al. LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text[C]// Proceedings of International Conference on Computational Processing of the Portuguese Language. 2018: 313-323.
[23] Hovy E, Marcus M, Palmer M, et al. OntoNotes: The 90% Solution[C]// Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. 2006: 57-60.
[24] Thomas A, Sangeetha S. Performance Analysis of the State-of-the-Art Neural Named Entity Recognition Model on Judicial Domain[A]// Pant M, Sharma T, Verma O, et al. Soft Computing: Theories and Applications[M]. Berlin: Springer, 2020: 147-154.
[25] Dozier C, Kondadadi R, Light M, et al. Named Entity Recognition and Resolution in Legal Text[A]// Francesconi E, Montemagni S, Peter W, et al. Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language[M]. Berlin: Springer, 2010: 27-43.
[26] 徐建忠, 朱俊, 赵瑞, 等. 基于超图的非连续法律实体识别[J]. 信息技术与信息化, 2017(5):19-22.
[26] (Xu Jianzhong, Zhu Jun, Zhao Rui, et al. Recognition of Discontiguous Law Entities Based on Hypergraph[J]. Information Technology & Informatization, 2017(5):19-22.)
[27] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11):46-52.
[27] (Zhang Lin, Qin Ce, Ye Wenhao. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11):46-52.)
[28] 王礼敏. 面向法律文书的中文命名实体识别方法研究[D]. 苏州: 苏州大学, 2018.
[28] (Wang Limin. Research on Chinese Named Entity Recognition for Legal Documents[D]. Suzhou: Soochow University, 2018.)
[29] 刘晨玥, 李兵, 吴卫星. 基于罪名相关成分标注的刑事裁判文书概要信息提取[J]. 山东科技大学学报(自然科学版), 2018, 37(4):92-101,124.
[29] (Liu Chenyue, Li Bing, Wu Weixing. Information Extraction of Judical Documents Based on Crime-related Tags[J]. Journal of Shandong University of Science and Technology (Natural Science), 2018, 37(4):92-101, 124.)
[30] 林义孟. 面向司法领域的命名实体识别研究[D]. 昆明: 云南财经大学, 2019.
[30] (Lin Yimeng. Research on Named Entity Recognition in Judicial Field[D]. Kunming: Yunnan University of Finance and Economics, 2019.)
[31] 黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019, 3(6):66-74.
[31] (Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
[32] 周晓磊, 赵薛蛟, 刘堂亮, 等. 基于SVM-BiLSTM-CRF模型的财产纠纷命名实体识别方法[J]. 计算机系统应用, 2019, 28(1):245-250.
[32] (Zhou Xiaolei, Zhao Xuejiao, Liu Tangliang, et al. Named Entity Recognition Method of Judgment Documents with SVM-BiLSTM-CRF[J]. Computer Systems & Applications, 2019, 28(1):245-250.)
[33] 孟昕. 基于深度学习的法律文书识别方法研究[J]. 电子科技, 2019, 32(12):84-86.
[33] (Meng Xin. Research on Recognition Method of Legal Documents Based on Deep Learning[J]. Electronic Science and Technology, 2019, 32(12):84-86.)
[34] Carletta J. Assessing Agreement on Classification Tasks: The Kappa Statistic[J]. Computational Linguistics, 1996, 22(2):249-254.
[35] Hripcsak G, Rothschild A S. Agreement, the F-Measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3):296-298.
pmid: 15684123
[36] Brandsen A, Verberne S, Wansleeben M, et al. Creating a Dataset for Named Entity Recognition in the Archaeology Domain[C]// Proceedings of the 12th Language Resources and Evaluation Conference, Marseille. Paris: European Language Resources Association, 2020: 4573-4577.
[37] 殷章志, 李欣子, 黄德根, 等. 融合字词模型的中文命名实体识别研究[J]. 中文信息学报, 2019, 33(11):95-100, 106.
[37] (Yin Zhangzhi, Li Xinzi, Huang Degen, et al. Chinese Named Entity Recognition Ensembled with Character[J]. Journal of Chinese Information Processing, 2019, 33(11):95-100, 106.)
[38] 王昊, 邓三鸿, 朱立平, 等. 大数据环境下政务数据的情报价值及其利用研究——以海关报关商品归类风险规避为例[J]. 科技情报研究, 2020, 2(4):74-89.
[38] (Wang Hao, Deng Sanhong, Zhu Liping, et al. A Study of Intelligence Value and Employment of Political Data in Big Data Environment——The Risk Avoidance of Customs Declaration Commodities[J]. Scientific Information Research, 2020, 2(4):74-89.)
[39] CRF++: Yet Another CRF toolkit[EB/OL]. [2021-01-20]. https://taku910.github.io/crfpp/.
[40] Jieba分词工具[EB/OL]. [2021-01-20]. https://github.com/fxsjy/jieba.
[40] (Chinese Text Segmentation “Jieba” [EB/OL]. [2021-01-20]. https://github.com/fxsjy/jieba.)
[41] Hinton G E, Srivastava N, Krizhevsky A, et al. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors[OL]. arXiv Preprint, arXiv:1207.0580.
[42] Krizhevsky A, Sutskever I, Hinton G E. Imagenet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM, 2017, 60(6):84-90.
doi: 10.1145/3065386
[43] Eziz E. Kashgari[EB/OL]. [2021-01-20]. https://github.com/BrikerMan/Kashgari.
[44] 朱茂然, 王奕磊, 高松, 等. 中文比较关系的识别:基于注意力机制的深度学习模型[J]. 情报学报, 2019, 38(6):612-621.
[44] (Zhu Maoran, Wang Yilei, Gao Song, et al. A Deep-Learning Model Based on Attention Mechanism for Chinese Comparative Relation Detection[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(6):612-621.)
[1] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[2] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[3] Shi Cui, Wang Yang, Yang Bin, Yao Ye. Identification of Non-nest Coordination for Chinese Patent Literature[J]. 现代图书情报技术, 2014, 30(10): 76-83.
[4] Meng Meiren, Ding Shengchun. Research on the Credibility of Online Chinese Product Reviews[J]. 现代图书情报技术, 2013, 29(9): 60-66.
[5] Gu Jun, Xu Xin. Study on Ontology Relation Extraction in Chinese Patent Documents[J]. 现代图书情报技术, 2013, 29(10): 73-78.
[6] Wang Hao,Deng Sanhong. Comparative Study on HMM and CRFs Applying in Information Extraction[J]. 现代图书情报技术, 2007, 2(12): 57-63.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn