Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation
Wang Hao,Lin Kerou(),Meng Zhen,Li Xinlei
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This paper investigates the performance of entity recognition models for legal judgments, aiming to construct better legal knowledge base in the future. [Methods] First, we extracted the court trial process and court opinions from criminal judgment texts to build an experimental dataset. Then, we compared the entity recognition results of the CRFs model (with artificially constructed features), the IDCNN-CRFs model (with automatically generated features), and the BiLSTM-CRFs model. Both of the IDCNN-CRFs and BiLSTM-CRFs models used pre-trained word vectors for their char embedding. The models’ transferred abilities on other types of legal judgment texts were also compared. [Results] The ALBERT-BiLSTM-CRFs model had the best recognition performance. Its F1 micro-average value reached 95.28%. However, the training time of the IDCNN-CRFs model was about 1/6 of the ALBERT-BiLSTM-CRFs model. Both models had good transferred abilities. [Limitations] Most of the recognized entities were the general ones. More domain-related entities are needed in future studies to enhance the model’s practical value. [Conclusions] The ALBERT-BiLSTM-CRFs and IDCNN-CRFs models could more effectively recognize entities from legal judgments and show better transferred ability than the CRFs model.
王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation. Data Analysis and Knowledge Discovery, 2021, 5(7): 10-25.
(Xu Yawen, Wu Dezhi. Legal Eloquence, Language Game and Sentence Legalization——Jurisprudence Thought about “Judgment Online”[J]. Journal of Henan Administrative Institute of Politics and Law, 2011, 26(1):11-18.)
(Yang Jinjing, Qin Hui, He Haibo. China’s Practice of Disclosing Judgment Documents Online: Progress, Problems and Improvements[J]. China Law Review, 2019(6):125-147.)
[4]
冯瑞. 基于深度学习的法院裁判文书命名实体识别研究[D]. 成都: 西南财经大学, 2019.
[4]
(Feng Rui. Research on Named Entity Recognition of Court Judgment Documents Based on Deep Learning[D]. Chengdu: Southwestern University of Finance and Economics, 2019.)
[5]
谢云. 面向中文法律文本的命名实体识别研究[D]. 南京: 南京师范大学, 2018.
[5]
(Xie Yun. Research on Named Entity Recognition for Chinese Legal Texts[D]. Nanjing: Nanjing Normal University, 2018.)
(She Guiqing, Zhang Yongan. Study on the Model of Automatic Extraction and Annotation of Trail Cases[J]. New Technology of Library and Information Service, 2013(6):23-29.)
(Wang Dexian, Wang Suge, Pei Wensheng, et al. Named Entity Recognition Based on JCWA-DLSTM for Legal Instruments[J]. Journal of Chinese Information Processing, 2020, 34(10):51-58.)
[8]
Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[OL]. arXiv Preprint, arXiv:1702.02098.
[9]
Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[10]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[11]
Devlin J, Chang M-W, Lee K, et al. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[12]
Lan Z Z, Chen M D, Goodman S, et al. ALBERT: A Lite Bert for Self-Supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv:1909.11942.
[13]
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
(Zhou Xiaohui. Design and Implementation of a Hidden Markov Model Based Model for Legal Named Entity Recognition[D]. Guangzhou: South China University of Technology, 2017.)
[15]
贡保才让. 深层神经网络的藏文命名实体识别研究[D]. 西宁: 青海师范大学, 2018.
[15]
(Gongbaocairang. Study on Tibetan Named Entity Recognition Using Deep Neural Networks[D]. Xining: Qinghai Normal University, 2018.)
[16]
孔玲玲. 面向少量标注数据的中文命名实体识别技术研究[D]. 杭州: 浙江大学, 2019.
[16]
(Kong Lingling. Research on Chinese Named Entity Recognition Technology from Sparsely Annotated Data[D]. Hangzhou: Zhejiang University, 2019.)
(Liu Yujiao, Ju Shenggen, Li Ruochen, et al. Named Entity Recognition in Chinese Micro-blog Based on Deep Learning[J]. Journal of Sichuan University (Engineering Science Edition), 2016, 48(S2):142-146.)
[18]
Hu Z K, Li X, Tu C C, et al. Few-Shot Charge Prediction with Discriminative Legal Attributes[C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 487-498.
[19]
Jiang H J, Wang R P, Shan S G, et al. Learning Discriminative Latent Attributes for Zero-Shot Classification[C]// Proceedings of 2017 IEEE International Conference on Computer Vision. 2017: 4223-4232.
[20]
Mencia E L, Fürnkranz J. Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008: 50-65.
[21]
Leitner E, Rehm G, Moreno-Schneider J. A Dataset of German Legal Documents for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2003. 13016.
[22]
de Araujo P H L, de Campos T E, de Oliveira R R, et al. LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text[C]// Proceedings of International Conference on Computational Processing of the Portuguese Language. 2018: 313-323.
[23]
Hovy E, Marcus M, Palmer M, et al. OntoNotes: The 90% Solution[C]// Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. 2006: 57-60.
[24]
Thomas A, Sangeetha S. Performance Analysis of the State-of-the-Art Neural Named Entity Recognition Model on Judicial Domain[A]// Pant M, Sharma T, Verma O, et al. Soft Computing: Theories and Applications[M]. Berlin: Springer, 2020: 147-154.
[25]
Dozier C, Kondadadi R, Light M, et al. Named Entity Recognition and Resolution in Legal Text[A]// Francesconi E, Montemagni S, Peter W, et al. Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language[M]. Berlin: Springer, 2010: 27-43.
(Xu Jianzhong, Zhu Jun, Zhao Rui, et al. Recognition of Discontiguous Law Entities Based on Hypergraph[J]. Information Technology & Informatization, 2017(5):19-22.)
(Zhang Lin, Qin Ce, Ye Wenhao. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11):46-52.)
[28]
王礼敏. 面向法律文书的中文命名实体识别方法研究[D]. 苏州: 苏州大学, 2018.
[28]
(Wang Limin. Research on Chinese Named Entity Recognition for Legal Documents[D]. Suzhou: Soochow University, 2018.)
(Liu Chenyue, Li Bing, Wu Weixing. Information Extraction of Judical Documents Based on Crime-related Tags[J]. Journal of Shandong University of Science and Technology (Natural Science), 2018, 37(4):92-101, 124.)
[30]
林义孟. 面向司法领域的命名实体识别研究[D]. 昆明: 云南财经大学, 2019.
[30]
(Lin Yimeng. Research on Named Entity Recognition in Judicial Field[D]. Kunming: Yunnan University of Finance and Economics, 2019.)
(Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
(Zhou Xiaolei, Zhao Xuejiao, Liu Tangliang, et al. Named Entity Recognition Method of Judgment Documents with SVM-BiLSTM-CRF[J]. Computer Systems & Applications, 2019, 28(1):245-250.)
(Meng Xin. Research on Recognition Method of Legal Documents Based on Deep Learning[J]. Electronic Science and Technology, 2019, 32(12):84-86.)
[34]
Carletta J. Assessing Agreement on Classification Tasks: The Kappa Statistic[J]. Computational Linguistics, 1996, 22(2):249-254.
[35]
Hripcsak G, Rothschild A S. Agreement, the F-Measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005, 12(3):296-298.
pmid: 15684123
[36]
Brandsen A, Verberne S, Wansleeben M, et al. Creating a Dataset for Named Entity Recognition in the Archaeology Domain[C]// Proceedings of the 12th Language Resources and Evaluation Conference, Marseille. Paris: European Language Resources Association, 2020: 4573-4577.
(Yin Zhangzhi, Li Xinzi, Huang Degen, et al. Chinese Named Entity Recognition Ensembled with Character[J]. Journal of Chinese Information Processing, 2019, 33(11):95-100, 106.)
(Wang Hao, Deng Sanhong, Zhu Liping, et al. A Study of Intelligence Value and Employment of Political Data in Big Data Environment——The Risk Avoidance of Customs Declaration Commodities[J]. Scientific Information Research, 2020, 2(4):74-89.)
[39]
CRF++: Yet Another CRF toolkit[EB/OL]. [2021-01-20]. https://taku910.github.io/crfpp/.
(Chinese Text Segmentation “Jieba” [EB/OL]. [2021-01-20]. https://github.com/fxsjy/jieba.)
[41]
Hinton G E, Srivastava N, Krizhevsky A, et al. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors[OL]. arXiv Preprint, arXiv:1207.0580.
[42]
Krizhevsky A, Sutskever I, Hinton G E. Imagenet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM, 2017, 60(6):84-90.
doi: 10.1145/3065386
[43]
Eziz E. Kashgari[EB/OL]. [2021-01-20]. https://github.com/BrikerMan/Kashgari.
(Zhu Maoran, Wang Yilei, Gao Song, et al. A Deep-Learning Model Based on Attention Mechanism for Chinese Comparative Relation Detection[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(6):612-621.)