Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (6): 66-74    DOI: 10.11925/infotech.2096-3467.2018.1226
Current Issue | Archive | Adv Search |
Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model
Han Huang1,Hongyu Wang2,3,Xiaoguang Wang2,3()
1(School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China)
2(Center for Studies of Information Resource, Wuhan University, Wuhan 430072, China)
3(School of Information Management, Wuhan University, Wuhan 430072, China)
Download: PDF(1308 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      

[Objective] This paper tries to identify legal terminologies automatically from the large-scale legal texts, aiming to structuralize legal big data. [Methods] We used the Conditional Random Field model as the classifier of the Active Learning algorithm, and then identify legal terms. Once the corpus was clustered by K-means, we extracted the initial list used to initiate the Active Learning algorithm with stratified sampling. Entropy was used as the basis of sample selection for Active Learning. The learning and sample selection process of active learning were carried out iteratively until the harmonic mean F value of the model was stabilized. Finally, the legal domain entity recognition model (AL-CRF) was generated. [Results] We ran the proposed model with Chinese judgment documents and found the precision and recall rates of AL-CRF model reached more than 90%, and its F value was 4.85% higher than that of the CRF model with equal labeling workload training. [Limitations] K-means clustering method is sensitive to noise and outliers, which may affect performance of the model. [Conclusions] The conditional random fields combined with active learning could reduce the workload with low-quality samples and ensure the recognition quality.

Key wordsLegal Text      Named Entity Recognition      Active Learning      Conditional Random Field      Sample Selection     
Received: 06 November 2018      Published: 15 August 2019

Cite this article:

Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model. Data Analysis and Knowledge Discovery, 2019, 3(6): 66-74.

URL:     OR

[1] Simmons R.Quantifying Criminal Procedure: How to Unlock the Potential of Big Data in Our Criminal Justice System[J]. Social Science Electronic Publishing, 2016(1): 947-1017.
[2] Moses L B, Chan J.Using Big Data for Legal and Law Enforcement Decisions: Testing the New Tools[J]. Social Science Electronic Publishing, 2014, 37(2): 643-678.
[3] Ferguson A G.The Big Data Jury[J]. Notre Dame Law Review, 2016, 91(3): 935-1006.
[4] 左卫民. 迈向大数据法律研究[J]. 法学研究, 2018, 40(4): 139-150.
[4] (Zuo Weimin.Towards Big Data Based Legal Research[J]. Chinese Journal of Law, 2018, 40(4): 139-150.)
[5] 左卫民. 关于法律人工智能在中国运用前景的若干思考[J]. 清华法学, 2018, 12(2): 108-124.
[5] (Zuo Weimin.Some Thoughts on the Application Prospect of Artificial Intelligence in Chinese Legal Field[J]. Tsinghua University Law Journal, 2018, 12(2): 108-124.)
[6] 孙镇, 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010(6): 42-47.
[6] (Sun Zhen, Wang Huilin.Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6): 42-47.)
[7] Goyal A, Gupta V, Kumar M.Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
[8] Chinchor N.MUC-6 Named Entity Task Definition (Version 2.1)[C]// Proceedings of the 6th Conference on Message Under-Standing. 1995.
[9] 唐慧慧, 王昊, 张紫玄, 等. 基于汉字标注的中文历史事件名抽取研究[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[9] (Tang Huihui, Wang Hao, Zhang Zixuan, et al.Extracting Names of Historical Events Based on Chinese Character Tags[J]. Data Analysis and Knowledge Discovery, 2018, 2(7): 89-100.)
[10] Bikel D M, Miller S, Schwartz R, et al.Nymble: A High-Performance Learning Name-Finder[C]// Proceedings of the 5th Conference on Applied Natural Language Processing. Strouds-burg: Association for Computational Linguistics, 1997: 194-201.
[11] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12): 54-58.
[11] (Cen Yonghua, Han Zhe, Ji Peipei.Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[12] 张玥杰, 徐智婷, 薛向阳. 融合多特征的最大熵汉语命名实体识别模型[J]. 计算机研究与发展, 2008, 45(6): 1004-1010.
[12] (Zhang Yuejie, Xu Zhiting, Xue Xiangyang.Fusion of Multiple Features for Chinese Named Entity Recognition Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2008, 45(6): 1004-1010.)
[13] Borthwick A E.A Maximum Entropy Approach to Named Entity Recognition[D]. New York: New York University, 1999.
[14] Isozaki H, Kazawa H.Efficient Support Vector Classifiers for Named Entity Recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002, 1: 1-7.
[15] 张传岩, 洪晓光, 彭朝晖, 等. 基于SVM和扩展条件随机场的Web实体活动抽取[J]. 软件学报, 2012, 23(10): 2612-2627.
[15] (Zhang Chuanyan, Hong Xiaoguang, Peng Zhaohui, et al.Extracting Web Entity Activities Based on SVM and Extended Conditional Random Fields[J]. Journal of Software, 2012, 23(10): 2612-2627.)
[16] 周俊生, 戴新宇, 尹存燕, 等. 基于层叠条件随机场模型的中文机构名自动识别[J]. 电子学报, 2006, 34(5): 804-809.
[16] (Zhou Junsheng, Dai Xinyu, Yin Cunyan, et al.Automatic Recognition of Chinese Organization Name Based on Cascaded Conditional Random Fields[J]. Acta Electronica Sinica, 2006, 34(5): 804-809.)
[17] 李想, 魏小红, 贾璐, 等. 基于条件随机场的农作物病虫害及农药命名实体识别[J]. 农业机械学报, 2017, 48(S1): 178-185.
[17] (Li Xiang, Wei Xiaohong, Jia Lu, et al.Recognition of Crops, Diseases and Pesticides Named Entities in Chinese Based on Conditional Random Fields[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(S1): 178-185.)
[18] 朱娜娜, 景东, 薛涵. 基于深度神经网络的微博图书名识别研究[J]. 图书情报工作, 2016, 60(4): 102-106.
[18] (Zhu Nana, Jing Dong, Xue Han.A Deep Neural Network for Book Title Identification in Microblog[J]. Library and Information Service, 2016, 60(4): 102-106.)
[19] 孙娟娟, 于红, 冯艳红, 等. 基于深度学习的渔业领域命名实体识别[J]. 大连海洋大学学报, 2018, 33(2): 265-269.
[19] (Sun Juanjuan, Yu Hong, Feng Yanhong, et al.Recognition of Nominated Fishery Domain Entity Based on Deep Learning Architectures[J]. Journal of Dalian Ocean University, 2018, 33(2): 265-269.)
[20] Wei Q K, Chen T, Xu R F, et al. Disease Named Entity Recognition by Combining Conditional Random Fields and Bidirectional Recurrent Neural Networks[J]. Database, 2016, 2016: Article No. 140.
[21] 王礼敏. 面向法律文书的中文命名实体识别方法研究[D]. 苏州: 苏州大学, 2018.
[21] (Wang Limin.Research on Chinese Named Entity Recognition for Legal Documents[D]. Suzhou: Soochow University, 2018.)
[22] 周晓辉. 基于隐式马尔科夫模型的法律命名实体识别模型的设计与应用[D]. 广州: 华南理工大学, 2017.
[22] (Zhou Xiaohui.Design and Implementation of a Hidden Markov Model Based Model for Legal Named Entity Recognition[D]. Guangzhou: South China University of Technology, 2017.)
[23] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[23] (Zhang Lin, Qin Ce, Ye Wenhao.Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.)
[24] 徐建忠, 朱俊, 赵瑞, 等. 基于超图的非连续法律实体识别[J]. 信息技术与信息化, 2017(5): 19-22.
[24] (Xu Jianzhong, Zhu Jun, Zhao Rui, et al.Recognition of Discontinuous Law Entities Based on Hypergraph[J]. Information Technology and Informatization, 2017(5): 19-22.)
[25] 杨文柱, 田潇潇, 王思乐, 等. 主动学习算法研究进展[J]. 河北大学学报: 自然科学版, 2017, 37(2): 216-224.
[25] (Yang Wenzhu, Tian Xiaoxiao, Wang Sile, et al.Recent Advances in Active Learning Algorithms[J]. Journal of Hebei University: Natural Science Edition, 2017, 37(2): 216-224.)
[26] 程志. 对裁判文书改革与深化的研究[J]. 当代法学, 2002(11): 117-120.
[26] (Cheng Zhi.Research on Reforming and Deepening of Judgment Documents[J]. Contemporary Law Review, 2002(11): 117-120.)
[1] Guangshang Gao. Reviewing Basic Methods of Entity Resolution[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[2] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[3] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[4] Xinyue Fan,Lei Cui. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[5] Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[6] Dongbo Wang,Yi Wu,Wenhao Ye,Ruilun Liu. Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[7] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[8] Lin Zhang,Ce Qin,Wenhao Ye. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[9] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[10] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[11] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[12] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[13] Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-supervised Micro-blog Sentiment Classification Method Combining Active Learning and Co-training[J]. 现代图书情报技术, 2015, 31(1): 38-44.
[14] Zeng Zhen, Lv Xueqiang, Li Zhuo. The Automatic Identification of Chinese Names in Query Logs[J]. 现代图书情报技术, 2014, 30(12): 71-77.
[15] Wang Run,He Lin,Wang Dongbo,Huang Shuiqing,Fan Yuanbiao. Research on Plant Growth and Development Stage Named Entity Recognition for Text Mining[J]. 现代图书情报技术, 2014, 30(1): 24-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938