Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model
Han Huang1,Hongyu Wang2,3,Xiaoguang Wang2,3()
1(School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China) 2(Center for Studies of Information Resource, Wuhan University, Wuhan 430072, China) 3(School of Information Management, Wuhan University, Wuhan 430072, China)
[Objective] This paper tries to identify legal terminologies automatically from the large-scale legal texts, aiming to structuralize legal big data. [Methods] We used the Conditional Random Field model as the classifier of the Active Learning algorithm, and then identify legal terms. Once the corpus was clustered by K-means, we extracted the initial list used to initiate the Active Learning algorithm with stratified sampling. Entropy was used as the basis of sample selection for Active Learning. The learning and sample selection process of active learning were carried out iteratively until the harmonic mean F value of the model was stabilized. Finally, the legal domain entity recognition model (AL-CRF) was generated. [Results] We ran the proposed model with Chinese judgment documents and found the precision and recall rates of AL-CRF model reached more than 90%, and its F value was 4.85% higher than that of the CRF model with equal labeling workload training. [Limitations] K-means clustering method is sensitive to noise and outliers, which may affect performance of the model. [Conclusions] The conditional random fields combined with active learning could reduce the workload with low-quality samples and ensure the recognition quality.
黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model. Data Analysis and Knowledge Discovery, DOI：10.11925/infotech.2096-3467.2018.1226.
(Tang Huihui, Wang Hao, Zhang Zixuan, et al.Extracting Names of Historical Events Based on Chinese Character Tags[J]. Data Analysis and Knowledge Discovery, 2018, 2(7): 89-100.)
Bikel D M, Miller S, Schwartz R, et al.Nymble: A High-Performance Learning Name-Finder[C]// Proceedings of the 5th Conference on Applied Natural Language Processing. Strouds-burg: Association for Computational Linguistics, 1997: 194-201.
(Zhang Yuejie, Xu Zhiting, Xue Xiangyang.Fusion of Multiple Features for Chinese Named Entity Recognition Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2008, 45(6): 1004-1010.)
Borthwick A E.A Maximum Entropy Approach to Named Entity Recognition[D]. New York: New York University, 1999.
Isozaki H, Kazawa H.Efficient Support Vector Classifiers for Named Entity Recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002, 1: 1-7.
(Li Xiang, Wei Xiaohong, Jia Lu, et al.Recognition of Crops, Diseases and Pesticides Named Entities in Chinese Based on Conditional Random Fields[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(S1): 178-185.)