Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (6): 66-74    DOI: 10.11925/infotech.2096-3467.2018.1226
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
结合主动学习的条件随机场模型用于法律术语的自动识别*
黄菡1,王宏宇2,3,王晓光2,3()
1(中南财经政法大学信息与安全工程学院 武汉 430073)
2(武汉大学信息资源研究中心 武汉 430072)
3(武汉大学信息管理学院 武汉 430072)
Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model
Han Huang1,Hongyu Wang2,3,Xiaoguang Wang2,3()
1(School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China)
2(Center for Studies of Information Resource, Wuhan University, Wuhan 430072, China)
3(School of Information Management, Wuhan University, Wuhan 430072, China)
全文: PDF(1308 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

目的】实现对大规模法律文本中法律术语的自动识别, 促进法律大数据的结构化进程。【方法】将条件随机场模型作为主动学习算法的分类器, 在经过K-means聚类后的语料库中, 按照分层抽样的方式抽取用于启动主动学习算法的初始样本, 将熵值作为主动学习的样例选择依据, 迭代地进行主动学习的学习过程及样例选择过程, 直到模型的调和均值F值趋于稳定时停止迭代, 输出最终的法律术语自动识别模型——AL-CRF模型。【结果】在中文裁判文书上的命名实体识别实验表明, 通过少量且高质的样本训练的AL-CRF模型对于法律术语的识别准确率和召回率可达90%以上, 且相较于等标注工作量训练的CRF模型F值提高4.85%。【局限】K-means聚类方法对噪声和离群点较为敏感, 可能会影响模型的识别效果。【结论】结合主动学习的条件随机场模型能在保证识别质量的情况下, 减少低质量样本的标注工作量。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
黄菡
王宏宇
王晓光
关键词 法律文本命名实体识别主动学习条件随机场样例选择    
Abstract

[Objective] This paper tries to identify legal terminologies automatically from the large-scale legal texts, aiming to structuralize legal big data. [Methods] We used the Conditional Random Field model as the classifier of the Active Learning algorithm, and then identify legal terms. Once the corpus was clustered by K-means, we extracted the initial list used to initiate the Active Learning algorithm with stratified sampling. Entropy was used as the basis of sample selection for Active Learning. The learning and sample selection process of active learning were carried out iteratively until the harmonic mean F value of the model was stabilized. Finally, the legal domain entity recognition model (AL-CRF) was generated. [Results] We ran the proposed model with Chinese judgment documents and found the precision and recall rates of AL-CRF model reached more than 90%, and its F value was 4.85% higher than that of the CRF model with equal labeling workload training. [Limitations] K-means clustering method is sensitive to noise and outliers, which may affect performance of the model. [Conclusions] The conditional random fields combined with active learning could reduce the workload with low-quality samples and ensure the recognition quality.

Key wordsLegal Text    Named Entity Recognition    Active Learning    Conditional Random Field    Sample Selection
收稿日期: 2018-11-06     
基金资助:*本文系教育部人文社会科学重点研究基地重大项目“大数据资源语义表示与组织研究——面向文化遗产领域”(项目编号: 16JJD870002)、国家自然科学基金面上项目“基于大规模开放科学知识图谱的学科新兴趋势探测研究”(项目编号: 71874129)和中南财经政法大学研究生创新基金项目“基于主动学习的命名实体识别研究”(项目编号: 201811409)的研究成果之一
引用本文:   
黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1226.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1226
[1] Simmons R.Quantifying Criminal Procedure: How to Unlock the Potential of Big Data in Our Criminal Justice System[J]. Social Science Electronic Publishing, 2016(1): 947-1017.
[2] Moses L B, Chan J.Using Big Data for Legal and Law Enforcement Decisions: Testing the New Tools[J]. Social Science Electronic Publishing, 2014, 37(2): 643-678.
[3] Ferguson A G.The Big Data Jury[J]. Notre Dame Law Review, 2016, 91(3): 935-1006.
[4] 左卫民. 迈向大数据法律研究[J]. 法学研究, 2018, 40(4): 139-150.
[4] (Zuo Weimin.Towards Big Data Based Legal Research[J]. Chinese Journal of Law, 2018, 40(4): 139-150.)
[5] 左卫民. 关于法律人工智能在中国运用前景的若干思考[J]. 清华法学, 2018, 12(2): 108-124.
[5] (Zuo Weimin.Some Thoughts on the Application Prospect of Artificial Intelligence in Chinese Legal Field[J]. Tsinghua University Law Journal, 2018, 12(2): 108-124.)
[6] 孙镇, 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010(6): 42-47.
[6] (Sun Zhen, Wang Huilin.Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6): 42-47.)
[7] Goyal A, Gupta V, Kumar M.Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
[8] Chinchor N.MUC-6 Named Entity Task Definition (Version 2.1)[C]// Proceedings of the 6th Conference on Message Under-Standing. 1995.
[9] 唐慧慧, 王昊, 张紫玄, 等. 基于汉字标注的中文历史事件名抽取研究[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[9] (Tang Huihui, Wang Hao, Zhang Zixuan, et al.Extracting Names of Historical Events Based on Chinese Character Tags[J]. Data Analysis and Knowledge Discovery, 2018, 2(7): 89-100.)
[10] Bikel D M, Miller S, Schwartz R, et al.Nymble: A High-Performance Learning Name-Finder[C]// Proceedings of the 5th Conference on Applied Natural Language Processing. Strouds-burg: Association for Computational Linguistics, 1997: 194-201.
[11] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12): 54-58.
[11] (Cen Yonghua, Han Zhe, Ji Peipei.Chinese Term Recognition Based on Hidden Markov Model[J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[12] 张玥杰, 徐智婷, 薛向阳. 融合多特征的最大熵汉语命名实体识别模型[J]. 计算机研究与发展, 2008, 45(6): 1004-1010.
[12] (Zhang Yuejie, Xu Zhiting, Xue Xiangyang.Fusion of Multiple Features for Chinese Named Entity Recognition Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2008, 45(6): 1004-1010.)
[13] Borthwick A E.A Maximum Entropy Approach to Named Entity Recognition[D]. New York: New York University, 1999.
[14] Isozaki H, Kazawa H.Efficient Support Vector Classifiers for Named Entity Recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002, 1: 1-7.
[15] 张传岩, 洪晓光, 彭朝晖, 等. 基于SVM和扩展条件随机场的Web实体活动抽取[J]. 软件学报, 2012, 23(10): 2612-2627.
[15] (Zhang Chuanyan, Hong Xiaoguang, Peng Zhaohui, et al.Extracting Web Entity Activities Based on SVM and Extended Conditional Random Fields[J]. Journal of Software, 2012, 23(10): 2612-2627.)
[16] 周俊生, 戴新宇, 尹存燕, 等. 基于层叠条件随机场模型的中文机构名自动识别[J]. 电子学报, 2006, 34(5): 804-809.
[16] (Zhou Junsheng, Dai Xinyu, Yin Cunyan, et al.Automatic Recognition of Chinese Organization Name Based on Cascaded Conditional Random Fields[J]. Acta Electronica Sinica, 2006, 34(5): 804-809.)
[17] 李想, 魏小红, 贾璐, 等. 基于条件随机场的农作物病虫害及农药命名实体识别[J]. 农业机械学报, 2017, 48(S1): 178-185.
[17] (Li Xiang, Wei Xiaohong, Jia Lu, et al.Recognition of Crops, Diseases and Pesticides Named Entities in Chinese Based on Conditional Random Fields[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(S1): 178-185.)
[18] 朱娜娜, 景东, 薛涵. 基于深度神经网络的微博图书名识别研究[J]. 图书情报工作, 2016, 60(4): 102-106.
[18] (Zhu Nana, Jing Dong, Xue Han.A Deep Neural Network for Book Title Identification in Microblog[J]. Library and Information Service, 2016, 60(4): 102-106.)
[19] 孙娟娟, 于红, 冯艳红, 等. 基于深度学习的渔业领域命名实体识别[J]. 大连海洋大学学报, 2018, 33(2): 265-269.
[19] (Sun Juanjuan, Yu Hong, Feng Yanhong, et al.Recognition of Nominated Fishery Domain Entity Based on Deep Learning Architectures[J]. Journal of Dalian Ocean University, 2018, 33(2): 265-269.)
[20] Wei Q K, Chen T, Xu R F, et al. Disease Named Entity Recognition by Combining Conditional Random Fields and Bidirectional Recurrent Neural Networks[J]. Database, 2016, 2016: Article No. 140.
[21] 王礼敏. 面向法律文书的中文命名实体识别方法研究[D]. 苏州: 苏州大学, 2018.
[21] (Wang Limin.Research on Chinese Named Entity Recognition for Legal Documents[D]. Suzhou: Soochow University, 2018.)
[22] 周晓辉. 基于隐式马尔科夫模型的法律命名实体识别模型的设计与应用[D]. 广州: 华南理工大学, 2017.
[22] (Zhou Xiaohui.Design and Implementation of a Hidden Markov Model Based Model for Legal Named Entity Recognition[D]. Guangzhou: South China University of Technology, 2017.)
[23] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[23] (Zhang Lin, Qin Ce, Ye Wenhao.Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.)
[24] 徐建忠, 朱俊, 赵瑞, 等. 基于超图的非连续法律实体识别[J]. 信息技术与信息化, 2017(5): 19-22.
[24] (Xu Jianzhong, Zhu Jun, Zhao Rui, et al.Recognition of Discontinuous Law Entities Based on Hypergraph[J]. Information Technology and Informatization, 2017(5): 19-22.)
[25] 杨文柱, 田潇潇, 王思乐, 等. 主动学习算法研究进展[J]. 河北大学学报: 自然科学版, 2017, 37(2): 216-224.
[25] (Yang Wenzhu, Tian Xiaoxiao, Wang Sile, et al.Recent Advances in Active Learning Algorithms[J]. Journal of Hebei University: Natural Science Edition, 2017, 37(2): 216-224.)
[26] 程志. 对裁判文书改革与深化的研究[J]. 当代法学, 2002(11): 117-120.
[26] (Cheng Zhi.Research on Reforming and Deepening of Judgment Documents[J]. Contemporary Law Review, 2002(11): 117-120.)
[1] 高广尚. 关于实体解析基本方法的研究和述评*[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[2] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[3] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[4] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[5] 王东波,吴毅,叶文豪,刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[6] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[7] 张琳,秦策,叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[8] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[9] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[10] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[11] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[12] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[13] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[14] 毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015, 31(1): 38-44.
[15] 曾镇, 吕学强, 李卓. 搜索日志中中文人名的自动识别[J]. 现代图书情报技术, 2014, 30(12): 71-77.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn