Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (11): 46-52     https://doi.org/10.11925/infotech.2096-3467.2017.0442
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于条件随机场的法言法语实体自动识别模型研究*
张琳1(), 秦策2, 叶文豪1
1南京农业大学信息科学技术学院 南京 210095
2南京师范大学法学院 南京 210023
Automatic Recognition of Legal Language Entities Based on Conditional Random Fields
Zhang Lin1(), Qin Ce2, Ye Wenhao1
1 College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
2 School of Law, Nanjing Normal University, Nanjing 210023, China
全文: PDF (460 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

目的】法言法语实体的自动识别是实现裁判文书文本挖掘的重要的基础性工作。【方法】采用爬虫方法获取数据, 人工方式进行语料标注, 利用NLPIR加载法律领域词典对语料进行分词, 结合法言法语的内部和外部特征构建基于条件随机场的特征模板, 自动识别语料中的法言法语。【结果】融入法言法语内部和外部特征的条件随机场模型, 自动识别法言法语的实验效果良好, 模型的调和平均值达到90%以上。【局限】法言法语实体自动识别模型在领域的扩展上有一定的局限性。【结论】基于条件随机场对法言法语实体实现自动抽取是可行的。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张琳
秦策
叶文豪
关键词 裁判文书条件随机场模型法言法语实体    
Abstract

[Objective] This paper aims to automatically identify the Legal Language Entities, which lays foundations for text mining of the Judgements. [Methods] First, we used a crawler to retrieve the needed data and manually marked the corpus. Then, we applied the NLPIR to load the legal field dictionary for corpus segmentation. Finally, we constructed the feature template based on the conditional random field and automatically recognize the Legal Language Entities. [Results] The conditional random field model with internal and external features of Legal Language could automatically identify the legal words, and its harmonic mean was over 90%. [Limitations] The proposed model has some limitations in field expansion. [Conclusions] It is feasible to automatically extract Legal Language Entities with the help of conditional random fields.

Key wordsJudgements    Conditional Random Field Model    Legal Language Entity
收稿日期: 2017-05-19      出版日期: 2017-11-27
ZTFLH:  G350  
基金资助:*本文系国家社会科学基金项目“转型期公众道德需求的司法应对研究”(项目编号: 13BFX006)的研究成果之一
引用本文:   
张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
Zhang Lin,Qin Ce,Ye Wenhao. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0442      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I11/46
实体
长度
数量(个) 实体
长度
数量(个) 实体
长度
数量(个) 实体
长度
数量(个)
2 39 803 7 1 210 12 93 17 25
3 23 017 8 444 13 59 18 4
4 26 555 9 309 14 41 19 19
5 6 488 10 316 15 26 20 1
6 1 671 11 22 16 25 21 4
  法言法语实体长度分布
左边界词分布 右边界词分布
词长度 频率 词长度 频率
1 17.57% 1 29.82%
2 81.52% 2 63.28%
3 0.68% 3 6.07%
4 0.22% 4 0.83%
  实体左右边界词长分布
词语 词性 词长度 是否
实体词
是否
左边界
是否
右边界
标记
作案 vi 2 Y Y Y S
ng 1 N N N S
具备 v 2 N N N S
刑事 b 2 Y Y N B
责任 n 2 Y N N M
能力 n 2 Y N Y E
, wd 1 N N N S
应予 v 2 N N N S
严惩 v 2 N N N S
  裁判文书语料预处理样例
编号 模板 模板含义
1 %x[-2, 0] 当前词的前2个词
2 %x[-1, 0] 当前词的前1个词
3 %x[0, 0] 当前词
4 %x[1, 0] 当前词的后1个词
5 %x[2, 0] 当前词的后2个词
6 %x[-2, 0]/%x[-1, 0] 前2个词到前1个词的转移概率
7 %x[-1, 0]/%x[0, 0] 前1个词到当前词的转移概率
8 %x[0, 0]/%x[1, 0] 当前词到后1个词的转移概率
  简单特征模板说明
编号 P R F
1 0.957209 0.974524 0.965789
2 0.934819 0.951670 0.943169
3 0.942223 0.959492 0.950779
4 0.934009 0.950114 0.941992
5 0.933376 0.948381 0.940819
6 0.938468 0.949555 0.943979
7 0.939941 0.949402 0.944647
8 0.942211 0.949419 0.945801
9 0.944823 0.950231 0.947519
10 0.945409 0.949339 0.947370
均值 0.941249 0.953213 0.947186
  使用罪名词典语料自动识别模型的测评数据
编号 P R F
1 0.835947 0.883422 0.859029
2 0.885392 0.915164 0.900032
3 0.890849 0.927982 0.909037
4 0.902713 0.930428 0.916361
5 0.915151 0.934568 0.924758
6 0.921697 0.939949 0.930733
7 0.928558 0.942517 0.935485
8 0.931797 0.943780 0.937750
9 0.935462 0.945968 0.940686
10 0.937246 0.946705 0.941952
均值 0.908481 0.931048 0.919582
  未使用罪名词典语料自动识别模型的测评数据
[1] 中国裁判文书网[EB/OL]. [2016-12-31]. .
[1] (China Judgements Online [EB/OL]. [2016-12-31].
[2] 熊小梅, 刘永浪. 基于LSA 的二次降维法在中文法律案情文本分类中的应用[J]. 电子测量技术, 2007, 30(10): 111-114.
doi: 10.3969/j.issn.1002-7300.2007.10.032
[2] (Xiong Xiaomei, Liu Yonglang.Application of Quadratic Dimension Reduction Method Based on LSA in Classification of the Chinese Legal Text[J]. Electronic Measurement Technology, 2007, 30(10): 111-114.)
doi: 10.3969/j.issn.1002-7300.2007.10.032
[3] 程春惠, 何钦铭. 面向不均衡类别朴素贝叶斯犯罪案件文本分类[J]. 计算机工程与应用, 2009, 45(35): 126-128, 131.
doi: 10.3778/j.issn.1002-8331.2009.35.038
[3] (Cheng Chunhui, He Qinming.Naive Bayes Based Criminal Text Classification of Unbalanced Classes[J]. Computer Engineering and Applications, 2009, 45(35): 126-128, 131.)
doi: 10.3778/j.issn.1002-8331.2009.35.038
[4] 佘贵清, 张永安. 审判案例自动抽取与标注模型研究[J]. 现代图书情报技术, 2013(6): 23-29.
[4] (She Guiqing, Zhang Yongan.Study on the Model of Automatic Extraction and Annotation of Trail Cases[J]. New Technology of Library and Information Service, 2013(6): 23-29.)
[5] 张忠民. 生态破坏的司法救济——基于5792份环境裁判文书样本的分析[J]. 法学, 2016(10): 111-124.
[5] (Zhang Zhongmin. Judicial Relief of Ecological Destruction - An Analysis Based on5792 Environmental Judgements[J]. Law Science, 2016(10): 111-124.)
[6] 马超, 于晓虹, 何海波. 大数据分析: 中国司法裁判文书上网公开报告[J]. 中国法律评论, 2016(4): 195-246.
[6] (Ma Chao, Yu Xiaohong, He Haibo.Big Data Analysis: Public Report of China Judgements Online[J]. China Law Review, 2016(4): 195-246.)
[7] Rau L F.Extracting Company Names from Text[C]// Proceedings of the 7th IEEE Conference on Artificial Intelligence Applications. 1991: 29-32.
[8] Grishman R, Sundheim B.Message Understanding Conference-6: A Brief Histroy[C]// Proceedings of the 16th International Conference on Computational Linguistics (COLING-96). 1996: 466-471.
[9] Bikel D M, Schwartz R, Weischedel R M.An Algorithm that Learns What’s in a Name[J]. Machine Learning, 1999, 34(1-3): 211-231.
doi: 10.1023/A:1007558221122
[10] Chen H H, Ding Y W, Tsai S C, et al.Description of the NTU System Used for MET2[C]//Proceedings of the 7th Message Understanding Conference, 1998.
[11] Yu S H, Bai S H, Wu P.Description of the Kent Ridge Digital Lads System Used for MUC-7[C]// Proceedings of the 7th Message Understanding Conference, 1998.
[12] Wikipedia: Named Entity Recognition[EB/OL]. [2017-02- 03]..
[13] 孙茂松, 黄昌宁, 高海燕, 等. 中文姓名的自动辨识[J]. 中文信息学报, 1995, 9(2): 16-27.
[13] (Sun Maosong, Huang Changning, Gao Haiyan, et al.Identifying Chinese Names in Unrestricted Texts[J]. Journal of Chinese Information Processing, 1995, 9(2): 16-27.)
[14] 俞鸿魁, 张华平, 刘群, 等. 基于层叠隐马尔可夫模型的中文命名实体识别[J]. 通信学报, 2006, 27(2): 87-93.
doi: 10.3321/j.issn:1000-436X.2006.02.013
[14] (Yu Hongkui, Zhang Huaping, Liu Qun, et al.Chinese Named Entity Identification Using Cascaded Hidden Markov Model[J]. Journal on Communications, 2006, 27(2): 87-93.)
doi: 10.3321/j.issn:1000-436X.2006.02.013
[15] 唐旭日, 陈小荷, 许超, 等. 基于篇章的中文地名识别研究[J]. 中文信息学报, 2010, 24(2): 24-32.
doi: 10.3969/j.issn.1003-0077.2010.02.003
[15] (Tang Xuri, Chen Xiaohe, Xu Chao, et al.Discourse-Based Chinese Location Name Recognition[J]. Journal of Chinese Information Processing, 2010, 24(2): 24-32.)
doi: 10.3969/j.issn.1003-0077.2010.02.003
[16] 鞠久朋, 张伟伟, 宁建军, 等. CRF与规则相结合的地理空间命名实体识别[J]. 计算机工程, 2011, 37(7): 210-212, 215.
doi: 10.3969/j.issn.1000-3428.2011.07.071
[16] (Ju Jiupeng, Zhang Weiwei, Ning Jianjun, et al.Geospatial Named Entities Recognition Using Combination of CRF and Rules[J]. Computer Engineering, 2011, 37(7): 210-212, 215.)
doi: 10.3969/j.issn.1000-3428.2011.07.071
[17] 叶枫, 陈莺莺, 周根贵, 等. 电子病历中命名实体的智能识别[J]. 中国生物医学工程学报, 2011, 30(2): 256-262.
doi: 10.3969/j.issn.0258-8021.2011.02.014
[17] (Ye Feng, Chen Yingying, Zhou Gengui, et al.Intelligent Recognition of Named Entity in Electronic Medical Records[J]. Chinese Journal of Biomedical Engineering, 2011, 30(2): 256-262.)
doi: 10.3969/j.issn.0258-8021.2011.02.014
[18] 王春雨, 王芳. 基于条件随机场的农业命名实体识别研究[J]. 河北农业大学学报, 2014, 37(1): 132-135.
[18] (Wang Chunyu, Wang Fang.Study on Recognition of Chinese Agricultural Named Entity with Conditional Random Fields[J]. Journal of Agricultural University of Hebei, 2014, 37(1): 132-135.)
[19] 隋明爽, 崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016(10): 91-97.
[19] (Sui Mingshuang, Cui Lei.Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. New Technology of Library and Information Service, 2016(10): 91-97.)
[20] 王东波, 吴毅, 叶文豪, 等. 多特征知识下的食品安全事件实体抽取研究[J]. 数据分析与知识发现, 2017(3): 54-61.
[20] (Wang Dongbo, Wu Yi, Ye Wenhao, et al.Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. Data Analysis and Knowledge Discovery, 2017(3): 54-61.)
[21] 吴云芳. 面向语言信息处理的现代汉语并列结构研究[M]. 北京: 北京师范大学出版社, 2004.
[21] (Wu Yunfang.Researches of Modern Chinese Coordinate Construction for Language Information Processing[M]. Beijing: Beijing Normal University Press, 2004.)
[22] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning. Williamstown: Williams College, 2001: 282-289.
[23] McCallum A, Freitag D, Pereira F. Maximum Entropy Markov Models for Information Extraction and Segmentation[C]//Proceedings of the 17th International Conference on Machine Learning. 2000: 591-598.
[1] 王东波,吴毅,叶文豪,刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn