Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (3): 57-65    DOI: 10.11925/infotech.2096-3467.2018.0213
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
不同词性标记集在典籍实体抽取上的差异性探究*
袁悦1,王东波1,2,黄水清1,2(),李斌3
1南京农业大学信息科学技术学院 南京 210095
2南京农业大学领域知识关联研究中心 南京 210095
3南京师范大学文学院 南京 210097
The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books
Yue Yuan1,Dongbo Wang1,2,Shuiqing Huang1,2(),Bin Li3
1College of Information Science and Technology, Nanjijg Agricultural University, Nanjing 210095, China
2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China
3School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
全文: PDF(582 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】在数字人文这一背景下, 为更加深入和精准地从古代典籍中挖掘相应的知识, 通过实验对比分析, 探究不同词性标记集在典籍实体抽取上的差异性。【方法】基于已完成人工校验和机器自动标注的《左传》与《国语》构成的训练和测试语料, 以南京师范大学先秦词性标记集为主、以北京大学、中国科学院计算技术研究所和教育部词性标记集为辅, 共形成三种不同大小的新标记集, 通过条件随机场以及添加特征模板比较这三种词性标记集合在同一语料上进行实体抽取结果的差异性。【结果】在先秦典籍《左传》和《国语》上对不同大小的三种词性标记集开展对比实验, 三种模型各自进行实体抽取的F值分别达到82.53%、83.42%和84.07%。【局限】特征选取有待进一步改善, 训练结果还有提升空间。【结论】本文研究结果有助于先秦古文献命名实体的抽取, 所构建的词性标记集合适用于古汉语词性标注工作。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
袁悦
王东波
黄水清
李斌
关键词 数字人文古文信息处理词性标注命名实体抽取    
Abstract

[Objective] In the context of digital humanities, in order to excavate the corresponding knowledge from the Pre-Qin literature more deeply and accurately, for different parts of the set of lexicon in the class of entity extraction model on the differences in the study. [Methods] Based on the training and testing corpora consisting of “Zuo Zhuan” and “Guo Yu” which have been manually labeled by the machine, three tagging sets of different sizes are formed, with the Pre-Qin part-of-speech tagging set of Nanjing normal university as the main part, supplemented by the part-of-speech tagging sets of Peking University, the Institute of Computing Technology of Chinese Academy of Sciences and the Ministry of Education. The differences between the results of the entity extraction on the same corpus were compared by using the conditional random field and the feature templates. [Results] Comparative experiments were carried out on three part-of-speech tagging sets of different sizes in the Pre-Qin classics “Zuo Zhuan” and “Guo Yu”. The F values of the three models were 82.53%, 83.42% and 84.07%, respectively. [Limitations] Feature selection needs further improvement, and training results can be improved. [Conclusions] The result is helpful for the extraction of the named entities in the ancient literature of the Pre-Qin period. The set of part-of-speech tags constructed is suitable for the part-of-speech tagging of ancient Chinese.

Key wordsDigital Humanities    Ancient Chinese Character Information Processing    Parts of Speech Tagging    Named Entity Extraction
收稿日期: 2018-02-27     
基金资助:*本文系国家社会科学基金重大项目“基于《汉学引得丛刊》的典籍知识库构建及人文计算研究”(项目编号: 15ZDB127)和国家自然科学基金面上项目“基于典籍引得的句法级汉英平行语料库构建及人文计算研究”(项目编号: 71673143)的研究成果之一
引用本文:   
袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0213.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0213
[1] 刘开瑛. 中文文本自动分词和标注[M]. 北京: 商务印书馆, 2000.
[1] (Liu Kaiying.Chinese Text Automatic Segmentation and Tagging[M]. Beijing: The Commercial Press, 2000.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京: 清华大学出版社, 2007.
[2] (Miao Duoqian, Wei Zhihua.The Principle and Application of Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007.)
[3] 牛秀萍. 基于隐马尔科夫模型词性标注的研究[D]. 太原: 太原理工大学, 2013.
[3] (Niu Xiuping.The Research of Part-of-Speech Tagging Based on Hidden Markov Model[D]. Taiyuan: Taiyuan University of Technology, 2013.)
[4] 蒋建洪, 赵嵩正, 罗玫. 词典与统计方法结合的中文分词模型研究及应用[J]. 计算机工程与设计, 2012, 33(1): 387-391.
[4] (Jiang Jianhong, Zhao Songzheng, Luo Mei.Analysis and Application of Chinese Word Segmentation Model Which Consist of Dictionary and Statistics Method[J]. Computer Engineering and Design, 2012, 33(1): 387-391.)
[5] 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京:南京师范大学, 2014.
[5] (Wang Jialing.Middle Ancient Chinese Word Segmentation Based on “Han Books”[D]. Nanjing: Nanjing Normal University, 2014.)
[6] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45.
[6] (Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
[7] 留金腾, 宋彦, 夏飞. 上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J]. 中文信息学报, 2013, 27(6): 6-15, 81.
[7] (Lau Kamtang, Song Yan, Xia Fei.The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J]. Journal of Chinese Information Processing, 2013, 27(6): 6-15, 81.)
[8] 钱智勇, 周建忠, 童国平, 等. 基于HMM的楚辞自动分词标注研究[J]. 图书情报工作, 2014, 58(4): 105-110.
[8] (Qian Zhiyong, Zhou Jianzhong, Tong Guoping, et al.Research on Automatic Word Segmentation and Pos Tagging for “Chu Ci” Based on HMM[J]. Library and Information Service, 2014, 58(4): 105-110.)
[9] 姜维, 关毅, 王晓龙. 基于条件随机域的词性标注模型[J]. 计算机工程与应用, 2006(21): 13-16, 42.
[9] (Jiang Wei, Guan Yi, Wang Xiaolong.Conditional Random Fields Based POS Tagging[J]. Computer Engineering and Application, 2006(21): 13-16, 42.)
[10] 张颖杰, 李斌, 陈家骏, 等. 基于词典信息的先秦汉语全文词义标注方法研究[J]. 中文信息学报, 2012, 26(3): 65-71, 103.
[10] (Zhang Yingjie, Li Bin, Chen Jiajun, et al.A Study in Dictionary-Based All-word Word Sense Disambiguation for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2012, 26(3): 65-71, 103.)
[11] Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[12] Frank E, Paynter G W, Witten I H, et al.Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999: 668-673.
[13] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. ACL, 2004: 404-411.
[14] 牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
[14] (Niu Ping, Huang Degen.TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[15] 徐文海, 温有奎. 一种基于TFIDF方法的中文关键词抽取算法[J]. 情报理论与实践, 2008, 31(2): 298-302.
[15] (Xu Wenhai, Wen Youkui.A Chinese Keyword Extraction Algorithm Based on TFIDF Method[J]. Information Studies: Theory & Application, 2008, 31(2): 298-302.)
[16] 李鹏, 王斌, 石志伟, 等. Tag-TextRank: 一种基于Tag的网页关键词抽取方法[J]. 计算机研究与发展, 2012, 49(11): 2344-2351.
[16] (Li Peng, Wang Bin, Shi Zhiwei, et al.Tag-TextRank: A Webpage Keyword Extraction Method Based on Tags[J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
[17] 谢玮, 沈一, 马永征. 基于图计算的论文审稿自动推荐系统[J].计算机应用研究, 2016, 33(3): 798-801.
[17] (Xie Wei, Shen Yi, Ma Yongzheng.Recommendation System for Paper Reviewing Based on Graph Computing[J]. Application Research of Computers, 2016, 33(3): 798-801.)
[18] 蒲梅, 周枫, 周晶晶, 等. 基于加权TextRank的新闻关键事件主题句提取[J]. 计算机工程, 2017, 43(8): 219-224.
[18] (Pu Mei, Zhou Feng, Zhou Jingjing, et al.Topic Sentence Extraction of Key News Events Based on Weighted TextRank[J].Computer Engineering, 2017, 43(8): 219-224.)
[19] 宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6): 20-27.
[19] (Ning Jianfei, Liu Jiangzhen.Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6): 20-27.)
[20] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[20] (Xia Tian.Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[21] 魏赟, 孙先朋. 融合统计学和TextRank的生物医学文献关键短语抽取[J]. 计算机应用与软件, 2017, 34(6): 27-30.
[21] (Wei Yun, Sun Xianpeng.Fusion of Statistics and TextRank for Keyphrase Extraction in Biomedical Literature[J]. Computer Applications and Software, 2017, 34(6): 27-30.)
[22] 温锐. 中文命名实体识别及其关系抽取研究[D]. 苏州: 苏州大学, 2005.
[22] (Wen Rui.The Research of Chinese Named Entity Recognition and Its Relation Extraction[D]. Suzhou: Soochow University, 2005.)
[23] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[24] Pearl J.Bayes and Markov Networks: A Comparison of Two Graphical Representations of Probabilistic Knowledge[D]. Los Angeles, California,USA: University of California,1986.
[25] 王东波, 黄水清, 何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作, 2017, 61(12): 64-70.
[25] (Wang Dongbo, Huang Shuiqing, He Lin.Research of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J]. Library and Information Service, 2017, 61(12): 64-70.)
[1] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[2] 赖茂生,屈鹏. 搜索引擎查询日志的词性标注和挖掘研究[J]. 现代图书情报技术, 2009, 25(4): 50-56.
[3] 阴晋岭,王惠临. 词性标注的方法研究*——结合条件随机场和基于转换学习的方法进行词性标注[J]. 现代图书情报技术, 2009, 3(3): 46-51.
[4] 饶洋辉,叶良,程洁. WordNet在文本聚类中的应用研究*[J]. 现代图书情报技术, 2009, (10): 67-70.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn