Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (3): 57-65    DOI: 10.11925/infotech.2096-3467.2018.0213
Current Issue | Archive | Adv Search |
The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books
Yue Yuan1,Dongbo Wang1,2,Shuiqing Huang1,2(),Bin Li3
1College of Information Science and Technology, Nanjijg Agricultural University, Nanjing 210095, China
2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China
3School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
Download: PDF (582 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      

[Objective] In the context of digital humanities, in order to excavate the corresponding knowledge from the Pre-Qin literature more deeply and accurately, for different parts of the set of lexicon in the class of entity extraction model on the differences in the study. [Methods] Based on the training and testing corpora consisting of “Zuo Zhuan” and “Guo Yu” which have been manually labeled by the machine, three tagging sets of different sizes are formed, with the Pre-Qin part-of-speech tagging set of Nanjing normal university as the main part, supplemented by the part-of-speech tagging sets of Peking University, the Institute of Computing Technology of Chinese Academy of Sciences and the Ministry of Education. The differences between the results of the entity extraction on the same corpus were compared by using the conditional random field and the feature templates. [Results] Comparative experiments were carried out on three part-of-speech tagging sets of different sizes in the Pre-Qin classics “Zuo Zhuan” and “Guo Yu”. The F values of the three models were 82.53%, 83.42% and 84.07%, respectively. [Limitations] Feature selection needs further improvement, and training results can be improved. [Conclusions] The result is helpful for the extraction of the named entities in the ancient literature of the Pre-Qin period. The set of part-of-speech tags constructed is suitable for the part-of-speech tagging of ancient Chinese.

Key wordsDigital Humanities      Ancient Chinese Character Information Processing      Parts of Speech Tagging      Named Entity Extraction     
Received: 27 February 2018      Published: 17 April 2019

Cite this article:

Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.

URL:     OR

[1] 刘开瑛. 中文文本自动分词和标注[M]. 北京: 商务印书馆, 2000.
[1] (Liu Kaiying.Chinese Text Automatic Segmentation and Tagging[M]. Beijing: The Commercial Press, 2000.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京: 清华大学出版社, 2007.
[2] (Miao Duoqian, Wei Zhihua.The Principle and Application of Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007.)
[3] 牛秀萍. 基于隐马尔科夫模型词性标注的研究[D]. 太原: 太原理工大学, 2013.
[3] (Niu Xiuping.The Research of Part-of-Speech Tagging Based on Hidden Markov Model[D]. Taiyuan: Taiyuan University of Technology, 2013.)
[4] 蒋建洪, 赵嵩正, 罗玫. 词典与统计方法结合的中文分词模型研究及应用[J]. 计算机工程与设计, 2012, 33(1): 387-391.
[4] (Jiang Jianhong, Zhao Songzheng, Luo Mei.Analysis and Application of Chinese Word Segmentation Model Which Consist of Dictionary and Statistics Method[J]. Computer Engineering and Design, 2012, 33(1): 387-391.)
[5] 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京:南京师范大学, 2014.
[5] (Wang Jialing.Middle Ancient Chinese Word Segmentation Based on “Han Books”[D]. Nanjing: Nanjing Normal University, 2014.)
[6] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45.
[6] (Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
[7] 留金腾, 宋彦, 夏飞. 上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J]. 中文信息学报, 2013, 27(6): 6-15, 81.
[7] (Lau Kamtang, Song Yan, Xia Fei.The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J]. Journal of Chinese Information Processing, 2013, 27(6): 6-15, 81.)
[8] 钱智勇, 周建忠, 童国平, 等. 基于HMM的楚辞自动分词标注研究[J]. 图书情报工作, 2014, 58(4): 105-110.
[8] (Qian Zhiyong, Zhou Jianzhong, Tong Guoping, et al.Research on Automatic Word Segmentation and Pos Tagging for “Chu Ci” Based on HMM[J]. Library and Information Service, 2014, 58(4): 105-110.)
[9] 姜维, 关毅, 王晓龙. 基于条件随机域的词性标注模型[J]. 计算机工程与应用, 2006(21): 13-16, 42.
[9] (Jiang Wei, Guan Yi, Wang Xiaolong.Conditional Random Fields Based POS Tagging[J]. Computer Engineering and Application, 2006(21): 13-16, 42.)
[10] 张颖杰, 李斌, 陈家骏, 等. 基于词典信息的先秦汉语全文词义标注方法研究[J]. 中文信息学报, 2012, 26(3): 65-71, 103.
[10] (Zhang Yingjie, Li Bin, Chen Jiajun, et al.A Study in Dictionary-Based All-word Word Sense Disambiguation for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2012, 26(3): 65-71, 103.)
[11] Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[12] Frank E, Paynter G W, Witten I H, et al.Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999: 668-673.
[13] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. ACL, 2004: 404-411.
[14] 牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
[14] (Niu Ping, Huang Degen.TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[15] 徐文海, 温有奎. 一种基于TFIDF方法的中文关键词抽取算法[J]. 情报理论与实践, 2008, 31(2): 298-302.
[15] (Xu Wenhai, Wen Youkui.A Chinese Keyword Extraction Algorithm Based on TFIDF Method[J]. Information Studies: Theory & Application, 2008, 31(2): 298-302.)
[16] 李鹏, 王斌, 石志伟, 等. Tag-TextRank: 一种基于Tag的网页关键词抽取方法[J]. 计算机研究与发展, 2012, 49(11): 2344-2351.
[16] (Li Peng, Wang Bin, Shi Zhiwei, et al.Tag-TextRank: A Webpage Keyword Extraction Method Based on Tags[J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
[17] 谢玮, 沈一, 马永征. 基于图计算的论文审稿自动推荐系统[J].计算机应用研究, 2016, 33(3): 798-801.
[17] (Xie Wei, Shen Yi, Ma Yongzheng.Recommendation System for Paper Reviewing Based on Graph Computing[J]. Application Research of Computers, 2016, 33(3): 798-801.)
[18] 蒲梅, 周枫, 周晶晶, 等. 基于加权TextRank的新闻关键事件主题句提取[J]. 计算机工程, 2017, 43(8): 219-224.
[18] (Pu Mei, Zhou Feng, Zhou Jingjing, et al.Topic Sentence Extraction of Key News Events Based on Weighted TextRank[J].Computer Engineering, 2017, 43(8): 219-224.)
[19] 宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6): 20-27.
[19] (Ning Jianfei, Liu Jiangzhen.Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6): 20-27.)
[20] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[20] (Xia Tian.Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[21] 魏赟, 孙先朋. 融合统计学和TextRank的生物医学文献关键短语抽取[J]. 计算机应用与软件, 2017, 34(6): 27-30.
[21] (Wei Yun, Sun Xianpeng.Fusion of Statistics and TextRank for Keyphrase Extraction in Biomedical Literature[J]. Computer Applications and Software, 2017, 34(6): 27-30.)
[22] 温锐. 中文命名实体识别及其关系抽取研究[D]. 苏州: 苏州大学, 2005.
[22] (Wen Rui.The Research of Chinese Named Entity Recognition and Its Relation Extraction[D]. Suzhou: Soochow University, 2005.)
[23] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[24] Pearl J.Bayes and Markov Networks: A Comparison of Two Graphical Representations of Probabilistic Knowledge[D]. Los Angeles, California,USA: University of California,1986.
[25] 王东波, 黄水清, 何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作, 2017, 61(12): 64-70.
[25] (Wang Dongbo, Huang Shuiqing, He Lin.Research of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J]. Library and Information Service, 2017, 61(12): 64-70.)
[1] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[2] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[3] Zhao Yuxiang,Lian Jingwen. Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[4] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[5] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[6] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7] Haici Yang,Jun Wang. Visualizing Knowledge Graph of Academic Inheritance in Song Dynasty[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938