1College of Information Science and Technology, Nanjijg Agricultural University, Nanjing 210095, China 2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China 3School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
[Objective] In the context of digital humanities, in order to excavate the corresponding knowledge from the Pre-Qin literature more deeply and accurately, for different parts of the set of lexicon in the class of entity extraction model on the differences in the study. [Methods] Based on the training and testing corpora consisting of “Zuo Zhuan” and “Guo Yu” which have been manually labeled by the machine, three tagging sets of different sizes are formed, with the Pre-Qin part-of-speech tagging set of Nanjing normal university as the main part, supplemented by the part-of-speech tagging sets of Peking University, the Institute of Computing Technology of Chinese Academy of Sciences and the Ministry of Education. The differences between the results of the entity extraction on the same corpus were compared by using the conditional random field and the feature templates. [Results] Comparative experiments were carried out on three part-of-speech tagging sets of different sizes in the Pre-Qin classics “Zuo Zhuan” and “Guo Yu”. The F values of the three models were 82.53%, 83.42% and 84.07%, respectively. [Limitations] Feature selection needs further improvement, and training results can be improved. [Conclusions] The result is helpful for the extraction of the named entities in the ancient literature of the Pre-Qin period. The set of part-of-speech tags constructed is suitable for the part-of-speech tagging of ancient Chinese.
袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.
(Jiang Jianhong, Zhao Songzheng, Luo Mei.Analysis and Application of Chinese Word Segmentation Model Which Consist of Dictionary and Statistics Method[J]. Computer Engineering and Design, 2012, 33(1): 387-391.)
[5]
王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京:南京师范大学, 2014.
[5]
(Wang Jialing.Middle Ancient Chinese Word Segmentation Based on “Han Books”[D]. Nanjing: Nanjing Normal University, 2014.)
(Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
(Lau Kamtang, Song Yan, Xia Fei.The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J]. Journal of Chinese Information Processing, 2013, 27(6): 6-15, 81.)
(Qian Zhiyong, Zhou Jianzhong, Tong Guoping, et al.Research on Automatic Word Segmentation and Pos Tagging for “Chu Ci” Based on HMM[J]. Library and Information Service, 2014, 58(4): 105-110.)
(Zhang Yingjie, Li Bin, Chen Jiajun, et al.A Study in Dictionary-Based All-word Word Sense Disambiguation for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2012, 26(3): 65-71, 103.)
[11]
Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[12]
Frank E, Paynter G W, Witten I H, et al.Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999: 668-673.
[13]
Mihalcea R, Tarau P.TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. ACL, 2004: 404-411.
(Xu Wenhai, Wen Youkui.A Chinese Keyword Extraction Algorithm Based on TFIDF Method[J]. Information Studies: Theory & Application, 2008, 31(2): 298-302.)
(Li Peng, Wang Bin, Shi Zhiwei, et al.Tag-TextRank: A Webpage Keyword Extraction Method Based on Tags[J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
(Xie Wei, Shen Yi, Ma Yongzheng.Recommendation System for Paper Reviewing Based on Graph Computing[J]. Application Research of Computers, 2016, 33(3): 798-801.)
(Wei Yun, Sun Xianpeng.Fusion of Statistics and TextRank for Keyphrase Extraction in Biomedical Literature[J]. Computer Applications and Software, 2017, 34(6): 27-30.)
[22]
温锐. 中文命名实体识别及其关系抽取研究[D]. 苏州: 苏州大学, 2005.
[22]
(Wen Rui.The Research of Chinese Named Entity Recognition and Its Relation Extraction[D]. Suzhou: Soochow University, 2005.)
[23]
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[24]
Pearl J.Bayes and Markov Networks: A Comparison of Two Graphical Representations of Probabilistic Knowledge[D]. Los Angeles, California,USA: University of California,1986.
(Wang Dongbo, Huang Shuiqing, He Lin.Research of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J]. Library and Information Service, 2017, 61(12): 64-70.)