Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 143-152    DOI: 10.11925/infotech.2096-3467.2019.0630
Current Issue | Archive | Adv Search |
Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity
Xiong Xin1,2,Wang Hao1,2(),Zhang Haichao1,2,Zhang Baolong1,2
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1426 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores the granularity of Chinese terms from different fields, and then measures the Term Discriminative Capacity (TDC).[Methods] First, we used TDC to evaluate the quality of terms from four indexes. Then, we detected the differences in TDC among disciplines, fields and term granularity.[Results] In control group, the order of mean TDC was Title > Abstract > Keywords Plus > Keywords. In experimental group, the performance of Keywords Plus was improved, thus Title > Keywords Plus > Abstract > Keywords.[Limitations] We only collected data from five disciplines in Humanities and Social sciences.[Conclusions] Both Chinese term granularity and source fields influence the Term Discriminative Capacity. We should standarize term granularity to reduce the impact of fields.

Key wordsTerm Discriminative Capacity      Term Granularity      Academic Literature Retrieval System      Automatic Indexing     
Received: 10 June 2019      Published: 26 April 2020
ZTFLH:  TP391  
Corresponding Authors: Hao Wang     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 143-152.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0630     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/143

Research Framework
序号 学科 学科简称 文献
检索数
有效
记录数
有效
百分比
学科类型
1 哲学 PHI 8 160 3 861 47.32% 人文
2 历史学 HIS 7 341 3 624 49.37% 人文
3 经济学 ECO 34 255 19 149 55.90% 社科
4 社会学 SOC 4 622 2 268 49.07% 社科
5 图书馆、情报与文献学 LIS 10 285 6 440 62.62% 交叉
Documents and Effective Documents in Disciplines
字段(Field) 编号 简称
题名 1 TI
摘要 2 AB
关键词 3 KW
附加关键词 4 KP
Serial Numbers and Abbreviations of Fields
组别

字段
TI AB KW KP All
对照组 2 772 8 997 3 294 7 986 18 891
实验组 2 772 8 997 2 693 5 188 11 173
Numbers of Terms
字段 TI AB KW KP All
对照组 1.94 2.05 4.11 4.37 3.31
实验组 1.94 2.05 1.95 1.95 2.06
Average Length of Terms
Percentages of Short Terms
Scatter Plot of TDC by Field (Control Group)
Line Plots of One-way ANOVA Mean of TDC and Filed (Control Group)
Scatter Plot of TDC and Number by Filed (Experimental Group)
Line Plots of One-way ANOVA Mean of TDC and Filed (Experimental Group)
Line Plots and Term Granularity of Two-way ANOVA Mean
[1] 马利 . 社科学术论文中关键词的标引[J]. 中央民族大学学报:哲学社会科学版, 2007,34(4):133-136.
[1] ( Ma Li . The Mark of Key Words in Social Academic Articles[J]. Journal of the Central University for Nationalities: Philosophy and Social Sciences Edition, 2007,34(4):133-136.)
[2] 马张华 . 简论标引用词和检索用词的差别[J]. 大学图书馆学报, 1997, 15(4): 59,61.
[2] ( Ma Zhanghua . A Brief Discussion on the Differences Between Indexing Words and Retrieval Words[J]. Journal of Academic Libraries, 1997, 15(4): 59,61.)
[3] Garfield E . Current Contents[J]. Current Contents, 1990(32):295-299.
[4] 储荷婷 . 索引工作自动化:自动标引的主要方法[J]. 情报学报, 1993,12(3):218-229.
[4] ( Chu Heting . Automation of Indexing: On the Major Approaches to Automatic Indexing[J]. Journal of the China Society for Scientific and Technical Information, 1993,12(3):218-229.)
[5] Salton G, Yang C S, Yu C T . A Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American Society for Information Science, 1975,26(1):33-44.
[6] Salton G . Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer[M]. Addison-Wesley, 1989.
[7] Luhn H P . A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957,1(4):309-317.
[8] 韩客松, 王永成 . 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001,20(2):212-216.
[8] ( Han Kesong, Wang Yongcheng . Methods of Keyword and Subject Concept Indexing to Chinese Full-text[J]. Journal of the China Society for Scientific and Technical Information, 2001,20(2):212-216.)
[9] Hulth A . Improved Automatic Keyword Extraction Given More Linguistic Knowledge [C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan. 2003: 216-223.
[10] Ercan G, Cicekli I . Using Lexical Chains for Keyword Extraction[J]. Information Processing and Management, 2007,43(6):1705-1714.
[11] Salton G, Buckley C . Automatic Text Structuring and Retrieval-Experiments in Automatic Encyclopedia Searching [C]//Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1991: 21-30.
[12] Matsuo Y, Ishizuka M . Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information[J]. International Journal on Artificial Intelligence Tools, 2004,13(1):157-169.
[13] Zhang K, Xu H, Tang J , et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 7th International Conference on Web-Age Information Management, Hong Kong, China. 2006: 85-96.
[14] Huang Z, Xu W, Yu K . Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[15] 苏新宁, 邹晓明 . 现代图书情报技术[J]. 现代图书情报技术, 2000(1):23-26.
[15] ( Su Xinning, Zou Xiaoming . On Automatic Indexing of Documents[J]. New Technology of Library and Information Service, 2000(1):23-26.)
[16] 章成志 . 现代图书情报技术[J]. 现代图书情报技术, 2007(11):33-39.
[16] ( Zhang Chengzhi . Review and Prospect of Automatic Indexing Research[J]. New Technology of Library and Information Service, 2007(11):33-39.)
[17] Kim W, Aronson A R, Wilbur W J . Automatic MeSH Term Assignment and Quality Assessment [C]// Proceedings of the 2001 American Medical Informatics Association Annual Symposium, Washington, DC, USA. 2001.
[18] Wacholder N, Klavans J L, Evans D K . Evaluation of Automatically Identified Index Terms for Browsing Electronic Documents [C]// Proceedings of the 6th Conference on Applied Natural Language Processing. 2000: 302-309.
[19] Salton G, Yang C S . On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973,29(4):351-372.
[20] Salton G, Wong A . On the Role of Words and Phrases in Automatic Text Analysis[J]. Computers and the Humanities, 1976,10(2):69-87.
[21] Willett P . An Algorithm for the Calculation of Exact Term Discrimination Values[J]. Information Processing and Management, 1985,21(3):225-232.
[22] Ajiferuke I, Chu C M . Quality of Indexing in Online Databases: An Alternative Measure for a Term Discriminating Index[J]. Information Processing and Management, 1988,24(5):599-601.
[23] Fisher R A . Statistical Methods for Research Workers[M]. Oliver and Boyd, 1925.
[24] 张海潮, 王昊, 唐慧慧 , 等. CRFs字角色标注方法在中文附加关键词抽取中的应用研究[J]. 情报理论与实践, 2019,42(2):169-176.
[24] ( Zhang Haichao, Wang Hao, Tang Huihui , et al. Application of CRFs Chinese Character Role Labeling Method in Chinese Keywords Plus Extraction[J]. Information Studies: Theory & Application, 2019,42(2):169-176.)
[25] NLPIR 汉语分词系统[CP/OL]. [ 2018- 11- 26]. http://www.nlpir.org/.
[25] ( NLPIR Chinese Word Segmentation System[CP/OL].[ 2018- 11- 26]. http://www.nlpir.org/
[26] 中国科学技术信息研究所. 2018版中国科技期刊引证报告[R]. 北京: 中国科学技术信息研究所, 2018.
[26] ( Institute of Scientific and Technical Information of China. The Statistical Report of Chinese Scientific and Technical Journals of 2018[R]. Beijing: Institute of Scientific and Technical Information of China, 2018.)
[1] Yang He, Yang Yihong, Li Ning. Construction of Keywords-Chinese Library Classification Codes Integrated Thesaurus[J]. 现代图书情报技术, 2013, 29(7/8): 107-113.
[2] Zhao Yan, Chen Heng. A Method to Improve Accuracy of Automatic Indexing for Chinese-English Mixed Text[J]. 现代图书情报技术, 2012, 28(6): 36-42.
[3] Zhang Chengmin,Xu Xin,Zhang Chengzhi. Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model[J]. 现代图书情报技术, 2008, 24(6): 34-40.
[4] Zhang Chengzhi. Review and Prospect of Automatic Indexing Research[J]. 现代图书情报技术, 2007, 2(11): 33-39.
[5] Wang Lancheng,Wang Lishuang. Research on a New Text Automatic Indexing Technology Based on Digital Library[J]. 现代图书情报技术, 2006, 1(2): 5-9.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn