Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (6): 91-108    DOI: 10.11925/infotech.2096-3467.2019.1224
Current Issue | Archive | Adv Search |
Measurement and Distribution of Index Quality in Research Topics from Academic Databases
Li Keyu,Wang Hao(),Gong Lijuan,Tang Huihui
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1790 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper measures the quality of index terms from research topics in academic databases and explores their distribution characteristics. [Methods] We collected the index terms of research topics in humanities, society and natural sciences from Web of Science and CNKI. Then, we constructed terminology spaces based on research topics, domains and databases. Third, we used term discriminative capacity (TDC) to evaluate their quality. Finally, we conducted ANOVA testing to explore the distribution characteristics of index terms quality from different databases/domains. [Results] The index term quality of research topics followed the rules of “Abstract”> average level >“Keyword”. The “Title” of CNKI (“Keyword Plus” in Web of Science) were lower than “Abstract”, while the “Title” in WoS were lower than average. [Limitations] The amount of research topics in this study needs to be expanded. [Conclusions] The TDC measure method is stable and reliable, which helps us improve the information retrieval services and terms quality.

Key wordsIndexing Term      Term Discriminative Capability      ANOVA Testing      Search Fields      The Distribution Characteristics of Terms Quality     
Received: 08 November 2019      Published: 07 July 2020
ZTFLH:  TP391 G35  
Corresponding Authors: Wang Hao     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Li Keyu,Wang Hao,Gong Lijuan,Tang Huihui. Measurement and Distribution of Index Quality in Research Topics from Academic Databases. Data Analysis and Knowledge Discovery, 2020, 4(6): 91-108.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1224     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I6/91

Research Framework
主题序号 领域标识 研究主题 检索文献数(篇) 有效文献数(篇) 选用文献数(篇) 术语数量(个)
1 A&HCI Aristotle (亚里士多德) 2 897 323 100 2 727
2 Realism(现实主义) 5 374 1 026 100 2 555
3 Christianity(基督教) 4 364 743 100 3 247
4 SSCI Government failure(政府失效) 4 254 1 871 100 3 184
5 Population urbanization(人口城市化) 3 882 2 782 100 3 463
6 Economic depression(经济萧条) 4 988 3 208 100 3 430
7 SCI Petrology(岩石学) 6 887 4 542 100 4 740
8 Rubella(风疹) 4 940 2 709 100 3 913
9 Supersaturated solution(过饱和溶液) 5 072 2 745 100 3 377
10 CSSCI_A 文学批评 4 561 4 468 100 2 308
11 黑格尔 2 257 2 225 100 1 914
12 非物质文化遗产 2 405 2 334 100 1 958
13 CSSCI_S 通货膨胀 4 315 4 297 100 1 873
14 产业集聚 4 324 4 316 100 1 732
15 经济危机 4 455 4 367 100 1 905
16 CSCD 粒子群算法 4 553 4 552 100 1 889
17 细胞移植 5 912 5 799 100 2 021
18 配合物 5 240 5 171 100 2 185
Literature Search in WoS and CNKI
WoS CNKI
编号 含义 编号 含义
字段 1 Title 1 Title
2 Keyword 2 Keyword
3 Keyword Plus 3 Abstract
4 Abstract
领域 1 A&HCI 1 CSSCI_A
2 SSCI 2 CSSCI_S
3 SCI 3 CSCD
Symbolic Explanation
Frequency Histogram of TDV、TDC of Terms in A&HCI
Relationship Between TDV 、TDC and DF in A&HCI
One-Way ANOVA Results of TDC for Each Research Topic in A&HCI
One-Way ANOVA Results of TDC for Each Research Topic in CSSCI_A
One-Way ANOVA Results of TDC for Each Research Topic in SSCI
One-Way ANOVA Results of TDC for Each Research Topic in CSSCI_S
One-Way ANOVA Results of TDC for Each Research Topic in SCI
One-Way ANOVA Results of TDC for Each Research Topic in CSCD
Distribution of Positive and Negative Terms of Research Topics、Domains in WoS
One-Way ANOVA Results of TDC for Each Domain in WoS
Distribution of Positive and Negative Terms of Research Topics、Domains in CNKI
One-Way ANOVA Results of TDC for Each Domain in CNKI
Distribution of Positive and Negative Terms of WoS and CNKI
One-Way ANOVA Results of Field Factors
One-Way ANOVA Results of Domain Factors
Two-Way ANOVA Results with Domain、Field Factors as Fixed Factors
The Relationship Between M_TDC and the Number of Terms of Horizontal and Vertical Factors in WoS
The Relationship Between M_TDC and the Number of Terms of Domains in WoS
The Relationship Between M_TDC and the Number of Terms of Topics in WoS
[1] 易中梅. 应用检索实例谈谈信息检索的查全率和查准率[J]. 科技信息(科学教研), 2008(24):363-364.
[1] ( Yi Zhongmei. Analysis on Recall Ratio and Accuracy Ratio of Information Retrieval Based on Retrieval Practices[J]. Science & Technology Information, 2008(24):363-364.)
[2] 张玲. 中刊库检索效率及其影响因素比较分析[J]. 情报理论与实践, 2001,24(2):120-121.
[2] ( Zhang Ling. Comparative Analysis of the Retrieval Functions of China Journal Database and Its Influence Factors[J]. Information Studies: Theory & Application, 2001,24(2):120-121.)
[3] Wolfram D, Zhang J. The Impact of Term-indexing Characteristics on a Document Space[J]. Canadian Journal of Information & Library Science, 2001,26(4):21-35.
[4] Wolfram D, Zhang J. An Investigation of the Influence of Indexing Exhaustivity and Term Distributions on a Document Space[J]. Journal of the American Society for Information Science and Technology, 2002,53(11):943-952.
doi: 10.1002/(ISSN)1532-2890
[5] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
doi: 10.1145/361219.361220
[6] Zhang J, Yu Q, Zheng F S, et al. Comparing Keywords Plus of WOS and Author Keywords: A Case Study of Patient Adherence Research[J]. Journal of the Association for Information Science & Technology, 2016,67(4):967-972.
[7] 魏凤萍, 何益华, 方吉, 等. 基于Web of Science的机构文献检索策略[J]. 上海高校图书情报工作研究, 2019,29(1):81-86.
[7] ( Wei Fengping, He Yihua, Fang Ji, et al. Organization Literature Retrieval Strategy Based on Web of Science[J]. Research on Library & Information Work of Shanghai Colleges & Universities, 2019,29(1):81-86.)
[8] 江宏春. 自然科学、社会科学、人文科学的关系——一种“学科光谱”分析[J]. 自然辩证法研究, 2014,30(6):61-67.
[8] ( Jiang Hongchun. Relations Among Natural Science, Social Science and Human Studies Under the Analysis on the Spectrum of Disciplines[J]. Studies in Dialectics of Nature, 2014,30(6):61-67.)
[9] 李醒民. 知识的三大部类:自然科学、社会科学和人文学科[J]. 学术界, 2012(8):5-33,286.
[9] ( Li Xingmin. Three Divisions of Knowledge: Natural Science, Social Science and the Humanities[J]. Academics, 2012(8):5-33,286.)
[10] 自动标引[EB/OL].[ 2020- 02- 17]. http://baike.baidu.com/view/853543.html.
[10] (Automatic Indexing[EB/OL]. [ 2020- 02- 17]. http://baike.baidu.com/view/853543.html.
[11] 李晓瑛, 夏光辉, 孙海霞. MTI自动文献标引系统研究[J]. 医学信息学杂志, 2015,36(3):52-57.
[11] ( Li Xiaoying, Xia Guanghui, Sun Haixia. Research on Medical Text Indexer[J]. Journal of Medical Informatics, 2015,36(3):52-57.)
[12] 李军莲, 王序文, 夏光辉, 等. 面向文献主题自动标引的通用概念表建设[J]. 情报理论与实践, 2017,40(4):95-99.
[12] ( Li Junlian, Wang Xuwen, Xia Guanghui, et al. Construction of Common Concept List for Automatic Text Subject Indexing[J]. Information Studies: Theory & Application, 2017,40(4):95-99.)
[13] 黄丹丹. 基于深度学习的中文分词和关键词抽取模型研究[D]. 北京:北京邮电大学, 2019.
[13] ( Huang Dandan. Research on Chinese Word Segmentation and Keyword Extraction Model Based on Deep Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[14] 张海潮, 王昊, 唐慧慧, 等. CRFs字角色标注方法在中文附加关键词抽取中的应用研究[J]. 情报理论与实践, 2019,42(2):169-176.
[14] ( Zhang Haichao, Wang Hao, Tang Huihui, et al. Application of CRFs Chinese Character Role Labeling Method in Chinese Keywords Plus Extraction[J]. Information Studies: Theory & Application, 2019,42(2):169-176.)
[15] Chemical Indexing [EB/OL]. [2020-02-17]. https://www.theiet.org/media/5239/chemical-indexing-updated-jan-2020.pdf.
[16] Numerical Indexing [EB/OL].[2020-02-17]. https://www.theiet.org/media/2019/numerical-data-indexing.pdf.
[17] 何琳, 常颖聪. 不同标引策略下的文本主题表达质量比较研究[J]. 图书馆杂志, 2014,33(5):29-33.
[17] ( He Lin, Chang Yingcong. Comparative Study of Subject Presentation with Different Indexing Strategies[J]. Library Journal, 2014,33(5):29-33.)
[18] Willett P. An Algorithm for the Calculation of Exact Term Discrimination Values[J]. Information Processing & Management, 1985,21(3):225-232.
doi: 10.1016/0306-4573(85)90107-4
[19] Zhang J, Wolfram D. Visualization of Term Discrimination Analysis[J]. Journal of the American Society for Information Science and Technology, 2001,52(8):615-627.
doi: 10.1002/(ISSN)1532-2890
[20] Pushpalatha K P, Raju G. Compactness-A Useful Feature for Generating Search Index [C]// Proceedings of the 2012 IEEE International Conference on Technology Enhanced Education(ICTEE), Kerala, India. 2012.
[21] Cai D, van Rijsbergen C J. Learning Semantic Relatedness from Term Discrimination Information[J]. Expert Systems with Applications, 2009,36(2):1860-1875.
doi: 10.1016/j.eswa.2007.12.072
[22] Lu K, Mao J. An Automatic Approach to Weighted Subject Indexing-An Empirical Study in the Biomedical Domain[J]. Journal of the Association for Information Science and Technology, 2015,66(9):1776-1784.
doi: 10.1002/asi.23290
[23] Lu K, Cai X, Ajiferuke I, et al. Vocabulary Size and Its Effect on Topic Representation[J]. Information Processing & Management, 2017,53(3):653-665.
doi: 10.1016/j.ipm.2017.01.003
[24] Labani M, Moradi P, Ahmadizar F, et al. A Novel Multivariate Filter Method for Feature Selection in Text Classification Problems[J]. Engineering Applications of Artificial Intelligence, 2018,70:25-37.
doi: 10.1016/j.engappai.2017.12.014
[25] Bernauer L, Han E J, Sohn S Y. Term Discrimination for Text Search Tasks Derived from Negative Binomial Distribution[J]. Information Processing & Management, 2018,54(3):370-379.
doi: 10.1016/j.ipm.2018.01.003
[26] Lakshmi R, Baskar S. Novel Term Weighting Schemes for Document Representation Based on Ranking of Terms and Fuzzy Logic with Semantic Relationship of Terms[J]. Expert Systems with Applications, 2019,137:493-503.
doi: 10.1016/j.eswa.2019.07.022
[27] 王昊, 唐慧慧, 张海潮, 等. 面向学术资源的术语区分能力的测度方法研究[J]. 情报学报, 2019,38(10):1078-1091.
[27] ( Wang Hao, Tang Huihui, Zhang Haichao, et al. A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(10):1078-1091.)
[28] 刘启元, 叶鹰. 文献题录信息挖掘技术方法及其软件SATI的实现——以中外图书情报学为例[J]. 信息资源管理学报, 2012,2(1):50-58.
[28] ( Liu Qiyuan, Ye Ying. A Study on Mining Bibliographic Records by Designed Software SATI: Case Study on Library and Information Science[J]. Journal of Information Resources Management, 2012,2(1):50-58.)
[29] NLPIR汉语分词系统[CP/OL].[ 2020- 02- 17]. http://www.nlpir.org/wordpress/.
[29] (NLPIR Chinese Word Segmentation System[CP/OL]. [ 2020- 02- 17]. http://www.nlpir.org/wordpress/.
[30] 熊欣, 王昊, 张海潮, 等. 中文术语粒度对其区分能力测度的影响分析[J]. 数据分析与知识发现, 2020,4(2-3):143-152.
[30] ( Xiong Xin, Wang Hao, Zhang Haichao, et al. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. Data Analysis and Knowledge Discovery, 2020,4(2-3):143-152.)
[31] Korfhage R R. Information Storage and Retrieval[M]. New York: Wiley, 1997.
[32] Zhang J, Korfhage R R. A Distance and Angle Similarity Measure Method[J]. Journal of the American Society for Information Science, 1999,50(9):772-778.
doi: 10.1002/(SICI)1097-4571(1999)50:9<>1.0.CO;2-J
[33] Salton G, Yang C S, Yu C T. Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American Society for Information Science, 1975,26(1):33-44.
doi: 10.1002/(ISSN)1097-4571
[1] Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets[J]. 数据分析与知识发现, 2020, 4(7): 87-95.
[2] Zheng Songyin,Tan Guoxin,Shi Zhongchao. Recommending Tourism Attractions Based on Segmented User Groups and Time Contexts[J]. 数据分析与知识发现, 2020, 4(5): 92-104.
[3] Wei Guohui,Zhang Fengcong,Fu Xianjun,Wang Zhenguo. Similarity Measurement of Traditional Chinese Medicine Components for Cold-hot Nature Discrimination[J]. 数据分析与知识发现, 2020, 4(5): 75-83.
[4] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[5] Hui Zhu,Hao Wang,Chengzhi Zhang. Research Methods and Technologies for Information Science from Process-Problem Perspective: Case Study of Public Opinion[J]. 数据分析与知识发现, 2019, 3(10): 2-11.
[6] Huiying Gao,Tian Wei,Jiawei Liu. Friend Recommendation Based on User Clustering and Dynamic Interaction Trust Relationship[J]. 数据分析与知识发现, 2019, 3(10): 66-77.
[7] Manyu Huang,Qi Yun,Hufeng Peng,Xuemeng Dou. Analyzing Textual Features of Excess-funded Agricultural Products——Case Study of Crowdfunding Website[J]. 数据分析与知识发现, 2019, 3(9): 124-134.
[8] Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[9] Shan Li,Yehui Yao,Hao Li,Jie Liu,Karmapemo. ISA Biclustering Algorithm for Group Recommendation[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
[10] Fusen Jiao,Shuqing Li. Collaborative Filtering Recommendation Based on Item Quality and User Ratings[J]. 数据分析与知识发现, 2019, 3(8): 62-67.
[11] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[12] Zhongxi You,Weina Hua,Xuelian Pan. Matching Book Reviews and Essential Sentiment Lexicons with Chinese Word Segmenters[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[13] Fan Xinyue,Cui Lei. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[14] Zhao Yang,Li Qiqi,Chen Yuhan,Cao Wenhang. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[15] Zhang Liyi,Li Yiran,Wen Xuan. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn