Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 91-108     https://doi.org/10.11925/infotech.2096-3467.2019.1224
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
学术数据库中研究主题术语的质量测度及分布研究*
李轲禹,王昊(),龚丽娟,唐慧慧
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
Measurement and Distribution of Index Quality in Research Topics from Academic Databases
Li Keyu,Wang Hao(),Gong Lijuan,Tang Huihui
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (1790 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 对学术数据库中研究主题的索引术语的质量进行测度并探究其分布特点。【方法】 从Web of Science、CNKI中采集来自人文、社会和自然科学领域的研究主题的索引术语,构建主题、领域和数据库层次的术语空间,将术语区分能力(Term Discriminative Capacity,TDC)作为术语质量评价指标,采用ANOVA分析方法探究不同数据库、领域的研究主题的术语质量分布特点。【结果】 不同领域的研究主题的术语质量在字段分布上均满足:“Abstract”>平均水平>“Keyword”;CNKI的“Title”(Web of Science的“Keyword Plus”)与平均水平相比在不同领域中有所差异,但均低于“Abstract”;Web of Science的“Title”与“Abstract”相比在不同领域中有所差异,但均高于平均水平。【局限】 研究主题不够丰富。【结论】 TDC测度方法具有稳定性和可靠性;通过探究研究主题的术语质量分布特点,可以为选择检索字段入口和提高术语质量提供方向与依据。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李轲禹
王昊
龚丽娟
唐慧慧
关键词 索引术语术语区分能力ANOVA分析检索字段术语质量分布特点    
Abstract

[Objective] This paper measures the quality of index terms from research topics in academic databases and explores their distribution characteristics. [Methods] We collected the index terms of research topics in humanities, society and natural sciences from Web of Science and CNKI. Then, we constructed terminology spaces based on research topics, domains and databases. Third, we used term discriminative capacity (TDC) to evaluate their quality. Finally, we conducted ANOVA testing to explore the distribution characteristics of index terms quality from different databases/domains. [Results] The index term quality of research topics followed the rules of “Abstract”> average level >“Keyword”. The “Title” of CNKI (“Keyword Plus” in Web of Science) were lower than “Abstract”, while the “Title” in WoS were lower than average. [Limitations] The amount of research topics in this study needs to be expanded. [Conclusions] The TDC measure method is stable and reliable, which helps us improve the information retrieval services and terms quality.

Key wordsIndexing Term    Term Discriminative Capability    ANOVA Testing    Search Fields    The Distribution Characteristics of Terms Quality
收稿日期: 2019-11-08      出版日期: 2020-07-07
ZTFLH:  TP391 G35  
基金资助:*本文系国家自然科学基金青年项目“面向学术资源的TSD与TDC测度及分析研究”(71503121);南京大学人文社会科学双一流建设“百层次”项目“多粒度学术对象区分性测度和分析研究”的研究成果之一(JY-001)
通讯作者: 王昊     E-mail: ywhaowang@nju.edu.cn
引用本文:   
李轲禹,王昊,龚丽娟,唐慧慧. 学术数据库中研究主题术语的质量测度及分布研究*[J]. 数据分析与知识发现, 2020, 4(6): 91-108.
Li Keyu,Wang Hao,Gong Lijuan,Tang Huihui. Measurement and Distribution of Index Quality in Research Topics from Academic Databases. Data Analysis and Knowledge Discovery, 2020, 4(6): 91-108.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1224      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/91
Fig.1  研究框架
主题序号 领域标识 研究主题 检索文献数(篇) 有效文献数(篇) 选用文献数(篇) 术语数量(个)
1 A&HCI Aristotle (亚里士多德) 2 897 323 100 2 727
2 Realism(现实主义) 5 374 1 026 100 2 555
3 Christianity(基督教) 4 364 743 100 3 247
4 SSCI Government failure(政府失效) 4 254 1 871 100 3 184
5 Population urbanization(人口城市化) 3 882 2 782 100 3 463
6 Economic depression(经济萧条) 4 988 3 208 100 3 430
7 SCI Petrology(岩石学) 6 887 4 542 100 4 740
8 Rubella(风疹) 4 940 2 709 100 3 913
9 Supersaturated solution(过饱和溶液) 5 072 2 745 100 3 377
10 CSSCI_A 文学批评 4 561 4 468 100 2 308
11 黑格尔 2 257 2 225 100 1 914
12 非物质文化遗产 2 405 2 334 100 1 958
13 CSSCI_S 通货膨胀 4 315 4 297 100 1 873
14 产业集聚 4 324 4 316 100 1 732
15 经济危机 4 455 4 367 100 1 905
16 CSCD 粒子群算法 4 553 4 552 100 1 889
17 细胞移植 5 912 5 799 100 2 021
18 配合物 5 240 5 171 100 2 185
Table 1  WoS与CNKI文献检索情况
WoS CNKI
编号 含义 编号 含义
字段 1 Title 1 Title
2 Keyword 2 Keyword
3 Keyword Plus 3 Abstract
4 Abstract
领域 1 A&HCI 1 CSSCI_A
2 SSCI 2 CSSCI_S
3 SCI 3 CSCD
Table 2  符号说明
Fig.2  A&HCI中术语TDV、TDC频次直方图
Fig.3  A&HCI中术语TDV、TDC与DF关系图
Fig.4  A&HCI中各研究主题TDC的One-Way ANOVA结果
Fig.5  CSSCI_A中各研究主题TDC的One-Way ANOVA结果
Fig.6  SSCI中各研究主题TDC的One-Way ANOVA结果
Fig.7  CSSCI_S中各研究主题TDC的One-Way ANOVA结果
Fig.8  SCI中各研究主题TDC的One-Way ANOVA结果
Fig.9  CSCD中各研究主题TDC的One-Way ANOVA结果
Fig.10  WoS中研究主题、领域的积极与消极术语数量分布
Fig.11  WoS中各领域TDC的One-Way ANOVA结果
Fig.12  CNKI中研究主题、领域的积极与消极术语数量分布
Fig.13  CNKI中各领域TDC的One-Way ANOVA结果
Fig.14  WoS、CNKI术语空间中积极与消极术语数量分布
Fig.15  对字段因素进行One-Way ANOVA分析的结果
Fig.16  对领域因素进行One-Way ANOVA分析的结果
Fig.17  领域、字段因素作为固定因子的Two-Way ANOVA分析结果
Fig.18  WoS术语空间中横纵向因素的M_TDC与术语数量之间的关系
Fig.19  WoS术语空间中领域的M_TDC与术语数量之间的关系
Fig.20  WoS术语空间中主题的M_TDC与术语数量之间的关系
[1] 易中梅. 应用检索实例谈谈信息检索的查全率和查准率[J]. 科技信息(科学教研), 2008(24):363-364.
[1] ( Yi Zhongmei. Analysis on Recall Ratio and Accuracy Ratio of Information Retrieval Based on Retrieval Practices[J]. Science & Technology Information, 2008(24):363-364.)
[2] 张玲. 中刊库检索效率及其影响因素比较分析[J]. 情报理论与实践, 2001,24(2):120-121.
[2] ( Zhang Ling. Comparative Analysis of the Retrieval Functions of China Journal Database and Its Influence Factors[J]. Information Studies: Theory & Application, 2001,24(2):120-121.)
[3] Wolfram D, Zhang J. The Impact of Term-indexing Characteristics on a Document Space[J]. Canadian Journal of Information & Library Science, 2001,26(4):21-35.
[4] Wolfram D, Zhang J. An Investigation of the Influence of Indexing Exhaustivity and Term Distributions on a Document Space[J]. Journal of the American Society for Information Science and Technology, 2002,53(11):943-952.
doi: 10.1002/(ISSN)1532-2890
[5] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
doi: 10.1145/361219.361220
[6] Zhang J, Yu Q, Zheng F S, et al. Comparing Keywords Plus of WOS and Author Keywords: A Case Study of Patient Adherence Research[J]. Journal of the Association for Information Science & Technology, 2016,67(4):967-972.
[7] 魏凤萍, 何益华, 方吉, 等. 基于Web of Science的机构文献检索策略[J]. 上海高校图书情报工作研究, 2019,29(1):81-86.
[7] ( Wei Fengping, He Yihua, Fang Ji, et al. Organization Literature Retrieval Strategy Based on Web of Science[J]. Research on Library & Information Work of Shanghai Colleges & Universities, 2019,29(1):81-86.)
[8] 江宏春. 自然科学、社会科学、人文科学的关系——一种“学科光谱”分析[J]. 自然辩证法研究, 2014,30(6):61-67.
[8] ( Jiang Hongchun. Relations Among Natural Science, Social Science and Human Studies Under the Analysis on the Spectrum of Disciplines[J]. Studies in Dialectics of Nature, 2014,30(6):61-67.)
[9] 李醒民. 知识的三大部类:自然科学、社会科学和人文学科[J]. 学术界, 2012(8):5-33,286.
[9] ( Li Xingmin. Three Divisions of Knowledge: Natural Science, Social Science and the Humanities[J]. Academics, 2012(8):5-33,286.)
[10] 自动标引[EB/OL].[ 2020- 02- 17]. http://baike.baidu.com/view/853543.html.
[10] (Automatic Indexing[EB/OL]. [ 2020- 02- 17]. http://baike.baidu.com/view/853543.html.
[11] 李晓瑛, 夏光辉, 孙海霞. MTI自动文献标引系统研究[J]. 医学信息学杂志, 2015,36(3):52-57.
[11] ( Li Xiaoying, Xia Guanghui, Sun Haixia. Research on Medical Text Indexer[J]. Journal of Medical Informatics, 2015,36(3):52-57.)
[12] 李军莲, 王序文, 夏光辉, 等. 面向文献主题自动标引的通用概念表建设[J]. 情报理论与实践, 2017,40(4):95-99.
[12] ( Li Junlian, Wang Xuwen, Xia Guanghui, et al. Construction of Common Concept List for Automatic Text Subject Indexing[J]. Information Studies: Theory & Application, 2017,40(4):95-99.)
[13] 黄丹丹. 基于深度学习的中文分词和关键词抽取模型研究[D]. 北京:北京邮电大学, 2019.
[13] ( Huang Dandan. Research on Chinese Word Segmentation and Keyword Extraction Model Based on Deep Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[14] 张海潮, 王昊, 唐慧慧, 等. CRFs字角色标注方法在中文附加关键词抽取中的应用研究[J]. 情报理论与实践, 2019,42(2):169-176.
[14] ( Zhang Haichao, Wang Hao, Tang Huihui, et al. Application of CRFs Chinese Character Role Labeling Method in Chinese Keywords Plus Extraction[J]. Information Studies: Theory & Application, 2019,42(2):169-176.)
[15] Chemical Indexing [EB/OL]. [2020-02-17]. https://www.theiet.org/media/5239/chemical-indexing-updated-jan-2020.pdf.
[16] Numerical Indexing [EB/OL].[2020-02-17]. https://www.theiet.org/media/2019/numerical-data-indexing.pdf.
[17] 何琳, 常颖聪. 不同标引策略下的文本主题表达质量比较研究[J]. 图书馆杂志, 2014,33(5):29-33.
[17] ( He Lin, Chang Yingcong. Comparative Study of Subject Presentation with Different Indexing Strategies[J]. Library Journal, 2014,33(5):29-33.)
[18] Willett P. An Algorithm for the Calculation of Exact Term Discrimination Values[J]. Information Processing & Management, 1985,21(3):225-232.
doi: 10.1016/0306-4573(85)90107-4
[19] Zhang J, Wolfram D. Visualization of Term Discrimination Analysis[J]. Journal of the American Society for Information Science and Technology, 2001,52(8):615-627.
doi: 10.1002/(ISSN)1532-2890
[20] Pushpalatha K P, Raju G. Compactness-A Useful Feature for Generating Search Index [C]// Proceedings of the 2012 IEEE International Conference on Technology Enhanced Education(ICTEE), Kerala, India. 2012.
[21] Cai D, van Rijsbergen C J. Learning Semantic Relatedness from Term Discrimination Information[J]. Expert Systems with Applications, 2009,36(2):1860-1875.
doi: 10.1016/j.eswa.2007.12.072
[22] Lu K, Mao J. An Automatic Approach to Weighted Subject Indexing-An Empirical Study in the Biomedical Domain[J]. Journal of the Association for Information Science and Technology, 2015,66(9):1776-1784.
doi: 10.1002/asi.23290
[23] Lu K, Cai X, Ajiferuke I, et al. Vocabulary Size and Its Effect on Topic Representation[J]. Information Processing & Management, 2017,53(3):653-665.
doi: 10.1016/j.ipm.2017.01.003
[24] Labani M, Moradi P, Ahmadizar F, et al. A Novel Multivariate Filter Method for Feature Selection in Text Classification Problems[J]. Engineering Applications of Artificial Intelligence, 2018,70:25-37.
doi: 10.1016/j.engappai.2017.12.014
[25] Bernauer L, Han E J, Sohn S Y. Term Discrimination for Text Search Tasks Derived from Negative Binomial Distribution[J]. Information Processing & Management, 2018,54(3):370-379.
doi: 10.1016/j.ipm.2018.01.003
[26] Lakshmi R, Baskar S. Novel Term Weighting Schemes for Document Representation Based on Ranking of Terms and Fuzzy Logic with Semantic Relationship of Terms[J]. Expert Systems with Applications, 2019,137:493-503.
doi: 10.1016/j.eswa.2019.07.022
[27] 王昊, 唐慧慧, 张海潮, 等. 面向学术资源的术语区分能力的测度方法研究[J]. 情报学报, 2019,38(10):1078-1091.
[27] ( Wang Hao, Tang Huihui, Zhang Haichao, et al. A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(10):1078-1091.)
[28] 刘启元, 叶鹰. 文献题录信息挖掘技术方法及其软件SATI的实现——以中外图书情报学为例[J]. 信息资源管理学报, 2012,2(1):50-58.
[28] ( Liu Qiyuan, Ye Ying. A Study on Mining Bibliographic Records by Designed Software SATI: Case Study on Library and Information Science[J]. Journal of Information Resources Management, 2012,2(1):50-58.)
[29] NLPIR汉语分词系统[CP/OL].[ 2020- 02- 17]. http://www.nlpir.org/wordpress/.
[29] (NLPIR Chinese Word Segmentation System[CP/OL]. [ 2020- 02- 17]. http://www.nlpir.org/wordpress/.
[30] 熊欣, 王昊, 张海潮, 等. 中文术语粒度对其区分能力测度的影响分析[J]. 数据分析与知识发现, 2020,4(2-3):143-152.
[30] ( Xiong Xin, Wang Hao, Zhang Haichao, et al. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. Data Analysis and Knowledge Discovery, 2020,4(2-3):143-152.)
[31] Korfhage R R. Information Storage and Retrieval[M]. New York: Wiley, 1997.
[32] Zhang J, Korfhage R R. A Distance and Angle Similarity Measure Method[J]. Journal of the American Society for Information Science, 1999,50(9):772-778.
doi: 10.1002/(SICI)1097-4571(1999)50:9<>1.0.CO;2-J
[33] Salton G, Yang C S, Yu C T. Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American Society for Information Science, 1975,26(1):33-44.
doi: 10.1002/(ISSN)1097-4571
[1] 熊欣,王昊,张海潮,张宝隆. 中文术语粒度对其区分能力测度的影响分析*[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn