Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (4): 56-67    DOI: 10.11925/infotech.2096-3467.2022.0676
Current Issue | Archive | Adv Search |
Interdisciplinary Measurement Based on Automatic Classification of Text Content
Lv Qi1(),Shangguan Yanhong1,Zhang Lin2,3,4,Huang Ying2,3,4()
School of Management and Economics, North China University of Water Resources and Electric Power, Zhengzhou 450046, China
2School of Information Management, Wuhan University, Wuhan 430072, China
3Center for Science, Technology & Education Assessment (CSTEA), Wuhan University, Wuhan 430072, China
4Department of MSI & ECOOM, KU Leuven, Leuven B-3000, Belgium
Download: PDF (1198 KB)   HTML ( 18
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper identifies the literature subjects according to their contents, aiming to meet the needs of interdisciplinary measurement based on the discipline classification of a single paper. [Methods] With the help of the Leuven-Budapest subject classification system, we used machine learning, deep learning, and pre-training language models to classify abstracts from 15 primary disciplines. Then, we used the improved SCIBERT model to conduct interdisciplinary measurement analysis. [Results] The improved SCIBERT model had the best automatic classification performance, with an average F1 score of 81.45%. Some individual categories achieved a classification performance of over 90%. The highest interdisciplinary degree among the 15 primary disciplines was 0.38 for biomedical research, while the lowest was 0.08 for physics. [Limitations] This paper measures the interdisciplinary from the perspective of text content and does not consider multi-dimensional methods for interdisciplinary measurement. [Conclusions] The pre-training model performs best in automatically classifying journal articles, followed by deep learning models. In contrast, machine learning models had the worst performance. Using automatic classification for interdisciplinary measurement based on literature content expanded the current research system and is helpful for a multi-angle and deep understanding of interdisciplinary research.

Key wordsInterdisciplinary Research      Document Classification      Text Mining      Machine Learning      Interdisciplinary Measurement     
Received: 03 July 2022      Published: 09 November 2022
ZTFLH:  G250  
Fund:National Science Foundation of China(72004169);Humanities and Social Sciences Research in Henan Universities in 2023(2023-ZZJH-176)
Corresponding Authors: Huang Ying,ORCID:0000-0003-0115-4581,E-mail:ying.huang@whu.edu.cn   

Cite this article:

Lv Qi, Shangguan Yanhong, Zhang Lin, Huang Ying. Interdisciplinary Measurement Based on Automatic Classification of Text Content. Data Analysis and Knowledge Discovery, 2023, 7(4): 56-67.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0676     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I4/56

模型类别 模型名称 模型特点
机器学习模型 SVM 场景:常用于解决非线性问题和高维数据集
优点:可以处理特征之间的相关关系
局限:大规模样本难以实施;多分类问题解决效果并不理想
KNN 场景:更适用于样本容量大的交叉或重叠较多的自动分类
优点:训练时间和模型复杂度低于SVM等模型
局限:计算量大;样本不平衡时准确率会降低
RF 场景:常用于解决分类和回归问题
优点:随机性的引入使随机森林模型具有良好的抗噪声能力
局限:在某些噪声较大的分类或回归问题上容易过拟合
深度学习模型 CNN 场景:常用于解决图像处理和语义分类问题
优点:模型简单,容易理解;共享卷积核可以处理高维数据
局限:需要调参;容易忽略局部与整体之间关联性
LSTM 场景:更适用于解决序列建模问题
优点:解决了长序列训练的梯度消失和梯度爆炸的问题
局限:计算成本高,耗时较多
FastText 场景:常用于解决自然语言处理中的文本分类问题
优点:模型简单,训练速度快
局限:语义信息获取有限
预训练语言模型 BERT 场景:在自然语言处理的各个领域都可以直接应用
优点:双向的语义表征,效果更好
局限:计算成本高,忽略了字符间的相关性
ALBERT 场景:在BERT模型的基础上解决目前预训练模型参数量过大的问题
优点:通过参数共享减少了模型参数
局限:减少参数但并未缩短预测时间
RoBERTa 场景:在BERT模型的基础上精细调优
优点:增大训练数据,引入了动态调整掩蔽机制,提高了模型输入数据的随机性
局限:训练成本比较高
SCIBERT 场景:适用于科技论文方向的自然语言处理任务
优点:科学领域的自然语言处理任务上性能较好
局限:为特定任务训练的BERT类模型,不具有普适性
Text Auto-classification Models Involved in This Paper
Improved SCIBERT Model Architecture
Framework of Experimental Setup
Distribution of Journal Papers in 16 ECOOM Disciplines
模型 准确率/% 召回率/% F1值/%
KNN 8.16 8.01 7.75
SVM 10.14 11.03 9.56
RF 12.18 12.20 11.78
LSTM 66.62 66.58 66.45
CNN 67.14 67.01 66.91
FastText 70.51 69.95 69.71
BERT 71.61 71.75 71.58
ALBERT 75.67 75.85 75.69
RoBERTa 77.93 77.49 77.52
SCIBERT 81.57 81.53 81.45
Classification Effects of Different Classification Methods on Test Data Sets
Different Classification Effects of 10 Models in 15 First-level Disciplines
样本量 准确率/% 召回率/% F1值/%
训练集(55 458):
测试集(6 162)
81.74 81.51 81.57
80.98 80.98 80.84
82.31 81.95 81.90
训练集(49 296):
测试集(12 324)
81.53 81.52 81.32
81.41 81.19 81.21
81.76 81.43 81.47
训练集(43 134):
测试集(18 486)
81.05 80.51 80.40
81.15 80.80 80.74
81.36 80.86 80.92
Robustness Test of Improved SCIBERT Model
类别 一级学科
(英文)
一级学科
(中文)
基于文本内容
自动分类的召回率/%
跨学
科度
A Agriculture & Environment 农业与环境 79.85 0.20
B Biosciences (General, Cellular & Subcellular Biology; Genetics) 生物科学(普通、细胞和亚细胞生物学;遗传学) 69.05 0.31
C Chemistry 化学 89.79 0.10
E Engineering 工程 77.16 0.23
G Geosciences & Space Sciences 地球科学和空间科学 86.39 0.14
H Mathematics 数学 91.04 0.09
I Clinical and Experimental Medicine I (General & Internal Medicine) 临床和实验医学I(一般和内科医学) 89.55 0.10
K Arts & Humanities 艺术与人文学科 86.36 0.14
L Social Sciences II (Economic, Political & Legal Sciences) 社会科学II(经济、政治和法律科学) 83.12 0.17
M Clinical and Experimental Medicine II (NonInternal Medicine Specialties) 临床和实验医学II(非内科医学) 85.58 0.14
N Neuroscience & Behavior 神经科学与行为 80.62 0.19
P Physics 物理学 91.77 0.08
R Biomedical Research 生物医学研究 61.99 0.38
Y Social Sciences I (General, Regional & Community Issues) 社会科学I(一般、区域和社区问题) 70.23 0.30
Z Biology (Organismic & Supraorganismic Level) 生物学(有机体和超有机体水平) 80.44 0.20
Interdisciplinarity Based on Automatic Text Content Classification
学科跨学科度(DID ECOOM一级学科
DID≤0.1 物理学;数学;化学;临床和实验医学I(一般和内科医学)
0.1<DID≤0.3 地球科学和空间科学;艺术与人文学科;临床和实验医学II(非内科医学);社会科学II(经济、政治和法律科学);神经科学与行为;生物学(有机体和超有机体水平);农业与环境;工程;社会科学I(一般、区域和社区问题)
0.3<DID≤1 生物科学(普通、细胞和亚细胞生物学;遗传学);生物医学研究
Hierarchical Classification of Interdisciplinary Degrees of 15 First-level Disciplines
[1] 杨良斌, 周秋菊, 金碧辉. 基于文献计量的跨学科测度及实证研究[J]. 图书情报工作, 2009, 53(10): 87-90, 115.
[1] (Yang Liangbin, Zhou Qiuju, Jin Bihui. The Interdisciplinary Measure and Empirical Research Based on Bibliometrics[J]. Library and Information Service, 2009, 53(10): 87-90, 115.)
[2] 杨辰毓妍, 范少萍, 蔡荣, 等. 医学领域学科交叉性和论文影响力关系及其测度模型构建[J]. 中华医学图书情报杂志, 2020, 29(11): 24-30.
[2] (Yang Chenyuyan, Fan Shaoping, Cai Rong, et al. Relationship Between Interdisciplinarity and Impact of Papers in Medical Field and Establishment of Its Measurement Model[J]. Chinese Journal of Medical Library and Information Science, 2020, 29(11): 24-30.)
[3] 曾粤亮, 司莉. 跨学科科研合作:背景、理论研究与实践进展[J]. 图书情报工作, 2021, 65(10): 127-140.
doi: 10.13266/j.issn.0252-3116.2021.10.013
[3] (Zeng Yueliang, Si Li. Interdisciplinary Research Collaboration: Background, Theoretical Research and Practice Progress[J]. Library and Information Service, 2021, 65(10): 127-140.)
doi: 10.13266/j.issn.0252-3116.2021.10.013
[4] 张雪, 张志强. 学科交叉研究系统综述[J]. 图书情报工作, 2020, 64(14): 112-125.
doi: 10.13266/j.issn.0252-3116.2020.14.012
[4] (Zhang Xue, Zhang Zhiqiang. Review on Interdisciplinary Research[J]. Library and Information Service, 2020, 64(14): 112-125.)
doi: 10.13266/j.issn.0252-3116.2020.14.012
[5] 王洪, 贾惠波, 徐端颐. 基于人工标引的中文学术期刊文献自动分类算法[J]. 清华大学学报(自然科学版), 2002, 42(6): 787-790.
[5] (Wang Hong, Jia Huibo, Xu Duanyi. Literature Automatic Categorization of Chinese Academic Journals Based on the Manual Labeling[J]. Journal of Tsinghua University(Science and Technology), 2002, 42(6): 787-790.)
[6] 王昊鹏, 王卫东, 李森. 基于元数据的科技论文分类方法[J]. 山东师范大学学报(自然科学版), 2008, 23(3): 41-43.
[6] (Wang Haopeng, Wang Weidong, Li Sen. A Methods Based on Metadata for Technical Literature Categorization[J]. Journal of Shandong Normal University(Natural Science), 2008, 23(3): 41-43.)
[7] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3): 80-87.
[7] (Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. New Technology of Library and Information Service, 2014(3): 80-87.)
[8] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6): 96-103.
[8] (Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6): 96-103.)
[9] 薛峰, 胡越, 夏帅, 等. 基于论文标题和摘要的短文本分类研究[J]. 合肥工业大学学报(自然科学版), 2018, 41(10): 1343-1349.
[9] Xue Feng, Hu Yue, Xia Shuai, et al. Research on Short Text Classification Based on Paper Title and Abstract[J]. Journal of Hefei University of Technology(Natural Science), 2018, 41(10): 1343-1349.)
[10] Hu J M, Zhang Y. Measuring the Interdisciplinarity of Big Data Research: A Longitudinal Study[J]. Online Information Review, 2018, 42(5): 681-696.
doi: 10.1108/OIR-12-2016-0361
[11] Porter A L, Cohen A S, Roessner J D, et al. Measuring Researcher Interdisciplinarity[J]. Scientometrics, 2007, 72(1): 117-147.
doi: 10.1007/s11192-007-1700-5
[12] Rafols I, Meyer M. Diversity and Network Coherence as Indicators of Interdisciplinarity: Case Studies in Bionanoscience[J]. Scientometrics, 2010, 82(2): 263-287.
doi: 10.1007/s11192-009-0041-y
[13] Stirling A. A General Framework for Analysing Diversity in Science, Technology and Society[J]. Journal of the Royal Society, Interface, 2007, 4(15): 707-719.
pmid: 17327202
[14] Porter A L, Chubin D E. An Indicator of Cross-Disciplinary Research[J]. Scientometrics, 1985, 8(3): 161-176.
doi: 10.1007/BF02016934
[15] Bromham L, Dinnage R, Hua X. Interdisciplinary Research Has Consistently Lower Funding Success[J]. Nature, 2016, 534(7609): 684-687.
doi: 10.1038/nature18315
[16] Zhang L, Rousseau R, Glänzel W. Diversity of References as an Indicator of the Interdisciplinarity of Journals: Taking Similarity Between Subject Fields into Account[J]. Journal of the Association for Information Science and Technology, 2016, 67(5): 1257-1265.
doi: 10.1002/asi.2016.67.issue-5
[17] del Carmen Calatrava Moreno M, Auzinger T, Werthner H. On the Uncertainty of Interdisciplinarity Measurements Due to Incomplete Bibliographic Data[J]. Scientometrics, 2016, 107(1): 213-232.
doi: 10.1007/s11192-016-1842-4
[18] Leydesdorff L, Wagner C S, Bornmann L. Interdisciplinarity as Diversity in Citation Patterns among Journals: Rao-Stirling Diversity, Relative Variety, and the Gini Coefficient[J]. Journal of Informetrics, 2019, 13(1): 255-269.
doi: 10.1016/j.joi.2018.12.006
[19] 黄颖, 张琳, 孙蓓蓓, 等. 跨学科的三维测度——外部知识融合、内在知识会聚与科学合作模式[J]. 科学学研究, 2019, 37(1): 25-35.
[19] (Huang Ying, Zhang Lin, Sun Beibei, et al. Interdisciplinarity Measurement: External Knowledge Integration,Internal Information Convergence and Research Activity Pattern[J]. Studies in Science of Science, 2019, 37(1): 25-35.)
[20] Huang L, Cai Y J, Zhao E D, et al. Measuring the Interdisciplinarity of Information and Library Science Interactions Using Citation Analysis and Semantic Analysis[J]. Scientometrics, 2022, 127(11): 6733-6761.
doi: 10.1007/s11192-022-04401-x
[21] Zhang L, Sun B B, Chinchilla-Rodríguez Z, et al. Interdisciplinarity and Collaboration: On the Relationship between Disciplinary Diversity in Departmental Affiliations and Reference Lists[J]. Scientometrics, 2018, 117(1): 271-291.
doi: 10.1007/s11192-018-2853-0
[22] Xu H Y, Guo T, Yue Z H, et al. Interdisciplinary Topics of Information Science: A Study Based on the Terms Interdisciplinarity Index Series[J]. Scientometrics, 2016, 106(2): 583-601.
doi: 10.1007/s11192-015-1792-2
[23] 华秀丽, 徐凡, 王中卿, 等. 细粒度科技论文摘要句子分类方法[J]. 计算机工程, 2012, 38(14): 138-140.
[23] (Hua Xiuli, Xu Fan, Wang Zhongqing, et al. Fine-Grained Classification Method for Abstract Sentence of Scientific Paper[J]. Computer Engineering, 2012, 38(14): 138-140.)
[24] 白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006, 22(36): 275-276, 65.
[24] (Bai Xiaoming, Qiu Taorong. Science and Technology Text Auto Sort Study Base of SVM and KNN Algorithm[J]. Microcomputer Information, 2006, 22(36): 275-276, 65.)
[25] Zhang M L, Zhou Z H. ML-KNN: A Lazy Learning Approach to Multi-label Learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
doi: 10.1016/j.patcog.2006.12.019
[26] Eckle-Kohler J, Nghiem T D, Gurevych I. Automatically Assigning Research Methods to Journal Articles in the Domain of Social Sciences[J]. Proceedings of the American Society for Information Science and Technology, 2013, 50(1): 1-8.
[27] 曾立梅. 基于文本数据挖掘的硕士论文分类技术[J]. 重庆邮电大学学报(自然科学版), 2010, 22(5): 669-672, 682.
[27] Zeng Limei. Categorization of Master Thesis Based on Text Data Mining[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2010, 22(5):669-672, 682.)
[28] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[29] 孔洁. 基于深度学习与《中国图书馆分类法》的文献自动分类系统研究[J]. 新世纪图书馆, 2021(5): 51-56.
[29] (Kong Jie. Research on Automatic Literature Classification System Based on Deep Learning and Chinese Library Classification[J]. New Century Library, 2021(5): 51-56.)
[30] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[31] 赵旸, 张智雄, 刘欢. 基于层次分类法的中文医学文献分类研究[J]. 图书馆学研究, 2021(21): 49-55, 61.
[31] (Zhao Yang, Zhang Zhixiong, Liu Huan. A Research on Automatic Classification of Chinese Medical Literature Based on Hierarchical Classification[J]. Research on Library Science, 2021(21): 49-55, 61.)
[32] 欧石燕, 陈嘉文. 科学论文全文语步自动识别研究[J]. 现代情报, 2021, 41(11): 3-11.
doi: 10.3969/j.issn.1008-0821.2021.11.001
[32] (Ou Shiyan, Chen Jiawen. The Research on Automatic Recognition of Moves in Full-Text Scientific Papers[J]. Journal of Modern Information, 2021, 41(11): 3-11.)
doi: 10.3969/j.issn.1008-0821.2021.11.001
[33] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[33] (Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
[34] Bu Y, Li M Y, Gu W Y, et al. Topic Diversity: A Discipline Scheme-Free Diversity Measurement for Journals[J]. Journal of the Association for Information Science and Technology, 2021, 72(5): 523-539.
doi: 10.1002/asi.v72.5
[35] 刘浏, 王东波. 基于论文自动分类的社科类学科跨学科性研究[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[35] (Liu Liu, Wang Dongbo. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. Data Analysis and Knowledge Discovery, 2018, 2(3): 30-38.)
[1] Li Jialei, An Peijun, Xiao Xiantao. Review of Methods for Interdisciplinary Topic Identification[J]. 数据分析与知识发现, 2023, 7(4): 1-15.
[2] Wang Weijun, Ning Zhiyuan, Du Yi, Zhou Yuanchun. Identifying Interdisciplinary Sci-Tech Literature Based on Multi-Label Classification[J]. 数据分析与知识发现, 2023, 7(1): 102-112.
[3] Qu Zongxi, Sha Yongzhong, Li Yutong. Predicting Major Infectious Diseases Based on Grey Wolf Optimization and Multi-machine Learning: Case Study of COVID-19[J]. 数据分析与知识发现, 2022, 6(8): 122-133.
[4] Zhao Yang, Yan Zhouzhou, Shen Qiqi, Li Zhonghang. Evaluating Privacy Policy for Mobile Health APPs with Machine Learning[J]. 数据分析与知识发现, 2022, 6(5): 112-126.
[5] Chen Shiji, Cui Tengteng, Qiu Junping. Review of Studies Analyzing Interdisciplinary Dynamics[J]. 数据分析与知识发现, 2022, 6(5): 1-9.
[6] Wang Lu, Le Xiaoqiu. Research Progress on Citation Analysis of Scientific Papers[J]. 数据分析与知识发现, 2022, 6(4): 1-15.
[7] Wang Ruojia, Yan Chengxi, Guo Fengying, Wang Jimin. Predicting Churners of Online Health Communities Based on the User Persona[J]. 数据分析与知识发现, 2022, 6(2/3): 80-92.
[8] Wu Jinhong, Mu Keliang. Automatic Identifying Abnormal Behaviors of International Journals[J]. 数据分析与知识发现, 2022, 6(2/3): 385-395.
[9] Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[10] Deng Lu,Hu Po,Li Xuanhong. Abstracting Biomedical Documents with Knowledge Enhancement[J]. 数据分析与知识发现, 2022, 6(11): 1-12.
[11] Hua Bin,Kang Yue,Fan Linhao. Knowledge Modeling and Association Q&A for Policy Texts[J]. 数据分析与知识发现, 2022, 6(11): 79-92.
[12] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[13] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[14] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[15] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn