Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (5): 38-45     https://doi.org/10.11925/infotech.2096-3467.2020.0201
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于特征融合的引文失范数据自动处理策略研究*
李军莲1,2,3(),吴英杰3,邓盼盼3,冷伏海4
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
3中国医学科学院医学信息研究所 北京 100020
4中国科学院科技战略咨询研究院 北京 100190
Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion
Li Junlian1,2,3(),Wu Yingjie3,Deng Panpan3,Leng Fuhai4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
4Institute of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (849 KB)   HTML ( 11
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 将同一篇引文文献的不同表达形式进行归一,实现期刊引文数据规范控制与管理,减轻引文失范造成的数据质量问题。【方法】 以期刊引文数据库建设为目标场景,根据参考文献著录标准分析期刊引文数据的核心特征,基于决策树方法和准确率指标获取有效特征子集并指定决策规则执行优先顺序,生成多特征融合的自动数据处理策略。【结果】 选取CBMCI的10 000条期刊引文样本数据集和10 000条验证数据集进行验证,本文方法进行期刊引文归一规范的准确率分别达99.72%、98.70%。【局限】 仅探讨了中文期刊引文失范数据的处理,尚未考虑其他语种和类型的引文。【结论】 该处理策略能够高效自动化地开展大规模期刊引文数据的归一规范,减少人工干预,特征融合的思路也适用于建立其他类型引文归一规范时的自动处理策略。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李军莲
吴英杰
邓盼盼
冷伏海
关键词 引文归一引文失范规范控制特征融合    
Abstract

[Objective] To normalize different expressions of the same citation document, realize standard control and management of periodical citation data, and alleviate the data quality problems caused by citation anomie.[Methods] Taking the construction of the periodical citation database as the target scenario, the core characteristics of periodical citation data were analyzed according to the reference standards. The subsets of effective features were obtained based on the decision tree and accuracy, the execution priority of decision rules was specified and an automatic data processing strategy was constructed based on multi-feature fusion.[Results] 10,000 periodical citation sample data and 10,000 validation data sets were selected from the Chinese Biomedical Citation Index (CBMCI) for the experiment. The results show that our proposed feature fusion approach achieved 99.72% and 98.70% accuracy of the journal citation normalization on these two datasets, respectively.[Limitations] This article only explored the Chinese periodical citation anomie data and has not yet covered the citations of other languages and types.[Conclusions] The proposed method could automatically standardize large-scale journal citation data with high efficiency, thus reduce the burden of labor-intensive manual intervention. The idea of feature fusion can be also applied to the automatic normalization strategies of other types of citation documents.

Key wordsCitation Data    Citation Anomie    Standard Control    Feature Fusion
收稿日期: 2020-03-16      出版日期: 2020-06-15
ZTFLH:  TP391  
基金资助:*本文系中国医学科学院医学与健康科技创新工程项目“生物医学科技信息支撑平台”的研究成果之一(2016-12M-2-005)
通讯作者: 李军莲     E-mail: junlian@imicams.ac.cn
引用本文:   
李军莲,吴英杰,邓盼盼,冷伏海. 基于特征融合的引文失范数据自动处理策略研究*[J]. 数据分析与知识发现, 2020, 4(5): 38-45.
Li Junlian,Wu Yingjie,Deng Panpan,Leng Fuhai. Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(5): 38-45.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0201      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I5/38
Fig.1  基于特征融合的引文失范数据自动处理思路
Fig.2  有效特征子集决策树
决策规则 有效特征子集 特征数 Pr
Rule_1 ta,firstauthor,vi,dp,pg_start 5 0.94
Rule_2 ta,vi,ip,dp,pg_start 5 0.94
Rule_3 ta,firstauthor,vi,ip,dp 5 0.94
Rule_4 ta,firstauthor,ip,dp,pg_start 5 0.93
Rule_5 ta,firstauthor,vi,ip,pg_start 5 0.93
Rule_6 ti_format,ta,vi,ip,dp 5 0.91
Rule_7 ta,firstauthor,ip,dp 4 0.95
Rule_8 ta,ip,dp,pg_start 4 0.94
Rule_9 ta,vi,ip,pg_start 4 0.94
Rule_10 ta,firstauthor,dp,pg_start 4 0.94
Rule_11 ta,firstauthor,vi,pg_start 4 0.94
Rule_12 ta,firstauthor,vi,ip 4 0.94
Rule_13 ti_format,ta,firstauthor,dp 4 0.91
Rule_14 ti_format,firstauthor,dp,pg_start 4 0.90
Rule_15 ti_format,ta,dp,pg_start 4 0.90
Rule_16 firstauthor,dp,pg_start 3 0.96
Rule_17 firstauthor,vi,pg_start 3 0.95
Rule_18 ta,firstauthor,pg_start 3 0.95
Rule_19 ti_format,firstauthor,dp 3 0.92
Rule_20 ti_format,ta,dp 3 0.92
Rule_21 ti_format,dp,pg_start 3 0.91
Rule_22 ti_format,ta,firstauthor 3 0.91
Table 1  期刊引文规范决策规则
数据 规模(条) 准确率AC
样本数据集 10 000 99.72%
验证数据集 10 000 98.70%
Table 2  引文规范结果
[1] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 7714-2005文后参考文献著录规则[S].北京: 中国标准出版社, 2005.
[1] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 7714-2005 Descriptive Rules for Bibliographic References[S]. Beijing: Standards Press of China, 2005.)
[2] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 7714-2015 信息与文献参考文献著录规则[S]. 北京: 中国标准出版社, 2015.
[2] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 7714-2015 Information and Documentation- rules for Bibliographic References and Citations to Information Resources [S]. Beijing: Standards Press of China, 2015.)
[3] 刘应竹. 学术论文中的引文失范问题刍议[J]. 编辑学报, 2014,26(1):7-9.
[3] ( Liu Yingzhu. Citation Anomie in Academic Papers[J]. Acta Editologica, 2014,26(1):7-9.)
[4] 胡玥. 引文统计分析中引文规范化问题分析研究[J].图书与情报, 2013(6):84-88.
[4] ( Hu Yue. Study of Citation Standard in Citation Analysis[J]. Library & Information, 2013(6):84-88.)
[5] 赵萍, 徐平. 影响CSTPC数据库检索效率的原因及对策[J].现代图书情报技术, 1999(4):35-36,66.
[5] ( Zhao Ping, Xu Ping. The Problems and Suggestions of Affecting the CSTPC Retrieving Efficiency[J]. New Technology of Library and Information Service, 1999(4):35-36, 66.)
[6] 苏新宁. 引文索引数据质量控制研究[J]. 中国图书馆学报, 2001,27(2):76-78.
[6] ( Su Xinning. Quality Control of Data in Citation Indexes[J]. Journal of the Library Science in China, 2001,27(2):76-78.)
[7] 王凌云. CSSCI被引文献数据质量问题的实证研究——以2007-2016年《图书情报工作》的被引数据为例[J]. 图书情报导刊, 2019,4(8):64-70.
[7] ( Wang Lingyun. An Empirical Study on Data Quality Problems of CSSCI Cited Documents: Taking the Cited Data of Library and Information Work from 2007 to 2016 as an Example[J]. Journal of Library and Information Science, 2019,4(8):64-70.)
[8] 张友谊, 刘春 . 中文社会科学引文索引数据质量问题研究[J]. 情报杂志,2012,31(1):21-24, 46.
[8] ( Zhang Youyi, Liu Chun. Research on the Data Quality Problems of CSSCI[J]. Journal of Intelligence, 2012,31(1):21-24, 46.)
[9] 蒋鸿标. 引文数据质量控制研究[J]. 图书馆建设, 2014(9):81-86,91.
[9] ( Jiang Hongbiao. Study on the Quality Control of Citation Data[J]. Library Development, 2014(9):81-86, 91.)
[10] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 36067-2018 信息与文献引文数据库数据加工规则[S]. 北京: 中国标准出版社, 2018.
[10] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 36067-2018 Information and Documentation-Specification for Data Processing of Citation Databases[S]. Beijing: Standards Press of China, 2018.)
[11] 任慧玲, 杨滨, 黄利辉, 等. NSTL国际科学引文数据库医学外文期刊引文数据加工流程和加工技术研究[J]. 医学信息学杂志, 2009,30(3):19-21.
[11] ( Ren Huiling, Yang Bin, Huang Lihui, et al. Study on Work Flow and Technology of Processing of Foreign Medical Journals Citation Data in NSTL Database of International Science Citation[J]. Journal of Medical Informatics, 2009,30(3):19-21.)
[12] 曾红英. 浅谈基于正则表达式的参考文献格式验证技术[J]. 农业图书情报学刊, 2014,26(8):138-140.
[12] ( Zeng Hongying. Discussion on the Regular Expression-Based Reference Format Verification Technology[J]. Journal of Library and Information Sciences in Agriculture, 2014,26(8):138-140.)
[13] 王珊珊, 陈晨, 肖明. 基于本体的引文知识服务原型系统设计与实现[J]. 图书情报工作, 2019,63(2):132-143.
[13] ( Wang Shanshan, Chen Chen, Xiao Ming. Design and Implementation of Ontology-based Citation Knowledge Service Prototype System[J]. Library and Information Service, 2019,63(2):132-143.)
[14] 鲜国建, 赵瑞雪, 金晨. NSTL外文期刊引文数据自动化拆分的研究与实践[J]. 数字图书馆论坛, 2010 ( 10):91-95.
[14] ( Xian Guojian, Zhao Ruixue, Jin Chen. Study and Practice on Automatically Splitting of NSTL’s Foreign Journals’ Citation Data[J]. Digital Library Forum, 2010(10):91-95.)
[15] 祝清松, 冷伏海. 引文类型识别研究进展[J].图书情报知识, 2013(6):70-76.
[15] ( Zhu Qingsong, Leng Fuhai. Review of Citation Type Recognition[J]. Document, Information & Knowledge, 2013(6):70-76.)
[16] 姜霖, 王东波. 引文元数据的自动发现和标注方法研究——以外文引文为例[J]. 数据分析与知识发现, 2017,1(1):47-54.
[16] ( Jiang Lin, Wang Dongbo. Automatically Detecting and Tagging Foreign Language Citation Metadata[J]. Data Analysis and Knowledge Discovery, 2017,1(1):47-54.)
[17] Brennan D. Simple Export of Journal Citation Data to Excel Using Any Reference Manager[J]. Journal of the Medical Library Association, 2016,104(1):72-75.
[18] Falagas M E, Pitsouni E I, Malietzis G A, et al. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and Weaknesses[J]. FASEB Journal, 2008,22(2):338-342.
[19] Adriaanse L S, Rensleigh C. Web of Science, Scopus and Google Scholar a Content Comprehensiveness Comparison[J]. The Electronic Library, 2013,31(6):727-744.
[20] 明巧英. 基于决策树技术的个性化学习系统的分析设计[J]. 微型电脑应用, 2018(1):53-57.
[20] ( Ming Qiaoying. The Analysis and Design of the Personalized Learning System Based on Decision Tree[J]. Microcomputer Applications, 2018(1):53-57.)
[21] SinoMed在线帮助[R/OL].[2020-02-01]. http://www.sinomed.ac.cn/help/ .
[21] (SinoMed Online Help[R/OL].[2020-02-01]. http://www.sinomed.ac.cn/help/ .)
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[4] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[5] 林克柔,王昊,龚丽娟,张宝隆. 融合多特征的中文论文同名学者消歧研究 *[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[6] 王雨竹,谢珺,陈波,续欣莹. 基于跨模态上下文感知注意力的多模态情感分析 *[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[7] 韩普, 张伟, 张展鹏, 王宇欣, 方浩宇. 基于特征融合和多通道的突发公共卫生事件微博情感分析*[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
[8] 祁瑞华,简悦,郭旭,关菁华,杨明昕. 融合特征与注意力的跨领域产品评论情感分析*[J]. 数据分析与知识发现, 2020, 4(12): 85-94.
[9] 马娜,张智雄,吴朋民. 基于特征融合的术语型引用对象自动识别方法研究*[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[10] 余传明, 龚雨田, 赵晓莉, 安璐. 基于多特征融合的金融领域科研合作推荐研究*[J]. 数据分析与知识发现, 2017, 1(8): 39-47.
[11] 陈金星,祝忠明. 责任者名称规范控制研究及进展*[J]. 现代图书情报技术, 2009, 25(12): 12-17.
[12] 刘炜,张春景. 试论网络资源的规范控制[J]. 现代图书情报技术, 2008, 24(12): 27-31.
[13] 包和平. 少数民族文字文献数字化的规范控制[J]. 现代图书情报技术, 2004, 20(5): 86-87.
[14] 梅海燕. 元数据的研究进展[J]. 现代图书情报技术, 2002, 18(4): 17-19.
[15] 曾海燕. 建立文献数据库规范控制的探讨[J]. 现代图书情报技术, 1997, 13(1): 51-54.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn