Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (5): 38-45    DOI: 10.11925/infotech.2096-3467.2020.0201
Current Issue | Archive | Adv Search |
Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion
Li Junlian1,2,3(),Wu Yingjie3,Deng Panpan3,Leng Fuhai4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
4Institute of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (849 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To normalize different expressions of the same citation document, realize standard control and management of periodical citation data, and alleviate the data quality problems caused by citation anomie.[Methods] Taking the construction of the periodical citation database as the target scenario, the core characteristics of periodical citation data were analyzed according to the reference standards. The subsets of effective features were obtained based on the decision tree and accuracy, the execution priority of decision rules was specified and an automatic data processing strategy was constructed based on multi-feature fusion.[Results] 10,000 periodical citation sample data and 10,000 validation data sets were selected from the Chinese Biomedical Citation Index (CBMCI) for the experiment. The results show that our proposed feature fusion approach achieved 99.72% and 98.70% accuracy of the journal citation normalization on these two datasets, respectively.[Limitations] This article only explored the Chinese periodical citation anomie data and has not yet covered the citations of other languages and types.[Conclusions] The proposed method could automatically standardize large-scale journal citation data with high efficiency, thus reduce the burden of labor-intensive manual intervention. The idea of feature fusion can be also applied to the automatic normalization strategies of other types of citation documents.

Key wordsCitation Data      Citation Anomie      Standard Control      Feature Fusion     
Received: 16 March 2020      Published: 15 June 2020
ZTFLH:  TP391  
Corresponding Authors: Li Junlian     E-mail: junlian@imicams.ac.cn

Cite this article:

Li Junlian,Wu Yingjie,Deng Panpan,Leng Fuhai. Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(5): 38-45.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0201     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I5/38

Data Automatic Processing Strategy of Citation Anomie Based on Feature Fusion
Decision Tree of Effective Feature Subset
决策规则 有效特征子集 特征数 Pr
Rule_1 ta,firstauthor,vi,dp,pg_start 5 0.94
Rule_2 ta,vi,ip,dp,pg_start 5 0.94
Rule_3 ta,firstauthor,vi,ip,dp 5 0.94
Rule_4 ta,firstauthor,ip,dp,pg_start 5 0.93
Rule_5 ta,firstauthor,vi,ip,pg_start 5 0.93
Rule_6 ti_format,ta,vi,ip,dp 5 0.91
Rule_7 ta,firstauthor,ip,dp 4 0.95
Rule_8 ta,ip,dp,pg_start 4 0.94
Rule_9 ta,vi,ip,pg_start 4 0.94
Rule_10 ta,firstauthor,dp,pg_start 4 0.94
Rule_11 ta,firstauthor,vi,pg_start 4 0.94
Rule_12 ta,firstauthor,vi,ip 4 0.94
Rule_13 ti_format,ta,firstauthor,dp 4 0.91
Rule_14 ti_format,firstauthor,dp,pg_start 4 0.90
Rule_15 ti_format,ta,dp,pg_start 4 0.90
Rule_16 firstauthor,dp,pg_start 3 0.96
Rule_17 firstauthor,vi,pg_start 3 0.95
Rule_18 ta,firstauthor,pg_start 3 0.95
Rule_19 ti_format,firstauthor,dp 3 0.92
Rule_20 ti_format,ta,dp 3 0.92
Rule_21 ti_format,dp,pg_start 3 0.91
Rule_22 ti_format,ta,firstauthor 3 0.91
Decision Rules of Journal Citation Standardization
数据 规模(条) 准确率AC
样本数据集 10 000 99.72%
验证数据集 10 000 98.70%
Results of Citation Standardization
[1] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 7714-2005文后参考文献著录规则[S].北京: 中国标准出版社, 2005.
[1] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 7714-2005 Descriptive Rules for Bibliographic References[S]. Beijing: Standards Press of China, 2005.)
[2] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 7714-2015 信息与文献参考文献著录规则[S]. 北京: 中国标准出版社, 2015.
[2] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 7714-2015 Information and Documentation- rules for Bibliographic References and Citations to Information Resources [S]. Beijing: Standards Press of China, 2015.)
[3] 刘应竹. 学术论文中的引文失范问题刍议[J]. 编辑学报, 2014,26(1):7-9.
[3] ( Liu Yingzhu. Citation Anomie in Academic Papers[J]. Acta Editologica, 2014,26(1):7-9.)
[4] 胡玥. 引文统计分析中引文规范化问题分析研究[J].图书与情报, 2013(6):84-88.
[4] ( Hu Yue. Study of Citation Standard in Citation Analysis[J]. Library & Information, 2013(6):84-88.)
[5] 赵萍, 徐平. 影响CSTPC数据库检索效率的原因及对策[J].现代图书情报技术, 1999(4):35-36,66.
[5] ( Zhao Ping, Xu Ping. The Problems and Suggestions of Affecting the CSTPC Retrieving Efficiency[J]. New Technology of Library and Information Service, 1999(4):35-36, 66.)
[6] 苏新宁. 引文索引数据质量控制研究[J]. 中国图书馆学报, 2001,27(2):76-78.
[6] ( Su Xinning. Quality Control of Data in Citation Indexes[J]. Journal of the Library Science in China, 2001,27(2):76-78.)
[7] 王凌云. CSSCI被引文献数据质量问题的实证研究——以2007-2016年《图书情报工作》的被引数据为例[J]. 图书情报导刊, 2019,4(8):64-70.
[7] ( Wang Lingyun. An Empirical Study on Data Quality Problems of CSSCI Cited Documents: Taking the Cited Data of Library and Information Work from 2007 to 2016 as an Example[J]. Journal of Library and Information Science, 2019,4(8):64-70.)
[8] 张友谊, 刘春 . 中文社会科学引文索引数据质量问题研究[J]. 情报杂志,2012,31(1):21-24, 46.
[8] ( Zhang Youyi, Liu Chun. Research on the Data Quality Problems of CSSCI[J]. Journal of Intelligence, 2012,31(1):21-24, 46.)
[9] 蒋鸿标. 引文数据质量控制研究[J]. 图书馆建设, 2014(9):81-86,91.
[9] ( Jiang Hongbiao. Study on the Quality Control of Citation Data[J]. Library Development, 2014(9):81-86, 91.)
[10] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 36067-2018 信息与文献引文数据库数据加工规则[S]. 北京: 中国标准出版社, 2018.
[10] ( General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. GB/T 36067-2018 Information and Documentation-Specification for Data Processing of Citation Databases[S]. Beijing: Standards Press of China, 2018.)
[11] 任慧玲, 杨滨, 黄利辉, 等. NSTL国际科学引文数据库医学外文期刊引文数据加工流程和加工技术研究[J]. 医学信息学杂志, 2009,30(3):19-21.
[11] ( Ren Huiling, Yang Bin, Huang Lihui, et al. Study on Work Flow and Technology of Processing of Foreign Medical Journals Citation Data in NSTL Database of International Science Citation[J]. Journal of Medical Informatics, 2009,30(3):19-21.)
[12] 曾红英. 浅谈基于正则表达式的参考文献格式验证技术[J]. 农业图书情报学刊, 2014,26(8):138-140.
[12] ( Zeng Hongying. Discussion on the Regular Expression-Based Reference Format Verification Technology[J]. Journal of Library and Information Sciences in Agriculture, 2014,26(8):138-140.)
[13] 王珊珊, 陈晨, 肖明. 基于本体的引文知识服务原型系统设计与实现[J]. 图书情报工作, 2019,63(2):132-143.
[13] ( Wang Shanshan, Chen Chen, Xiao Ming. Design and Implementation of Ontology-based Citation Knowledge Service Prototype System[J]. Library and Information Service, 2019,63(2):132-143.)
[14] 鲜国建, 赵瑞雪, 金晨. NSTL外文期刊引文数据自动化拆分的研究与实践[J]. 数字图书馆论坛, 2010 ( 10):91-95.
[14] ( Xian Guojian, Zhao Ruixue, Jin Chen. Study and Practice on Automatically Splitting of NSTL’s Foreign Journals’ Citation Data[J]. Digital Library Forum, 2010(10):91-95.)
[15] 祝清松, 冷伏海. 引文类型识别研究进展[J].图书情报知识, 2013(6):70-76.
[15] ( Zhu Qingsong, Leng Fuhai. Review of Citation Type Recognition[J]. Document, Information & Knowledge, 2013(6):70-76.)
[16] 姜霖, 王东波. 引文元数据的自动发现和标注方法研究——以外文引文为例[J]. 数据分析与知识发现, 2017,1(1):47-54.
[16] ( Jiang Lin, Wang Dongbo. Automatically Detecting and Tagging Foreign Language Citation Metadata[J]. Data Analysis and Knowledge Discovery, 2017,1(1):47-54.)
[17] Brennan D. Simple Export of Journal Citation Data to Excel Using Any Reference Manager[J]. Journal of the Medical Library Association, 2016,104(1):72-75.
[18] Falagas M E, Pitsouni E I, Malietzis G A, et al. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and Weaknesses[J]. FASEB Journal, 2008,22(2):338-342.
[19] Adriaanse L S, Rensleigh C. Web of Science, Scopus and Google Scholar a Content Comprehensiveness Comparison[J]. The Electronic Library, 2013,31(6):727-744.
[20] 明巧英. 基于决策树技术的个性化学习系统的分析设计[J]. 微型电脑应用, 2018(1):53-57.
[20] ( Ming Qiaoying. The Analysis and Design of the Personalized Learning System Based on Decision Tree[J]. Microcomputer Applications, 2018(1):53-57.)
[21] SinoMed在线帮助[R/OL].[2020-02-01]. http://www.sinomed.ac.cn/help/ .
[21] (SinoMed Online Help[R/OL].[2020-02-01]. http://www.sinomed.ac.cn/help/ .)
[1] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[2] Yu Chuanming,Gong Yutian,Zhao Xiaoli,An Lu. Collaboration Recommendation of Finance Research Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2017, 1(8): 39-47.
[3] Chen Zuqin,Zheng Hong . Citation Analysis System of China Database Based  on Meta-search Engine[J]. 现代图书情报技术, 2006, 1(11): 65-68.
[4] Qiao Dongmei. The Development of the OAIbased Citation Database[J]. 现代图书情报技术, 2005, 21(12): 39-43.
[5] Bao Heping. Standard Control of Digitizing Documents with Ethnic Languages[J]. 现代图书情报技术, 2004, 20(5): 86-87.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn