Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (5): 29-37    DOI: 10.11925/infotech.2096-3467.2023.0475
Current Issue | Archive | Adv Search |
Fusion of Organization Authority Files from Multiple Sources
Fan Yunman,Chen Ying,Tang Xiaoli()
Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
Download: PDF (841 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the selection and evaluation of the organization authority files (OAF) and address the mapping issues between OAF and redundant relationships. [Methods] First, we examined the existing OAF and related studies. Then, we constructed a fusion model with six steps: data collection and analysis, metadata framework fusion, organization relationship fusion, alias fusion, OAF data model construction, and verification of fusion results. Finally, we examined the new model using data from Dimensions, Scopus, and Web of Science. [Results] Our new model’s F1 value reached 0.97 or above in the first, second, and third-level organizations, and the Dimensions made the most significant contribution. We constructed an OAF containing 5,128 organizations. [Limitations] The organization relationship only included the parent-child relations. Cross-reference relations and the choice of standard organization names need to be studied. We also need to verify the proposed model with more data. [Conclusions] The new model could effectively integrate OAF from multiple sources.

Key wordsOrganization Authority File Fusion      Metadata Framework Fusion      Multi-source OAF      Scientific Research Entity Authority     
Received: 19 May 2023      Published: 15 March 2024
ZTFLH:  G254  
Fund:Chinese Academy of Medical Sciences Medical and Health Science and Technology Innovation Project (Major Collaborative Innovation Project)(2021-I2M-1-033)
Corresponding Authors: Tang Xiaoli,ORCID: 0000-0001-6946-3482,E-mail:tang.xiaoli@imicams.ac.cn。   

Cite this article:

Fan Yunman, Chen Ying, Tang Xiaoli. Fusion of Organization Authority Files from Multiple Sources. Data Analysis and Knowledge Discovery, 2024, 8(5): 29-37.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0475     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I5/29

规范文档 数据量 数据来源 版权 发展态势
VIAF 3 000万 多个国家规范文档 免费公开 持续发展
SAP 12万 Scopus数据库 商业 持续发展
OEL 1.6万 WOS数据库 商业 持续发展
ISNI 155万 ISNI注册机构数据 免费公开 持续发展
OrgRef 3.1万 Wikipedia、ISNI等开放数据源 免费公开 停更
GRID 10万 Dimensions数据库 免费公开 Dimensions独有,ROR维持社区模式
Comparative Analysis of Well-Known OAF
Fusion Model of Organization Authority File
对比项 Dimensions WOS Scopus ROR ISNI Ringgold VIAF OrgRef Wikidata
基本信息 机构ID
机构名称
机构别名
状态
地址信息
内部关联信息
外部关联信息
Metadata Comparison of Multiple OAF
问题类型 融合策略 说明
字段名称一致且含义一致 去重保留 机构名称、机构别名
字段名称不一致但含义一致 取多来源中用得较多的名称 Address[wos];scopus:affilAddress[Dimensions]
字段名称一致,含义一致,取值不一致 深入分析字段取值 阜外医院.type[wos]=health,阜外医院.type[Dimensions]=healthcare
字段名称不一致但表达含义一致 统一表述方式,如父子关系统一为child_ids 以表达机构父子关系为例,如scopus-childids,wos-parent_organizationsids
个别来源缺失但非常重要 补齐来源中缺失字段 WOS中无机构ID,课题组指定
个别来源中存在,非重要 直接融合保留 EMAIL[Dimesnions]
Problems and Solutions of the Fusion of Metadata Frameworks
Metadata Relation of Multiple OAF
Data Model of Organization Authority File(Partial)
问题类型 举例 解决策略
不同来源中的关系不一致 Dimensions中机构关系为相关关系,Scopus的关系为父子关系 将Dimensions中的节点增补为待映射节点,增加映射关系(算法自动匹配、人工审核),增加父子关系(算法自动增加、人工审核)
不同来源中的机构所处层级不一致 WOS:伦敦大学学院-伦敦大学医学院
Scopus:伦敦大学学院-伦敦大学医学学院-伦敦大学医学院
保留深层关系,去掉浅层关系
不同来源中的机构映射错误 约翰霍普金斯大学传播项目中心、约翰霍普金斯大学彭博公共卫生学院 将映射错误的ID加入映射错误名单,通过映射算法对其排除
机构更名导致三个机构存在映射关系 马克斯普朗克发育生物学研究所改名为马克斯普朗克生物学研究所 规范文档中建立变更关系
需要人工调研发现的存在错误的机构 Scopus、WOS中都存在UCSF贝尼奥夫儿童医院奥克兰;二者所处的层级不一致;WOS中错误挂接 人工调研问题机构并修正
Problems and Solutions of the Fusion of Relation
问题描述 示例 解决策略
中文名称 上级机构 别名
兄弟节点包含相同的别名 医学生物学研究所-CAMS 中国医学科学院北京协和医学院 Inst Med Biol; IMB 查准为主,兼顾查全
医药生物技术研究所-CAMS INST MEDICINAL BIOTECHNOL; IMB
同一个机构包含重复的别名 伦敦学校经济与政治学 伦敦大学 London School of Economics and Political Science; London Sch Econ & Polit Sci; London School of Economics and Political Science 去重
美国国立卫生研究院国家补充和替代医学中心 美国国立卫生研究院 Natl Ctr Comp Alt Med (NCCAM); NCCAM; National Center for Complementary and Alternative Medicine (NCCAM); Natl Ctr Comp Alt Med (NCCAM) 去重
父级中包含子机构的别名 加州大学洛杉矶医学中心 加州大学洛杉矶分校 Univ Calif Los Angeles Med Ctr;Ronald Reagan UCLA Med Ctr 父归父、子归子
Problems and Solutions of the Fusion of Alias
问题描述 示例 解决策略
中文名称 上级机构 词形变体
子机构中包含父机构的变体 圣保罗巴斯德研究所 Institut Pasteur INST PASTEUR
INST PASTEUR SAO PAULO
父归父,子归子
父机构中包含子机构的变体 多伦多大学 多伦多大学健康网络 10EN212 TORONTO GEN HOSP 父归父,子归子
机构中包含机构的地址变体 伦敦大学 29 39 BRUNS WICK SQ 简称扩为全称
包含非父子机构但是名称相似的机构变体 华盛顿大学 GEORGE WASHINGTON UNIV 查准为主,兼顾查全
包含长度过短的变体 中国科学院地质与地球物理研究所 中国科学院 INS 查准为主,兼顾查全
出现两次的变体,包含信息较少且包含
一些数字
加州大学圣地亚哥分校 加州大学 0109 UNIV CALIF SAN DIEGO 查准为主,兼顾查全
Problems and Solutions for the WOS Variant Term
层级 TP FP FN TN P R F1
一级 123 2 4 - 0.984 0.969 0.976
二级 2 796 17 58 - 0.994 0.980 0.987
三级 1 198 2 28 - 0.998 0.977 0.988
Confusion Matrix of Fusion Results
比较项 Dimensions Scopus WOS
记录数 1 852 2 580 379
交集数量 619 622 339
Fusion值 0.33 0.24 0
Fusion Rate of the Three Sources
问题描述 问题原因 解决策略
机构融合错误 规范文档内缺少机构间变迁 增加机构变迁的变更关系
机构缺少融合 没有对规范文档内部的机构融合 源内消歧
没有对规范文档之间的机构融合 源间消歧
Problems Discovered During Fusion Result Verification and Solution Strategies
Organization Authority File(Partial)
[1] 薛明, 王丽萍. 我国规范文档研究综述[J]. 图书馆学刊, 1999(1): 28-30.
[1] (Xue Ming, Wang Liping. The Review of the Research on Authority Documents in China[J]. Journal of Library Science, 1999(1): 28-30.)
[2] MacEwan A, Angjeli A, Gatenby J. The International Standard Name Identifier (ISNI): The Evolving Future of Name Authority Control[J]. Cataloging & Classification Quarterly, 2013, 51(1-3): 55-71.
[3] DataSalon. OrgRef[EB/OL].[2022-08-22]. https://web.archive.org/web/20140912085615/http://www.orgref.org/web/index.htm.
[4] Loesch M F. VIAF (The Virtual International Authority File)-http://viaf.org[J]. Technical Services Quarterly, 2011, 28(2): 255-256.
[5] Burnham J F. Scopus Database: A Review[J]. Biomedical Digital Libraries, 2006, 3: Article No.1.
[6] Clarivate Analytics. Web of Science Core Collection Help-Corporate and Institution Abbreviations[EB/OL].[2022-06-08]. https://images.webofknowledge.com/WOKRS58B4/help/WOS/hs_corporate_abbreviations.html.
[7] Digital Science. Dimensions[EB/OL].[2022-08-22]. https://app.dimensions.ai/.
[8] Lammey R. Solutions for Identification Problems: A Look at the Research Organization Registry[J]. Science Editing, 2020, 7(1): 65-69.
[9] 贾君枝, 石燕青. 中文名称规范文档与VIAF的关联[J]. 国家图书馆学刊, 2014, 23(6): 85-90.
[9] (Jia Junzhi, Shi Yanqing. The Association of Chinese Name Authority File with VIAF[J]. Journal of the National Library of China, 2014, 23(6): 85-90.)
[10] 胡媛. 中文名称规范文档与VIAF共享问题分析[J]. 河南图书馆学刊, 2018, 38(2): 111-113.
[10] (Hu Yuan. Analysis of Sharing Problems Between Chinese Name Authority Document and VIAF[J]. The Library Journal of Henan, 2018, 38(2): 111-113.)
[11] 王锦华, 陈锐, 冯占英, 等. 基于多源数据融合的军事医学机构名称规范研究[J]. 中华医学图书情报杂志, 2020, 29(2): 52-57.
[11] (Wang Jinhua, Chen Rui, Feng Zhanying, et al. Multisource Data Fusion-Based Normalization of Military Medical Institution Names[J]. Chinese Journal of Medical Library and Information Science, 2020, 29(2): 52-57.)
[12] 王星, 曾建勋, 苏静, 等. 机构规范文档构建方式研究[J]. 数字图书馆论坛, 2015(7): 2-8.
[12] (Wang Xing, Zeng Jianxun, Su Jing, et al. Research on the Construction of Institutional Authority File[J]. Digital Library Forum, 2015(7): 2-8.)
[13] Huang Y W, Li J, Sun T, et al. Institution Information Specification and Correlation Based on Institutional PIDs and IND Tool[J]. Scientometrics, 2020, 122(1): 381-396.
[14] 王瑞云, 贾君枝. 基于外部ID的中文实体对齐分析——以中国科学院院士Wikidata数据子集为例[J]. 国家图书馆学刊, 2020, 29(2): 102-113.
[14] (Wang Ruiyun, Jia Junzhi. Analysis of Named Entity Alignment Based on External-ID—Taking Data Subset of Wikidata for Academician of Chinese Academy of Sciences as an Example[J]. Journal of the National Library of China, 2020, 29(2): 102-113.)
[15] 刘翔, 黄晨. 基于ISNI的学术应用生态构建[J]. 数字图书馆论坛, 2020(5): 49-53.
[15] (Liu Xiang, Huang Chen. Construction of Academic Application Ecology Based on ISNI[J]. Digital Library Forum, 2020(5): 49-53.)
[16] Huang S Q, Yang B, Yan S L, et al. Institution Name Disambiguation for Research Assessment[J]. Scientometrics, 2014, 99: 823-838.
[17] 孙海霞, 王蕾, 吴英杰, 等. 科技文献数据库中机构名称匹配策略研究[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[17] (Sun Haixia, Wang Lei, Wu Yingjie, et al. Matching Strategies for Institution Names in Literature Database[J]. Data Analysis and Knowledge Discovery, 2018, 2(8): 88-97.)
[18] 苏娜, 张志强. 科学计量学中多重关系融合方法研究进展及分析[J]. 情报科学, 2010, 28(9): 1309-1313.
[18] (Su Na, Zhang Zhiqiang. On the Multiple Relation Fusion Research in Scientometrics[J]. Information Science, 2010, 28(9): 1309-1313.)
[19] Xu H Y, Dong K, Luo R, et al. Research on Topic Recognition Based on Multivariate Relation Fusion[C]// Proceedings of the 23rd International Conference on Science and Technology Indicators. 2018: 378-384.
[20] 周毅, 张建勇, 刘峥, 等. 科研实体名称规范的关联数据模型构建[J]. 图书情报工作, 2020, 64(10): 109-117.
doi: 10.13266/j.issn.0252-3116.2020.10.012
[20] (Zhou Yi, Zhang Jianyong, Liu Zheng, et al. Research on the Construction of Linked Data Model for Research Entity’s Name Authority Data[J]. Library and Information Service, 2020, 64(10): 109-117.)
doi: 10.13266/j.issn.0252-3116.2020.10.012
[21] 陈辰, 周莉, 王璐, 等. 科研实体唯一标识符互操作研究[J]. 情报理论与实践, 2018, 41(12): 99-103.
doi: 10.16353/j.cnki.1000-7490.2018.12.018
[21] (Chen Chen, Zhou Li, Wang Lu, et al. Interoperability of Scientific Research Entity Unique Identifier[J]. Information Studies: Theory & Application, 2018, 41(12): 99-103.)
doi: 10.16353/j.cnki.1000-7490.2018.12.018
[22] ISNI. FAQs[EB/OL]. [2023-07-12]. https://isni.org/page/faqs.
[23] Wikidata. VIAF ID[EB/OL]. [2023-07-13]. https://www.wikidata.org/wiki/Q19832964.
[24] 贤信, 曾建勋. 科研实体唯一标识系统研究[J]. 图书情报工作, 2015, 59(12): 113-119.
doi: 10.13266/j.issn.0252-3116.2015.12.017
[24] (Xian Xin, Zeng Jianxun. Research on Identification Systems of Scientific Research Entity[J]. Library and Information Service, 2015, 59(12): 113-119.)
doi: 10.13266/j.issn.0252-3116.2015.12.017
[1] Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo. Identifying Styles of Cross-Language Classics with Pre-Trained Models[J]. 数据分析与知识发现, 2023, 7(10): 50-62.
[2] Cheng Quan, Dong Jia. Hierarchical Multi-label Classification of Children's Literature for Graded Reading[J]. 数据分析与知识发现, 2023, 7(7): 156-169.
[3] Wang Xuezhao, Wang Yanpeng, Zhao Ping, Chen Fang, Chen Xiaoli. Scenarized Intelligent Data-Driven Research Model: Concept, Technical Framework, and Experimental Verification[J]. 数据分析与知识发现, 2023, 7(5): 1-9.
[4] Wang Yufei, Zhang Zhixiong, Zhao Yang, Zhang Mengting, Li Xuesi. Designing and Implementing Automatic Title Generation System for Sci-Tech Papers[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[5] Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
[6] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[7] Lv Lucheng, Zhou Jian, Wang Xuezhao, Liu Xiwen. Technology Evolution Analysis Framework Based on Two-Layer Topic Model and Application[J]. 数据分析与知识发现, 2022, 6(2/3): 18-32.
[8] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[9] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[10] Wei Wei,Guo Chonghui,Xing Xiaoyu. Annotating Knowledge Points & Recommending Questions Based on Semantic Association Rules[J]. 数据分析与知识发现, 2020, 4(2/3): 182-191.
[11] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] Jinzhu Zhang,Yue Wang,Yiming Hu. Analyzing Sci-Tech Topics Based on Semantic Representation of Patent References[J]. 数据分析与知识发现, 2019, 3(12): 52-60.
[13] Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
[14] Zhao Yuxiang,Liu Zhouying,Song Shijie. Exploring the Influential Factors of Askers’ Intention to Pay in Knowledge Q&A Platforms[J]. 数据分析与知识发现, 2018, 2(8): 16-30.
[15] Jia Junzhi,Li Xiao. Analyzing owl:sameAs Network in Linked Data[J]. 数据分析与知识发现, 2017, 1(10): 77-84.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn