Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (10): 56-65    DOI: 10.11925/infotech.2096-3467.2018.1368
Current Issue | Archive | Adv Search |
Clustering Wikidata’s Organizational Entities with Latent Semantic Index
Junzhi Jia1(),Zhuangzhuang Ye2
1School of Information Resource Management, Renmin University of China, Beijing 100872, China
2School of Economics and Management, Shanxi University, Taiyuan 030006, China
Download: PDF (662 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a model to classify institutions in Wikidata’s category trees, aiming to better organize these entities. [Methods] We used an unsupervised hierarchical clustering algorithm to automatically cluster the institutional instances without proper tags. To eliminate the influence of the co-occurring feature words, we introduced the relevant attributes of the organizational entities in Wikidata. The clustering algorithm is sensitive to the data dimensions, hence, used the Latent Semantic Index to represent the texts. We also mapped the high-dimensional data to the potential low-dimensional semantic spaces through the singular value decomposition. [Results] The accuracy rate of the proposed clustering method on the experimental dataset reached 87.3%. [Limitations] The sample data sets need to be expanded. [Conclusions] The proposed model could effectively aggregate names of similar institutions and address the clustering issues of high-dimensional texts.

Key wordsOrganizational Entity Clustering      Latent Semantic Index      Hierarchical Clustering      Wikidata     
Received: 04 December 2018      Published: 25 November 2019
ZTFLH:  G254  
Corresponding Authors: Junzhi Jia     E-mail: junzhij@163.com

Cite this article:

Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index. Data Analysis and Knowledge Discovery, 2019, 3(10): 56-65.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1368     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I10/56

实例数范围 相应条目数
i=0 1 181
0<i≤10 1 474
10<i≤100 865
100<i≤1 000 366
1 000<i≤10 000 111
i>10 000 23
Name Description Abstract of Wikipedia
Academy of
Fine Arts
Art society
in India
The Academy of Fine Arts, in Kolkata
(formerly Calcutta) is …[16]
类号 数量 机构
0 26 Animal Aid; Animal Liberation; Animal Defenders International……
1 35 International Society of Copier Artists; Lithuanian Artists' Association; Artists Union……
2 116 Air Accident Investigation Bureau of Singapore; Air Accidents Investigation Institute; Airlines Electronic Engineering Committee……
3 76 Medical Emergency Relief International; Medical Council of India; Center for Medical Progress……
4 55 Academy of Labor and Social Relations; Academy of Labour Social Relations and Tourism; Academy of Innovation Management……
5 30 Accreditation Council for Business Schools and Programs; Accreditation Service for International Colleges; Accreditation and Quality Assurance Commission……
6 33 Financial Stability Board;Financial Stability Forum;Financial and Economic Committee……
7 53 Women Food and Agriculture Network; Women Involved in Nurturing, Giving, Sharing; Women's Armed Services Integration Act……
8 62 National Student Film Association; National Student Lobby; Students Offering Support……
9 14 Accounting Professional & Ethical Standards Board; Accounting Standards Board; Educational Foundation for Women in Accounting……
手工聚类中
属于同一类
自动聚类中
属于同一类
标识 数量(个)
TP 15 096
TN 99 612
FP 3 634
FN 6 408
[1] 贤信, 曾建勋 . 科研实体唯一标识系统研究[J]. 图书情报工作, 2015,59(12):113-119.
[1] ( Xian Xin, Zeng Jianxun . Research on Identification Systems of Scientific Research Entity[J]. Library and Information Service, 2015,59(12):113-119.)
[2] 李慧佳, 马建玲, 张秀秀 , 等. 中文机构名称规范库建设的实践与分析——以“中科院机构名称规范库”建设为例[J]. 图书与情报, 2016(1):133-139.
[2] ( Li Huijia, Ma Jianling, Zhang Xiuxiu , et al. The Practice and Analysis of the Construction of Chinese Institution Name Library ——“Institution Name Authority of Chinese Academy of Science” as Example[J]. Library & Information, 2016(1):133-139.)
[3] 杨奕虹, 李雅萍, 张立丽 , 等. 机构多层级词表的编制及在文献计量评价与科研绩效管理中的应用[J]. 数字图书馆论坛, 2013(6):57-63.
[3] ( Yang Yihong, Li Yaping, Zhang Lili , et al. The Compilation of Multi-Echelon Thesaurus of Organization Names and Its Application in the Document Measurement and Evaluation and in the Management of Achievements in Scientific Researches[J]. Digital Library Forum, 2013(6):57-63.)
[4] 胡万亭, 杨燕, 尹红风 , 等. 一种基于词频统计的组织机构名识别方法[J]. 计算机应用研究, 2013,30(7):2014-2016.
[4] ( Hu Wanting, Yang Yan, Yin Hongfeng , et al. Organization Name Recognition Based on Word Frequency Statistics[J]. Application Research of Computers, 2013,30(7):2014-2016.)
[5] 贾君枝, 叶壮壮 . 基于Wikidata的机构类目范畴树构建与优化[J]. 国家图书馆学刊, 2018,27(1):56-64.
[5] ( Jia Junzhi, Ye Zhuangzhuang . Construction and Optimization of Organizational Category Tree Based on Wikidata[J]. Journal of the National Library of China, 2018,27(1):56-64.)
[6] 刘朋杰 . 基于维基百科的语义Web搜索技术研究[D]. 天津: 天津理工大学, 2015.
[6] ( Liu Pengjie . Semantic Web Search Technology Based on Wikipedia[D]. Tianjin: Tianjin University of Technology, 2015.)
[7] Deerwester S, Dumais S T, Furnas G W , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41:391-407.
[8] 邬启为 . 基于向量空间的文本聚类方法与实现[D]. 北京: 北京交通大学, 2014.
[8] ( Wu Qiwei . Design and Implementation of Text Clustering Based on Vector Space Model[D]. Beijing: Beijing Jiaotong University, 2014.)
[9] 李华云, 金玉坚 . 基于层次搜索的潜在语义索引方法研究[J]. 图书情报工作, 2006,50(11):36-38.
[9] ( Li Huayun, Jin Yujian . Latent Semantic Indexing Based on Level Search Scheme[J]. Library and Information Service, 2006,50(11):36-38.)
[10] 廖律超, 蒋新华, 邹复民 , 等. 一种支持轨迹大数据潜在语义相关性挖掘的谱聚类方法[J]. 电子学报, 2015,43(5):956-964.
doi: 10.3969/j.issn.0372-2112.2015.05.019
[10] ( Liao Lvchao, Jiang Xinhua, Zou Fumin , et al. A Spectral Clustering Method for Big Trajectory Data Mining with Latent Semantic Correlation[J]. Acta Electronica Sinica, 2015,43(5):956-964.)
doi: 10.3969/j.issn.0372-2112.2015.05.019
[11] Karypis G, Han E H S. Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization & Retrieval [C]// Proceedings of the 9th International Conference on Information and Knowledge Management, McLean, VA, USA. ACM, 2000: 12-19.
[12] Bingham E, Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data [C]// Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 2001: 245-250.
[13] 赵伟 . 基于并行计算的概率潜在语义分析算法研究[J]. 安徽职业技术学院学报, 2014,13(3):1-3.
[13] ( Zhao Wei . Research on Probability Latent Semantic Analysis Algorithm Based on Parallel Computing[J]. Journal of Anhui Vocational & Technical College, 2014,13(3):1-3.)
[14] 陈黎飞 . 高维数据的聚类方法研究与应用[D]. 厦门: 厦门大学, 2008.
[14] ( Chen Lifei . Research on Clustering Methods for High Dimensional Data and Their Applications[D]. Xiamen: Xiamen University, 2008.)
[15] Aranganayagi S, Thangavel K. Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure [C]// Proceedings of the 2017 International Conference on Computational Intelligence and Multimedia Applications, Sivakasi, Tamil Nadu, India. IEEE, 2007: 13-17.
[16] China Academy of Art. Wikipedia, The Free Encyclopedia[DB/OL].[2018-03-28]. .
[17] Radim R, Sojka P. Software Framework for Topic Modelling with Large Corpora [C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. 2010: 46-50.
[18] 姜子林 . 层次聚类的方法及应用[J]. 电子技术与软件工程, 2018(1):179-180.
[18] ( Jiang Zilin . Hierarchical Clustering Method and Application[J]. Electronic Technology & Software Engineering, 2018(1):179-180.)
[19] Alahakoon D, Halgamuge S K, Srinivasan B . Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery[J]. IEEE Transactions on Neural Networks, 2000,11(3):601-614.
[1] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[2] Ding Shengchun,Gong Silan,Li Hongmei. A New Method to Detect Bursty Events from Micro-blog Posts Based on Bursty Topic Words and Agglomerative Hierarchical Clustering Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[3] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[4] Zhao Pengwei, Ma Lin, Qin Chunxiu. Formation of Interest-based Peer-to-Peer Community[J]. 现代图书情报技术, 2013, 29(10): 53-58.
[5] Xiao Ming, Li Wenchao, Xia Qiuju. Mapping the Themes of Information Retrieval Based on Prefuse and Hierarchical Clustering[J]. 现代图书情报技术, 2012, 28(4): 35-40.
[6] Zhang Shunrui, You Hongliang. Chinese People Name Disambiguation by Hierarchical Clustering[J]. 现代图书情报技术, 2010, 26(11): 64-68.
[7] Sun Haixia,Cheng Ying. Overview of Research on Latent Semantic Indexing[J]. 现代图书情报技术, 2007, 2(9): 49-53.
[8] Qin Chunxiu,Liu Huailiang,Zhao Pengwei . A Text Semantic Information Processing Method Based on Ontology and Latent Semantic Indexing[J]. 现代图书情报技术, 2006, 1(9): 34-37.
[9] Wang Zhijin,Zheng Hongjun. Algebra-Based Retrieval Model and Its Extension[J]. 现代图书情报技术, 2005, 21(7): 30-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn