Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (10): 56-65    DOI: 10.11925/infotech.2096-3467.2018.1368
Clustering Wikidata’s Organizational Entities with Latent Semantic Index
Junzhi Jia1(),Zhuangzhuang Ye2
1School of Information Resource Management, Renmin University of China, Beijing 100872, China
2School of Economics and Management, Shanxi University, Taiyuan 030006, China
[Objective] This paper proposes a model to classify institutions in Wikidata’s category trees, aiming to better organize these entities. [Methods] We used an unsupervised hierarchical clustering algorithm to automatically cluster the institutional instances without proper tags. To eliminate the influence of the co-occurring feature words, we introduced the relevant attributes of the organizational entities in Wikidata. The clustering algorithm is sensitive to the data dimensions, hence, used the Latent Semantic Index to represent the texts. We also mapped the high-dimensional data to the potential low-dimensional semantic spaces through the singular value decomposition. [Results] The accuracy rate of the proposed clustering method on the experimental dataset reached 87.3%. [Limitations] The sample data sets need to be expanded. [Conclusions] The proposed model could effectively aggregate names of similar institutions and address the clustering issues of high-dimensional texts.

Key wordsOrganizational Entity Clustering      Latent Semantic Index      Hierarchical Clustering      Wikidata     
Received: 04 December 2018      Published: 25 November 2019
Cite this article:

Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index. Data Analysis and Knowledge Discovery, 2019, 3(10): 56-65.

实例数范围 相应条目数
i=0 1 181
0<i≤10 1 474
10<i≤100 865
100<i≤1 000 366
1 000<i≤10 000 111
i>10 000 23
Name Description Abstract of Wikipedia
Academy of
Fine Arts
Art society
in India
The Academy of Fine Arts, in Kolkata
(formerly Calcutta) is …[16]
类号 数量 机构
0 26 Animal Aid; Animal Liberation; Animal Defenders International……
1 35 International Society of Copier Artists; Lithuanian Artists' Association; Artists Union……
2 116 Air Accident Investigation Bureau of Singapore; Air Accidents Investigation Institute; Airlines Electronic Engineering Committee……
3 76 Medical Emergency Relief International; Medical Council of India; Center for Medical Progress……
4 55 Academy of Labor and Social Relations; Academy of Labour Social Relations and Tourism; Academy of Innovation Management……
5 30 Accreditation Council for Business Schools and Programs; Accreditation Service for International Colleges; Accreditation and Quality Assurance Commission……
6 33 Financial Stability Board;Financial Stability Forum;Financial and Economic Committee……
7 53 Women Food and Agriculture Network; Women Involved in Nurturing, Giving, Sharing; Women's Armed Services Integration Act……
8 62 National Student Film Association; National Student Lobby; Students Offering Support……
9 14 Accounting Professional & Ethical Standards Board; Accounting Standards Board; Educational Foundation for Women in Accounting……
标识 数量(个)
TP 15 096
TN 99 612
FP 3 634
FN 6 408
