Clustering Wikidata’s Organizational Entities with Latent Semantic Index
Junzhi Jia1(),Zhuangzhuang Ye2
1School of Information Resource Management, Renmin University of China, Beijing 100872, China 2School of Economics and Management, Shanxi University, Taiyuan 030006, China
[Objective] This paper proposes a model to classify institutions in Wikidata’s category trees, aiming to better organize these entities. [Methods] We used an unsupervised hierarchical clustering algorithm to automatically cluster the institutional instances without proper tags. To eliminate the influence of the co-occurring feature words, we introduced the relevant attributes of the organizational entities in Wikidata. The clustering algorithm is sensitive to the data dimensions, hence, used the Latent Semantic Index to represent the texts. We also mapped the high-dimensional data to the potential low-dimensional semantic spaces through the singular value decomposition. [Results] The accuracy rate of the proposed clustering method on the experimental dataset reached 87.3%. [Limitations] The sample data sets need to be expanded. [Conclusions] The proposed model could effectively aggregate names of similar institutions and address the clustering issues of high-dimensional texts.
International Society of Copier Artists; Lithuanian Artists' Association; Artists Union……
2
116
Air Accident Investigation Bureau of Singapore; Air Accidents Investigation Institute; Airlines Electronic Engineering Committee……
3
76
Medical Emergency Relief International; Medical Council of India; Center for Medical Progress……
4
55
Academy of Labor and Social Relations; Academy of Labour Social Relations and Tourism; Academy of Innovation Management……
5
30
Accreditation Council for Business Schools and Programs; Accreditation Service for International Colleges; Accreditation and Quality Assurance Commission……
6
33
Financial Stability Board;Financial Stability Forum;Financial and Economic Committee……
7
53
Women Food and Agriculture Network; Women Involved in Nurturing, Giving, Sharing; Women's Armed Services Integration Act……
8
62
National Student Film Association; National Student Lobby; Students Offering Support……
9
14
Accounting Professional & Ethical Standards Board; Accounting Standards Board; Educational Foundation for Women in Accounting……
( Li Huijia, Ma Jianling, Zhang Xiuxiu , et al. The Practice and Analysis of the Construction of Chinese Institution Name Library ——“Institution Name Authority of Chinese Academy of Science” as Example[J]. Library & Information, 2016(1):133-139.)
( Yang Yihong, Li Yaping, Zhang Lili , et al. The Compilation of Multi-Echelon Thesaurus of Organization Names and Its Application in the Document Measurement and Evaluation and in the Management of Achievements in Scientific Researches[J]. Digital Library Forum, 2013(6):57-63.)
( Hu Wanting, Yang Yan, Yin Hongfeng , et al. Organization Name Recognition Based on Word Frequency Statistics[J]. Application Research of Computers, 2013,30(7):2014-2016.)
( Jia Junzhi, Ye Zhuangzhuang . Construction and Optimization of Organizational Category Tree Based on Wikidata[J]. Journal of the National Library of China, 2018,27(1):56-64.)
[6]
刘朋杰 . 基于维基百科的语义Web搜索技术研究[D]. 天津: 天津理工大学, 2015.
[6]
( Liu Pengjie . Semantic Web Search Technology Based on Wikipedia[D]. Tianjin: Tianjin University of Technology, 2015.)
[7]
Deerwester S, Dumais S T, Furnas G W , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41:391-407.
[8]
邬启为 . 基于向量空间的文本聚类方法与实现[D]. 北京: 北京交通大学, 2014.
[8]
( Wu Qiwei . Design and Implementation of Text Clustering Based on Vector Space Model[D]. Beijing: Beijing Jiaotong University, 2014.)
( Liao Lvchao, Jiang Xinhua, Zou Fumin , et al. A Spectral Clustering Method for Big Trajectory Data Mining with Latent Semantic Correlation[J]. Acta Electronica Sinica, 2015,43(5):956-964.)
doi: 10.3969/j.issn.0372-2112.2015.05.019
[11]
Karypis G, Han E H S. Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization & Retrieval [C]// Proceedings of the 9th International Conference on Information and Knowledge Management, McLean, VA, USA. ACM, 2000: 12-19.
[12]
Bingham E, Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data [C]// Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 2001: 245-250.
( Zhao Wei . Research on Probability Latent Semantic Analysis Algorithm Based on Parallel Computing[J]. Journal of Anhui Vocational & Technical College, 2014,13(3):1-3.)
[14]
陈黎飞 . 高维数据的聚类方法研究与应用[D]. 厦门: 厦门大学, 2008.
[14]
( Chen Lifei . Research on Clustering Methods for High Dimensional Data and Their Applications[D]. Xiamen: Xiamen University, 2008.)
[15]
Aranganayagi S, Thangavel K. Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure [C]// Proceedings of the 2017 International Conference on Computational Intelligence and Multimedia Applications, Sivakasi, Tamil Nadu, India. IEEE, 2007: 13-17.
[16]
China Academy of Art. Wikipedia, The Free Encyclopedia[DB/OL].[2018-03-28]. .
[17]
Radim R, Sojka P. Software Framework for Topic Modelling with Large Corpora [C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. 2010: 46-50.
Alahakoon D, Halgamuge S K, Srinivasan B . Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery[J]. IEEE Transactions on Neural Networks, 2000,11(3):601-614.