[Objective] This paper proposes a model to classify institutions in Wikidata’s category trees, aiming to better organize these entities. [Methods] We used an unsupervised hierarchical clustering algorithm to automatically cluster the institutional instances without proper tags. To eliminate the influence of the co-occurring feature words, we introduced the relevant attributes of the organizational entities in Wikidata. The clustering algorithm is sensitive to the data dimensions, hence, used the Latent Semantic Index to represent the texts. We also mapped the high-dimensional data to the potential low-dimensional semantic spaces through the singular value decomposition. [Results] The accuracy rate of the proposed clustering method on the experimental dataset reached 87.3%. [Limitations] The sample data sets need to be expanded. [Conclusions] The proposed model could effectively aggregate names of similar institutions and address the clustering issues of high-dimensional texts.
( Li Huijia, Ma Jianling, Zhang Xiuxiu , et al. The Practice and Analysis of the Construction of Chinese Institution Name Library ——“Institution Name Authority of Chinese Academy of Science” as Example[J]. Library & Information, 2016(1):133-139.)
( Yang Yihong, Li Yaping, Zhang Lili , et al. The Compilation of Multi-Echelon Thesaurus of Organization Names and Its Application in the Document Measurement and Evaluation and in the Management of Achievements in Scientific Researches[J]. Digital Library Forum, 2013(6):57-63.)
( Liao Lvchao, Jiang Xinhua, Zou Fumin , et al. A Spectral Clustering Method for Big Trajectory Data Mining with Latent Semantic Correlation[J]. Acta Electronica Sinica, 2015,43(5):956-964.)
Karypis G, Han E H S. Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization & Retrieval [C]// Proceedings of the 9th International Conference on Information and Knowledge Management, McLean, VA, USA. ACM, 2000: 12-19.
Bingham E, Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data [C]// Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 2001: 245-250.
( Zhao Wei . Research on Probability Latent Semantic Analysis Algorithm Based on Parallel Computing[J]. Journal of Anhui Vocational & Technical College, 2014,13(3):1-3.)
陈黎飞 . 高维数据的聚类方法研究与应用[D]. 厦门: 厦门大学, 2008.
( Chen Lifei . Research on Clustering Methods for High Dimensional Data and Their Applications[D]. Xiamen: Xiamen University, 2008.)
Aranganayagi S, Thangavel K. Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure [C]// Proceedings of the 2017 International Conference on Computational Intelligence and Multimedia Applications, Sivakasi, Tamil Nadu, India. IEEE, 2007: 13-17.
China Academy of Art. Wikipedia, The Free Encyclopedia[DB/OL].[2018-03-28]. .
Radim R, Sojka P. Software Framework for Topic Modelling with Large Corpora [C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. 2010: 46-50.