Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (10): 56-65     https://doi.org/10.11925/infotech.2096-3467.2018.1368
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于潜在语义索引的Wikidata机构实体聚类研究 *
贾君枝1(),叶壮壮2
1中国人民大学信息资源管理学院 北京 100872
2山西大学经济与管理学院 太原 030006
Clustering Wikidata’s Organizational Entities with Latent Semantic Index
Junzhi Jia1(),Zhuangzhuang Ye2
1School of Information Resource Management, Renmin University of China, Beijing 100872, China
2School of Economics and Management, Shanxi University, Taiyuan 030006, China
全文: PDF (662 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】Wikidata机构类目范畴树中, 存在实例数目过多而使类目的外延过大、不能明确指示及类分资源的问题。为系统化机构名称层级体系, 需将这些实例进行划分, 使其均衡分布在机构范畴树的各层。【方法】将无监督的层次聚类算法用于解决无类别标签的机构实例的自动聚簇问题。为消除机构实体名称中特征词共现对聚类算法的影响, 引入Wikidata中机构实体的相关属性作为其上下文环境。同时聚类算法对数据的维度十分敏感, 因此采用潜在语义索引作为文本表示模型, 通过奇异值分解将高维数据映射到潜在的低维语义空间。【结果】本文方法在实验数据集上的聚类准确率达到87.3%, 取得了较好的聚类效果。【局限】仅在小样本数据集上进行验证。【结论】为机构名称提供上下文环境有利于同类机构的聚集, 基于潜在语义索引模型的层次聚类算法对于高维度的文本聚类问题是有效的。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
贾君枝
叶壮壮
关键词 机构实体聚类潜在语义索引层次聚类Wikidata    
Abstract

[Objective] This paper proposes a model to classify institutions in Wikidata’s category trees, aiming to better organize these entities. [Methods] We used an unsupervised hierarchical clustering algorithm to automatically cluster the institutional instances without proper tags. To eliminate the influence of the co-occurring feature words, we introduced the relevant attributes of the organizational entities in Wikidata. The clustering algorithm is sensitive to the data dimensions, hence, used the Latent Semantic Index to represent the texts. We also mapped the high-dimensional data to the potential low-dimensional semantic spaces through the singular value decomposition. [Results] The accuracy rate of the proposed clustering method on the experimental dataset reached 87.3%. [Limitations] The sample data sets need to be expanded. [Conclusions] The proposed model could effectively aggregate names of similar institutions and address the clustering issues of high-dimensional texts.

Key wordsOrganizational Entity Clustering    Latent Semantic Index    Hierarchical Clustering    Wikidata
收稿日期: 2018-12-04      出版日期: 2019-11-25
ZTFLH:  G254  
基金资助:*本文系国家社会科学基金重点项目“基于关联数据的中文名称规范档语义描述及数据聚合研究”的研究成果之一(15ATQ004)
通讯作者: 贾君枝     E-mail: junzhij@163.com
引用本文:   
贾君枝,叶壮壮. 基于潜在语义索引的Wikidata机构实体聚类研究 *[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index. Data Analysis and Knowledge Discovery, 2019, 3(10): 56-65.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1368      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I10/56
实例数范围 相应条目数
i=0 1 181
0<i≤10 1 474
10<i≤100 865
100<i≤1 000 366
1 000<i≤10 000 111
i>10 000 23
  机构类目范畴树中拥有不同实例数量条目的分布
  Wikidata中机构类条目的数据结构(以Organization为例)
  文本聚类流程
  机构实例聚类的实现流程
Name Description Abstract of Wikipedia
Academy of
Fine Arts
Art society
in India
The Academy of Fine Arts, in Kolkata
(formerly Calcutta) is …[16]
  机构实体数据集存储形式示例表
  k值增加奇异值的变化情况
  两种聚类方法在不同聚类数目下的文本聚类 平均轮廓系数折线图
类号 数量 机构
0 26 Animal Aid; Animal Liberation; Animal Defenders International……
1 35 International Society of Copier Artists; Lithuanian Artists' Association; Artists Union……
2 116 Air Accident Investigation Bureau of Singapore; Air Accidents Investigation Institute; Airlines Electronic Engineering Committee……
3 76 Medical Emergency Relief International; Medical Council of India; Center for Medical Progress……
4 55 Academy of Labor and Social Relations; Academy of Labour Social Relations and Tourism; Academy of Innovation Management……
5 30 Accreditation Council for Business Schools and Programs; Accreditation Service for International Colleges; Accreditation and Quality Assurance Commission……
6 33 Financial Stability Board;Financial Stability Forum;Financial and Economic Committee……
7 53 Women Food and Agriculture Network; Women Involved in Nurturing, Giving, Sharing; Women's Armed Services Integration Act……
8 62 National Student Film Association; National Student Lobby; Students Offering Support……
9 14 Accounting Professional & Ethical Standards Board; Accounting Standards Board; Educational Foundation for Women in Accounting……
  机构实体的聚类结果(示例)
手工聚类中
属于同一类
自动聚类中
属于同一类
标识 数量(个)
TP 15 096
TN 99 612
FP 3 634
FN 6 408
  机构实体自动聚类和手工聚类的4种情况
[1] 贤信, 曾建勋 . 科研实体唯一标识系统研究[J]. 图书情报工作, 2015,59(12):113-119.
[1] ( Xian Xin, Zeng Jianxun . Research on Identification Systems of Scientific Research Entity[J]. Library and Information Service, 2015,59(12):113-119.)
[2] 李慧佳, 马建玲, 张秀秀 , 等. 中文机构名称规范库建设的实践与分析——以“中科院机构名称规范库”建设为例[J]. 图书与情报, 2016(1):133-139.
[2] ( Li Huijia, Ma Jianling, Zhang Xiuxiu , et al. The Practice and Analysis of the Construction of Chinese Institution Name Library ——“Institution Name Authority of Chinese Academy of Science” as Example[J]. Library & Information, 2016(1):133-139.)
[3] 杨奕虹, 李雅萍, 张立丽 , 等. 机构多层级词表的编制及在文献计量评价与科研绩效管理中的应用[J]. 数字图书馆论坛, 2013(6):57-63.
[3] ( Yang Yihong, Li Yaping, Zhang Lili , et al. The Compilation of Multi-Echelon Thesaurus of Organization Names and Its Application in the Document Measurement and Evaluation and in the Management of Achievements in Scientific Researches[J]. Digital Library Forum, 2013(6):57-63.)
[4] 胡万亭, 杨燕, 尹红风 , 等. 一种基于词频统计的组织机构名识别方法[J]. 计算机应用研究, 2013,30(7):2014-2016.
[4] ( Hu Wanting, Yang Yan, Yin Hongfeng , et al. Organization Name Recognition Based on Word Frequency Statistics[J]. Application Research of Computers, 2013,30(7):2014-2016.)
[5] 贾君枝, 叶壮壮 . 基于Wikidata的机构类目范畴树构建与优化[J]. 国家图书馆学刊, 2018,27(1):56-64.
[5] ( Jia Junzhi, Ye Zhuangzhuang . Construction and Optimization of Organizational Category Tree Based on Wikidata[J]. Journal of the National Library of China, 2018,27(1):56-64.)
[6] 刘朋杰 . 基于维基百科的语义Web搜索技术研究[D]. 天津: 天津理工大学, 2015.
[6] ( Liu Pengjie . Semantic Web Search Technology Based on Wikipedia[D]. Tianjin: Tianjin University of Technology, 2015.)
[7] Deerwester S, Dumais S T, Furnas G W , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41:391-407.
[8] 邬启为 . 基于向量空间的文本聚类方法与实现[D]. 北京: 北京交通大学, 2014.
[8] ( Wu Qiwei . Design and Implementation of Text Clustering Based on Vector Space Model[D]. Beijing: Beijing Jiaotong University, 2014.)
[9] 李华云, 金玉坚 . 基于层次搜索的潜在语义索引方法研究[J]. 图书情报工作, 2006,50(11):36-38.
[9] ( Li Huayun, Jin Yujian . Latent Semantic Indexing Based on Level Search Scheme[J]. Library and Information Service, 2006,50(11):36-38.)
[10] 廖律超, 蒋新华, 邹复民 , 等. 一种支持轨迹大数据潜在语义相关性挖掘的谱聚类方法[J]. 电子学报, 2015,43(5):956-964.
doi: 10.3969/j.issn.0372-2112.2015.05.019
[10] ( Liao Lvchao, Jiang Xinhua, Zou Fumin , et al. A Spectral Clustering Method for Big Trajectory Data Mining with Latent Semantic Correlation[J]. Acta Electronica Sinica, 2015,43(5):956-964.)
doi: 10.3969/j.issn.0372-2112.2015.05.019
[11] Karypis G, Han E H S. Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization & Retrieval [C]// Proceedings of the 9th International Conference on Information and Knowledge Management, McLean, VA, USA. ACM, 2000: 12-19.
[12] Bingham E, Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data [C]// Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 2001: 245-250.
[13] 赵伟 . 基于并行计算的概率潜在语义分析算法研究[J]. 安徽职业技术学院学报, 2014,13(3):1-3.
[13] ( Zhao Wei . Research on Probability Latent Semantic Analysis Algorithm Based on Parallel Computing[J]. Journal of Anhui Vocational & Technical College, 2014,13(3):1-3.)
[14] 陈黎飞 . 高维数据的聚类方法研究与应用[D]. 厦门: 厦门大学, 2008.
[14] ( Chen Lifei . Research on Clustering Methods for High Dimensional Data and Their Applications[D]. Xiamen: Xiamen University, 2008.)
[15] Aranganayagi S, Thangavel K. Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure [C]// Proceedings of the 2017 International Conference on Computational Intelligence and Multimedia Applications, Sivakasi, Tamil Nadu, India. IEEE, 2007: 13-17.
[16] China Academy of Art. Wikipedia, The Free Encyclopedia[DB/OL].[2018-03-28]. .
[17] Radim R, Sojka P. Software Framework for Topic Modelling with Large Corpora [C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. 2010: 46-50.
[18] 姜子林 . 层次聚类的方法及应用[J]. 电子技术与软件工程, 2018(1):179-180.
[18] ( Jiang Zilin . Hierarchical Clustering Method and Application[J]. Electronic Technology & Software Engineering, 2018(1):179-180.)
[19] Alahakoon D, Halgamuge S K, Srinivasan B . Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery[J]. IEEE Transactions on Neural Networks, 2000,11(3):601-614.
[1] 毕崇武,叶光辉,李明倩,曾杰妍. 基于标签语义挖掘的城市画像感知研究 *[J]. 数据分析与知识发现, 2019, 3(12): 41-51.
[2] 王雪颖, 张紫玄, 王昊, 邓三鸿. 中国农产品品牌评价研究的内容解析*[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[3] 丁晟春,龚思兰,李红梅. 基于突发主题词和凝聚式层次聚类的微博突发事件检测研究*[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[4] 肖天久, 刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[5] 赵捧未, 马琳, 秦春秀. P2P用户兴趣社区形成研究[J]. 现代图书情报技术, 2013, 29(10): 53-58.
[6] 肖明, 栗文超, 夏秋菊. 基于Prefuse和层次聚类的信息检索主题知识图谱研究[J]. 现代图书情报技术, 2012, 28(4): 35-40.
[7] 章顺瑞, 游宏梁. 基于层次聚类算法的中文人名消歧[J]. 现代图书情报技术, 2010, 26(11): 64-68.
[8] 曹高辉,焦玉英,成全. 基于凝聚式层次聚类算法的标签聚类研究*[J]. 现代图书情报技术, 2008, 24(4): 23-28.
[9] 秦春秀,刘怀亮,赵捧未 . 一种基于本体论和潜在语义索引的文本语义处理方法*[J]. 现代图书情报技术, 2006, 1(9): 34-37.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn