Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (10): 22-29     https://doi.org/10.11925/infotech.1003-3513.2015.10.04
  专题 本期目录 | 过刊浏览 | 高级检索 |
区分标签质量的机器生成标签聚类研究
章成志1,2, 顾晓雪1
1 南京理工大学经济管理学院 南京 210094;
2 江苏省数据工程与知识服务重点实验室(南京大学) 南京 210093
Clustering Machine-Generated Tags with Different Quality
Zhang Chengzhi1,2, Gu Xiaoxue1
1 School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China;
2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210093, China
全文: PDF (761 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 常规的标签或词语聚类没有考虑聚类对象的质量差异对聚类效果的影响, 本文旨在分析不同质量的机器生成标签的聚类效果差异, 并提出融合标签质量的标签聚类算法优化建议。[方法] 首先, 抓取Engadet中英文博客数据, 对其进行数据预处理得到候选标签, 抽取标签社会化特征与内容特征并进行权重计算, 采用两种标签质量区分策略, 得到不同质量的标签集合; 然后, 对不同质量的标签集合进行相似度计算, 使用AP算法进行聚类, 分析比较它们的聚类结果。[结果] 实验结果表明, 对于中英文标签, Top5标签聚类结果要优于Top5-10标签聚类结果, 加权社会化属性标签聚类结果优于不加权社会标签聚类结果。[局限] 区分标签质量的方法比较单一, 缺乏评价标签质量的有效方法。[结论] 高质量的机器生成标签聚类结果比低质量的标签聚类结果更好, 对标签的社会化属性的加权能够提高机器生成标签的聚类效果, 且社会化属性可以作为区分标签质量的特征之一。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
Abstract

[Objective] Conventional tags or words clustering haven't considered the impact of clustering members' quality to clustering results. This paper aims to analyze the differences in clustering results of different quality of the clustering machine-generated tags and make suggestions to improve the clustering result with fusion of tag quality. [Methods] Firstly, fetch the data of Engadet's blogs in Chinese and English, preprocess the data and get the candidate tags, extract tags' social and content features to calculate their weight. The authors use two strategies to distinguish different quality tags and obtain different tag sets. Then calculate the similarities of these tag sets and use AP algorithm to get clustering results, which could be compared and analyzed. [Results] The experiment results show that, for both Chinese and English tags, clustering results of Top5 tags are better than Top5-10, and clustering results of weighted social attributes of tags are better than non-weighted tags. [Limitations] The method of distinguishing tags' quality is relatively simple and lacking of effective method to evaluate the quality of tags. [Conclusions] Clustering results of machine-generated tags with high quality are better than clustering results of tags with low quality. The clustering performance of machine-generated tags can be improved by weighting the social attribute. At the same time, the social attribute of tags can be used to evaluate the quality of them.

收稿日期: 2014-04-29      出版日期: 2016-04-06
:  G250  
基金资助:

本文系国家社会科学基金项目“在线社交网络中基于用户的知识组织模式研究”(项目编号:14BTQ033)、教育部人文社会科学基金规划项目“多语言高质量社会化标签生成及聚类研究”(项目编号:13YJA870020)和国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号:14AZD084)的研究成果之一。

通讯作者: 章成志, ORCID: 0000-0001-8121-4796, E-mail: zhangcz@njust.edu.cn。     E-mail: zhangcz@njust.edu.cn
作者简介: 作者贡献声明:章成志: 提出研究思路, 讨论研究方案, 采集并分析数据, 论文起草及最终版本修订;顾晓雪: 设计研究方案, 设计与实施实验, 清洗与分析数据。
引用本文:   
章成志, 顾晓雪. 区分标签质量的机器生成标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 22-29.
Zhang Chengzhi, Gu Xiaoxue. Clustering Machine-Generated Tags with Different Quality. New Technology of Library and Information Service, 2015, 31(10): 22-29.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.10.04      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I10/22

[1] Peters I. Folksonomies: Indexing and Retrieval in Web 2.0 [M]. Walter de Gruyter, 2009.
[2] Liu Y, Liu M, Chen X, et al. Automatic Tag Recommendation for Weblogs [C]. In: Proceedings of International Conference on Information Technology and Computer Science (ITCS 2009). 2009: 546-549.
[3] Li Z, Zhou D, Juan Y F, et al. Keyword Extraction for Social Snippets [C]. In: Proceedings of the 19th International Conference on World Wide Web. 2010: 1143-1144.
[4] Carmel D, Uziel E, Guy I, et al. Folksonomy-based Term Extraction for Word Cloud Generation [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2012, 3(4): Article No. 60.
[5] Gemmell J, Shepitsen A, Mobasher M, et al. Personalization in Folksonomies Based on Tag Clustering [C]. In: Proceedings of the 6th Workshop on Intelligent Techniques for Web Personalization and Recommender Systems.2008.
[6] Shepitsen A, Gemmell J, Mobasher B, et al. Personalized Recommendation in Social Tagging Systems Using Hierar­chical Clustering [C]. In: Proceedings of the 2008 ACM Conference on Recommender Systems.2008:259-266.
[7] Wang J, Hong L, Davison B D. RSDC'09: Tag Recommen­dation Using Keywords and Association Rules [C]. In: Proceedings of ECML PKDD 2009 Discovery Challenge Workshop. 2009: 261-274.
[8] Kim H N, El Saddik A. Exploring Social Tagging for Personalized Community Recommendations [J]. User Modeling and User-Adapted Interaction, 2013, 23(2-3): 249-285.
[9] 李蕾, 章成志. 社会化标签质量评估研究综述[J]. 现代图书情报技术, 2013(11): 22-29. (Li Lei, Zhang Chengzhi. Survey on Quality Evaluation of Social Tags [J]. New Technology of Library and Information Service, 2013(11): 22-29.)
[10] Sen S, Vig J, Riedl J. Learning to Recognize Valuable Tags [C]. In: Proceedings of the 14th International Conference on Intelligent User Interfaces.2009:87-96.
[11] Chen X, Shin H. Extracting Representative Tags for Flickr Users[C]. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW). 2010: 312-317.
[12] 李丕绩, 马军, 张冬梅, 等. 用户评论中的标签抽取以及排序[J]. 中文信息学报, 2012, 26(5): 14-19. (Li Piji, Ma Jun, Zhang Dongmei, et al. Extraction and Ranking of Tags for User Opinions [J]. Journal of Chinese Information Processing, 2012, 26(5): 14-19.)
[13] Suchanek F M, Vojnovic M, Gunawardena D. Social Tags: Meaning and Suggestions [C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Manage­ment. 2008: 223-232.
[14] Begelman G, Keller P, Smadja F. Automated Tag Clustering: Improving Search and Exploration in the Tag Space [C]. In: Proceedings of the Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland. 2006: 15-33.
[15] Cui J, Liu H, He J, et al. TagClus: A Random Walk-based Method for Tag Clustering [J]. Knowledge and Information Systems, 2011, 27(2): 193-225.
[16] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 2009: 54-63.
[17] 曹高辉, 焦玉英, 成全. 基于凝聚式层次聚类算法的标签聚类研究[J]. 现代图书情报技术, 2008 (4): 23-28. (Cao Gaohui, Jiao Yuying, Cheng Quan. Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm [J]. New Technology of Library and Information Service, 2008 (4): 23-28.)
[18] Sbodio M L, Simpson E. Tag Clustering with Self Organizing Maps [R]. Hewlett-Packard Development Company, LP, 2009.
[19] Zong Y, Xu G, Jin P, et al. APPECT: An Approximate Backbone-based Clustering Algorithm for Tags [C]. In: Proceedings of the 7th International ADMA Conference, Beijing, China. Springer. 2011: 175-189.
[20] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[21] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. (He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging [J]. New Technology of Library and Information Service, 2013 (7-8): 49-54.)
[22] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points [J]. Science, 2007, 315(5814): 972-976.
[23] Tan P N, Steinbach M, Kumar V. 数据挖掘导论[M]. 范明, 范宏建译. 北京: 人民邮电出版社, 2006: 340-341. (Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [M]. Translated by Fan Ming, Fan Hongjian. Beijing: Posts & Telecom Press, 2006: 340-341.)
[24] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations [C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[25] Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis [M]. John Wiley & Sons, 2009.
[26] Ester M, Kriegel H P, Sander J, et al. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise [C]. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). 1996: 226-231.

[1] 柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[2] 谭荧, 唐亦非. 基于指代消解的引文内容抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[3] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[4] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[5] 李贺,刘嘉宇,李世钰,吴迪,金帅岐. 基于疾病知识图谱的自动问答系统优化研究*[J]. 数据分析与知识发现, 2021, 5(5): 115-126.
[6] 李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[7] 伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[8] 王红斌,王健雄,张亚飞,杨恒. 主题不平衡新闻文本数据集的主题识别方法研究*[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[9] 常志军,钱力,谢靖,吴振新,张鹄,于倩倩,王颖,王永吉. 基于分布式技术的科技文献大数据平台的建设研究*[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
[10] 胡少虎,张颖怡,章成志. 关键词提取研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 45-59.
[11] 刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法 [J]. 数据分析与知识发现, 0, (): 1-.
[12] 王红斌, 王健雄, 张亚飞, 杨恒. 主题不平衡新闻文本数据集的主题识别方法研究 [J]. 数据分析与知识发现, 0, (): 1-.
[13] 张思凡, 牛振东, 陆浩, 朱一凡, 王荣荣. 基于图卷积嵌入与特征交叉的文献被引量预测方法:以交通运输领域为例 [J]. 数据分析与知识发现, 0, (): 1-.
[14] 祁瑞华, 简悦, 郭旭, 关菁华, 杨明昕. 融合特征与注意力的跨领域产品评论情感分析 [J]. 数据分析与知识发现, 0, (): 1-.
[15] 李娇, 黄永文, 罗婷婷, 赵瑞雪, 鲜国建. 基于多因子算法的自动分类研究 [J]. 数据分析与知识发现, 0, (): 1-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn