Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (10): 22-29    DOI: 10.11925/infotech.1003-3513.2015.10.04
  专题 本期目录 | 过刊浏览 | 高级检索 |
区分标签质量的机器生成标签聚类研究
章成志1,2, 顾晓雪1
1 南京理工大学经济管理学院 南京 210094;
2 江苏省数据工程与知识服务重点实验室(南京大学) 南京 210093
Clustering Machine-Generated Tags with Different Quality
Zhang Chengzhi1,2, Gu Xiaoxue1
1 School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China;
2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210093, China
全文: PDF(761 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 常规的标签或词语聚类没有考虑聚类对象的质量差异对聚类效果的影响, 本文旨在分析不同质量的机器生成标签的聚类效果差异, 并提出融合标签质量的标签聚类算法优化建议。[方法] 首先, 抓取Engadet中英文博客数据, 对其进行数据预处理得到候选标签, 抽取标签社会化特征与内容特征并进行权重计算, 采用两种标签质量区分策略, 得到不同质量的标签集合; 然后, 对不同质量的标签集合进行相似度计算, 使用AP算法进行聚类, 分析比较它们的聚类结果。[结果] 实验结果表明, 对于中英文标签, Top5标签聚类结果要优于Top5-10标签聚类结果, 加权社会化属性标签聚类结果优于不加权社会标签聚类结果。[局限] 区分标签质量的方法比较单一, 缺乏评价标签质量的有效方法。[结论] 高质量的机器生成标签聚类结果比低质量的标签聚类结果更好, 对标签的社会化属性的加权能够提高机器生成标签的聚类效果, 且社会化属性可以作为区分标签质量的特征之一。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
Abstract

[Objective] Conventional tags or words clustering haven't considered the impact of clustering members' quality to clustering results. This paper aims to analyze the differences in clustering results of different quality of the clustering machine-generated tags and make suggestions to improve the clustering result with fusion of tag quality. [Methods] Firstly, fetch the data of Engadet's blogs in Chinese and English, preprocess the data and get the candidate tags, extract tags' social and content features to calculate their weight. The authors use two strategies to distinguish different quality tags and obtain different tag sets. Then calculate the similarities of these tag sets and use AP algorithm to get clustering results, which could be compared and analyzed. [Results] The experiment results show that, for both Chinese and English tags, clustering results of Top5 tags are better than Top5-10, and clustering results of weighted social attributes of tags are better than non-weighted tags. [Limitations] The method of distinguishing tags' quality is relatively simple and lacking of effective method to evaluate the quality of tags. [Conclusions] Clustering results of machine-generated tags with high quality are better than clustering results of tags with low quality. The clustering performance of machine-generated tags can be improved by weighting the social attribute. At the same time, the social attribute of tags can be used to evaluate the quality of them.

收稿日期: 2014-04-29     
:  G250  
基金资助:

本文系国家社会科学基金项目“在线社交网络中基于用户的知识组织模式研究”(项目编号:14BTQ033)、教育部人文社会科学基金规划项目“多语言高质量社会化标签生成及聚类研究”(项目编号:13YJA870020)和国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号:14AZD084)的研究成果之一。

通讯作者: 章成志, ORCID: 0000-0001-8121-4796, E-mail: zhangcz@njust.edu.cn。     E-mail: zhangcz@njust.edu.cn
作者简介: 作者贡献声明:章成志: 提出研究思路, 讨论研究方案, 采集并分析数据, 论文起草及最终版本修订;顾晓雪: 设计研究方案, 设计与实施实验, 清洗与分析数据。
引用本文:   
章成志, 顾晓雪. 区分标签质量的机器生成标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 22-29.
Zhang Chengzhi, Gu Xiaoxue. Clustering Machine-Generated Tags with Different Quality. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.10.04.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.10.04

[1] Peters I. Folksonomies: Indexing and Retrieval in Web 2.0 [M]. Walter de Gruyter, 2009.
[2] Liu Y, Liu M, Chen X, et al. Automatic Tag Recommendation for Weblogs [C]. In: Proceedings of International Conference on Information Technology and Computer Science (ITCS 2009). 2009: 546-549.
[3] Li Z, Zhou D, Juan Y F, et al. Keyword Extraction for Social Snippets [C]. In: Proceedings of the 19th International Conference on World Wide Web. 2010: 1143-1144.
[4] Carmel D, Uziel E, Guy I, et al. Folksonomy-based Term Extraction for Word Cloud Generation [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2012, 3(4): Article No. 60.
[5] Gemmell J, Shepitsen A, Mobasher M, et al. Personalization in Folksonomies Based on Tag Clustering [C]. In: Proceedings of the 6th Workshop on Intelligent Techniques for Web Personalization and Recommender Systems.2008.
[6] Shepitsen A, Gemmell J, Mobasher B, et al. Personalized Recommendation in Social Tagging Systems Using Hierar­chical Clustering [C]. In: Proceedings of the 2008 ACM Conference on Recommender Systems.2008:259-266.
[7] Wang J, Hong L, Davison B D. RSDC'09: Tag Recommen­dation Using Keywords and Association Rules [C]. In: Proceedings of ECML PKDD 2009 Discovery Challenge Workshop. 2009: 261-274.
[8] Kim H N, El Saddik A. Exploring Social Tagging for Personalized Community Recommendations [J]. User Modeling and User-Adapted Interaction, 2013, 23(2-3): 249-285.
[9] 李蕾, 章成志. 社会化标签质量评估研究综述[J]. 现代图书情报技术, 2013(11): 22-29. (Li Lei, Zhang Chengzhi. Survey on Quality Evaluation of Social Tags [J]. New Technology of Library and Information Service, 2013(11): 22-29.)
[10] Sen S, Vig J, Riedl J. Learning to Recognize Valuable Tags [C]. In: Proceedings of the 14th International Conference on Intelligent User Interfaces.2009:87-96.
[11] Chen X, Shin H. Extracting Representative Tags for Flickr Users[C]. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW). 2010: 312-317.
[12] 李丕绩, 马军, 张冬梅, 等. 用户评论中的标签抽取以及排序[J]. 中文信息学报, 2012, 26(5): 14-19. (Li Piji, Ma Jun, Zhang Dongmei, et al. Extraction and Ranking of Tags for User Opinions [J]. Journal of Chinese Information Processing, 2012, 26(5): 14-19.)
[13] Suchanek F M, Vojnovic M, Gunawardena D. Social Tags: Meaning and Suggestions [C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Manage­ment. 2008: 223-232.
[14] Begelman G, Keller P, Smadja F. Automated Tag Clustering: Improving Search and Exploration in the Tag Space [C]. In: Proceedings of the Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland. 2006: 15-33.
[15] Cui J, Liu H, He J, et al. TagClus: A Random Walk-based Method for Tag Clustering [J]. Knowledge and Information Systems, 2011, 27(2): 193-225.
[16] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 2009: 54-63.
[17] 曹高辉, 焦玉英, 成全. 基于凝聚式层次聚类算法的标签聚类研究[J]. 现代图书情报技术, 2008 (4): 23-28. (Cao Gaohui, Jiao Yuying, Cheng Quan. Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm [J]. New Technology of Library and Information Service, 2008 (4): 23-28.)
[18] Sbodio M L, Simpson E. Tag Clustering with Self Organizing Maps [R]. Hewlett-Packard Development Company, LP, 2009.
[19] Zong Y, Xu G, Jin P, et al. APPECT: An Approximate Backbone-based Clustering Algorithm for Tags [C]. In: Proceedings of the 7th International ADMA Conference, Beijing, China. Springer. 2011: 175-189.
[20] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[21] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. (He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging [J]. New Technology of Library and Information Service, 2013 (7-8): 49-54.)
[22] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points [J]. Science, 2007, 315(5814): 972-976.
[23] Tan P N, Steinbach M, Kumar V. 数据挖掘导论[M]. 范明, 范宏建译. 北京: 人民邮电出版社, 2006: 340-341. (Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [M]. Translated by Fan Ming, Fan Hongjian. Beijing: Posts & Telecom Press, 2006: 340-341.)
[24] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations [C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[25] Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis [M]. John Wiley & Sons, 2009.
[26] Ester M, Kriegel H P, Sander J, et al. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise [C]. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). 1996: 226-231.

[1] 刘峰, 张晓林. 科学数据元数据标准述评及其通用化设计研究[J]. 现代图书情报技术, 2015, 31(12): 3-12.
[2] 孙轶楠, 顾立平, 宋秀芳, 刘晶晶, 江娴. 学科数据知识库的政策调研与分析——以生命科学领域为例[J]. 现代图书情报技术, 2015, 31(12): 13-20.
[3] 毕强, 刘健. 数字文献资源内容服务推荐方法研究[J]. 现代图书情报技术, 2015, 31(12): 21-27.
[4] 朱光. 基于零水印的图博档彩色图像资源版权保护策略研究[J]. 现代图书情报技术, 2015, 31(12): 89-94.
[5] 王政军, 俞小怡, 金玉玲. 利用旁路监听技术约束数字资源过量下载[J]. 现代图书情报技术, 2015, 31(12): 95-100.
[6] 金玮, 赵蓉英, 殷鸽. 用户在社会化引文软件中的阅读数据积累程度与有效性分析——以Altmetrics指标为例[J]. 现代图书情报技术, 2015, 31(11): 75-81.
[7] 郑飏飏, 徐健, 肖卓. 情感分析及可视化方法在网络视频弹幕数据分析中的应用[J]. 现代图书情报技术, 2015, 31(11): 82-90.
[8] 刘悦如, 郭利敏. 微信公众号互动功能新开发[J]. 现代图书情报技术, 2015, 31(11): 104-109.
[9] 顾晓雪, 章成志. 标注内容与用户属性结合的标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 30-39.
[10] 刘丹. 利用Apache Mahout部署个性化图书推荐服务[J]. 现代图书情报技术, 2015, 31(10): 102-108.
[11] 马雨萌, 郭进京, 王昉. e-Science环境下科学数据语义组织模型框架研究[J]. 现代图书情报技术, 2015, 31(7-8): 48-57.
[12] 吴丹, 冉爱华. 移动阅读应用的用户体验比较研究[J]. 现代图书情报技术, 2015, 31(7-8): 73-79.
[13] 陈挺, 韩涛, 李泽霞, 李国鹏, 王小梅. 科研项目布局差异对比方法研究——以NSF和EUFP项目为例[J]. 现代图书情报技术, 2015, 31(7-8): 89-96.
[14] 郭振英, 赵文兵, 魏育辉. 轻量级书目本体关联数据建设实践[J]. 现代图书情报技术, 2015, 31(7-8): 139-143.
[15] 郭利敏, 刘悦如, 相明琼. 微信二维码用于图书馆读者身份认证的实践[J]. 现代图书情报技术, 2015, 31(7-8): 144-147.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn