标注内容与用户属性结合的标签聚类研究

doi:10.11925/infotech.1003-3513.2015.10.05

现代图书情报技术

2015, Vol. 31

Issue (10): 30-39 https://doi.org/10.11925/infotech.1003-3513.2015.10.05

专题

本期目录 | 过刊浏览 | 高级检索

标注内容与用户属性结合的标签聚类研究

顾晓雪¹, 章成志^1,2

1 南京理工大学经济管理学院南京 210094;
2 江苏省数据工程与知识服务重点实验室(南京大学) 南京 210093

Combined with Annotated Content and User Attributes for Tag Clustering

Gu Xiaoxue¹, Zhang Chengzhi^1,2

1 School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China;
2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210093, China

摘要
参考文献
相关文章
Metrics

全文: PDF (611 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的] 研究标签聚类中标注内容与用户属性及其结合对聚类效果的影响。[方法] 采用科学网博客数据, 对其进行特征抽取、模型构建和相似度计算, 利用线性函数和Sigmod函数进行相似度加权, 并使用AP聚类算法进行标签聚类。[结果] 在学科分类体系下, 用户属性与标注内容的结合均对标签聚类的结果有所提升, Sigmod加权表现最优; 在系统分类体系下, 两者结合均不如标注内容结果表现优秀。[局限] 选择的数据量较小, 评估标签聚类的分类体系不够完善, AP聚类算法不适用于大数据的处理。[结论] 两种特征的结合在部分情况下能够提高聚类效果, 标签聚类中应更加关注标签的内容特征。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

Abstract：

[Objective] Explore the impact of tags' annotated content and tags' user attributes and their combinations in tag clustering. [Methods] Using ScienceNet.cn blogs, extract tag feature, build a vector space model and calculate the similarities between tags where linear method and Sigmod method are used to weight them, finally use the AP algorithm to cluster the tags. [Results] Experimental evaluation results show that in subject classification, in combination of annotated content and user attributes, two types of weighting methods can improve the clustering results, and the performace of Sigmod method is optimal; while in systematic classification, the combination of these two features can't perform as well as the former one and even worse than the content feature. [Limitations] The data selected for experiment is small and the classification for estimating the clustering results is not perfect. What's more, AP clustering algorithm lacks the ability to deal with big data. [Conclusions] The combination of these two features can improve the tag clustering results in some cases, and we should focus more on tag's content in tag clustering.

收稿日期: 2015-04-29 出版日期: 2016-04-06

G250

基金资助:

本文系国家社会科学基金重大项目“面向突发事件应急决策的快速响应情报体系研究”(项目编号:13&ZD174)、国家社会科学基金项目“在线社交网络中基于用户的知识组织模式研究”(项目编号:14BTQ033)和教育部人文社会科学基金规划项目“多语言高质量社会化标签生成及聚类研究”(项目编号:13YJA870020)的研究成果之一。

通讯作者: 章成志, ORCID: 0000-0001-8121-4796, E-mail: zhangcz@njust.edu.cn。 E-mail: zhangcz@njust.edu.cn

作者简介: 作者贡献声明:顾晓雪: 研究方案设计, 实验设计与实施, 数据清洗与分析, 论文起草; 章成志: 提出研究思路, 讨论研究方案, 采集并分析数据, 论文最终版本修订。

引用本文:

顾晓雪, 章成志. 标注内容与用户属性结合的标签聚类研究[J]. 现代图书情报技术, 2015, 31(10): 30-39.
Gu Xiaoxue, Zhang Chengzhi. Combined with Annotated Content and User Attributes for Tag Clustering. New Technology of Library and Information Service, 2015, 31(10): 30-39.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.10.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I10/30

[1] Gemmell J, Shepitsen A, Mobasher M, et al. Personalization in Folksonomies Based on Tag Clustering [C]. In: Proceedings of the 6th Workshop on Intelligent Techniques for Web Personalization and Recommender Systems.2008.
[2] Mathes A. Folksonomies-cooperative Classification and Communication Through Shared Metadata [J]. Computer Mediated Communication, 2004, 47(10): 1-13.
[3] Hammond T, Hannay T, Lund B, et al. Social Bookmarking Tools (I): A General Review [J/OL]. D-Lib Magazine, 2005, 11(4). http://www.dlib.org/dlib/april05/hammond/04hammond. html.
[4] Millen D R, Feinberg J, Kerr B. Dogear: Social Bookmarking in the Enterprise [C]. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2006: 111-120.
[5] Choy S O, Lui A K. Web Information Retrieval in Collaborative Tagging Systems [C]. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006). 2006: 352-355.
[6] Wu X, Zhang L,Yu Y. Exploring Social Annotations for the Semantic Web [C]. In: Proceedings of the 15th International Conference on World Wide Web. 2006: 417-426.
[7] Yan R, Natsev A,Campbell M. An Efficient Manual Image Annotation Approach Based on Tagging and Browsing [C]. In: Proceedings of the Workshop on Multimedia Information Retrieval on the Many Faces of Multimedia Semantics. 2007: 13-20.
[8] Simpson E. Clustering Tags in Enterprise and Web Folksonomies [C]. In: Proceedings of the International Conference on Weblogs & Social Media, Seattle, USA. 2008.
[9] Begelman G, Keller P, Smadja F. Automated Tag Clustering: Improving Search and Exploration in the Tag Space [C]. In: Proceedings of the Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland. 2006: 15-33.
[10] Van Damme C, Hepp M, Siorpaes K. Folksontology: An Integrated Approach for Turning Folksonomies into Ontologies [C]. In: Proceedings of the ESWC Workshop “Bridging the Gap between Semantic Web and Web”. 2007: 57-70.
[11] Agirre E,De Lacalle O L. Clustering WordNet Word Senses [C]. In: Proceedings of the Conference on Recent Advances on Natural Language (RANLP'03). 2003: 121-130.
[12] Fokker J, Pouwelse J, Buntine W. Tag-based Navigation for Peer-to-peer Wikipedia [C]. In: Proceedings of the Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland. 2006.
[13] Christopher H B, Nancy M. Improved Annotation of the Blogopshere via Autotagging and Hierarchical Clustering [C]. In: Proceedings of the 15th World Wide Web Conference (WWW'06), Edinburgh, Scotland. 2006.
[14] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York, NY, USA: McGraw-Hill, Inc., 1983.
[15] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[16] 周津, 陈超, 俞能海. 采用对象特征向量表示法的标签聚类算法[J]. 小型微型计算机系统, 2012, 33(3): 525-530. (Zhou Jin, Chen Chao, Yu Nenghai. Tag Clustering Algorithm Using Object-based Feature Vector [J]. Journal of Chinese Computer Systems, 2012, 33(3): 525-530.)
[17] Jeh G, Widom J. SimRank: A Measure of Structural-context Similarity [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 538-543.
[18] Cui J, Liu H, He J, et al. Tagclus: A Random Walk-based Method for Tag Clustering [J]. Knowledge and Information Systems, 2011, 27(2): 193-225.
[19] 王萍, 张际平. 一种社会性标签聚类算法[J]. 计算机应用与软件, 2010, 27(2): 126-129. (Wang Ping, Zhang Jiping. A Clustering Algorithm of Social Tags [J]. Computer Applications and Software, 2010, 27(2): 126-129.)
[20] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations [C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[21] Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis [M]. John Wiley & Sons, 2009.
[22] Ester M, Kriegel H P, Sander J, et al. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise [C]. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) .1996: 226-231.
[23] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 2009:54-63.
[24] 曹高辉, 焦玉英, 成全. 基于凝聚式层次聚类算法的标签聚类研究[J]. 现代图书情报技术, 2008(4): 23-28. (Cao Gaohui, Jiao Yuying, Cheng Quan. Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm [J]. New Technology of Library and Information Service, 2008(4): 23-28.)
[25] Shepitsen A, Gemmell J, Mobasher B, et al. Personalized Recommendation in Social Tagging Systems Using Hierarchical Clustering [C]. In: Proceedings of the 2008 ACM Conference on Recommender Systems.2008: 259-266.
[26] Sbodio M L, Simpson E. Tag Clustering with Self Organizing Maps [R]. Hewlett-Packard Development Company, LP, 2009.
[27] Zong Y, Xu G, Jin P, et al. APPECT: An Approximate Backbone-based Clustering Algorithm for Tags [C]. In: Proceedings of the 7th International ADMA Conference, Beijing, China. Springer.2011: 175-189.
[28] Salton G, Yu C T. On the Construction of Effective Vocabularies for Information Retrieval [C]. In: Proceedings of the 1973 Meeting on Programming Languages ACM SIGPLAN Notices.1973: 48-60.
[29] 金春霞, 周海岩. 位置加权文本聚类算法[J]. 计算机工程与科学, 2011, 33(6): 154-158. (Jin Chunxia, Zhou Haiyan. A Text Clustering Based on Position Weighting [J]. Computer Engineering & Science, 2011, 33(6): 154-158.)
[30] 姚清耘. 基于向量空间模型的中文文本聚类方法的研究[D]. 上海: 上海交通大学, 2008. (Yao Qingyun. Research of VSM-based Chinese Text Clustering Algorithms [D]. Shanghai:Shanghai Jiaotong University,2008.)
[31] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. (He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging [J]. New Technology of Library and Information Service, 2013(7-8): 49-54.)
[32] Ehrig M, Staab S. QOM-quick Ontology Mapping[C].In: Proceedings of the 3rd International Semantic Web Conference, Hiroshima, Japan. Springer, 2004: 683-697.
[33] Peukert E, Massmann S, Konig K. Comparing Similarity Combination Methods for Schema Matching [C]. In: Proceedings of the GI Jahrestagung (1). 2010: 692-701.
[34] 何琳. 基于多策略的领域本体术语抽取研究[J]. 情报学报, 2012, 31(8): 798-804. (He Lin. Domain Ontology Terminology Extraction Based on Integrated Strategy Method [J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(8): 798-804.)
[35] Frey B J,Dueck D. Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814): 972-976.
[36] Tan P N, Steinbach M, Kumar V. 数据挖掘导论[M]. 范明, 范宏建译. 北京: 人民邮电出版社, 2006: 340-341. (Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [M]. Translated by Fan Ming, Fan Hongjian. Beijing: Posts & Telecom Press, 2006: 340-341.)

[1]	柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取^*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[2]	谭荧, 唐亦非. 基于指代消解的引文内容抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[3]	王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究^*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[4]	韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究^*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[5]	李贺,刘嘉宇,李世钰,吴迪,金帅岐. 基于疾病知识图谱的自动问答系统优化研究^*[J]. 数据分析与知识发现, 2021, 5(5): 115-126.
[6]	李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[7]	伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[8]	王红斌,王健雄,张亚飞,杨恒. 主题不平衡新闻文本数据集的主题识别方法研究^*[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[9]	常志军,钱力,谢靖,吴振新,张鹄,于倩倩,王颖,王永吉. 基于分布式技术的科技文献大数据平台的建设研究^*[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
[10]	胡少虎,张颖怡,章成志. 关键词提取研究综述^*[J]. 数据分析与知识发现, 2021, 5(3): 45-59.
[11]	刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法 [J]. 数据分析与知识发现, 0, (): 1-.
[12]	王红斌, 王健雄, 张亚飞, 杨恒. 主题不平衡新闻文本数据集的主题识别方法研究 [J]. 数据分析与知识发现, 0, (): 1-.
[13]	张思凡, 牛振东, 陆浩, 朱一凡, 王荣荣. 基于图卷积嵌入与特征交叉的文献被引量预测方法：以交通运输领域为例 [J]. 数据分析与知识发现, 0, (): 1-.
[14]	祁瑞华, 简悦, 郭旭, 关菁华, 杨明昕. 融合特征与注意力的跨领域产品评论情感分析 [J]. 数据分析与知识发现, 0, (): 1-.
[15]	李娇, 黄永文, 罗婷婷, 赵瑞雪, 鲜国建. 基于多因子算法的自动分类研究 [J]. 数据分析与知识发现, 0, (): 1-.

Viewed

Full text

Abstract

Cited

Shared

Discussed