Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (2): 58-63     https://doi.org/10.11925/infotech.2096-3467.2017.02.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于依存关系的中文微博作者性别识别*
祁瑞华()
大连外国语大学软件学院 大连 116044
Identifying Chinese Microblog Author Gender Based on Dependency
Qi Ruihua()
School of Software, Dalian University of Foreign Languages, Dalian 116044, China
全文: PDF (608 KB)   HTML ( 30
输出: BibTeX | EndNote (RIS)      
摘要 

目的】针对网络文本篇幅短小、传统文体特征集稀疏等特点, 探讨依存关系在中文微博作者性别识别中的应用。【方法】选取腾讯公开微博作为实验语料, 抽取依存关系特征与现有文献中的词汇特征、结构特征、功能词特征、词性标注特征和微博特征进行对照实验。【结果】采用支持向量机、朴素贝叶斯、最近邻和决策树算法的对照实验验证了本文方法在中文微博作者性别识别任务中的准确率、召回率和F-Measure最高。【局限】依存关系在微博作者性别识别中的有效性还需在大规模语料上进一步验证。【结论】本文模型能够避免短文本特征集的稀疏性, 与其他对照特征集相比, 能更有效地识别作者性别。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
祁瑞华
关键词 依存关系中文微博性别识别    
Abstract

[Objective] This paper proposes a new method to indentify the gender of Chinese microblog author with the help of dependency features. [Methods] This study collected public posts from Tencent Microblogs and extracted the dependency features, which were analyzed and compared with existing vocabulary, structure, function words, and part-of-speech tagging features. [Results] A controlled experiment showed that the proposed method obtained the highest values of precision, recall and F-measure. [Limitations] The new method needs to be examined with larger corpus. [Conclusions] The proposed method is the most effective way to identify the gender of microblog author.

Key wordsDependency    Chinese Microblog    Gender Identification
收稿日期: 2016-10-06      出版日期: 2017-03-27
ZTFLH:  TP182  
基金资助:*本文系国家社会科学基金一般项目“典籍英译国外读者网上评论观点挖掘研究”(项目编号: 15BYY028)和国家教育部回国人员科研启动基金项目(项目编号: 教外司[2015]1098)的研究成果之一
引用本文:   
祁瑞华. 基于依存关系的中文微博作者性别识别*[J]. 数据分析与知识发现, 2017, 1(2): 58-63.
Qi Ruihua. Identifying Chinese Microblog Author Gender Based on Dependency. Data Analysis and Knowledge Discovery, 2017, 1(2): 58-63.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.02.08      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I2/58
  句子成分依存关系示例
  对照实验采用的特征集
算法 指标 词汇特征 结构特征 微博特征 功能词 词性标注 依存关系
Lib-SVM Precision 0.797 0.897 0.918 0.832 0.861 0.998
Recall 0.843 0.903 0.921 0.852 0.868 0.998
F-Measure 0.787 0.898 0.914 0.802 0.835 0.998
NBC Precision 0.838 0.799 0.766 0.828 0.798 0.814
Recall 0.396 0.815 0.806 0.436 0.834 0.691
F-Measure 0.432 0.806 0.781 0.482 0.807 0.730
IBK Precision 0.809 0.912 0.909 0.806 0.834 0.999
Recall 0.811 0.913 0.914 0.812 0.836 0.999
F-Measure 0.810 0.912 0.909 0.809 0.835 0.999
C4.5 Precision 0.824 0.928 0.918 0.899 0.851 0.997
Recall 0.852 0.929 0.921 0.904 0.864 0.997
F-Measure 0.818 0.928 0.915 0.893 0.855 0.997
  LibSVM、NBC、IBK和C4.5中文微博作者性别识别结果
  C4.5算法依存关系特征集决策树
[1] 新浪科技.3200万Twitter账号被盗 [R/OL].[2016-06-09]. .
[1] (Sina Science and Technology. 32 Million Twitter Account Stolen [R/OL]. [2016-06-09].
[2] 新浪科技.微博月活跃用户增至2.61亿[R/OL]. [2016-05- 12]. .
[2] (Sina Science and Technology. Micro-blog Monthly Active Users Increased to 261 Million[R/OL].[2016-05-12].
[3] Burger J D, Henderson J, Kim G, et al.Discriminating Gender on Twitter[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 1301-1309.
[4] 王晶晶, 李寿山, 黄磊.中文微博用户性别分类方法研究[J]. 中文信息学报, 2014, 28(6): 150-155, 168.
doi: 10.3969/j.issn.1003-0077.2014.06.021
[4] (Wang Jingjing, Li Shoushan, Huang Lei.User Gender Classification in Chinese Microblog[J]. Journal of Chinese Information Processing, 2014, 28(6): 150-155, 168.)
doi: 10.3969/j.issn.1003-0077.2014.06.021
[5] Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging[C]// Proceedings of the 2006 Association for the Advance of Artificial Intelligence Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006.
[6] Argamon S, Koppel M, Pennebaker J W, et al.Automatically Profiling the Author of an Anonymous Text[J]. Communications of the ACM, 2009, 52(2): 119-123.
doi: 10.1145/1461928.1461959
[7] Argamon S, Koppel M.A Systemic Functional Approach to Automated Authorship Analysis[J]. Journal of Law & Policy, 2013, 12: 299-315.
[8] Mikros G K, Perifanos K.Authorship Attribution in Greek Tweets Using Author’s Multilevel N-Gram Profiles[C]// Proceedings of the 2013 Association for the Advance of Artificial Intelligence (AAAI) Spring Symposium: Analyzing Microtext. 2013.
[9] Rangel F, Rosso P.Use of Language and Author Profiling: Identification of Gender and Age[C]//Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science. 2013.
[10] 唐琴, 林鸿飞. 文本中人物性别识别研究[J]. 中文信息学报, 2010, 24(2): 46-51.
doi: 10.3969/j.issn.1003-0077.2010.02.006
[10] (Tang Qin, Lin Hongfei.Research on Gender Recognition for Character in Text[J]. Journal of Chinese Information Processing, 2010, 24(2): 46-51.)
doi: 10.3969/j.issn.1003-0077.2010.02.006
[11] 黄发良, 熊金波, 黄添强, 等. 基于粗糙集的微博用户性别识别[J]. 计算机应用, 2014, 34(8): 2209-2211.
doi: 10.11772/j.issn.1001-9081.2014.08.2209
[11] (Huang Faliang, Xiong Jinbo, Huang Tianqiang, et al.Gender Identification of Microblog Users Based on Rough Set[J]. Journal of Computer Applications, 2014, 34(8): 2209-2211.)
doi: 10.11772/j.issn.1001-9081.2014.08.2209
[12] 白丽娟. 基于文本挖掘的性别分类研究[D]. 哈尔滨: 哈尔滨工业大学, 2011.
[12] (Bai Lijuan.Gender Classification Based on Text Mining [D]. Harbin : Harbin Institute of Technology, 2011.)
[13] 祁瑞华, 杨德礼, 郭旭, 等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015, 34(6): 628-634.
doi: 10.3772/j.issn.1000-0135.2015.006.008
[13] (Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.)
doi: 10.3772/j.issn.1000-0135.2015.006.008
[14] Hollingsworth C.Using Dependency-based Annotations for Authorship Identification[M]. Text, Speech and Dialogue, Springer Berlin Heidelberg, 2012: 314-319.
[15] Zhang C, Wu X, Niu Z, et al.Authorship Identification from Unstructured Texts[J]. Knowledge-Based Systems, 2014, 66: 99-111.
doi: 10.1016/j.knosys.2014.04.025
[16] Tesnière L, Osborne T, Kahane S.Elements of Structural Syntax[M]. John Benjamins Publishing Company, 2015.
[17] Robinson J J.Dependency Structures and Transformational Rules[J]. Language, 1970, 46(2): 259-285.
doi: 10.2307/412278
[18] Fudan Natural Language Processing Group. FudanNLP [EB /OL]. [2016-01-01]..
[19] 国家语言资源监测与研究中心平面语言媒体中心. 历年中国语言生活状况绿皮书[R/OL]. [2015-01-01]. .
[19] (National Language Resources Monitoring and Research Center. Chinese Language Situation over the Years [R/OL]. [2015-01-01].
[20] Zheng R, Li J, Chen H, et al.A Framework for Authorship Identification of Online Messages: Writing-style Features and Classification Techniques[J]. Journal of the American Society for Information Science and Technology, 2006, 57(3): 378-393.
doi: 10.1002/asi.20316
[21] Yu B.Function Words for Chinese Authorship Attribution[C]// Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012.
[22] ICTCLAS 2015 [EB/OL]. [2015-01-01]. .
[23] Silva R S, Laboreiro G, Sarmento L, et al.‘twazn me!!!; (’ Automatic Authorship Analysis of Micro-blogging Messages[M]. Natural Language Processing and Information Systems. Berlin Heidelberg: Springer, 2011: 161-168.
[24] Machine Learning Group at the University of Waikato. WEKA [EB/OL]. [2015-01-01]. .
[1] 李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[2] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014, 30(3): 73-79.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn