Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 58-63    DOI: 10.11925/infotech.2096-3467.2017.02.08
Orginal Article Current Issue | Archive | Adv Search |
Identifying Chinese Microblog Author Gender Based on Dependency
Ruihua Qi()
School of Software, Dalian University of Foreign Languages, Dalian 116044, China
Download: PDF(608 KB)   HTML ( 30
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method to indentify the gender of Chinese microblog author with the help of dependency features. [Methods] This study collected public posts from Tencent Microblogs and extracted the dependency features, which were analyzed and compared with existing vocabulary, structure, function words, and part-of-speech tagging features. [Results] A controlled experiment showed that the proposed method obtained the highest values of precision, recall and F-measure. [Limitations] The new method needs to be examined with larger corpus. [Conclusions] The proposed method is the most effective way to identify the gender of microblog author.

Key wordsDependency      Chinese Microblog      Gender Identification     
Received: 06 October 2016      Published: 27 March 2017

Cite this article:

Ruihua Qi. Identifying Chinese Microblog Author Gender Based on Dependency. Data Analysis and Knowledge Discovery, 2017, 1(2): 58-63.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.08     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/58

[1] 新浪科技.3200万Twitter账号被盗 [R/OL].[2016-06-09]. .
[1] (Sina Science and Technology. 32 Million Twitter Account Stolen [R/OL]. [2016-06-09].
[2] 新浪科技.微博月活跃用户增至2.61亿[R/OL]. [2016-05- 12]. .
[2] (Sina Science and Technology. Micro-blog Monthly Active Users Increased to 261 Million[R/OL].[2016-05-12].
[3] Burger J D, Henderson J, Kim G, et al.Discriminating Gender on Twitter[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 1301-1309.
[4] 王晶晶, 李寿山, 黄磊.中文微博用户性别分类方法研究[J]. 中文信息学报, 2014, 28(6): 150-155, 168.
[4] (Wang Jingjing, Li Shoushan, Huang Lei.User Gender Classification in Chinese Microblog[J]. Journal of Chinese Information Processing, 2014, 28(6): 150-155, 168.)
[5] Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging[C]// Proceedings of the 2006 Association for the Advance of Artificial Intelligence Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006.
[6] Argamon S, Koppel M, Pennebaker J W, et al.Automatically Profiling the Author of an Anonymous Text[J]. Communications of the ACM, 2009, 52(2): 119-123.
[7] Argamon S, Koppel M.A Systemic Functional Approach to Automated Authorship Analysis[J]. Journal of Law & Policy, 2013, 12: 299-315.
[8] Mikros G K, Perifanos K.Authorship Attribution in Greek Tweets Using Author’s Multilevel N-Gram Profiles[C]// Proceedings of the 2013 Association for the Advance of Artificial Intelligence (AAAI) Spring Symposium: Analyzing Microtext. 2013.
[9] Rangel F, Rosso P.Use of Language and Author Profiling: Identification of Gender and Age[C]//Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science. 2013.
[10] 唐琴, 林鸿飞. 文本中人物性别识别研究[J]. 中文信息学报, 2010, 24(2): 46-51.
[10] (Tang Qin, Lin Hongfei.Research on Gender Recognition for Character in Text[J]. Journal of Chinese Information Processing, 2010, 24(2): 46-51.)
[11] 黄发良, 熊金波, 黄添强, 等. 基于粗糙集的微博用户性别识别[J]. 计算机应用, 2014, 34(8): 2209-2211.
[11] (Huang Faliang, Xiong Jinbo, Huang Tianqiang, et al.Gender Identification of Microblog Users Based on Rough Set[J]. Journal of Computer Applications, 2014, 34(8): 2209-2211.)
[12] 白丽娟. 基于文本挖掘的性别分类研究[D]. 哈尔滨: 哈尔滨工业大学, 2011.
[12] (Bai Lijuan.Gender Classification Based on Text Mining [D]. Harbin : Harbin Institute of Technology, 2011.)
[13] 祁瑞华, 杨德礼, 郭旭, 等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015, 34(6): 628-634.
[13] (Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.)
[14] Hollingsworth C.Using Dependency-based Annotations for Authorship Identification[M]. Text, Speech and Dialogue, Springer Berlin Heidelberg, 2012: 314-319.
[15] Zhang C, Wu X, Niu Z, et al.Authorship Identification from Unstructured Texts[J]. Knowledge-Based Systems, 2014, 66: 99-111.
[16] Tesnière L, Osborne T, Kahane S.Elements of Structural Syntax[M]. John Benjamins Publishing Company, 2015.
[17] Robinson J J.Dependency Structures and Transformational Rules[J]. Language, 1970, 46(2): 259-285.
[18] Fudan Natural Language Processing Group. FudanNLP [EB /OL]. [2016-01-01]..
[19] 国家语言资源监测与研究中心平面语言媒体中心. 历年中国语言生活状况绿皮书[R/OL]. [2015-01-01]. .
[19] (National Language Resources Monitoring and Research Center. Chinese Language Situation over the Years [R/OL]. [2015-01-01].
[20] Zheng R, Li J, Chen H, et al.A Framework for Authorship Identification of Online Messages: Writing-style Features and Classification Techniques[J]. Journal of the American Society for Information Science and Technology, 2006, 57(3): 378-393.
[21] Yu B.Function Words for Chinese Authorship Attribution[C]// Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012.
[22] ICTCLAS 2015 [EB/OL]. [2015-01-01]. .
[23] Silva R S, Laboreiro G, Sarmento L, et al.‘twazn me!!!; (’ Automatic Authorship Analysis of Micro-blogging Messages[M]. Natural Language Processing and Information Systems. Berlin Heidelberg: Springer, 2011: 161-168.
[24] Machine Learning Group at the University of Waikato. WEKA [EB/OL]. [2015-01-01]. .
[1] Guanghui Ye,Jinglan Hu,Jian Xu,Lixin Xia. Analyzing Growth Trends and Attachment Mode of Social Blog Tags[J]. 数据分析与知识发现, 2018, 2(6): 70-78.
[2] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[3] Lan Qiujun,Liu Wenxing,Li Weikang,Hu Xingye. Sentiment Analysis of Financial Forum Textual Message[J]. 现代图书情报技术, 2016, 32(4): 64-71.
[4] Zhang Fan, Le Xiaoqiu. Research on Recognition of Concept Attribute Instances in Innovation Sentences of Scientific Research Paper[J]. 现代图书情报技术, 2015, 31(5): 15-23.
[5] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[6] Nie Hui, Du Jiazhong. Using Dependency Parsing Pattern to Extract Product Feature Tags[J]. 现代图书情报技术, 2014, 30(12): 44-50.
[7] Tang Xiaobo, Xiao Lu. Research of Text Feature Extraction on Dependency Parsing Network[J]. 现代图书情报技术, 2014, 30(11): 31-37.
[8] Shi Jing,Zhang Lijuan. Extending Inside-outside Algorithm by Using HowNet[J]. 现代图书情报技术, 2009, 25(7-8): 54-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn