Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 58-63    DOI: 10.11925/infotech.2096-3467.2017.02.08
Orginal Article Current Issue | Archive | Adv Search |
Identifying Chinese Microblog Author Gender Based on Dependency
Qi Ruihua()
School of Software, Dalian University of Foreign Languages, Dalian 116044, China
Download: PDF (608 KB)   HTML ( 31
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method to indentify the gender of Chinese microblog author with the help of dependency features. [Methods] This study collected public posts from Tencent Microblogs and extracted the dependency features, which were analyzed and compared with existing vocabulary, structure, function words, and part-of-speech tagging features. [Results] A controlled experiment showed that the proposed method obtained the highest values of precision, recall and F-measure. [Limitations] The new method needs to be examined with larger corpus. [Conclusions] The proposed method is the most effective way to identify the gender of microblog author.

Key wordsDependency      Chinese Microblog      Gender Identification     
Received: 06 October 2016      Published: 27 March 2017
ZTFLH:  TP182  

Cite this article:

Qi Ruihua. Identifying Chinese Microblog Author Gender Based on Dependency. Data Analysis and Knowledge Discovery, 2017, 1(2): 58-63.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.08     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/58

算法 指标 词汇特征 结构特征 微博特征 功能词 词性标注 依存关系
Lib-SVM Precision 0.797 0.897 0.918 0.832 0.861 0.998
Recall 0.843 0.903 0.921 0.852 0.868 0.998
F-Measure 0.787 0.898 0.914 0.802 0.835 0.998
NBC Precision 0.838 0.799 0.766 0.828 0.798 0.814
Recall 0.396 0.815 0.806 0.436 0.834 0.691
F-Measure 0.432 0.806 0.781 0.482 0.807 0.730
IBK Precision 0.809 0.912 0.909 0.806 0.834 0.999
Recall 0.811 0.913 0.914 0.812 0.836 0.999
F-Measure 0.810 0.912 0.909 0.809 0.835 0.999
C4.5 Precision 0.824 0.928 0.918 0.899 0.851 0.997
Recall 0.852 0.929 0.921 0.904 0.864 0.997
F-Measure 0.818 0.928 0.915 0.893 0.855 0.997
[1] 新浪科技.3200万Twitter账号被盗 [R/OL].[2016-06-09]. .
[1] (Sina Science and Technology. 32 Million Twitter Account Stolen [R/OL]. [2016-06-09].
[2] 新浪科技.微博月活跃用户增至2.61亿[R/OL]. [2016-05- 12]. .
[2] (Sina Science and Technology. Micro-blog Monthly Active Users Increased to 261 Million[R/OL].[2016-05-12].
[3] Burger J D, Henderson J, Kim G, et al.Discriminating Gender on Twitter[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011: 1301-1309.
[4] 王晶晶, 李寿山, 黄磊.中文微博用户性别分类方法研究[J]. 中文信息学报, 2014, 28(6): 150-155, 168.
doi: 10.3969/j.issn.1003-0077.2014.06.021
[4] (Wang Jingjing, Li Shoushan, Huang Lei.User Gender Classification in Chinese Microblog[J]. Journal of Chinese Information Processing, 2014, 28(6): 150-155, 168.)
doi: 10.3969/j.issn.1003-0077.2014.06.021
[5] Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging[C]// Proceedings of the 2006 Association for the Advance of Artificial Intelligence Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006.
[6] Argamon S, Koppel M, Pennebaker J W, et al.Automatically Profiling the Author of an Anonymous Text[J]. Communications of the ACM, 2009, 52(2): 119-123.
doi: 10.1145/1461928.1461959
[7] Argamon S, Koppel M.A Systemic Functional Approach to Automated Authorship Analysis[J]. Journal of Law & Policy, 2013, 12: 299-315.
[8] Mikros G K, Perifanos K.Authorship Attribution in Greek Tweets Using Author’s Multilevel N-Gram Profiles[C]// Proceedings of the 2013 Association for the Advance of Artificial Intelligence (AAAI) Spring Symposium: Analyzing Microtext. 2013.
[9] Rangel F, Rosso P.Use of Language and Author Profiling: Identification of Gender and Age[C]//Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science. 2013.
[10] 唐琴, 林鸿飞. 文本中人物性别识别研究[J]. 中文信息学报, 2010, 24(2): 46-51.
doi: 10.3969/j.issn.1003-0077.2010.02.006
[10] (Tang Qin, Lin Hongfei.Research on Gender Recognition for Character in Text[J]. Journal of Chinese Information Processing, 2010, 24(2): 46-51.)
doi: 10.3969/j.issn.1003-0077.2010.02.006
[11] 黄发良, 熊金波, 黄添强, 等. 基于粗糙集的微博用户性别识别[J]. 计算机应用, 2014, 34(8): 2209-2211.
doi: 10.11772/j.issn.1001-9081.2014.08.2209
[11] (Huang Faliang, Xiong Jinbo, Huang Tianqiang, et al.Gender Identification of Microblog Users Based on Rough Set[J]. Journal of Computer Applications, 2014, 34(8): 2209-2211.)
doi: 10.11772/j.issn.1001-9081.2014.08.2209
[12] 白丽娟. 基于文本挖掘的性别分类研究[D]. 哈尔滨: 哈尔滨工业大学, 2011.
[12] (Bai Lijuan.Gender Classification Based on Text Mining [D]. Harbin : Harbin Institute of Technology, 2011.)
[13] 祁瑞华, 杨德礼, 郭旭, 等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015, 34(6): 628-634.
doi: 10.3772/j.issn.1000-0135.2015.006.008
[13] (Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.)
doi: 10.3772/j.issn.1000-0135.2015.006.008
[14] Hollingsworth C.Using Dependency-based Annotations for Authorship Identification[M]. Text, Speech and Dialogue, Springer Berlin Heidelberg, 2012: 314-319.
[15] Zhang C, Wu X, Niu Z, et al.Authorship Identification from Unstructured Texts[J]. Knowledge-Based Systems, 2014, 66: 99-111.
doi: 10.1016/j.knosys.2014.04.025
[16] Tesnière L, Osborne T, Kahane S.Elements of Structural Syntax[M]. John Benjamins Publishing Company, 2015.
[17] Robinson J J.Dependency Structures and Transformational Rules[J]. Language, 1970, 46(2): 259-285.
doi: 10.2307/412278
[18] Fudan Natural Language Processing Group. FudanNLP [EB /OL]. [2016-01-01]..
[19] 国家语言资源监测与研究中心平面语言媒体中心. 历年中国语言生活状况绿皮书[R/OL]. [2015-01-01]. .
[19] (National Language Resources Monitoring and Research Center. Chinese Language Situation over the Years [R/OL]. [2015-01-01].
[20] Zheng R, Li J, Chen H, et al.A Framework for Authorship Identification of Online Messages: Writing-style Features and Classification Techniques[J]. Journal of the American Society for Information Science and Technology, 2006, 57(3): 378-393.
doi: 10.1002/asi.20316
[21] Yu B.Function Words for Chinese Authorship Attribution[C]// Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012.
[22] ICTCLAS 2015 [EB/OL]. [2015-01-01]. .
[23] Silva R S, Laboreiro G, Sarmento L, et al.‘twazn me!!!; (’ Automatic Authorship Analysis of Micro-blogging Messages[M]. Natural Language Processing and Information Systems. Berlin Heidelberg: Springer, 2011: 161-168.
[24] Machine Learning Group at the University of Waikato. WEKA [EB/OL]. [2015-01-01]. .
[1] Fan Tao,Wang Hao,Wu Peng. Sentiment Analysis of Online Users' Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[2] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[3] Bocheng Li,Yunqiu Zhang,Kaixi Yang. Extracting Emotion Tags from Comments of Microblog Commodities[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[4] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[5] Ye Guanghui,Hu Jinglan,Xu Jian,Xia Lixin. Analyzing Growth Trends and Attachment Mode of Social Blog Tags[J]. 数据分析与知识发现, 2018, 2(6): 70-78.
[6] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[7] Lan Qiujun,Liu Wenxing,Li Weikang,Hu Xingye. Sentiment Analysis of Financial Forum Textual Message[J]. 现代图书情报技术, 2016, 32(4): 64-71.
[8] Zhang Fan, Le Xiaoqiu. Research on Recognition of Concept Attribute Instances in Innovation Sentences of Scientific Research Paper[J]. 现代图书情报技术, 2015, 31(5): 15-23.
[9] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[10] Nie Hui, Du Jiazhong. Using Dependency Parsing Pattern to Extract Product Feature Tags[J]. 现代图书情报技术, 2014, 30(12): 44-50.
[11] Tang Xiaobo, Xiao Lu. Research of Text Feature Extraction on Dependency Parsing Network[J]. 现代图书情报技术, 2014, 30(11): 31-37.
[12] Shi Jing,Zhang Lijuan. Extending Inside-outside Algorithm by Using HowNet[J]. 现代图书情报技术, 2009, 25(7-8): 54-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn