|
|
Using Non-standard Text Features to Identify Authors |
Guo Xu(),Qi Ruihua |
School of Software, Dalian University of Foreign Languages, Dalian 116044,China |
|
|
Abstract [Objective] This paper aims to identify authors with features extracted from non-standard online texts. [Methods] First, we used the non-standard text similarity M defined by the Jaccard coefficient. Second, we adopted the frequency of non-standard text from the corpus. [Results] The recognition accuracy of the two features were 85.1% and 80.2%. Adding the two features to the traditional recognition mechanism, the precision of the system increased by 5.8% and 4%, respectively. [Limitations] We did not study the online texts from the syntactic and structure levels. [Conclusions] The proposed method could effectively extract the non-standard text features and then improve the accuracy of author identification.
|
Received: 12 July 2016
Published: 20 December 2016
|
[1] | Abbasi A, Chen H.Applying Authorship Analysis to Extremist-group Web Forum Messages[J]. IEEE Intelligent Systems, 2005, 20(5): 67-75. | [2] | Iqbal F, Binsalleeh H, Fung B C M, et al. A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications[J]. Information Sciences, 2013, 231(9): 98-112. | [3] | 骆昌日, 何婷婷. 网络语言的特点及其情感性意义[J]. 武汉理工大学学报: 社会科学版, 2015, 28(2): 322-328. | [3] | (Luo Changri, He Tingting.Characteristics of Internet Language and Its Emotional Meanings[J]. Journal of Wuhan University of Technology: Social Sciences Edition, 2015, 28(2): 322-328.) | [4] | Nie L, Wang M, Gao Y, et al.Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information[J]. IEEE Transactions on Multimedia, 2013, 15(2): 426-441. | [5] | 陈叶旺, 王华珍, 李海波,等. 基于百度百科与文本分类的网络文本语义主题抽取方法[J]. 小型微型计算机系统, 2012, 33(12): 2605-2610. | [5] | (Chen Yewang, Wang Huazhen, Li Haibo, et al.Topic Extraction Method for Chinese Web Text Based on Baidu Baike and Text Classification[J]. Journal of Chinese Computer Systems, 2012, 33(12): 2605-2610.) | [6] | 张文文, 王挺. 不规范文本的无监督观点句抽取[J]. 计算机与数字工程, 2013, 41(1): 64-68. | [6] | (Zhang Wenwen, Wang Ting.Unsupervised Subjective Sentence Extraction for Non-Standard Texts[J]. Computer and Digital Engineering, 2013, 41(1): 64-68.) | [7] | Dehkharghani R, Mercan H, Javeed A, et al.Sentimental Causal Rule Discovery from Twitter[J]. Expert Systems with Applications, 2014, 41(10): 4950-4958. | [8] | Iqbal F, Binsalleeh H, Fung B C M, et al. Mining Writeprints from Anonymous E-mails for Forensic Investigation[J]. Digital Investigation, 2010, 7(1): 56-64. | [9] | 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864. | [9] | (Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computers, 2011, 34(5): 856-864.) | [10] | Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging [C]. In: Proceedings of the 2006 AAAI Spring Symposium. 2006. | [11] | Schler J, Koppel M, Argamon S, et al. The Blog Authorship Corpus [DS/OL]. [2014-05-28]. . | [12] | Ward G. Moby Words [DS/OL]. [2016-06-24]. . | [13] | Manning C D, Surdeanu M, Bauer J, et al.The Stanford CoreNLP Natural Language Processing Toolkit [C]. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014. | [14] | Witten I H, Frank E, Hall M A.Data Mining [M]. Beijing: China Machine Press, 2012. | [15] | 祁瑞华, 杨德礼, 郭旭,等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015,34(6):628-634. | [15] | (Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|