Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (11): 27-33    DOI: 10.11925/infotech.1003-3513.2016.11.04
Orginal Article Current Issue | Archive | Adv Search |
Using Non-standard Text Features to Identify Authors
Guo Xu(),Qi Ruihua
School of Software, Dalian University of Foreign Languages, Dalian 116044,China
Download: PDF(388 KB)   HTML ( 48
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to identify authors with features extracted from non-standard online texts. [Methods] First, we used the non-standard text similarity M defined by the Jaccard coefficient. Second, we adopted the frequency of non-standard text from the corpus. [Results] The recognition accuracy of the two features were 85.1% and 80.2%. Adding the two features to the traditional recognition mechanism, the precision of the system increased by 5.8% and 4%, respectively. [Limitations] We did not study the online texts from the syntactic and structure levels. [Conclusions] The proposed method could effectively extract the non-standard text features and then improve the accuracy of author identification.

Key wordsAuthor identification      Non-standard text      Network text      Text similarity     
Received: 12 July 2016      Published: 20 December 2016

Cite this article:

Guo Xu,Qi Ruihua. Using Non-standard Text Features to Identify Authors. New Technology of Library and Information Service, 2016, 32(11): 27-33.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.11.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I11/27

[1] Abbasi A, Chen H.Applying Authorship Analysis to Extremist-group Web Forum Messages[J]. IEEE Intelligent Systems, 2005, 20(5): 67-75.
[2] Iqbal F, Binsalleeh H, Fung B C M, et al. A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications[J]. Information Sciences, 2013, 231(9): 98-112.
[3] 骆昌日, 何婷婷. 网络语言的特点及其情感性意义[J]. 武汉理工大学学报: 社会科学版, 2015, 28(2): 322-328.
[3] (Luo Changri, He Tingting.Characteristics of Internet Language and Its Emotional Meanings[J]. Journal of Wuhan University of Technology: Social Sciences Edition, 2015, 28(2): 322-328.)
[4] Nie L, Wang M, Gao Y, et al.Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information[J]. IEEE Transactions on Multimedia, 2013, 15(2): 426-441.
[5] 陈叶旺, 王华珍, 李海波,等. 基于百度百科与文本分类的网络文本语义主题抽取方法[J]. 小型微型计算机系统, 2012, 33(12): 2605-2610.
[5] (Chen Yewang, Wang Huazhen, Li Haibo, et al.Topic Extraction Method for Chinese Web Text Based on Baidu Baike and Text Classification[J]. Journal of Chinese Computer Systems, 2012, 33(12): 2605-2610.)
[6] 张文文, 王挺. 不规范文本的无监督观点句抽取[J]. 计算机与数字工程, 2013, 41(1): 64-68.
[6] (Zhang Wenwen, Wang Ting.Unsupervised Subjective Sentence Extraction for Non-Standard Texts[J]. Computer and Digital Engineering, 2013, 41(1): 64-68.)
[7] Dehkharghani R, Mercan H, Javeed A, et al.Sentimental Causal Rule Discovery from Twitter[J]. Expert Systems with Applications, 2014, 41(10): 4950-4958.
[8] Iqbal F, Binsalleeh H, Fung B C M, et al. Mining Writeprints from Anonymous E-mails for Forensic Investigation[J]. Digital Investigation, 2010, 7(1): 56-64.
[9] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
[9] (Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computers, 2011, 34(5): 856-864.)
[10] Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging [C]. In: Proceedings of the 2006 AAAI Spring Symposium. 2006.
[11] Schler J, Koppel M, Argamon S, et al. The Blog Authorship Corpus [DS/OL]. [2014-05-28]. .
[12] Ward G. Moby Words [DS/OL]. [2016-06-24]. .
[13] Manning C D, Surdeanu M, Bauer J, et al.The Stanford CoreNLP Natural Language Processing Toolkit [C]. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014.
[14] Witten I H, Frank E, Hall M A.Data Mining [M]. Beijing: China Machine Press, 2012.
[15] 祁瑞华, 杨德礼, 郭旭,等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015,34(6):628-634.
[15] (Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.)
[1] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[2] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[3] Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
[4] Yang Zhimo, Liu Huailiang, Zhao Hui. An Algorithm of Chinese Text Representation Based on Complex Network[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[5] Xu Jian. A Term Similarity Algorithm Based on Context Dependency Relation Pattern[J]. 现代图书情报技术, 2011, 27(9): 28-33.
[6] Wang Junhui, Hu Tiejun, Li Danya. Research Review of Related Articles Retrieval[J]. 现代图书情报技术, 2011, 27(1): 39-45.
[7] Lu Shengjun,Li Fayong,Qian Jianjun ,Zhen Zhen. WCONS+:An Ontology Integration Approach Based on WCONS[J]. 现代图书情报技术, 2009, 3(2): 18-22.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn