[Objective] This paper aims to identify authors with features extracted from non-standard online texts. [Methods] First, we used the non-standard text similarity M defined by the Jaccard coefficient. Second, we adopted the frequency of non-standard text from the corpus. [Results] The recognition accuracy of the two features were 85.1% and 80.2%. Adding the two features to the traditional recognition mechanism, the precision of the system increased by 5.8% and 4%, respectively. [Limitations] We did not study the online texts from the syntactic and structure levels. [Conclusions] The proposed method could effectively extract the non-standard text features and then improve the accuracy of author identification.
郭旭,祁瑞华. 作者身份识别中不规范文本特征选择方法的研究*[J]. 现代图书情报技术, 2016, 32(11): 27-33.
Guo Xu,Qi Ruihua. Using Non-standard Text Features to Identify Authors. New Technology of Library and Information Service, 2016, 32(11): 27-33.
Abbasi A, Chen H.Applying Authorship Analysis to Extremist-group Web Forum Messages[J]. IEEE Intelligent Systems, 2005, 20(5): 67-75.
[2]
Iqbal F, Binsalleeh H, Fung B C M, et al. A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications[J]. Information Sciences, 2013, 231(9): 98-112.
(Luo Changri, He Tingting.Characteristics of Internet Language and Its Emotional Meanings[J]. Journal of Wuhan University of Technology: Social Sciences Edition, 2015, 28(2): 322-328.)
[4]
Nie L, Wang M, Gao Y, et al.Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information[J]. IEEE Transactions on Multimedia, 2013, 15(2): 426-441.
(Chen Yewang, Wang Huazhen, Li Haibo, et al.Topic Extraction Method for Chinese Web Text Based on Baidu Baike and Text Classification[J]. Journal of Chinese Computer Systems, 2012, 33(12): 2605-2610.)
(Zhang Wenwen, Wang Ting.Unsupervised Subjective Sentence Extraction for Non-Standard Texts[J]. Computer and Digital Engineering, 2013, 41(1): 64-68.)
[7]
Dehkharghani R, Mercan H, Javeed A, et al.Sentimental Causal Rule Discovery from Twitter[J]. Expert Systems with Applications, 2014, 41(10): 4950-4958.
[8]
Iqbal F, Binsalleeh H, Fung B C M, et al. Mining Writeprints from Anonymous E-mails for Forensic Investigation[J]. Digital Investigation, 2010, 7(1): 56-64.
(Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computers, 2011, 34(5): 856-864.)
[10]
Schler J, Koppel M, Argamon S, et al.Effects of Age and Gender on Blogging [C]. In: Proceedings of the 2006 AAAI Spring Symposium. 2006.
[11]
Schler J, Koppel M, Argamon S, et al. The Blog Authorship Corpus [DS/OL]. [2014-05-28]. .
[12]
Ward G. Moby Words [DS/OL]. [2016-06-24]. .
[13]
Manning C D, Surdeanu M, Bauer J, et al.The Stanford CoreNLP Natural Language Processing Toolkit [C]. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014.
[14]
Witten I H, Frank E, Hall M A.Data Mining [M]. Beijing: China Machine Press, 2012.
(Qi Ruihua, Yang Deli, Guo Xu, et al.Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(6): 628-634.)