Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (4): 56-62    DOI: 10.11925/infotech.2096-3467.2019.0343
Current Issue | Archive | Adv Search |
Identifying Authorship with Novelty Detection Method
Guo Xu(),Qi Ruihua
Research Center for Language Intelligence of Dalian University of Foreign Languages, Dalian 116044, China
Download: PDF (789 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a novelty detection method to identify authorship. [Methods] We built an algorithm combining one-class SVM or multivariate Gaussian algorithm with multi-layer stylistic feature model. Then, we proposed a threshold selection method based on tolerance t. [Results] When the total number of sample characters was greater than 500, the accuracy, recall and F1 values were more than 0.9. Once the number of sample characters reached 2000, the accuracy, recall and F1 values were 0.978, 0.984 and 0.979. [Limitations] The model’s performance with short texts needs to be improved. [Conclusions] The proposed method could effectively address the novelty detection issue facing long text for authorship identification.

Key wordsAuthorship Identification      Novelty Detection      Anomaly Detection     
Received: 01 April 2019      Published: 01 June 2020
ZTFLH:  TP391  
Corresponding Authors: Guo Xu     E-mail: guoxu@dlufl.edu.cn

Cite this article:

Guo Xu,Qi Ruihua. Identifying Authorship with Novelty Detection Method. Data Analysis and Knowledge Discovery, 2020, 4(4): 56-62.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0343     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I4/56

算法 优点 缺点 适合环境
概率分布 实现简单、理论体系完备。 对特征维度和样本数量要求苛刻。 特征维度和样本数量适中。
孤立森林 具有线性时间复杂度、可以实现并行计算。 样本数据量较少时,算法没有明显优势。 大数据环境下的异常点检测。
单分类SVM 高维特征、小样本机器学习中效果突出。 无法并行计算,样本较多时计算代价大。 高维特征、中小样本数量。
Commonly Used Novel Detection Algorithms
层面 特征类别 特征
字符层面 字符 所有字符、中文字符、数字字符、字母、空符号、特殊符号、标点符号、不同标点符号
标点 句号、逗号、叹号等
词汇 所有词数、词最大长度、句子最小长度、平均词长、词长方差、长词数、短词数、四字词数、词汇丰富度
词汇层面 词性 动词、副词、名词等
虚词 才、不过、原来、没、一、但、便、还、怎么、倒、那、再、却、只、可、多、将、就、已、很、说道、之、儿。
句法层面 句子 句子总数、句子最大长度、句子最小长度、平均长度、句长方差、长句子数、短句子数
句法树 形容词短语、动词短语等
依存关系 动词修饰、形容词修饰等
Multidimensional Stylistic Features
算法

字符数
100 300 500 1 000 2 000
One-Class SVM AC 0.818 0.848 0.903 0.946 0.970
F1 0.826 0.859 0.906 0.949 0.971
RC 0.850 0.900 0.920 0.978 0.982
多元高斯 AC 0.798 0.874 0.908 0.939 0.978
F1 0.810 0.880 0.913 0.943 0.979
RC 0.850 0.900 0.938 0.984 0.984
特征独立高斯 AC 0.764 0.809 0.856 0.928 0.939
F1 0.789 0.825 0.865 0.930 0.943
RC 0.858 0.904 0.904 0.948 0.948
特征独立高斯窗 AC 0.773 0.850 0.862 0.936 0.962
F1 0.800 0.859 0.869 0.939 0.964
RC 0.892 0.902 0.904 0.964 0.990
Isolation Forest AC 0.722 0.792 0.844 0.906 0.912
F1 0.760 0.816 0.857 0.911 0.920
RC 0.860 0.902 0.912 0.944 0.982
The Results of Experiment 1
ROC Graph of Experiment 2
F1 Values of Experiment 3
[1] Soler J, Wanner L . On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. EACL, 2017(2):681-687.
[2] 祁瑞华, 郭旭, 刘彩虹 . 中文微博作者身份识别研究[J]. 情报学报, 2017,36(1):76-82.
[2] ( Qi Ruihua, Guo Xu, Liu Caihong . Authorship Attribution of Chinese Microblog[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(1):76-82.)
[3] Szwed P. Authorship Attribution for Polish Texts Based on Part of Speech Tagging[C]// Proceedings of the International Conference: Beyond Databases, Architectures and Structures, Poland. 2017: 316-328.
[4] Rocha A, Scheirer W J, Forstall C W , et al. Authorship Attribution for Social Media Forensics[J]. IEEE Transactions on Information Forensics & Security, 2017,12(1):5-33.
[5] Yu D, Chen N, Jiang F , et al. Constrained NMF-based Semi-Supervised Learning for Social Media Spammer Detection[J]. Knowledge-Based Systems, 2017,125(C):64-73.
doi: 10.1016/j.knosys.2017.03.025
[6] Ren Y, Ji D. Neural Networks for Deceptive Opinion Spam Detection: An Empirical Study[J]. Information Sciences, 2017, 385- 386:213-224.
[7] Mohammadi-Ghazi R, Marzouk Y M, Büyüköztürk O . Conditional Classifiers and Boosted Conditional Gaussian Mixture Model for Novelty Detection[J]. Pattern Recognition, 2018,81:601-614.
doi: 10.1016/j.patcog.2018.03.022
[8] Puig X, Font M, Ginebra J . A Unified Approach to Authorship Attribution and Verification[J]. The American Statistician, 2017,70(3):232-242.
doi: 10.1080/00031305.2016.1148630
[9] Koppel M, Winter Y . Determining if Two Documents are Written by the Same Author[J]. Journal of the Association for Information Science and Technology, 2014,65(1):178-187.
doi: 10.1002/asi.22954
[10] Halvani O, Winter C, Pflug A . Authorship Verification for Different Languages, Genres and Topics[J]. Digital Investigation, 2016,16(S):S33-S43.
doi: 10.1016/j.diin.2016.01.006
[11] 张艳梅, 黄莹莹, 甘世杰 , 等. 基于贝叶斯模型的微博网络水军识别算法研究[J]. 通信学报, 2017,38(1):44-53.
[11] ( Zhang Yanmei, Huang Yingying, Gan Shijie , et al. Weibo Spammers’ Identification Algorithm Based on Bayesian Model[J]. Journal on Communications, 2017,38(1):44-53.)
[12] Tarassenko L, Hayton P, Cerneaz N, et al. Novelty Detection for the Identification of Masses in Mammograms[C] // Proceedings of the 4th International Conference on Artificial Neural Networks, UK. IEEE, 1995: 442-447.
[13] Yeung Di-Y, Chow C. Parzen-window Network Intrusion Detectors[C]// Proceedings of the Object Recognition Supported by User Interaction for Service Robots, Canada. IEEE, 2002: 385-388.
[14] Liu F T, Ting K M, Zhou Z H . Isolation Forest[C]// Proceedings of the IEEE International Conference on Data Mining. IEEE, 2008(1):413-422.
[15] Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): Article No. 3.
[16] Schölkopf B, Burges C, Vapnik V. Extracting Support Data for a Given Task[C] // Proceedings of the 1st International Conference on Knowledge Discovery & Data Mining. AAAI Press, 1995: 252-257.
[17] Tax D, Duin R . Support Vector Data Description[J]. Machine Learning, 2004,54(1):45-66.
doi: 10.1023/B:MACH.0000008084.60811.49
[18] 祁瑞华, 杨德礼, 郭旭 , 等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015,34(6):628-634.
[18] ( Qi Ruihua, Yang Deli, Guo Xu , et al. Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(6):628-634.)
[19] Manning C D, Surdeanu M, Finkel J, et al. The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014: 55-60.
[20] 郭旭, 祁瑞华 . 作者身份识别中不规范文本特征选择方法的研究[J]. 现代图书情报技术, 2016(11):27-33.
[20] ( Guo Xu, Qi Ruihua . Using Non-standard Text Features to Identify Authors[J]. New Technology of Library and Information Service, 2016(11):27-33.)
[21] Shrestha P, Sierra S, Gonzalez F A, et al. Convolutional Neural Networks for Authorship Attribution of Short Texts[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. 2017. DOI: 10.18653/v1/E17-2106.
[1] Zhai Dongsheng,Guo Cheng,Zhang Jie,Li Dengjie. Identifying Technology Opportunities with Anomaly Detection Technique[J]. 现代图书情报技术, 2016, 32(10): 81-90.
[2] Qi Ruihua, Huo Yuehong, Guo Xu, Liu Caihong. Authorship Identification in English Translations of Chinese Classics[J]. 现代图书情报技术, 2015, 31(1): 31-37.
[3] Lv Yingjie, Fan Jing, Liu Jingfang. Authorship Identification of Chinese UGC Based on Stylistics[J]. 现代图书情报技术, 2013, 29(9): 48-53.
[4] Qian Xu,Gu Wei,Chen Linghui,Ding Xiaofeng . Design and Application of Network Worm Detection System[J]. 现代图书情报技术, 2007, 2(1): 44-48.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn