作者身份识别中新奇检测方法研究*

doi:10.11925/infotech.2096-3467.2019.0343

数据分析与知识发现

2020, Vol. 4

Issue (4): 56-62 https://doi.org/10.11925/infotech.2096-3467.2019.0343

研究论文

本期目录 | 过刊浏览 | 高级检索

作者身份识别中新奇检测方法研究*

郭旭(

),祁瑞华

大连外国语大学语言智能研究中心大连 116044

Identifying Authorship with Novelty Detection Method

Guo Xu(

),Qi Ruihua

Research Center for Language Intelligence of Dalian University of Foreign Languages, Dalian 116044, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (789 KB) HTML ( 4 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 实现作者身份识别研究领域的新奇检测。【方法】 采用单分类支持向量机或多元高斯算法结合多层面文体特征模型的方法,并提出一种基于宽容度的阈值选择方法。【结果】 当样本字符数大于500时,准确率、召回率和F1值均可达到0.9以上,其中样本字符数达到2 000时,准确率、召回率和F1值分别为0.978、0.984和0.979。【局限】 对于短文本的检测效果有待提高,需进一步优化特征模型。【结论】 本文提出的方法可以有效解决作者身份识别中长文本的新奇检测问题。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	郭旭
	祁瑞华

关键词 ：作者身份识别, 新奇检测, 异常点检测

Abstract：

[Objective] This paper proposes a novelty detection method to identify authorship. [Methods] We built an algorithm combining one-class SVM or multivariate Gaussian algorithm with multi-layer stylistic feature model. Then, we proposed a threshold selection method based on tolerance t. [Results] When the total number of sample characters was greater than 500, the accuracy, recall and F1 values were more than 0.9. Once the number of sample characters reached 2000, the accuracy, recall and F1 values were 0.978, 0.984 and 0.979. [Limitations] The model’s performance with short texts needs to be improved. [Conclusions] The proposed method could effectively address the novelty detection issue facing long text for authorship identification.

Key words： Authorship Identification Novelty Detection Anomaly Detection

收稿日期: 2019-04-01 出版日期: 2020-06-01

ZTFLH:

TP391

基金资助:*本文系国家社会科学基金项目“典籍英译国外读者网上评论观点挖掘研究”(15BYY028);大连外国语大学研究创新团队项目“计算语言学与人工智能创新团队”(2016CXTD06);辽宁省自然科学基金项目“神经网络语言模型在作者身份识别中的应用研究”的研究成果之一(2019-ZD-0513)

通讯作者: 郭旭 E-mail: guoxu@dlufl.edu.cn

引用本文:

郭旭,祁瑞华. 作者身份识别中新奇检测方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 56-62.
Guo Xu,Qi Ruihua. Identifying Authorship with Novelty Detection Method. Data Analysis and Knowledge Discovery, 2020, 4(4): 56-62.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0343 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I4/56

Table 1 常用新奇检测算法

Table 2 多层面文体特征

Table 3 实验1结果对比

Fig.1 实验2 ROC图

Fig.2 实验3 F1值对比图

[1]	Soler J, Wanner L . On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. EACL, 2017(2):681-687.
[2]	祁瑞华, 郭旭, 刘彩虹 . 中文微博作者身份识别研究[J]. 情报学报, 2017,36(1):76-82.
[2]	( Qi Ruihua, Guo Xu, Liu Caihong . Authorship Attribution of Chinese Microblog[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(1):76-82.)
[3]	Szwed P. Authorship Attribution for Polish Texts Based on Part of Speech Tagging[C]// Proceedings of the International Conference: Beyond Databases, Architectures and Structures, Poland. 2017: 316-328.
[4]	Rocha A, Scheirer W J, Forstall C W , et al. Authorship Attribution for Social Media Forensics[J]. IEEE Transactions on Information Forensics & Security, 2017,12(1):5-33.
[5]	Yu D, Chen N, Jiang F , et al. Constrained NMF-based Semi-Supervised Learning for Social Media Spammer Detection[J]. Knowledge-Based Systems, 2017,125(C):64-73. doi: 10.1016/j.knosys.2017.03.025
[6]	Ren Y, Ji D. Neural Networks for Deceptive Opinion Spam Detection: An Empirical Study[J]. Information Sciences, 2017, 385- 386:213-224.
[7]	Mohammadi-Ghazi R, Marzouk Y M, Büyüköztürk O . Conditional Classifiers and Boosted Conditional Gaussian Mixture Model for Novelty Detection[J]. Pattern Recognition, 2018,81:601-614. doi: 10.1016/j.patcog.2018.03.022
[8]	Puig X, Font M, Ginebra J . A Unified Approach to Authorship Attribution and Verification[J]. The American Statistician, 2017,70(3):232-242. doi: 10.1080/00031305.2016.1148630
[9]	Koppel M, Winter Y . Determining if Two Documents are Written by the Same Author[J]. Journal of the Association for Information Science and Technology, 2014,65(1):178-187. doi: 10.1002/asi.22954
[10]	Halvani O, Winter C, Pflug A . Authorship Verification for Different Languages, Genres and Topics[J]. Digital Investigation, 2016,16(S):S33-S43. doi: 10.1016/j.diin.2016.01.006
[11]	张艳梅, 黄莹莹, 甘世杰 , 等. 基于贝叶斯模型的微博网络水军识别算法研究[J]. 通信学报, 2017,38(1):44-53.
[11]	( Zhang Yanmei, Huang Yingying, Gan Shijie , et al. Weibo Spammers’ Identification Algorithm Based on Bayesian Model[J]. Journal on Communications, 2017,38(1):44-53.)
[12]	Tarassenko L, Hayton P, Cerneaz N, et al. Novelty Detection for the Identification of Masses in Mammograms[C] // Proceedings of the 4th International Conference on Artificial Neural Networks, UK. IEEE, 1995: 442-447.
[13]	Yeung Di-Y, Chow C. Parzen-window Network Intrusion Detectors[C]// Proceedings of the Object Recognition Supported by User Interaction for Service Robots, Canada. IEEE, 2002: 385-388.
[14]	Liu F T, Ting K M, Zhou Z H . Isolation Forest[C]// Proceedings of the IEEE International Conference on Data Mining. IEEE, 2008(1):413-422.
[15]	Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): Article No. 3.
[16]	Schölkopf B, Burges C, Vapnik V. Extracting Support Data for a Given Task[C] // Proceedings of the 1st International Conference on Knowledge Discovery & Data Mining. AAAI Press, 1995: 252-257.
[17]	Tax D, Duin R . Support Vector Data Description[J]. Machine Learning, 2004,54(1):45-66. doi: 10.1023/B:MACH.0000008084.60811.49
[18]	祁瑞华, 杨德礼, 郭旭 , 等. 基于多层面文体特征的博客作者身份识别研究[J]. 情报学报, 2015,34(6):628-634.
[18]	( Qi Ruihua, Yang Deli, Guo Xu , et al. Blogger Identification Based on Multidimensional Stylistic Features[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(6):628-634.)
[19]	Manning C D, Surdeanu M, Finkel J, et al. The Stanford CoreNLP Natural Language Processing Toolkit[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014: 55-60.
[20]	郭旭, 祁瑞华 . 作者身份识别中不规范文本特征选择方法的研究[J]. 现代图书情报技术, 2016(11):27-33.
[20]	( Guo Xu, Qi Ruihua . Using Non-standard Text Features to Identify Authors[J]. New Technology of Library and Information Service, 2016(11):27-33.)
[21]	Shrestha P, Sierra S, Gonzalez F A, et al. Convolutional Neural Networks for Authorship Attribution of Short Texts[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. 2017. DOI: 10.18653/v1/E17-2106.

[1]	王鸿, 舒展, 高印权, 田文洪. 一种单分类器联合多任务网络的隐式句间关系分析方法^*[J]. 数据分析与知识发现, 2021, 5(11): 80-88.
[2]	吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究^*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[3]	李振宇, 李树青. 嵌入隐式相似群的深度协同过滤算法^*[J]. 数据分析与知识发现, 2021, 5(11): 124-134.
[4]	董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[5]	余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究^*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[6]	丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型^*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[7]	华斌, 吴诺, 贺欣. 基于知识融合的政务信息化项目多专家审批意见整合^*[J]. 数据分析与知识发现, 2021, 5(10): 124-136.
[8]	王媛, 时恺泽, 牛振东. 一种用于实体关系三元组抽取的位置辅助分步标记方法^*[J]. 数据分析与知识发现, 2021, 5(10): 71-80.
[9]	杨辰, 陈晓虹, 王楚涵, 刘婷婷. 基于用户细粒度属性偏好聚类的推荐策略^*[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[10]	戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用^*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[11]	汪雪锋, 任惠超, 刘玉琴. 融合聚类信息的技术主题图可视化方法研究 [J]. 数据分析与知识发现, 0, (): 1-.
[12]	王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[13]	车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[14]	周阳,李学俊,王冬磊,陈方,彭莉娟. 炸药配方设计知识图谱的构建与可视分析方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 42-53.
[15]	马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法^*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.

Viewed

Full text

Abstract

Cited

Shared

Discussed