Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (6): 126-134    DOI: 10.11925/infotech.2096-3467.2021.0098
Current Issue | Archive | Adv Search |
Identifying Clickbait with BERT-BiGA Model
Yin Pengbo,Pan Weimin(),Zhang Haijun,Chen Degang
College of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China
Download: PDF (1288 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an algorithm with BiGRU and attention mechanism based on the Chinese BERT model,aiming to identify the clickbait from online news titles. [Methods] First, we pre-trained our model as a text encoder using the Chinese BERT. Then, we extracted text features through the fusion attention mechanism, and used BiGRU to model news titles and contents. Finally, we identified clickbait based on their semantic correlation. [Results] This method addressed the issues of complex feature engineering and secondary error amplification in the text similarity calculation. The recognition accuracy rate was 81%, and a browser plug-in was developed to detect clickbait. [Limitations] The proposed model only examined news titles and contents, and did not include pageviews, likes, and comments in the calculation. [Conclusions] Our new method, whose recall is 4% higher than those of the existing methods, could effectively identify the clickbait from online news.

Key wordsNews      Clickbait Detection      Chinese BERT      BiGRU      Attention Mechanism     
Received: 29 January 2021      Published: 06 July 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China-Xinjiang Joint Fund(U1703261)
Corresponding Authors: Pan Weimin     E-mail: panweiminss@163.com

Cite this article:

Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model. Data Analysis and Knowledge Discovery, 2021, 5(6): 126-134.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0098     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I6/126

参数
新闻来源 趣头条、hao123、一点资讯、2345新闻
新闻类别 政治、经济、文化、体育、娱乐、教育、游戏、情感
时间范围 2017.01.01-2020.01.01
爬取标准 阅读量大于100的新闻
Data Sources and Settings
字段 说明
Source_name 数据来源站点名称
Title 新闻标题
Content 新闻内容
Abstract 新闻的摘要
Unlikes 不喜欢这篇新闻的人数
Likes 喜欢这篇新闻的人数
views 这篇新闻的浏览数
tags 新闻标签词汇
url 新闻的原文链接
The Field Information Contained in News Data
Manual Marking Platform
新闻标题 新闻摘要 标签
20岁男子患上尿毒症,医生:经常喝“它” 小王因为持续的皮肤瘙痒到医院检查,结果发现自己患上了尿毒症。医生提示,当身体出现不适时,要尽快就医 标题党
6年后贾乃亮再爬香山,无意间露出手机壁纸,不是李小璐而是她! 主要内容是贾乃亮和助理两个人一起去爬了香山。贾乃亮拿出手机对着镜头给大家看时间,屏幕壁纸很明显不是和李小璐的合影,但文章未说明是谁 标题党
《仙剑奇侠传》原定主角是他,没有档期才让给胡歌,网友:好险 节目组一开始邀请的男一号并不是胡歌,而是邀请何炅来扮演李逍遥的角色,但当时何炅忙着主持工作,所以才邀请了胡歌 标题党
儿子不愿与53岁的妈妈逛街,只因妈妈太年轻,常被误认为是情侣 在印度尼西亚雅加达,有一位妈妈,虽然已经53岁了,但是脸上丝毫看不出岁月留下的痕迹。每次和儿子走在大街上,总会被别人误认为是他的女朋友,所以连自己的儿子都不愿意跟她一起逛街 正常
Sample Dataset Examples
Structure of BERT-BiGA Model
内容
原文
分词
中文和英文的基本组成单位不同
中文和英文的基本组成单位不同
字Mask
全词Mask
[mask]文 和 英文 的 基[mask] 组成[mask]位 不同
[mask][mask] 和 英文 的 [mask][mask] 组成单位 不同
Comparison of the Word Mask and the Whole Word Mask
Schematic Diagram of BiGRU
方法 准确率 F1值 召回率
Word2Vec-BiGA 0.78 0.78 0.79
BERT-BiGRU 0.76 0.77 0.78
BERT-GA 0.79 0.81 0.82
EBERT-BiGA 0.80 0.81 0.83
BERT-BiGA 0.81 0.82 0.85
Ablation Test Results
方法/模型 准确率 F1值 召回率
SVM 0.70 0.68 0.65
n-grams 0.73 0.72 0.70
LSTM 0.77 0.77 0.79
BiGRU-Att 0.80 0.79 0.81
BERT-BiGA 0.81 0.82 0.85
Comparison of Experimental Results of Different Models
Classification and Discrimination of the Model
Automatic Detection of Plug-In Operation
[1] Pujahari A, Sisodia D S. Clickbait Detection Using Multiple Categorisation Techniques[J]. Journal of Information Science, 2019,24(5):132-137.
[2] Agrawal A. Clickbait Detection Using Deep Learning[C]// Proceedings of the 2nd International Conference on Next Generation Computing Technologies (NGCT). 2016: 268-272.
[3] Loewenstein G. The Psychology of Curiosity: A Review and Reinterpretation[J]. Psychological Bulletin, 1994,116(1):75-82.
doi: 10.1037/0033-2909.116.1.75
[4] Potthast M, Köpsel S, Stein B, et al. Clickbait Detection[C]// Proceedings of European Conference on Information Retrieval. 2016: 810-817.
[5] 赵帅. 基于改进型VSM-HowNet融合相似度算法在“标题党”新闻识别中的研究[D]. 长春: 吉林大学, 2018.
[5] (Zhao Shuai. A Research on the Recognition of the “Sensational Headline” News Based on an Improved VSM-HowNet Fusion Similarity Algorithm[D]. Changchun: Jilin University, 2018.)
[6] Bourgonje P, Schneider J M, Rehm G. From Clickbait to Fake News Detection: An Approach Based on Detecting the Stance of Headlines to Articles[C]// Proceedings of the 2017 EMNLP Workshop: Natural Language Processing Meets Journalism. 2017: 84-89.
[7] Potthast M, Gollub T, Komlossy K, et al. Crowdsourcing a Large Corpus of Clickbait on Twitter[C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 1498-1507.
[8] Shu K, Wang S H, Le T, et al. Deep Headline Generation for Clickbait Detection[C]// Proceedings of 2018 IEEE International Conference on Data Mining (ICDM). 2018: 467-476.
[9] Chen Y M, Conroy N J, Rubin V L. Misleading Online Content: Recognizing Clickbait as “False News”[C]// Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection. 2015: 15-19.
[10] Chakraborty A, Paranjape B, Kakarla S, et al. Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media[C]// Proceedings of 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2016: 9-16.
[11] Biyani P, Tsioutsiouliklis K, Blackmer J. “8 Amazing Secrets for Getting More Clicks”: Detecting Clickbaits in News Streams Using Article Informality[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2016: 46-53.
[12] 梅钟霄. 基于文本挖掘的新闻标题与内容契合度评价研究[D]. 北京: 首都经济贸易大学, 2018.
[12] (Mei Zhongxiao. Research on Evaluation of News Headlines and Content Correspondence Based on Text Mining[D]. Beijing: Capital University of Economics and Business, 2018.)
[13] 罗佳. 基于潜在语义分析的标题党新闻识别技术研究[D]. 武汉: 湖北工业大学, 2015.
[13] (Luo Jia. Research of Title Party News Identification Technology Based on Latent Semantic Analysis[D]. Wuhan: Hubei University of Technology, 2015.)
[14] Rony M M U, Hassan N, Yousuf M. Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects?[C]// Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2017: 232-239.
[15] Anand A, Chakraborty T, Park N. We Used Neural Networks to Detect Clickbaits: You won’t Believe What Happened Next![C]// Proceedings of European Conference on Information Retrieval. 2017: 541-547.
[16] Chakraborty A, Sarkar R, Mrigen A, et al. Tabloids in the Era of Social Media? Understanding the Production and Consumption of Clickbaits in Twitter[J]. PACM on Human-Computer Interaction, 2017, 1(CSCW): Article No. 30.
[17] Zhou Y W. Clickbait Detection in Tweets Using Self-Attentive Network[OL]. arXiv Preprint, arXiv:1710.05364.
[18] Kumar V, Khattar D, Gairola S, et al. Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks[C]// Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 1225-1228.
[19] Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT [OL]. arXiv Preprint, arXiv:1906.08101.
[20] Seo M, Kembhavi A, Farhadi A, et al. Bidirectional Attention Flow for Machine Comprehension [OL]. arXiv Preprint, arXiv:1611.01603.
[21] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need [OL]. arXiv Preprint, arXiv:1706.03762.
[22] Tilk O, Alumäe T. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration[C]// Proceedings of Interspeech 2016: Understanding Speech Processing in Human and Machines. 2016: 3047-3051.
[23] Naeem B, Khan A, Beg M O, et al. A Deep Learning Framework for Clickbait Detection on Social Area Network Using Natural Language Cues[J]. Journal of Computational Social Science, 2020,26(2):1-13.
[1] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[2] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[3] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[4] Xu Guang,Ren Ming,Song Chengyu. Extracting China’s Economic Image from Western News[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[5] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[6] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[7] Duan Jianyong,Wei Xiaopeng,Wang Hao. A Multi-Perspective Co-Matching Model for Machine Reading Comprehension[J]. 数据分析与知识发现, 2021, 5(4): 134-141.
[8] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[9] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[10] Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan. Generating News Clues with Biterm Topic Model[J]. 数据分析与知识发现, 2021, 5(2): 1-13.
[11] Jiang Cuiqing,Wang Xiangxiang,Wang Zhao. Forecasting Car Sales Based on Consumer Attention[J]. 数据分析与知识发现, 2021, 5(1): 128-139.
[12] Lv Huakui,Liu Zhenghao,Qian Yuxing,Hong Xudong. Relationship Between Financial News and Stock Market Fluctuations[J]. 数据分析与知识发现, 2021, 5(1): 99-111.
[13] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[14] Yin Haoran,Cao Jinxuan,Cao Luzhe,Wang Guodong. Identifying Emergency Elements Based on BiGRU-AM Model with Extended Semantic Dimension[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[15] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn