Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (6): 128-140    DOI: 10.11925/infotech.2096-3467.2021.1116
Current Issue | Archive | Adv Search |
Identifying Tax Audit Cases with Multi-task Learning
Li Guofeng1,Li Zuojuan1(),Wang Zheji1,Wu Meng2
1School of Statistics, Shandong University of Finance and Economics, Jinan 250014, China
2School of Economics, Shandong University of Finance and Economics, Jinan 250014, China
Download: PDF (1100 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper integrates tax-related data from multiple sources, and uses machine learning methods to identify the illegal corporate tax evasions. [Methods] First, we use web-scraping, text mining, and other methods to collect business financial data, executive information, and media coverage of the corporations. Then, we used the random forest method for feature selection and established indictors for the candidate companies. Then, we built a discriminatory model with the multi-task sparse structure learning based on the improved focal loss function. Finally, we trained the model with different types of tax audits to identify the needed candidates. [Results] We examined our model with real world datasets and found it had good performance for various applications. Its mean recall rate reached 0.830 9, which was 0.135 1 and 0.103 3 higher than the logistic method and the traditional multi-task sparse structure learning. [Limitations] The model needs to be examined with datasets not from the listed companies. [Conclusions] The new model could identify the target enterprises with various dishonest tax evasions. This study provides new directions for smart tax audit by the government.

Key wordsMulti-source Data Fusion      Smart Tax Audit      Multi-task Sparse Structure Learning      Focal Loss     
Received: 30 September 2021      Published: 25 January 2022
ZTFLH:  F812  
Fund:National Social Science Fund of China(19BTJ023)
Corresponding Authors: Li Zuojuan     E-mail: lzj901231@163.com

Cite this article:

Li Guofeng, Li Zuojuan, Wang Zheji, Wu Meng. Identifying Tax Audit Cases with Multi-task Learning. Data Analysis and Knowledge Discovery, 2022, 6(6): 128-140.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1116     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I6/128

数据来源 指标名称 指标描述
企业报表 财务指标 盈利能力指标:资产报酬率、销售毛利率、管理费用率、销售费用率等
成长能力指标:营业收入增长率、利润总额增长率、净资产收益率等
营运能力指标:存货周转率、流动资产周转率、应付账款周转率等
偿债能力指标:流动比率、速动比率等
现金流量指标:现金流动负债比、净利润现金含量等
股票情况指标:市盈率、市净率、市销率等
股权指标 股权性质:国有企业、非国有企业等
股东规模:股东户数(取对数)、户均持股数等
高管信息 年龄 高管团队成员年龄的平均值
性别 高管团队中女性人数的比例
学历 高管团队中本科以上学历人数占比
薪酬激励 高管薪酬均值(取对数)
法律背景 法律类专业高管人数占所有高管人数比例
财务背景 财务类专业高管人数占所有高管人数比例
媒体关注 媒体关
注度
对企业某年度新闻报道进行分析计数后取对数
媒体负面
情绪度
利用文本挖掘分析得到每条新闻的情感得分,计算公司年度新闻报道中情感得分为负数的报道数占当年总新闻报道数的比例
税务部门 企业是否
诚实纳税
不诚实纳税企业标记为1,诚实纳税企业标记为0
Data Sources and Description of Relevant Indicators
Calculation Process of Negative Media Sentiment Score
Ranking of Feature Importance Scores for VAT Dataset Based on Random Forest
指标维度 指标名称
盈利能力 销售成本率、销售费用率、管理费用率、每股营业总收入、净利润、营业总成本/营业总收入、每股收益增长率、投入资本回报率、营业外收支净额/利润总额
营运能力 存货周转率、流动资产周转率、固定资产周转率、存货周转天数、营业周期、应付账款周转天数、应收账款周转率、应收账款周转天数、股东权益周转率
偿债能力 流动比率、资产负债率、股东权益/负债合计、每股资本公积金
成长能力 净资产收益率、营业收入3年复合增长率
现金流量 每股企业自由现金流量、销售商品劳务收入现金
股票市场 市盈率、市净率、市销率、股东户数、户均持股数
高管信息 性别、年龄、法律背景、财务背景
媒体关注 媒体负面情绪度
Judgment Index System of Case Selection for Tax Inspection by Tax Type
预测正类 预测负类
实际正类 TP(True Positives) FN(False Negatives)
实际负类 FP(False Positives) TN(True Negatives)
Confusion Matrix
评价指标 分税种 Pool-
LogReg
STL-
LogReg
MSSL FL-
MSSL
准确率 增值税 0.847 8 0.878 3 0.826 1
企业所得税 0.752 1 0.816 7 0.816 7
个人所得税 0.798 2 0.853 2 0.834 9
其他税 0.803 4 0.841 9 0.806 3
均值 0.795 0 0.800 4 0.847 5 0.821 0
AUC 增值税 0.822 4 0.819 1 0.837 3
企业所得税 0.785 7 0.817 0 0.819 2
个人所得税 0.722 9 0.780 8 0.816 6
其他税 0.749 1 0.750 0 0.814 8
均值 0.735 1 0.770 0 0.791 7 0.822 0
F1-Score 增值税 0.715 4 0.740 7 0.710 1
企业所得税 0.602 7 0.807 0 0.813 6
个人所得税 0.607 1 0.703 7 0.727 3
其他税 0.645 7 0.638 6 0.693 6
均值 0.604 9 0.642 7 0.722 5 0.736 1
G-mean 增值税 0.783 5 0.842 5 0.837 3
企业所得税 0.664 3 0.815 6 0.819 2
个人所得税 0.752 8 0.843 1 0.816 6
其他税 0.812 1 0.795 0 0.814 8
均值 0.735 1 0.753 2 0.824 1 0.822 0
召回率 增值税 0.686 3 0.701 8 0.859 6
企业所得税 0.831 2 0.821 4 0.857 1
个人所得税 0.645 1 0.689 6 0.774 2
其他税 0.620 6 0.697 7 0.832 8
均值 0.619 6 0.695 8 0.727 6 0.830 9
The Effectiveness of Logistic Regression and Multi-task Sparse Structure Learning
Ablation Experiments with Different Characteristic Indicators
Recall Rate of Model with Different Sampling Ratios
FL-MSSL Evaluation Indexes under Different Number of Tasks
企业编号 上市公司代码 预测不诚实
纳税概率
预测不诚实
纳税税种
实际不诚实
纳税税种
纳税信用
评级为A
1 C6034** 13.75% 0 0
2 C3000** 38.81% 0 0
3 C0021** 80.00% 增值税 增值税
4 C0007** 76.96% 其他税 其他税
5 C6039** 3.22% 0 0
6 C0027** 68.86% 增值税 增值税
7 C6006** 88.35% 个人所得税 个人所得税
8 C3007** 22.12% 0 0
9 C3005** 64.74% 企业所得税 企业所得税
10 C3005** 40.93% 0 0
11 C6006** 65.41% 其他税 其他税
12 C6038** 4.09% 0 0
13 C0027** 86.33% 个人所得税 个人所得税
14 C0027** 32.37% 0 0
15 C6003** 43.43% 0 0
Out-of-Sample Firm Predictions Based on the FL-MSSL
[1] Wu R S, Ou C S, Lin H Y, et al. Using Data Mining Technique to Enhance Tax Evasion Detection Performance[J]. Expert Systems with Applications, 2012, 39(10): 8769-8777.
[2] 刘尚希, 孙静. 大数据思维:在税收风险管理中的应用[J]. 经济研究参考, 2016(9):19-26.
[2] (Liu Shangxi, Sun Jing. Big Data Thinking: Application in Tax Risk Management[J]. Review of Economic Research, 2016(9):19-26.)
[3] Lismont J, Cardinaels E, Bruynseels L, et al. Predicting Tax Avoidance by Means of Social Network Analytics[J]. Decision Support Systems, 2018, 108: 13-24.
[4] 王艳杰, 李清, 齐鑫鑫. 基于 Logistic 回归的税务稽查选案模型研究[J]. 经济研究导刊, 2012(35): 96-97.
[4] (Wang Yanjie, Li Qing, Qi Xinxin. Research on the Tax Inspection Selection Scheme Model Based on the Logistic Regression[J]. Economic Research Guide, 2012(35): 96-97.)
[5] 程书生. 浅析大数据背景下制造业税务风险管理[J]. 纳税, 2020(14): 13-15.
[5] (Cheng Shusheng. Analysis of Tax Risk Management in Manufacturing Industry in the Context of Big Data[J]. Taxation, 2020(14): 13-15.)
[6] Slemrod J. The Economics of Corporate Tax Selfishness[J]. National Tax Journal, 2004, 57(4): 877-899.
[7] 田高良, 李星, 司毅. 期权激励、媒体关注与税收激进行为:基于媒体情绪的公司治理机制研究[J]. 管理工程学报, 2019, 33(1): 1-11.
[7] (Tian Gaoliang, Li Xing, Si Yi, et al. Option Incentives, Media Coverage and Tax Aggressive: The Corporate Governance Mechanism of Media from Coverage Mode Perspective[J]. Journal of Industrial Engineering and Engineering Management, 2019, 33(1): 1-11.)
[8] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318-327.
[9] Gonçalves A R, Zuben F V J, Banerjee A. Multi-task Sparse Structure Learning with Gaussian Copula Models[J]. Journal of Machine Learning Research, 2016, 17: 1-30.
[10] 李选举. Tobit模型与税收稽查[J]. 统计研究, 2000, 17(1):46-50.
[10] (Li Xuanju. Tobit Model and Tax Audit[J]. Statistical Research, 2000, 17(1):46-50.)
[11] González P C, Velásquez J D. Characterization and Detection of Taxpayers with False Invoices Using Data Mining Techniques[J]. Expert Systems with Applications, 2013, 40(5): 1427-1436.
[12] 齐鑫鑫. 识别偷税的税务稽查方法研究[D]. 长春: 吉林大学, 2010.
[12] (Qi Xinxin. The Research on the Tax Inspection Methods about Identifying Tax Evasion[D]. Changchun: Jilin University, 2010.)
[13] 唐登山. 税务稽查选案方法探析[J]. 税务研究, 2011(4):61-63.
[13] (Tang Dengshan. Exploration of Case Selection Method of Tax Audit[J]. Taxation Research, 2011(4):61-63.)
[14] Rahimikia E, Mohammadi S, Rahmani T, et al. Detecting Corporate Tax Evasion Using a Hybrid Intelligent System: A Case Study of Iran[J]. International Journal of Accounting Information Systems, 2017, 25: 1-17.
[15] 谢旭人. 加强税收经济分析和企业纳税评估, 提高税源管理水平[J]. 税务研究, 2007 (5): 3-10.
[15] (Xie Xuren. Strengthen Tax Economic Analysis and Enterprise Tax Assessment to Improve Tax Source Management[J]. Taxation Research, 2007(5): 3-10.)
[16] 范辉. “互联网+”思维下完善税收风险识别指标体系的探索[J]. 税务研究, 2019(11): 77-81.
[16] (Fan Hui. A Discussion on the Improvement of the Tax Risk Identification Index System from the “Interne-Plus” Perspective[J]. Taxation Research, 2019(11): 77-81.)
[17] Bonilla E V, Chai K M A, Williams C K I. Multi-task Gaussian Process Prediction[C]// Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2007: 153-160.
[18] Zhang Y, Yeung D Y. A Convex Formulation for Learning Task Relationships in Multi-Task Learning[C]// Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. 2010: 733-742.
[19] 邢新颖, 冀俊忠, 姚垚. 基于自适应多任务卷积神经网络的脑网络分类方法[J]. 计算机研究与发展, 2020, 57(7): 1449-1459.
[19] (Xing Xinying, Ji Junzhong, Yao Yao. Brain Networks Classification Based on an Adaptive Multi-task Convolutional Neural Networks[J]. Journal of Computer Research and Development, 2020, 57(7): 1449-1459.)
[20] 杨晗迅, 周德群, 马静, 等. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[20] (Yang Hanxun, Zhou Dequn, Ma Jing, et al. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 101-110.)
[21] 郑红霞, 韩梅芳. 基于不同股权结构的上市公司税收筹划行为研究——来自中国国有上市公司和民营上市公司的经验证据[J]. 中国软科学, 2008(9): 122-131.
[21] (Zheng Hongxia, Han Meifang. Tax Planning Analysis Based on Listed Company with Different Ownership Structure: The Empirical Evidence from State-owned Listed Company and Private Listed Company in China[J]. China Soft Science, 2008(9): 122-131.)
[22] 刘华, 张天敏, 徐建斌. 高管个人特征与公司税负[J]. 税务与经济, 2012(4): 58-64.
[22] (Liu Hua, Zhang Tianmin, Xu Jianbin. Personal Characteristics of Top Executives and Company Tax Burden[J]. Taxation and Economy, 2012(4): 58-64.)
[23] Desai M A, Dyck A, Zingales L. Theft and Taxes[J]. Journal of Financial Economics, 2007, 84(3): 591-623.
[24] 于忠泊, 田高良, 齐保垒, 等. 媒体关注的公司治理机制——基于盈余管理视角的考察[J]. 管理世界, 2011(9): 127-140.
[24] (Yu Zhongbo, Tian Gaoliang, Qi Baolei, et al. Corporate Governance Mechanisms of Media Attention: An Examination Based on the Perspective of Surplus Management[J]. Management World, 2011(9): 127-140.)
[25] Kamkar I, Gupta S K, Phung D, et al. Stable Feature Selection for Clinical Prediction: Exploiting ICD Tree Structure Using Tree-Lasso[J]. Journal of Biomedical Informatics, 2015, 53: 277-290.
[26] Geurts P, Ernst D, Wehenkel L. Extremely Randomized Trees[J]. Machine Learning, 2006, 63(1): 3-42.
[27] Jiang R, Tang W W, Wu X B, et al. A Random Forest Approach to the Detection of Epistatic Interactions in Case-Control Studies[J]. BMC Bioinformatics, 2009, 10: S65.
[28] Gorski J, Pfeuffer F, Klamroth K. Biconvex Sets and Optimization with Biconvex Functions: A Survey and Extensions[J]. Mathematical Methods of Operations Research, 2007, 66(3): 373-407.
[29] Beck A, Teboulle M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems[J]. SIAM Journal on Imaging Sciences, 2009, 2(1): 183-202.
[30] Boyd S, Parikh N, Chu E, et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers[J]. Foundation and Trends in Machine Learning, 2010, 3(1): 1-122.
[31] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[1] Li Guangjian,Wang Kai,Zhang Qingzhi. Analysis Framework Based on Multi-Source Data for US Export Control: An Empirical Study[J]. 数据分析与知识发现, 2020, 4(9): 26-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn