Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (6): 128-140     https://doi.org/10.11925/infotech.2096-3467.2021.1116
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多任务学习的税务稽查选案研究*
李国锋1,李祚娟1(),王哲吉1,吴梦2
1山东财经大学统计学院 济南 250014
2山东财经大学经济学院 济南 250014
Identifying Tax Audit Cases with Multi-task Learning
Li Guofeng1,Li Zuojuan1(),Wang Zheji1,Wu Meng2
1School of Statistics, Shandong University of Finance and Economics, Jinan 250014, China
2School of Economics, Shandong University of Finance and Economics, Jinan 250014, China
全文: PDF (1100 KB)   HTML ( 11
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 整合多源涉税数据信息,利用机器学习方法,实现对重点税种涉税违法企业的智能判别分析。【方法】 利用网络数据获取、文本挖掘等技术,收集企业财务指标、高管信息、媒体关注信息等多源涉税数据进行融合处理;利用随机森林方法进行特征选择,构建税务稽查选案判别指标体系;利用改进的基于焦点损失函数的多任务结构化稀疏学习方法,视不同税种选案工作为不同任务联合训练,构建了分税种的税务稽查选案判别模型。【结果】 真实数据实验结果表明,所提出的基于多任务学习方法构建的税务稽查选案判别模型具有较好的泛化性能和应用能力,其召回率均值达到0.830 9,相对于逻辑回归方法和传统的多任务结构化稀疏学习分别提升了0.135 1和0.103 3。【局限】 模型需要在上市企业以外的数据集层面进一步验证。【结论】 本研究所构建的模型能够更加精准地甄别出不诚实纳税的目标企业,且可同时识别出其具体涉及的偷漏税税种,为政府智慧税务稽查提供新思路。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李国锋
李祚娟
王哲吉
吴梦
关键词 多源数据融合智慧税务稽查多任务结构化稀疏学习焦点损失函数    
Abstract

[Objective] This paper integrates tax-related data from multiple sources, and uses machine learning methods to identify the illegal corporate tax evasions. [Methods] First, we use web-scraping, text mining, and other methods to collect business financial data, executive information, and media coverage of the corporations. Then, we used the random forest method for feature selection and established indictors for the candidate companies. Then, we built a discriminatory model with the multi-task sparse structure learning based on the improved focal loss function. Finally, we trained the model with different types of tax audits to identify the needed candidates. [Results] We examined our model with real world datasets and found it had good performance for various applications. Its mean recall rate reached 0.830 9, which was 0.135 1 and 0.103 3 higher than the logistic method and the traditional multi-task sparse structure learning. [Limitations] The model needs to be examined with datasets not from the listed companies. [Conclusions] The new model could identify the target enterprises with various dishonest tax evasions. This study provides new directions for smart tax audit by the government.

Key wordsMulti-source Data Fusion    Smart Tax Audit    Multi-task Sparse Structure Learning    Focal Loss
收稿日期: 2021-09-30      出版日期: 2022-01-25
ZTFLH:  F812  
基金资助:*国家社会科学基金一般项目(19BTJ023)
通讯作者: 李祚娟, ORCID:0000-0003-2580-6118     E-mail: lzj901231@163.com
引用本文:   
李国锋, 李祚娟, 王哲吉, 吴梦. 基于多任务学习的税务稽查选案研究*[J]. 数据分析与知识发现, 2022, 6(6): 128-140.
Li Guofeng, Li Zuojuan, Wang Zheji, Wu Meng. Identifying Tax Audit Cases with Multi-task Learning. Data Analysis and Knowledge Discovery, 2022, 6(6): 128-140.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1116      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I6/128
数据来源 指标名称 指标描述
企业报表 财务指标 盈利能力指标:资产报酬率、销售毛利率、管理费用率、销售费用率等
成长能力指标:营业收入增长率、利润总额增长率、净资产收益率等
营运能力指标:存货周转率、流动资产周转率、应付账款周转率等
偿债能力指标:流动比率、速动比率等
现金流量指标:现金流动负债比、净利润现金含量等
股票情况指标:市盈率、市净率、市销率等
股权指标 股权性质:国有企业、非国有企业等
股东规模:股东户数(取对数)、户均持股数等
高管信息 年龄 高管团队成员年龄的平均值
性别 高管团队中女性人数的比例
学历 高管团队中本科以上学历人数占比
薪酬激励 高管薪酬均值(取对数)
法律背景 法律类专业高管人数占所有高管人数比例
财务背景 财务类专业高管人数占所有高管人数比例
媒体关注 媒体关
注度
对企业某年度新闻报道进行分析计数后取对数
媒体负面
情绪度
利用文本挖掘分析得到每条新闻的情感得分,计算公司年度新闻报道中情感得分为负数的报道数占当年总新闻报道数的比例
税务部门 企业是否
诚实纳税
不诚实纳税企业标记为1,诚实纳税企业标记为0
Table 1  数据来源及相关指标描述
Fig.1  媒体负面情绪测度的计算流程
Fig.2  基于随机森林的增值税数据集特征重要性评分排序
指标维度 指标名称
盈利能力 销售成本率、销售费用率、管理费用率、每股营业总收入、净利润、营业总成本/营业总收入、每股收益增长率、投入资本回报率、营业外收支净额/利润总额
营运能力 存货周转率、流动资产周转率、固定资产周转率、存货周转天数、营业周期、应付账款周转天数、应收账款周转率、应收账款周转天数、股东权益周转率
偿债能力 流动比率、资产负债率、股东权益/负债合计、每股资本公积金
成长能力 净资产收益率、营业收入3年复合增长率
现金流量 每股企业自由现金流量、销售商品劳务收入现金
股票市场 市盈率、市净率、市销率、股东户数、户均持股数
高管信息 性别、年龄、法律背景、财务背景
媒体关注 媒体负面情绪度
Table 2  分税种税务稽查选案判别指标体系
预测正类 预测负类
实际正类 TP(True Positives) FN(False Negatives)
实际负类 FP(False Positives) TN(True Negatives)
Table 3  混淆矩阵
评价指标 分税种 Pool-
LogReg
STL-
LogReg
MSSL FL-
MSSL
准确率 增值税 0.847 8 0.878 3 0.826 1
企业所得税 0.752 1 0.816 7 0.816 7
个人所得税 0.798 2 0.853 2 0.834 9
其他税 0.803 4 0.841 9 0.806 3
均值 0.795 0 0.800 4 0.847 5 0.821 0
AUC 增值税 0.822 4 0.819 1 0.837 3
企业所得税 0.785 7 0.817 0 0.819 2
个人所得税 0.722 9 0.780 8 0.816 6
其他税 0.749 1 0.750 0 0.814 8
均值 0.735 1 0.770 0 0.791 7 0.822 0
F1-Score 增值税 0.715 4 0.740 7 0.710 1
企业所得税 0.602 7 0.807 0 0.813 6
个人所得税 0.607 1 0.703 7 0.727 3
其他税 0.645 7 0.638 6 0.693 6
均值 0.604 9 0.642 7 0.722 5 0.736 1
G-mean 增值税 0.783 5 0.842 5 0.837 3
企业所得税 0.664 3 0.815 6 0.819 2
个人所得税 0.752 8 0.843 1 0.816 6
其他税 0.812 1 0.795 0 0.814 8
均值 0.735 1 0.753 2 0.824 1 0.822 0
召回率 增值税 0.686 3 0.701 8 0.859 6
企业所得税 0.831 2 0.821 4 0.857 1
个人所得税 0.645 1 0.689 6 0.774 2
其他税 0.620 6 0.697 7 0.832 8
均值 0.619 6 0.695 8 0.727 6 0.830 9
Table 4  Logistic回归和多任务结构化稀疏学习效果对比
Fig.3  不同特征指标的消融实验
Fig.4  不同采样比例下各模型的召回率
Fig.5  不同任务数目下FL-MSSL评价指标对比
企业编号 上市公司代码 预测不诚实
纳税概率
预测不诚实
纳税税种
实际不诚实
纳税税种
纳税信用
评级为A
1 C6034** 13.75% 0 0
2 C3000** 38.81% 0 0
3 C0021** 80.00% 增值税 增值税
4 C0007** 76.96% 其他税 其他税
5 C6039** 3.22% 0 0
6 C0027** 68.86% 增值税 增值税
7 C6006** 88.35% 个人所得税 个人所得税
8 C3007** 22.12% 0 0
9 C3005** 64.74% 企业所得税 企业所得税
10 C3005** 40.93% 0 0
11 C6006** 65.41% 其他税 其他税
12 C6038** 4.09% 0 0
13 C0027** 86.33% 个人所得税 个人所得税
14 C0027** 32.37% 0 0
15 C6003** 43.43% 0 0
Table 5  基于FL-MSSL模型的样本外企业预测
[1] Wu R S, Ou C S, Lin H Y, et al. Using Data Mining Technique to Enhance Tax Evasion Detection Performance[J]. Expert Systems with Applications, 2012, 39(10): 8769-8777.
[2] 刘尚希, 孙静. 大数据思维:在税收风险管理中的应用[J]. 经济研究参考, 2016(9):19-26.
[2] (Liu Shangxi, Sun Jing. Big Data Thinking: Application in Tax Risk Management[J]. Review of Economic Research, 2016(9):19-26.)
[3] Lismont J, Cardinaels E, Bruynseels L, et al. Predicting Tax Avoidance by Means of Social Network Analytics[J]. Decision Support Systems, 2018, 108: 13-24.
[4] 王艳杰, 李清, 齐鑫鑫. 基于 Logistic 回归的税务稽查选案模型研究[J]. 经济研究导刊, 2012(35): 96-97.
[4] (Wang Yanjie, Li Qing, Qi Xinxin. Research on the Tax Inspection Selection Scheme Model Based on the Logistic Regression[J]. Economic Research Guide, 2012(35): 96-97.)
[5] 程书生. 浅析大数据背景下制造业税务风险管理[J]. 纳税, 2020(14): 13-15.
[5] (Cheng Shusheng. Analysis of Tax Risk Management in Manufacturing Industry in the Context of Big Data[J]. Taxation, 2020(14): 13-15.)
[6] Slemrod J. The Economics of Corporate Tax Selfishness[J]. National Tax Journal, 2004, 57(4): 877-899.
[7] 田高良, 李星, 司毅. 期权激励、媒体关注与税收激进行为:基于媒体情绪的公司治理机制研究[J]. 管理工程学报, 2019, 33(1): 1-11.
[7] (Tian Gaoliang, Li Xing, Si Yi, et al. Option Incentives, Media Coverage and Tax Aggressive: The Corporate Governance Mechanism of Media from Coverage Mode Perspective[J]. Journal of Industrial Engineering and Engineering Management, 2019, 33(1): 1-11.)
[8] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318-327.
[9] Gonçalves A R, Zuben F V J, Banerjee A. Multi-task Sparse Structure Learning with Gaussian Copula Models[J]. Journal of Machine Learning Research, 2016, 17: 1-30.
[10] 李选举. Tobit模型与税收稽查[J]. 统计研究, 2000, 17(1):46-50.
[10] (Li Xuanju. Tobit Model and Tax Audit[J]. Statistical Research, 2000, 17(1):46-50.)
[11] González P C, Velásquez J D. Characterization and Detection of Taxpayers with False Invoices Using Data Mining Techniques[J]. Expert Systems with Applications, 2013, 40(5): 1427-1436.
[12] 齐鑫鑫. 识别偷税的税务稽查方法研究[D]. 长春: 吉林大学, 2010.
[12] (Qi Xinxin. The Research on the Tax Inspection Methods about Identifying Tax Evasion[D]. Changchun: Jilin University, 2010.)
[13] 唐登山. 税务稽查选案方法探析[J]. 税务研究, 2011(4):61-63.
[13] (Tang Dengshan. Exploration of Case Selection Method of Tax Audit[J]. Taxation Research, 2011(4):61-63.)
[14] Rahimikia E, Mohammadi S, Rahmani T, et al. Detecting Corporate Tax Evasion Using a Hybrid Intelligent System: A Case Study of Iran[J]. International Journal of Accounting Information Systems, 2017, 25: 1-17.
[15] 谢旭人. 加强税收经济分析和企业纳税评估, 提高税源管理水平[J]. 税务研究, 2007 (5): 3-10.
[15] (Xie Xuren. Strengthen Tax Economic Analysis and Enterprise Tax Assessment to Improve Tax Source Management[J]. Taxation Research, 2007(5): 3-10.)
[16] 范辉. “互联网+”思维下完善税收风险识别指标体系的探索[J]. 税务研究, 2019(11): 77-81.
[16] (Fan Hui. A Discussion on the Improvement of the Tax Risk Identification Index System from the “Interne-Plus” Perspective[J]. Taxation Research, 2019(11): 77-81.)
[17] Bonilla E V, Chai K M A, Williams C K I. Multi-task Gaussian Process Prediction[C]// Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2007: 153-160.
[18] Zhang Y, Yeung D Y. A Convex Formulation for Learning Task Relationships in Multi-Task Learning[C]// Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. 2010: 733-742.
[19] 邢新颖, 冀俊忠, 姚垚. 基于自适应多任务卷积神经网络的脑网络分类方法[J]. 计算机研究与发展, 2020, 57(7): 1449-1459.
[19] (Xing Xinying, Ji Junzhong, Yao Yao. Brain Networks Classification Based on an Adaptive Multi-task Convolutional Neural Networks[J]. Journal of Computer Research and Development, 2020, 57(7): 1449-1459.)
[20] 杨晗迅, 周德群, 马静, 等. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[20] (Yang Hanxun, Zhou Dequn, Ma Jing, et al. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 101-110.)
[21] 郑红霞, 韩梅芳. 基于不同股权结构的上市公司税收筹划行为研究——来自中国国有上市公司和民营上市公司的经验证据[J]. 中国软科学, 2008(9): 122-131.
[21] (Zheng Hongxia, Han Meifang. Tax Planning Analysis Based on Listed Company with Different Ownership Structure: The Empirical Evidence from State-owned Listed Company and Private Listed Company in China[J]. China Soft Science, 2008(9): 122-131.)
[22] 刘华, 张天敏, 徐建斌. 高管个人特征与公司税负[J]. 税务与经济, 2012(4): 58-64.
[22] (Liu Hua, Zhang Tianmin, Xu Jianbin. Personal Characteristics of Top Executives and Company Tax Burden[J]. Taxation and Economy, 2012(4): 58-64.)
[23] Desai M A, Dyck A, Zingales L. Theft and Taxes[J]. Journal of Financial Economics, 2007, 84(3): 591-623.
[24] 于忠泊, 田高良, 齐保垒, 等. 媒体关注的公司治理机制——基于盈余管理视角的考察[J]. 管理世界, 2011(9): 127-140.
[24] (Yu Zhongbo, Tian Gaoliang, Qi Baolei, et al. Corporate Governance Mechanisms of Media Attention: An Examination Based on the Perspective of Surplus Management[J]. Management World, 2011(9): 127-140.)
[25] Kamkar I, Gupta S K, Phung D, et al. Stable Feature Selection for Clinical Prediction: Exploiting ICD Tree Structure Using Tree-Lasso[J]. Journal of Biomedical Informatics, 2015, 53: 277-290.
[26] Geurts P, Ernst D, Wehenkel L. Extremely Randomized Trees[J]. Machine Learning, 2006, 63(1): 3-42.
[27] Jiang R, Tang W W, Wu X B, et al. A Random Forest Approach to the Detection of Epistatic Interactions in Case-Control Studies[J]. BMC Bioinformatics, 2009, 10: S65.
[28] Gorski J, Pfeuffer F, Klamroth K. Biconvex Sets and Optimization with Biconvex Functions: A Survey and Extensions[J]. Mathematical Methods of Operations Research, 2007, 66(3): 373-407.
[29] Beck A, Teboulle M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems[J]. SIAM Journal on Imaging Sciences, 2009, 2(1): 183-202.
[30] Boyd S, Parikh N, Chu E, et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers[J]. Foundation and Trends in Machine Learning, 2010, 3(1): 1-122.
[31] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[1] 李广建,王锴,张庆芝. 基于多源数据的美国出口管制分析框架及其实证研究*[J]. 数据分析与知识发现, 2020, 4(9): 26-40.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn