[Objective] This paper builds ensemble models to detect financial frauds of Growth Enterprise Market (GEM) listed companies. [Methods] We constructed a financial fraud anomaly detection framework based on data fusion. In the data layer, we fused structured, text, and multi-source heterogeneous data to construct financial and non-financial information features. In the information layer, we combined different sampling and ensemble classification models. In the knowledge layer, we fused current domain information to construct the model evaluation indicators. [Results] After non-balance processing, the evaluation indicators of the model were better than those of the un-processed results. The optimized SMOTE+ENN+LightGBM model achieved an Fβ of 0.7738. In addition, the detection results containing multiple types of features were better than those containing only single-class features. [Limitations] The proposed method mainly identifies suspicious financial fraud companies. It cannot distinguish or determine specific types of fraud. [Conclusions] Non-balance processing is beneficial for improving the model’s ability to find abnormal samples, and the fusion of multi-source heterogeneous data positive affects the identification of financial frauds in listed companies.
李爱华, 王迪文, 续维佳, 李子沫, 姚思涵. 基于多数据源融合的创业板上市公司财务造假异常检测*[J]. 数据分析与知识发现, 2023, 7(5): 33-47.
Li Aihua, Wang Diwen, Xu Weijia, Li Zimo, Yao Sihan. Financial Fraud Detection for Growth Enterprise Market Listed Companies Based on Data Fusion. Data Analysis and Knowledge Discovery, 2023, 7(5): 33-47.
Fligstein N, Roehrkasse A. All the Incentives were Wrong: Opportunism and the Financial Crisis[C]// Proceedings of Annual Meetings of the American Sociological Association. 2013.
(Song Xinping, Ding Yongsheng, Zhang Gefu. Application of Integrated Classification Method in Identifying Risk of Fraudulent Financial Report[J]. Computer Engineering and Applications, 2008, 44(34): 226-230.)
doi: 10.3778/j.issn.1002-8331.2008.34.069
[3]
Lin C C, Chiu A A, Huang S Y, et al. Detecting the Financial Statement Fraud: The Analysis of the Differences Between Data Mining Techniques and Experts’ Judgments[J]. Knowledge-Based Systems, 2015, 89: 459-470.
doi: 10.1016/j.knosys.2015.08.011
(Xia Ming, Li Hailin, Wu Liyuan. Identification of Accounting Fraud Based on Neural Network Combination Model[J]. Statistics & Decision, 2015(16): 49-52.)
[5]
Albrecht W S, Wernz G W, Williams T L. Fraud: Bringing Light to the Dark Side of Business[M]. Irwin Professional Pub., 1995.
[6]
Persons O S. Using Financial Statement Data to Identify Factors Associated with Fraudulent Financial Reporting[J]. Journal of Applied Business Research, 2011, 11(3): 38-46.
(He Jiangang, Sun Zheng, Zhou Youmei. Pyramid Structures, Audit Quality and the Usefulness of MD & A—Evidence from Accounting Restatements[J]. Auditing Research, 2013(6): 68-75.)
(Wang Kemin, Wang Huajie, Li Dongdong, et al. Complexity of Annual Report and Management Self-Interest: Empirical Evidence from Chinese Listed Firms[J]. Management World, 2018, 34(12): 120-132.)
[9]
Purda L, Skillicorn D. Accounting Variables, Deception, and a Bag of Words: Assessing the Tools of Fraud Detection[J]. Contemporary Accounting Research, 2015, 32(3): 1193-1223.
doi: 10.1111/1911-3846.12089
[10]
Bell T B, Carcello J V. A Decision Aid for Assessing the Likelihood of Fraudulent Financial Reporting[J]. Auditing: A Journal of Practice & Theory, 2000, 19(1): 169-184.
doi: 10.2308/aud.2000.19.1.169
[11]
Fanning K M, Cogger K O. Neural Network Detection of Management Fraud Using Published Financial Data[J]. International Journal of Intelligent Systems in Accounting, Finance & Management, 1998, 7(1): 21-41.
[12]
Waltz E, Llinas J. Multisensor Data Fusion[M]. Boston: Artech House, 1990.
(Chen Kewen, Zhang Zuping, Long Jun. Multisource Information Fusion: Key Issues, Research Progress and New Trends[J]. Computer Science, 2013, 40(8): 6-13.)
[14]
Li A H, Xu W J, Shi Y. A New Data Fusion Framework of Business Intelligence and Analytics in Economy, Finance and Management[C]// Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. IEEE, 2021: 940-945.
[15]
Shen J L, Liu R Y, Xie M G. iFusion: Individualized Fusion Learning[J]. Journal of the American Statistical Association, 2020, 115(531): 1251-1267.
doi: 10.1080/01621459.2019.1672557
[16]
Kashinath S A, Mostafa S A, Mustapha A, et al. Review of Data Fusion Methods for Real-Time and Multi-Sensor Traffic Flow Analysis[J]. IEEE Access, 2021, 9: 51258-51276.
doi: 10.1109/ACCESS.2021.3069770
[17]
Lau B P L, Marakkalage S H, Zhou Y R, et al. A Survey of Data Fusion in Smart City Applications[J]. Information Fusion, 2019, 52: 357-374.
doi: 10.1016/j.inffus.2019.05.004
(Du Delin, Huang Jie, Wang Jiaoe. Assessment of Smart City Development Status in China Based on Multi-Source Data[J]. Journal of Geo-Information Science, 2020, 22(6): 1294-1306.)
doi: 10.12082/dqxxkx.2020.190702
(Wu Jianhua, Zhang Ying, Yuan Xuemei. Macroeconomic Shock Model for Dynamic Bayesian Credit Rating[J]. Journal of Applied Statistics and Management, 2022, 41(6): 969-981.)
[20]
Wang Q L, Xu W, Huang X T, et al. Enhancing Intraday Stock Price Manipulation Detection by Leveraging Recurrent Neural Networks with Ensemble Learning[J]. Neurocomputing, 2019, 347: 46-58.
doi: 10.1016/j.neucom.2019.03.006
[21]
Chen F H, Chi D J, Zhu J Y. Application of Random Forest, Rough Set Theory, Decision Tree and Neural Network to Detect Financial Statement Fraud-Taking Corporate Governance into Consideration[C]// Proceedings of the 10th International Conference on Intelligent Computing. 2014: 221-234.
(Li Shuxin, Ni Qing, Cao Qi, et al. Research and Application of Corporate Credit Risk Early Warning Based on Financial Fraud Identification Model[J]. International Finance, 2019(1): 30-33.)
(Yao Xin. An Empirical Analysis on the Influencing Factors of Financial Fraud of Listed Companies in China[J]. Assets and Finances in Administration and Institution, 2019(20): 83-84.)
(Zhang Yue, Song Haitao. Research on Financial Fraud Identification Based on Cost-Sensitive Learning[J]. Research of Finance and Accounting, 2022(2): 22-29.)
(Yuan Xianzhi, Zhou Yunpeng, Yan Chengxing, et al. The Framework for the Risk Feature Extraction Method on Corporate Financial Fraud George[J]. Chinese Journal of Management Science, 2022, 30(3): 43-54.)
[26]
连竑彬. 中国上市公司财务报表舞弊现状分析及甄别模型研究[D]. 厦门: 厦门大学, 2008.
[26]
(Lian Hongbin. Fraudulent Financial Statements of Chinese Listed Companies: Analysis of the Status Quo and the Fraud-Detecting Model[D]. Xiamen: Xiamen University, 2008.)
(Yu Yumiao, Lü Fan. The Identification of Financial Fraud: Based on Incremental Information of Financial Index[J]. Economic Review, 2010(4): 124-130.)
[28]
Cecchini M, Aytug H, Koehler G J, et al. Making Words Work: Using Financial Text as a Predictor of Financial Events[J]. Decision Support Systems, 2010, 50(1): 164-175.
doi: 10.1016/j.dss.2010.07.012
(Dong Wei. Mining and Analyzing the Text for Corporate Fraud Detection: An Investigation of Financial Statements and Social Media[D]. Hefei: University of Science and Technology of China, 2017.)
(Zhang Chunmei, Zhao Mingqing, Wu Xuezi. Financial Fraud Identification Method for Listed Companies Based on News Sentiment[J]. Journal of Shandong University of Science and Technology (Natural Science), 2021, 40(1): 91-99.)
[31]
Ng W W Y, Hu J J, Yeung D S, et al. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11): 2402-2412.
doi: 10.1109/TCYB.2014.2372060
pmid: 25474818
[32]
Liu X Y, Wu J X, Zhou Z H. Exploratory Undersampling for Class-Imbalance Learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539-550.
doi: 10.1109/TSMCB.2008.2007853
[33]
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
doi: 10.1613/jair.953
[34]
Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
doi: 10.1145/1007730.1007735
[35]
Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1): 81-106.
[36]
Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1): 5-32.
doi: 10.1023/A:1010933404324
[37]
Friedman J H. Greedy Function Approximation: A Gradient Boosting Machine[J]. The Annals of Statistics, 2001, 29(5): 1189-1232.
doi: 10.1214/aos/1013203450
[38]
Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016: 785-794.
[39]
Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017: 3149-3157.
[40]
Van Rijsbergen C J. Information Retrieval[M]. Butterworths, 1979.