Benchmarking data augmentation techniques for credit scoring data

  • Lotte VAN HERREWEGHE

Student thesis: Master typesMaster in Business Engineering Professional focus in Data Science

Abstract

Data augmentation has proved useful in training machine learning models for images or natural language processing. For tabular data, however, the existing data augmentation algorithms are much less numerous and well known. Nevertheless, most of the available data is tabular and existing DA techniques have already demonstrated that data augmentation can also improve the performance of classifiers here. Therefore, the purpose of this thesis is, on the one hand, to create a taxonomy of both mature and potential techniques. With the latter being techniques that are nowadays mainly used for other types of data, but have the potential to achieve good results on tabular data. On the other hand, the performance improvement offered by data augmentation is tested on credit scoring data. For this reason, 5 mature techniques are selected (insertion of noise, data swapping, SMOTE, CTGAN and PrivBayes), one from each category of the established taxonomy. Empirical results show that no algorithm consistently scores best. The classifier with which the DA technique is combined also has a major impact on performance. Moreover, a large variety of algorithms in terms of complexity is found. The most complex algorithms turn out to require a lot of time, processing power and understanding of the algorithm. Depending on the purpose for which DA is used, it may be permissible to use the extra time and computing power. But if, in the future, a company were to use DA by default on its data, for example, there are other alternatives that require fewer resources. Although no clear winning strategy is found, this thesis provides gainful insights in which techniques and models to combine when making predictions on credit scoring data. Furthermore, a clear taxonomy that can be consulted in need for an overview of existing DA techniques has been created and suggestions for research into other techniques have been done.
Date of Award1 Jun 2022
Original languageEnglish
Awarding Institution
  • University of Namur
SupervisorIsabelle Linden (Supervisor)

Keywords

  • Data augmentation
  • tabular data
  • credit scoring
  • benchmarking study

Cite this

'