Data augmentation has proved useful in training machine learning models for images or
natural language processing. For tabular data, however, the existing data augmentation
algorithms are much less numerous and well known. Nevertheless, most of the available
data is tabular and existing DA techniques have already demonstrated that data augmentation can also improve the performance of classifiers here. Therefore, the purpose
of this thesis is, on the one hand, to create a taxonomy of both mature and potential
techniques. With the latter being techniques that are nowadays mainly used for other
types of data, but have the potential to achieve good results on tabular data. On the
other hand, the performance improvement offered by data augmentation is tested on
credit scoring data. For this reason, 5 mature techniques are selected (insertion of noise,
data swapping, SMOTE, CTGAN and PrivBayes), one from each category of the established taxonomy. Empirical results show that no algorithm consistently scores best.
The classifier with which the DA technique is combined also has a major impact on
performance. Moreover, a large variety of algorithms in terms of complexity is found.
The most complex algorithms turn out to require a lot of time, processing power and
understanding of the algorithm. Depending on the purpose for which DA is used, it
may be permissible to use the extra time and computing power. But if, in the future, a
company were to use DA by default on its data, for example, there are other alternatives
that require fewer resources. Although no clear winning strategy is found, this thesis
provides gainful insights in which techniques and models to combine when making predictions on credit scoring data. Furthermore, a clear taxonomy that can be consulted
in need for an overview of existing DA techniques has been created and suggestions for
research into other techniques have been done.
Date of Award | 1 Jun 2022 |
---|
Original language | English |
---|
Awarding Institution | |
---|
Supervisor | Isabelle Linden (Supervisor) |
---|
- Data augmentation
- tabular data
- credit scoring
- benchmarking study
Benchmarking data augmentation techniques for credit
scoring data
VAN HERREWEGHE, L. (Author). 1 Jun 2022
Student thesis: Master types › Master in Business Engineering Professional focus in Data Science