Detection of fake web-shops in the .be zone using active learning

Student thesis: Master typesMaster en cybersécurité, à finalité spécialisée


E-commerce activity in Belgium is growing, with consumers spending more money almost every year. With this popularity of online commerce, some malicious actors saw an opportunity to profit by selling counterfeit goods or not delivering goods at all. DNS Belgium actively takes countermeasures against such malicious actors operating in the .be zone as they threaten the trust that Internet users can have in the .be TLD. With more than 1.7 million domains in the zone, a manual inspection is impossible. With the data it has available, DNS Belgium collected a set of fake web-shops and legitimate domains over the last few years to train a classifier to help detect fake web-shops.

This thesis starts by presenting the existing solution used by DNS Belgium: the features, the classifier and its performance. Based on the literature, we suggest new features to improve the classifier. Next, we attempt to address one of the issues of the existing classifier: it requires a significant amount of labeled data and labeling data is a time-consuming process for the annotators.

We present an active learning architecture and consider multiple query strategy algorithms that the learner may use to identify the instances for which it should request the label. In order to evaluate this architecture, we conduct two main experiments: (i) we apply active learning on the labeled dataset, providing labels only when the learner requests them and (ii) we apply active learning on the .be DNS zone to evaluate how the approach performs and how many fake web-shops we are able to uncover.

Leveraging active learning in this setting limits the number of labeled instances required to start training a classifier. Moreover, it enables the classifier to ask for new labels, limiting the labeling cost to only instances that are actually relevant to the model. We apply active learning on the entire .be zone, leading to the identification of 152 domains that are marked as fake web-shops using significantly less labeled data than the existing classifier used by DNS Belgium.
la date de réponse2022
langue originaleAnglais
L'institution diplômante
  • Universite de Namur
SuperviseurJean-Noel Colin (Promoteur) & Maarten Bosteels (Copromoteur)

Contient cette citation