Knowledge acquisition for machine learning and analysis of software quality in GitHub

  • David Fernández-Grande

Student thesis: Master typesMaster in Computer science


Software measurement aims at providing a reliable and repeatable method to
assess software quality. However, quantifying precisely (i.e. defining meaningful thresholds) the connection between software metrics and higher level quality attributes has been a continuing challenge. Machine learning can be exploited to improve our understanding of software metrics and the relationship with software quality. This requires the analysis of large amounts of data. Fortunately, online social coding platforms such as GitHub make large amounts of software-related data (both source code and metadata) publicly available. This study uses machine learning on GitHub repositories to assess their quality. Specifically, this work (i) makes publicly available a dataset with the metadata of 71,942 GitHub repositories, then uses it to (ii) describe the characteristics of the use of GitHub and (iii) define criteria to select projects relevant to software quality analysis. Additionally, the study (iv) builds an extended version of the dataset with software metrics of 3,074 GitHub repositories exploitable by standard machine learning techniques, (v) examines the style of code in this platform and (vi) creates a machine learning model to analyse the quality of these repositories.
Date of Award30 Aug 2018
Original languageEnglish
Awarding Institution
  • University of Namur
SupervisorBenoit Frenay (Supervisor) & Benoit Vanderose (Co-Supervisor)


  • Machine learning
  • Software Measurement
  • Software Metrics
  • Software Quality,
  • Software Repositories Mining
  • GitHub

Cite this