GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research

Résultats de recherche: Contribution dans un livre/un catalogue/un rapport/dans les actes d'une conférenceArticle dans les actes d'une conférence/un colloque

47 Téléchargements (Pure)


Conducting socio-technical software engineering research on closed-source software is difficult as most organizations do not want to give access to their code repositories. Most experiments and publications therefore focus on open-source projects, which only provides a partial view of software development communities. Yet, closing the gap between open and closed source software industries is essential to increase the validity and applicability of results stemming from socio-technical software engineering research. We contribute to this effort by sharing our work in a large company counting 4,800 employees. We mined 101 repositories and produced the GDED dataset containing socio-technical information about 106,216 commits, 470,940 file modifications and 3,471,556 method modifications from 164 developers during the last 13 years, using various programming languages. For that, we used GitDelver, an open-source tool we developed on top of Pydriller, and anonymized and scrambled the data to comply with legal and corporate requirements. Our dataset can be used for various purposes and provides information about code complexity, self-admitted technical debt, bug fixes, as well as temporal information. We also share our experience regarding the processing of sensitive data to help other organizations making datasets publicly available to the research community.
langue originaleAnglais
titre19th International Conference on Mining Software Repositories (MSR '22), May 23-24, 2022, Pittsburgh, PA, USA
EditeurACM Press
Nombre de pages5
ISBN (Electronique)9781450393034
Les DOIs
Etat de la publicationPublié - mai 2022

Série de publications

NomProceedings - 2022 Mining Software Repositories Conference, MSR 2022

Empreinte digitale

Examiner les sujets de recherche de « GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation