A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Thomas Vanhaeren; Federico Divina; Miguel García-Torres; Francisco Gómez-Vela; Wim Vanhoof; Pedro Manuel Martínez García

doi:10.3390/genes11090985

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Thomas Vanhaeren, Federico Divina, Miguel García-Torres, Francisco Gómez-Vela, Wim Vanhoof, Pedro Manuel Martínez García

Faculte d'informatique

Résultats de recherche: Contribution à un journal/une revue › Article › Revue par des pairs

45 Téléchargements (Pure)

Résumé

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

langue originale	Anglais
Numéro d'article	985
Pages (de - à)	1-17
Nombre de pages	17
journal	Genes
Volume	11
Numéro de publication	9
Date de mise en ligne précoce	24 août 2020
Les DOIs	https://doi.org/10.3390/genes11090985
Etat de la publication	Publié - sept. 2020

Accès au document

10.3390/genes11090985

genes-11-00985-v2Version finale publiée, 1,47 MBLicense: CC BY

Autres fichiers et liens

Link to publication in Scopus

Contient cette citation

@article{cbd431a9694b4c99b2b8992142523869,

title = "A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions",

abstract = "The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.",

keywords = "Chromatin interactions, Genome architecture, Genomics, Machine-learning, Prediction",

author = "Thomas Vanhaeren and Federico Divina and Miguel Garc{\'i}a-Torres and Francisco G{\'o}mez-Vela and Wim Vanhoof and Garc{\'i}a, {Pedro Manuel Mart{\'i}nez}",

note = "Funding Information: Funding: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Funding Information: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Acknowledgments: We want to thank the Centro Inform?tico Cient?fico de Andaluc?a (CICA) for providing the high performance computing cluster, in which we performed all the analyses using custom scripts. Publisher Copyright: {\textcopyright} 2020 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2020",

month = sep,

doi = "10.3390/genes11090985",

language = "English",

volume = "11",

pages = "1--17",

journal = "Genes",

issn = "2073-4425",

publisher = "MDPI AG",

number = "9",

}

TY - JOUR

T1 - A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

AU - Vanhaeren, Thomas

AU - Divina, Federico

AU - García-Torres, Miguel

AU - Gómez-Vela, Francisco

AU - Vanhoof, Wim

AU - García, Pedro Manuel Martínez

N1 - Funding Information: Funding: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Funding Information: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Acknowledgments: We want to thank the Centro Inform?tico Cient?fico de Andaluc?a (CICA) for providing the high performance computing cluster, in which we performed all the analyses using custom scripts. Publisher Copyright: © 2020 by the authors. Licensee MDPI, Basel, Switzerland.

PY - 2020/9

Y1 - 2020/9

N2 - The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

AB - The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

KW - Chromatin interactions

KW - Genome architecture

KW - Genomics

KW - Machine-learning

KW - Prediction

UR - http://www.scopus.com/inward/record.url?scp=85089972014&partnerID=8YFLogxK

U2 - 10.3390/genes11090985

DO - 10.3390/genes11090985

M3 - Article

C2 - 32847102

AN - SCOPUS:85089972014

SN - 2073-4425

VL - 11

SP - 1

EP - 17

JO - Genes

JF - Genes

IS - 9

M1 - 985

ER -

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Résumé

Accès au document

Autres fichiers et liens

Empreinte digitale

Contient cette citation