A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Thomas Vanhaeren; Federico Divina; Miguel García-Torres; Francisco Gómez-Vela; Wim Vanhoof; Pedro Manuel Martínez García

doi:10.3390/genes11090985

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Thomas Vanhaeren, Federico Divina, Miguel García-Torres, Francisco Gómez-Vela, Wim Vanhoof, Pedro Manuel Martínez García

Faculty of Computer Science

Research output: Contribution to journal › Article › peer-review

45 Downloads (Pure)

Abstract

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

Original language	English
Article number	985
Pages (from-to)	1-17
Number of pages	17
Journal	Genes
Volume	11
Issue number	9
Early online date	24 Aug 2020
DOIs	https://doi.org/10.3390/genes11090985
Publication status	Published - Sept 2020

Keywords

Chromatin interactions
Genome architecture
Genomics
Machine-learning
Prediction

Access to Document

10.3390/genes11090985

genes-11-00985-v2Final published version, 1.47 MBLicence: CC BY

Cite this

@article{cbd431a9694b4c99b2b8992142523869,

title = "A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions",

abstract = "The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.",

keywords = "Chromatin interactions, Genome architecture, Genomics, Machine-learning, Prediction",

author = "Thomas Vanhaeren and Federico Divina and Miguel Garc{\'i}a-Torres and Francisco G{\'o}mez-Vela and Wim Vanhoof and Garc{\'i}a, {Pedro Manuel Mart{\'i}nez}",

note = "Funding Information: Funding: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Funding Information: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Acknowledgments: We want to thank the Centro Inform?tico Cient?fico de Andaluc?a (CICA) for providing the high performance computing cluster, in which we performed all the analyses using custom scripts. Publisher Copyright: {\textcopyright} 2020 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2020",

month = sep,

doi = "10.3390/genes11090985",

language = "English",

volume = "11",

pages = "1--17",

journal = "Genes",

issn = "2073-4425",

publisher = "MDPI AG",

number = "9",

}

TY - JOUR

T1 - A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

AU - Vanhaeren, Thomas

AU - Divina, Federico

AU - García-Torres, Miguel

AU - Gómez-Vela, Francisco

AU - Vanhoof, Wim

AU - García, Pedro Manuel Martínez

N1 - Funding Information: Funding: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Funding Information: This research was funded by grant TIN2015-64776-C3-2-R from the Spanish Government and the European Regional Development Fund. Acknowledgments: We want to thank the Centro Inform?tico Cient?fico de Andaluc?a (CICA) for providing the high performance computing cluster, in which we performed all the analyses using custom scripts. Publisher Copyright: © 2020 by the authors. Licensee MDPI, Basel, Switzerland.

PY - 2020/9

Y1 - 2020/9

N2 - The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

AB - The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

KW - Chromatin interactions

KW - Genome architecture

KW - Genomics

KW - Machine-learning

KW - Prediction

UR - http://www.scopus.com/inward/record.url?scp=85089972014&partnerID=8YFLogxK

U2 - 10.3390/genes11090985

DO - 10.3390/genes11090985

M3 - Article

C2 - 32847102

AN - SCOPUS:85089972014

SN - 2073-4425

VL - 11

SP - 1

EP - 17

JO - Genes

JF - Genes

IS - 9

M1 - 985

ER -

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this