Supplementary MaterialsAdditional document 1 Supplementary methods, tables and figures

Supplementary MaterialsAdditional document 1 Supplementary methods, tables and figures. amounts. Excel (.xlsx) document of size 66.2 kB. 12859_2020_3621_MOESM2_ESM.xlsx (65K) GUID:?EFD3A1F7-512C-4B20-A8B0-C894DE70568F Extra file 3 The full genome browser example figure of the K562 cell line data. PDF of size 199 kB. 12859_2020_3621_MOESM3_ESM.pdf (194K) GUID:?C07A6944-AF91-4B27-A59A-6273E622018C Additional file 4 The full genome browser example figure of the GM12878 cell line data. PDF of size 217 kB. 12859_2020_3621_MOESM4_ESM.pdf (212K) GUID:?92879B59-1CAD-4451-B6BD-5F1CC09E6CFA Data Availability StatementThe ENCODE data Z-VAD(OH)-FMK analysed in this study are described in the Methods section. The ENCODE accession numbers of the datasets and files analysed in this study are included as Supplementary Tables S3, S4 and S5 in Additional file?2. The Supplementary Tables S3CS5 include also the direct web links to download the files. The PREPRINT package and the codes for the data preprocessing steps are available in GitHub https://github.com/MariaOsmala/preprint. The processed data and enhancer predictions are stored as Z-VAD(OH)-FMK a UCSC Genome Browser [44] track hubs, links to the track hubs are provided in GitHub. Abstract Background The binding sites of transcription factors (TFs) and the localisation of histone modifications in the human genome can be quantified by the chromatin immunoprecipitation assay coupled with next-generation sequencing (ChIP-seq). The resulting chromatin feature data has been successfully adopted for genome-wide enhancer identification by many unsupervised and supervised machine learning strategies. However, the existing strategies anticipate different numbers and various units of enhancers for the same cell type and do not utilise the pattern of the ChIP-seq protection profiles efficiently. Results In this work, we propose a PRobabilistic Enhancer PRedictIoN Tool (PREPRINT) that assumes characteristic protection patterns of chromatin features at enhancers and employs a statistical model to account for their variability. PREPRINT defines probabilistic distance steps to quantify the similarity of the genomic query regions and the characteristic protection patterns. The probabilistic scores of the enhancer and non-enhancer samples are utilised to train a kernel-based classifier. The overall performance of the method is exhibited on ENCODE data for two cell lines. The predicted enhancers are computationally validated based on the transcriptional regulatory protein binding sites and compared to the predictions obtained by state-of-the-art methods. Conclusion PREPRINT performs favorably to the state-of-the-art methods, especially when requiring the methods to predict a larger set of enhancers. PREPRINT generalises successfully to data from cell type not utilised for training, and often the PREPRINT performs better than the previous methods. The PREPRINT enhancers are less sensitive to the choice of prediction threshold. PREPRINT identifies biologically validated enhancers not predicted by the competing methods. The enhancers predicted by PREPRINT can aid the genome interpretation Z-VAD(OH)-FMK in functional genomics and clinical studies. FPR threshold in the K562 cell collection, b the number of enhancers predicted by RFECS with the threshold of 0.25 in the K562 cell collection, c the minimum quantity of enhancers predicted by PREPRINT with Z-VAD(OH)-FMK the 1FPR threshold in cell collection GM12878, and d the number of enhancers predicted by RFECS with the threshold of 0.25 in the GM12878 cell collection. Overall, again around half of the enhancers predicted by any of the method were found by all methods, and this intersection set achieved the highest validation rate (85C95%). Furthermore, there have been significant amounts of enhancers forecasted by any couple of strategies. Notably, the validation price of intersecting enhancers between PREPRINT and RFECS exceeded the validation price of intersecting enhancers between your PREPRINT ML and Bayesian strategy. Lastly, there have been significant amounts of enhancers predicted by one technique just also. RFECS forecasted the highest variety of exclusive enhancers achieving a higher validation price (88C91%) when contemplating a smaller variety of enhancer predictions (Supplementary Body S8a and c, Extra file?1). Nevertheless, the validation prices of exclusive RFECS predictions had been low when needing a larger group of enhancer predictions, specifically in the GM12878 cell series (37%). Of the initial enhancers forecasted by PREPRINT, the predictions attained with the Rabbit polyclonal to ADCK4 Bayesian strategy achieved the best validation price (70C85%). Being a conclusion, PREPRINT educated in the K562 data generalised in the GM12878 data effectively, as well as the Bayesian approach performed sufficiently. In some comparisons, the Bayesian approach achieved related and even superior overall performance to RFECS. When calming the prediction threshold, RFECS started to forecast more enhancers not covered by the other methods, and these enhancers acquired a low validation rate. In contrast, PREPRINT expected a low quantity of unique enhancers even when calming the prediction threshold; this may be a desirable home of PREPRINT. However, it was demanding to compare the.