Compare with chromVAR: Part I

Since we completely re-implemented the algorithm proposed in chromVAR in Python, we may want to confirm that these two packages can produce the similar, if not exactly same, results. So here we test them on a small example data from chromVAR.

Run chromVAR on example data

We first load the packages that we need

[1]:
suppressMessages(library(chromVAR))
suppressMessages(library(Repitools))
suppressMessages(library(SummarizedExperiment))
suppressMessages(library(motifmatchr))
suppressMessages(library(Matrix))
suppressMessages(library(BiocParallel))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))
suppressMessages(library(JASPAR2020))
suppressMessages(library(TFBSTools))
set.seed(2017)

Let’s get the example count matrix

[2]:
data(example_counts, package = "chromVAR")

We can check how the peaks look like:

[3]:
df_peaks <- annoGR2DF(example_counts@rowRanges)
df_peaks$peak <- paste0(df_peaks$chr, "-", df_peaks$start, "-", df_peaks$end)
head(df_peaks)
A data.frame: 6 × 8
chrstartendwidthscoreqvalnamepeak
<fct><int><int><int><int><dbl><chr><chr>
1chr171393371443250025925.99629GM_peak_6 chr1-713933-714432
2chr1856395856894500 82 8.21494H1_peak_7 chr1-856395-856894
3chr189578089627950015615.65456GM_peak_17 chr1-895780-896279
4chr1911578912077500 82 8.21494H1_peak_13 chr1-911578-912077
5chr194857694907550014014.03065GM_peak_27 chr1-948576-949075
6chr195454195504050018918.99529GM_peak_29achr1-954541-955040

Next, we pre-process the data by adding GC bias for each peak and identify the backgrounds

[4]:
example_counts <- addGCBias(example_counts,
                            genome = BSgenome.Hsapiens.UCSC.hg19)
bg_peaks <- getBackgroundPeaks(example_counts)

write.csv(bg_peaks, "bg_peaks.csv")

Perform motif matching to get TF binding sites

[5]:
opts <- list()
opts[["collections"]] = c("CORE")
opts[["tax_group"]] = c("vertebrates")
motifs <- getMatrixSet(JASPAR2020, opts)

names(motifs) <- paste(names(motifs), name(motifs), sep = ".")

motif_ix <- matchMotifs(motifs, example_counts,
                         genome = BSgenome.Hsapiens.UCSC.hg19,
                        p.cutoff = 5e-05)

We here save the motif matching results for later use

[6]:
df_motifmatch <- as.data.frame(as.matrix(assays(motif_ix)$motifMatches) * 1)
write.csv(df_motifmatch, "chromvar_motif_match.csv")

We now can estimate the deviations with chromVAR

[7]:
dev <- computeDeviations(object = example_counts,
                         annotations = motif_ix)

dev
class: chromVARDeviations
dim: 746 50
metadata(0):
assays(2): deviations z
rownames(746): MA0004.1.Arnt MA0006.1.Ahr::Arnt ... MA0528.2.ZNF263
  MA0609.2.CREM
rowData names(3): name fractionMatches fractionBackgroundOverlap
colnames(50): singles-GM12878-140905-1 singles-GM12878-140905-2 ...
  singles-H1ESC-140820-24 singles-H1ESC-140820-25
colData names(2): Cell_Type depth

Plot variability

[10]:
variability <- computeVariability(dev)

plotVariability(variability, use_plotly = FALSE)
Warning message:
“Removed 1 rows containing missing values (`geom_point()`).”
../_images/notebooks_run_chromVAR_18_1.png
[13]:
head(variability)
A data.frame: 6 × 6
namevariabilitybootstrap_lower_boundbootstrap_upper_boundp_valuep_value_adj
<chr><dbl><dbl><dbl><dbl><dbl>
MA0004.1.ArntArnt 1.30975871.05743671.5390511.354635e-035.368102e-03
MA0006.1.Ahr::ArntAhr::Arnt 1.64080991.33103701.8973131.558287e-091.055386e-08
MA0019.1.Ddit3::CebpaDdit3::Cebpa0.92173920.75301521.0714867.633490e-018.387992e-01
MA0029.1.MecomMecom 1.11189840.91521951.2791911.241110e-012.357689e-01
MA0030.1.FOXF2FOXF2 1.24760690.98039631.4883057.572098e-032.648457e-02
MA0031.1.FOXD1FOXD1 1.41454921.10594771.6897894.012849e-052.019982e-04

Let’s save the final results so we can compare it with pychromVAR

[8]:
df_z <- as.data.frame(t(round(assays(dev)$z, 2)))
write.csv(df_z, "chromvar_z.csv")

We also save the count matrix so we can use it as input for pychromVAR

[9]:
counts <- assays(example_counts)$counts
rownames(counts) <- df_peaks$peak
counts <- as.data.frame(as.matrix(counts))
write.csv(counts, "counts.csv")