RNA-seq data analysis with DESeq2

Presented By Petr Smirnov1

Heavily Adapted From: Michael I. Love2, Simon Anders3, Wolfgang Huber4 Last modified: Jan 31, 2019.

Overview

Description

In this workshop, we will give a quick overview of the most useful functions in the DESeq2 package, and a basic RNA-seq analysis. We will cover: how to quantify transcript expression from FASTQ files using Kallisto, import quantification from Kallisto with tximport, generate plots for quality control and exploratory data analysis EDA, perform differential expression (DE) (also using apeglm), overlap with other annotation data (using AnnotationHub), and build reports (using ReportingTools and Glimma). The workshop is designed to be a lab with plenty of time for questions throughout the lab.

Pre-requisites

  • Basic knowledge of R syntax

Non-essential background reading:

Participation

Students will participate by following along an Rmarkdown document, and asking questions throughout the workshop.

R / Bioconductor packages used

  • DESeq2
  • tximport
  • apeglm
  • AnnotationHub
  • ReportingTools
  • Glimma
  • vsn

Time outline

Activity Time
Overview of packages 20m
Quantification and import 20m
EDA and DE 20m
Downstream analysis & reports 20m
Additional questions 20m

Workshop goals and objectives

Learning goals

  • Visually assess quality of RNA-seq data
  • Perform basic differential analysis of RNA-seq data

Learning objectives

  • Learn how transcript expression is quantified from FASTQ files
  • Import quantification into R
  • Perform quality control and exploratory data analysis
  • Perform differential expression
  • Build dynamic reports

Installing Packages

install.packages("BiocManager")
install(c("DESeq2", "tximport", "apeglm", "AnnotationHub", "ReportingTools", "Glimma", "vsn"))

Preparing data for DESeq2

Experimental data

The data used in this workflow comes from the airway package that summarizes an RNA-seq experiment wherein airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778.

Modeling count data

As input, the count-based statistical methods, such as DESeq2 (Love, Huber, and Anders 2014), edgeR (Robinson, McCarthy, and Smyth 2009), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and baySeq (Hardcastle and Kelly 2010), expect input data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of counts. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) have been assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-Seq), species of bacteria (with metagenomic datasets), or peptide sequences (with quantitative mass spectrometry).

The values in the matrix should be counts of sequencing reads/fragments. This is important for the statistical models used by DESeq2 and edgeR to hold, as only counts allow assessing the measurement precision correctly. It is important to not provide counts that were pre-normalized for sequencing depth (also called library size), as the statistical model is most powerful when applied to un-normalized counts and is designed to account for library size differences internally.

Transcript abundances

In this workflow, we will show how to use transcript abundances as quantified by the Kallisto (Bray et al. 2016) software package. Kallisto and other methods, such as Sailfish (Patro, Mount, and Kingsford 2014), Salmon (Patro et al. 2017), or RSEM (Li and Dewey 2011), estimate the relative abundances of all (known, annotated) transcripts without aligning reads. Because estimating the abundance of the transcripts involves an inference step, the counts are estimated. Most methods either use a statistical framework called Estimation-Maximization or Bayesian techniques to estimate the abundances and counts. Following quantification, we will use the tximport (Soneson, Love, and Robinson 2015) package for assembling estimated count and offset matrices for use with Bioconductor differential gene expression packages.

The advantages of using the transcript abundance quantifiers in conjunction with tximport to produce gene-level count matrices and normalizing offsets, are:

  1. this approach corrects for any potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013)
  2. some of these methods are substantially faster and require less memory and less disk usage compared to alignment-based methods
  3. it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence (Robert and Watson 2015).

Note that transcript abundance quantifiers skip the generation of large files which store read alignments (SAM or BAM files), instead producing smaller files which store estimated abundances, counts and effective lengths per transcript. For more details, see the manuscript describing this approach (Soneson, Love, and Robinson 2015) and the tximport package vignette for software details.

kallisto quantification

We begin by providing kallisto with the sequence of all of the reference transcripts, which we will call the reference transcriptome. We recommend to use the GENCODE human transcripts, which can be downloaded from the GENCODE website. Kalliso needs to once create an index of these transcripts, which can be reused as long as you want to quantify using the same transcriptome. On the command line, creating the transcriptome index looks like:

kallisto index transcriptome/gencode.v29.transcripts.fa.gz -i transcriptome/gencodev29.kalliso.v45.idx

The v45 refers to the version of kallisto that was used, and is useful to put into the index name.

To quantify an individual sample, for example SRR1039508, the following command can be used:

kallisto quant -i kallisto-45/transcriptome/gencodev29.kalliso.v45.idx\
  -o output/SRR1039508 \
  -t 3 \
  fastq/SRR1039508_1.fastq\
  fastq/SRR1039508_2.fastq

In simple English, this command says to “quantify a sample using this transcriptome index, use this output directory, using 3 of the computer processors, and here are the first and second read files.” The output directory will be created if it doesn’t exist, though if earlier parts of the path do not exist, it will give an error. A single sample of human RNA-seq usually takes ~5 minutes.

Rather than writing the above command on the command line multiple times for each sample, it is possible to loop over files using a bash loop, which is what we do below:

files=`cat samples.txt`
for file in $files
do
kallisto-45/kallisto quant -i kallisto-45/transcriptome/gencodev29.kalliso.v45.idx -o output/"$file" -t 3  --plaintext fastq/"$file"_1.fastq fastq/"$file"_2.fastq
done

Importing into R with tximport

Specifying file locations

Following quantification, we can use tximport to import the data into R and perform statistical analysis using Bioconductor packages.

The identifiers used here are the SRA identifiers from the Sequence Read Archive. We need to create a named vector pointing to the quantification files. It is recommended to create a sample table mapping each file to the sample and condition of the sequence run. We precreated a table for this subset of the airways data.

mydir <- getwd()
samples <- read.csv(file.path(mydir,"samples.csv"), header=TRUE)
samples
##   SampleName          FileName   dex    cell
## 1 SRR1039508 output/SRR1039508 untrt  N61311
## 2 SRR1039509 output/SRR1039509   trt  N61311
## 3 SRR1039512 output/SRR1039512 untrt N052611
## 4 SRR1039513 output/SRR1039513   trt N052611
## 5 SRR1039516 output/SRR1039516 untrt N080611
## 6 SRR1039517 output/SRR1039517   trt N080611
## 7 SRR1039520 output/SRR1039520 untrt N061011
## 8 SRR1039521 output/SRR1039521   trt N061011
files <- file.path(mydir, "output", samples$SampleName, "abundance.tsv")
names(files) <- samples$SampleName
all(file.exists(files))
## [1] TRUE

Mapping transcripts to genes

Transcripts need to be associated with gene IDs for gene-level summarization. We therefore will construct a data.frame called tx2gene with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files. This can most easily be accomplished by downloading the GTF file at the same time that the transcriptome FASTA is downloaded, and generating tx2gene from the GTF file using Bioconductor’s TxDb infrastructure.

Generating a TxDb from a GTF file can be easily accomplished with the makeTxDbFromGFF function, but this step requires a few minutes of waiting, and a large file, so we precomputed the results:

txdb <- makeTxDbFromGFF("transcriptome/gencode.v29.annotation.gtf.gz")
saveDb(txdb, "transcriptome/gencode.v29.annotation.txdb")

Creating the tx2gene data.frame can be accomplished by calling the select function from the AnnotationDbi package on a TxDb object. The following code could be used to construct such a table:

library(AnnotationDbi) 
## Warning: package 'AnnotationDbi' was built under R version 3.5.1
## Loading required package: stats4
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 3.5.1
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 3.5.1
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.5.1
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.5.1
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
txdb <- loadDb("transcriptome/gencode.v29.annotation.txdb")
## Loading required package: GenomicFeatures
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
k <- keys(txdb, keytype="TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
## 'select()' returned 1:1 mapping between keys and columns

tximport command

Finally the following line of code imports Kallisto transcript quantifications into R, collapsing to the gene level using the information in tx2gene.

library("tximport")
## Warning: package 'tximport' was built under R version 3.5.2
library("jsonlite")
library("readr")
txi <- tximport(files, type="kallisto", tx2gene=tx2gene, ignoreAfterBar = TRUE)
## Note: importing `abundance.h5` is typically faster than `abundance.tsv`
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 
## summarizing abundance
## summarizing counts
## summarizing length

The txi object is simply a list of matrices (and one character vector):

names(txi)
## [1] "abundance"           "counts"              "length"             
## [4] "countsFromAbundance"
txi$counts[1:3,1:3]
##                    SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003.14   718.4204   471.5849   914.2380
## ENSG00000000005.5      0.0000     0.0000     0.0000
## ENSG00000000419.12   465.9998   515.0001   621.9995
txi$length[1:3,1:3]
##                    SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003.14  2084.7608  2051.0837  2078.1869
## ENSG00000000005.5    786.4290   786.4290   786.4290
## ENSG00000000419.12   937.1276   928.1181   929.8103
txi$abundance[1:3,1:3]
##                    SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003.14   26.31824   18.94351   27.77742
## ENSG00000000005.5     0.00000    0.00000    0.00000
## ENSG00000000419.12   37.97706   45.71813   42.23895
txi$countsFromAbundance
## [1] "no"

Now that we have used the tximport package to load our files into R, we can start our analysis. For the purpose of this tutorial we are using the methods implemented in the DESeq2 package, so its time to load it in, and create a DESeqDataSet to use with the functions in the package. Here, we provide the sample table with the group annotations for our cells, and for now we provide a dummy design, assuming every cell is in the same group.

library("DESeq2")
## Warning: package 'DESeq2' was built under R version 3.5.2
## Loading required package: SummarizedExperiment
## Warning: package 'SummarizedExperiment' was built under R version 3.5.1
## Loading required package: DelayedArray
## Warning: package 'DelayedArray' was built under R version 3.5.1
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## Warning: package 'BiocParallel' was built under R version 3.5.2
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~1)
## using counts and average transcript lengths from tximport

Exploratory data analysis

Simple EDA

In what follows, we will be comparing the gene expression of the dex treated vs untreated cells. To ensure we know how to interpret the directions, we want to specify that untrt is the reference level for the dex variable, using R’s relevel command:

dds$dex <- relevel(dds$dex, "untrt")
dds$dex
## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: untrt trt

We can quickly check the millions of fragments that uniquely aligned to the genes (the second argument of round tells how many decimal points to keep).

round( colSums(assay(dds)) / 1e6, 1 )
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 
##       21.4       19.6       26.4       15.9       25.5       32.2 
## SRR1039520 SRR1039521 
##       20.0       22.2

We can inspect the information about the samples, by pulling out the colData slot of the SummarizedExperiment:

colData(dds)
## DataFrame with 8 rows and 4 columns
##            SampleName          FileName      dex     cell
##              <factor>          <factor> <factor> <factor>
## SRR1039508 SRR1039508 output/SRR1039508    untrt   N61311
## SRR1039509 SRR1039509 output/SRR1039509      trt   N61311
## SRR1039512 SRR1039512 output/SRR1039512    untrt  N052611
## SRR1039513 SRR1039513 output/SRR1039513      trt  N052611
## SRR1039516 SRR1039516 output/SRR1039516    untrt  N080611
## SRR1039517 SRR1039517 output/SRR1039517      trt  N080611
## SRR1039520 SRR1039520 output/SRR1039520    untrt  N061011
## SRR1039521 SRR1039521 output/SRR1039521      trt  N061011
table(dds$cell)
## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2
table(dds$dex)
## 
## untrt   trt 
##     4     4

We will perform a minimal filtering to reduce the size of the dataset. We do not need to retain genes if they do not have a count of 5 or more for 4 or more samples as these genes will have no statistical power to detect differences, and no information to compute distances between samples.

keep <- rowSums(counts(dds) >= 5) >= 4
table(keep)
## keep
## FALSE  TRUE 
## 39820 18901
dds <- dds[keep,]

Some very basic exploratory analysis is to examine a boxplot of the counts for each sample. We will take the logarithm so that large counts do not dominate the boxplot:

boxplot(log10(counts(dds)+1))

The main function in DESeq2 involves computation of size factors which normalize for differences in sequencing depth among samples. We can also compute these size factors manually, so that the normalized counts are available for plotting:

dds <- estimateSizeFactors(dds)
## using 'avgTxLength' from assays(dds), correcting for library size
boxplot(log10(counts(dds,normalized=TRUE)+1))

Data transformation for EDA

Taking the logarithm of counts plus a pseudocount of 1 is a common transformation, but it tends to inflate the sampling variance of low counts such that it is even larger than biological variation across groups of samples. In DESeq2 we therefore provide transformations which produce log-scale data such that the systematic trends have been removed. Our recommended transformation is the variance-stabilizing transformation, or VST, and it can be called with the vst function:

vsd <- vst(dds)
class(vsd)
## [1] "DESeqTransform"
## attr(,"package")
## [1] "DESeq2"

This function does not return a DESeqDataSet, because it does not return counts, but instead continuous values (on the log2 scale). We can access the transformed data with assay:

assay(vsd)[1:3,1:3]
##                    SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003.14   9.579838   9.216321   9.672732
## ENSG00000000419.12   9.008684   9.323100   9.166692
## ENSG00000000457.13   8.619155   8.401713   8.483289

Why VST?

The reason we should usually perform a Variance Stabalizing transformation is because count data has the property that the variance of expression depends highly on the average expression. If we then apply a method like PCA, it will be biased by the higher variance in lower expressed genes, and will overestimate the importance in the differences observed in those genes.

We can visualize the effects of a variance stabilizing transform as below:

ntd <- normTransform(dds)
library("vsn")
## Warning: package 'vsn' was built under R version 3.5.1
meanSdPlot(assay(ntd))

If we plot the same relationship after applying the VST:

meanSdPlot(assay(vsd))

DESeq also provides a slighly different method to do this transformation, the regularized log rlog method. Rougly, the benefit of this method is that the values are more comparable to the standard log-transformed expression values. However, as you can see it is more strict with the lowly expressed genes.

rld <- rlog(dds)
meanSdPlot(assay(rld))

Principal components plot

The VST data is appropriate for calculating distances between samples or for performing PCA. More information about PCA and distance calculation can be found in the RNA-seq gene-level workflow. In short, PCA plots allow us to visualize the most dominant axes of variation in our data, which is useful for both quality control, and to get a sense of how large the inter-sample differences are across and within conditions. Here we see that PC1 (the primary axis of variation in the data) separates the treated and untreated samples:

plotPCA(vsd, "dex")

With some additional ggplot2 code, we can also indicate which samples belong to which cell line:

library("ggplot2")
pcaData <- plotPCA(vsd, intgroup = c( "dex", "cell"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size =3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed()

Note that we do not recommend working with the transformed data for the primary differential expression analysis. Instead we will use the original counts and a generalized linear model (GLM) which takes into account the expected variance from either low or high counts. For statistical details, please refer to the DESeq2 methods paper (Love, Huber, and Anders 2014).

Choosing the proper design formula

As we saw in the plots above, while the samples are well split by treatment status, they also segregate well by cell line type. In this case, it would be beneficial to explicitly control for cell line when doing our comparison between treated and untreated, to make sure the differentially expressed genes come as a result of the treatment.

We can set the design formula to group our samples by the two variables, cell and dex. We want to control for the cell line, while testing for differences across dexamethasone treatment, so we use a design of ~ cell + dex, with the variable of interest to compare last (the last variable is the default DESeq does differential expression for). Notice that we are creating a new DESeqDataSet here, to erase any of the calculations done above on the improper design formula.

dds <- DESeqDataSet(dds, design = ~ cell + dex)

Differential expression analysis

Standard DE steps

Differential expression analysis in DESeq2 is performed by calling the following two functions:

dds <- DESeq(dds)
## using pre-existing normalization factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)

The results table res contains the results for each gene (in the same order as in the DESeqDataSet). If we want to see the top genes, we can order it like so:

head(res[order(res$pvalue),])
## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 6 columns
##                            baseMean   log2FoldChange             lfcSE
##                           <numeric>        <numeric>         <numeric>
## ENSG00000189221.9  2396.59173997229 3.38261705967537 0.132186289372024
## ENSG00000120129.5  3464.38812141347 2.96716563478624 0.120043498766579
## ENSG00000101347.9  14061.4541232065 3.72675335761991 0.158929849489948
## ENSG00000211445.11 12618.6634965049 3.75975750146415 0.164507540708853
## ENSG00000109906.13 416.145546167368 6.16679831638614 0.272463093463929
## ENSG00000196136.17  2742.0774378471 3.23609637800743 0.144817123351246
##                                stat                pvalue
##                           <numeric>             <numeric>
## ENSG00000189221.9  25.5897724018516 1.98282417784197e-144
## ENSG00000120129.5  24.7174204790198 6.94915595643267e-135
## ENSG00000101347.9  23.4490460387406 1.35186425041163e-121
## ENSG00000211445.11 22.8546210420725  1.3146679384391e-115
## ENSG00000109906.13  22.633517949111 2.02765963829645e-113
## ENSG00000196136.17 22.3460893513155 1.31794039531065e-110
##                                     padj
##                                <numeric>
## ENSG00000189221.9  3.31171294183165e-140
## ENSG00000120129.5  5.80324013921692e-131
## ENSG00000101347.9  7.52627890345833e-118
## ENSG00000211445.11 5.48939597695245e-112
## ENSG00000109906.13 6.77319425576545e-110
## ENSG00000196136.17 3.66870674707975e-107

We can plot the counts for the top gene using plotCounts:

plotCounts(dds, which.min(res$pvalue), "dex")

We can examine all the log2 fold changes (LFC) due to dexamethasone treatment over the mean of counts using plotMA:

plotMA(res, ylim=c(-5,5))

Note that there are many large LFC which are not significant (grey points) on the left side of the MA-plot above. These obtain a large LFC because of the imprecision of log counts. For more informative visualization and more accurate ranking of genes by effect size (the log fold change may sometimes be referred to as an effect size), we recommend to use DESeq2’s functionality for shrinking LFCs. Our most recent methodological development is the apeglm shrinkage estimator, which is available in DESeq2’s lfcShrink function:

library("apeglm")
## Warning: package 'apeglm' was built under R version 3.5.2
resultsNames(dds)
## [1] "Intercept"               "cell_N061011_vs_N052611"
## [3] "cell_N080611_vs_N052611" "cell_N61311_vs_N052611" 
## [5] "dex_trt_vs_untrt"
res2 <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
par(mfrow=c(1,2))
plotMA(res, ylim=c(-3,3), main="No shrinkage")
plotMA(res2, ylim=c(-3,3), main="apeglm")

Minimum effect size

If we don’t want to report as significant genes with small LFC, we can specify a minimum biologically meaningful effect size, by choosing an LFC and testing against this. We can either perform such a threshold test using the unshrunken LFCs or the LFCs provided by lfcShrink using the apeglm method:

res.lfc <- results(dds, lfcThreshold=1)
res.lfc2 <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm",
                      lfcThreshold=1)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
## computing FSOS 'false sign or small' s-values (T=1)

Note that testing against an LFC threshold is not equivalent to testing against a null hypothesis of 0 and then filtering on LFC values. We prefer the former, as discussed in Love, Huber, and Anders (2014) and Zhu, Ibrahim, and Love (2018).

The apeglm method provides s-values (Stephens 2016) when svalue=TRUE or when we supply a minimum effect size as above. These are analogous to q-values or adjusted p-values, in that the genes with s-values less than \(\alpha\) should have an aggregate rate of false sign or being smaller in absolute value than our given LFC threshold, which is bounded by \(\alpha\).

par(mfrow=c(1,2))
plotMA(res.lfc, ylim=c(-5,5), main="No shrinkage, LFC test")
plotMA(res.lfc2, ylim=c(-5,5), main="apeglm, LFC test", alpha=0.01)

AnnotationHub

Querying AnnotationHub

We will use the AnnotationHub package to attach additional information to the results table. AnnotationHub provides an easy-to-use interface to more than 40,000 annotation records. A record may be peaks from a ChIP-seq experiment from ENCODE, the sequence of the human genome, a TxDb containing information about transcripts and genes, or an OrgDb containing general information about biological identifiers for a particular organism.

library("AnnotationHub")
## Warning: package 'AnnotationHub' was built under R version 3.5.2
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
ah <- AnnotationHub()
## snapshotDate(): 2018-10-24

The following code chunk, un-evaluated here, launches a browser for navigating all the records available through AnnotationHub.

display(ah)

We can also query using keywords with the query function:

query(ah, c("OrgDb","Homo sapiens"))
## AnnotationHub with 1 record
## # snapshotDate(): 2018-10-24 
## # names(): AH66156
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Homo sapiens
## # $rdataclass: OrgDb
## # $rdatadateadded: 2018-10-22
## # $title: org.Hs.eg.db.sqlite
## # $description: NCBI gene ID based annotations about Homo sapiens
## # $taxonomyid: 9606
## # $genome: NCBI genomes
## # $sourcetype: NCBI/ensembl
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl....
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation") 
## # retrieve record with 'object[["AH66156"]]'

To pull down a particular record we use double brackets and the name of the record:

hs <- ah[["AH66156"]]
## downloading 0 resources
## loading from cache 
##     '/Users/psmirnov//.AnnotationHub/72902'
hs
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2018-Oct11
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2018-Oct10
## | GOEGSOURCEDATE: 2018-Oct11
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: 
## | GPSOURCEDATE: 2018-Oct2
## | ENSOURCEDATE: 2018-Oct05
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Thu Oct 18 05:22:10 2018
## 
## Please see: help('select') for usage information

Mapping IDs

The rownames of the results table are Ensembl IDs, and most of these are entries in OrgDb (although thousands are not).

columns(hs)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
table(gsub(rownames(res), pat="\\.[0-9]+$", rep="") %in% keys(hs, "ENSEMBL"))
## 
## FALSE  TRUE 
##  3030 15871

We can use the mapIds function to add gene symbols, using ENSEMBL as the keytype, and requesting the column SYMBOL.

res$symbol <- mapIds(hs, gsub(rownames(res), pat="\\.[0-9]+$", rep=""), column="SYMBOL", keytype="ENSEMBL")
## 'select()' returned 1:many mapping between keys and columns
head(res)
## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 7 columns
##                            baseMean     log2FoldChange              lfcSE
##                           <numeric>          <numeric>          <numeric>
## ENSG00000000003.14 747.571101016142 -0.348443133778461  0.111040248749858
## ENSG00000000419.12 522.830471469422  0.216982809251158  0.120048909196314
## ENSG00000000457.13 320.417066719833 0.0330643254442832   0.15455280873228
## ENSG00000000460.16 80.1273497844276 -0.244241350469973  0.295844084025202
## ENSG00000000971.15 5790.21487048519  0.452064961442455 0.0882081578307369
## ENSG00000001036.13 2034.57889760362 -0.220113000969607 0.0909392768402294
##                                  stat               pvalue
##                             <numeric>            <numeric>
## ENSG00000000003.14  -3.13798949211115  0.00170110975207116
## ENSG00000000419.12   1.80745340131603   0.0706916140784454
## ENSG00000000457.13  0.213935454913395    0.830597391218182
## ENSG00000000460.16 -0.825574563286407    0.409045469411911
## ENSG00000000971.15   5.12497905590462 2.97570822679652e-07
## ENSG00000001036.13  -2.42043931530621   0.0155017664600645
##                                    padj      symbol
##                               <numeric> <character>
## ENSG00000000003.14    0.011020921287468      TSPAN6
## ENSG00000000419.12    0.211971718674045        DPM1
## ENSG00000000457.13    0.928626944211367       SCYL3
## ENSG00000000460.16    0.652684468929838    C1orf112
## ENSG00000000971.15 4.88978226304023e-06         CFH
## ENSG00000001036.13   0.0670926414656638       FUCA2

Building reports

ReportingTools

There are many packages for building interactive reports from Bioconductor. Two of these are ReportingTools and Glimma, which both provide HTML reports that allow for collaborators to examine the top genes (or whatever features of interest) from a genomic analysis.

The code for compiling a ReportingTools report is:

library("ReportingTools")
## Warning: package 'ReportingTools' was built under R version 3.5.2
## Loading required package: knitr
## 
## 
tmp <- tempdir() # you would instead use a meaningful path here
rep <- HTMLReport(shortName="airway", title="Airway DGE",
                  basePath=tmp, reportDirectory="report")
publish(res, rep, dds, n=20, make.plots=TRUE, factor=dds$dex)
finish(rep)
## [1] "/var/folders/rf/gtm9grcx1yd8kr41h9r677gm0000gq/T//RtmpvwfirQ/report/airway.html"

This last line, un-evaluated would launch the report in a web browser:

browseURL(file.path(tmp,"report","airway.html"))

Glimma

Another package which can generate interactive reports is Glimma. The glMDPlot constructs an interactive MA-plot where hovering over a gene in the MA-plot on the left side will display the counts for the samples on the right hand side. Clicking will bring up the gene’s information in a tooltip and in a list at the bottom of the screen. Hovering on a sample on the right hand side will give the sample ID in a tooltip.

library("Glimma")
## Warning: package 'Glimma' was built under R version 3.5.2
status <- as.numeric(res$padj < .1)
anno <- data.frame(GeneID=rownames(res), symbol=res$symbol)
glMDPlot(res2, status=status, counts=counts(dds,normalized=TRUE),
         groups=dds$dex, transform=FALSE,
         samples=colnames(dds), anno=anno,
         path=tmp, folder="glimma", launch=FALSE)

This last line would launch the report in a web browser:

browseURL(file.path(tmp,"glimma","MD-Plot.html"))

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal RNA-Seq Quantification.” Nat. Biotechnol.

Hardcastle, Thomas, and Krystyna Kelly. 2010. “baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data.” BMC Bioinformatics 11 (1): 422+. https://doi.org/10.1186/1471-2105-11-422.

Himes, Blanca E., Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M. Whitaker, et al. 2014. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PloS One 9 (6). https://doi.org/10.1371/journal.pone.0099625.

Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2): R29+. https://doi.org/10.1186/gb-2014-15-2-r29.

Leng, N., J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. G. Smits, J. D. Haag, M. N. Gould, R. M. Stewart, and C. Kendziorski. 2013. “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29 (8): 1035–43. https://doi.org/10.1093/bioinformatics/btt087.

Li, Bo, and Colin N. Dewey. 2011. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC Bioinformatics 12: 323+. https://doi.org/10.1186/1471-2105-12-3231.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550+. https://doi.org/10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14: 417–19.

Patro, Rob, Stephen M. Mount, and Carl Kingsford. 2014. “Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.” Nature Biotechnology 32: 462–64. https://doi.org/10.1038/nbt.2862.

Robert, Christelle, and Mick Watson. 2015. “Errors in RNA-Seq quantification affect genes of relevance to human disease.” Genome Biology. https://doi.org/10.1186/s13059-015-0734-x.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2009. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.

Stephens, Matthew. 2016. “False Discovery Rates: A New Deal.” Biostatistics 18 (2). https://doi.org/10.1093/biostatistics/kxw041.

Trapnell, Cole, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. 2013. “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology. https://doi.org/10.1038/nbt.2450.

Wu, Hao, Chi Wang, and Zhijin Wu. 2013. “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data.” Biostatistics 14 (2): 232–43. https://doi.org/10.1093/biostatistics/kxs033.

Zhu, Anqi, Joseph G. Ibrahim, and Michael I. Love. 2018. “Heavy-Tailed Prior Distributions for Sequence Count Data: Removing the Noise and Preserving Large Differences.” bioRxiv. https://doi.org/10.1101/303255.


  1. University of Toronto, CA

  2. UNC-Chapel Hill, NC, US

  3. ZMBH Heidelberg, Germany

  4. EMBL Heidelberg, Germany