MOSClip Pathway Analysis with gene list

Prepare the multi-omic dataset and gene list
- Prepare the input data: the Omics object
MOSClip pathway analysis on gene list
- Graphical exploration of MOSClip pathway results

This tutorial provide an example analysis with MOSClip R package using gene lists. We will use the same data used in the other tutorial but we will import gene list from the Molecular Signatures Database. For seek of simplicity, I quickly reported the step to import and create the data also in this tutorial.

Prepare the multi-omic dataset and gene list

First of all, we load the necessary libraries.

library(org.Hs.eg.db)
library(pheatmap)
library(RColorBrewer)
library(EDASeq)
library(graphite)
library(houseOfClipUtility)
library(MOSClip)
library(msigdbr)

Data retrieving and data analysis will depend on your computational resources but generally need time and disk space. To speed up this tutorial, we pre-processed the data for you. Details on data preparation are available in Material and Method session of the paper.

Thus, download “TCGA-OV-hg38-preprocessed-externalSurv.RData” available at github.com/cavei/example-datasets/.

Move the downloaded file inside a directory called downloadTCGA in your working directory.

Now, we can load the preprocessed data.

load("downloadTCGA/TCGA-OV-hg38-preprocessed-externalSurv.RData")

The TCGA dataset came along with the survival annotations from the table by Liu et al, Cell, 2018. Among the data loaded we can find the object ‘fup’ (short of ‘followup’) that represents our survival annotations. We extract the slot for the ‘progression free survival’ (pfs) and we save it in survAnnotations.

survAnnotations <- fup$pfs

Now, we are about to intersect different datasets and annotations to select those patient samples having all the four omics profiled.

We also filter genes in expression matrix keeping only those genes with at least 100 counts in at least one patients, to avoid data sparsity.

After the selection of patients and genes, we perform expression normalization and log2 of the counts+1 transformation. This will ensure us to work with expression data approximately close to a normal distribution, the most suitable distribution for the subsequent MOSClip tests.

expression <- rnaSeq
mutation <- mutations$data

survAnnot <- na.omit(survAnnotations)
patients <- row.names(survAnnot)
patients <- intersect(patients, colnames(expression))
patients <- intersect(patients, colnames(metClustValues$MET_Cancer_Clustered))
patients <- intersect(patients, colnames(mutation))
patients <- intersect(patients, colnames(cnv))

survAnnot <- survAnnot[patients, , drop = F]

expression <- expression[, patients, drop = F]
keep = apply(expression >= 100, 1, any)
expNorm <- betweenLaneNormalization(expression[keep, , drop = F], which = "upper")
pseudoExpNorm <- log2(expNorm + 1)

methylation <- metClustValues
methylation$MET_Cancer_Clustered <- methylation$MET_Cancer_Clustered[, patients, 
    drop = F]

mutation <- mutation[, patients, drop = F]
cnv <- cnv[, patients, drop = F]

At this point, we can create or import our own list of interesting genes. We will use the Molecular Signature Database gene lists provided by msigdbr R package.

We will create two lists: hallmark of cancer and transcription factor targets. We will select three gene set examples. Nevertheless, a loop can be created to analyze all gene set as described in the general tutorial with a list of Pathways from graphite.

hallmark_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% dplyr::select(gs_name, 
    entrez_gene)

hallmark_t2g$gs_name <- gsub("HALLMARK_", "", hallmark_t2g$gs_name)
hallmark_t2g$entrez_gene <- as.character(hallmark_t2g$entrez_gene)

hallmark_gene_set_list <- tapply(hallmark_t2g$entrez_gene, hallmark_t2g$gs_name, 
    identity, simplify = F)
hallmark_gene_set_list <- lapply(hallmark_gene_set_list, identity)

angiogenesis_gl <- hallmark_gene_set_list$ANGIOGENESIS

tft_t2g <- msigdbr(species = "Homo sapiens", category = "C3")
table(tft_t2g$gs_subcat)

MIR:MIR_Legacy MIR:MIRDB TFT:GTRD TFT:TFT_Legacy 34163 372126 173572 155087

tft_t2g <- msigdbr(species = "Homo sapiens", category = "C3", subcategory = "TFT:TFT_Legacy") %>% 
    dplyr::select(gs_name, entrez_gene)
tft_t2g$entrez_gene <- as.character(tft_t2g$entrez_gene)

tft_gene_set_list <- tapply(tft_t2g$entrez_gene, tft_t2g$gs_name, identity, simplify = F)
tft_gene_set_list <- lapply(tft_gene_set_list, identity)
stat3_idx <- grep("STAT3", names(tft_gene_set_list))

STAT3_gl <- tft_gene_set_list[stat3_idx]

This way we have collected the gene set we are interested in.

Prepare the input data: the Omics object

At this step we are almost ready to run MOSClip analysis, we just need to help MOSClip to organize the data in input. We do that creating the Omics object to wrap together expression, methylation, mutation and CNV matrices.

Given the nature of the methylation data, we expect to have more than a CpG cluster associated to a gene. For this reason, MOSClip provide the possibility to include a dictionary to associate the methylation level of multiple CpG clusters to a single gene. Thus, in the methylation specific arguments you need to provide the dictionary to convert cluster names into genes.

For each matrix we need to indicate also the data reduction strategy we want to apply.

In this tutorial we chose to use PCA for expression data, cluster analysis for methylation data, vote counting for mutations and CNVs (for detail see MOSClip paper). This data transformations are easily applied calling MOSClip functions, thus here we need only to provide the name of the needed function.

multiOmics <- Omics(data = list(expr = pseudoExpNorm, met = methylation$MET_Cancer_Clustered, 
    mut = mutation, cnv = cnv), methods = c("summarizeWithPca", "summarizeInCluster", 
    "summarizeToNumberOfEvents", "summarizeToNumberOfDirectionalEvents"), specificArgs = list(pcaArgs = list(name = "exp", 
    shrink = "FALSE", method = "sparse", maxPCs = 3), clusterArgs = list(name = "met", 
    dict = methylation$eMap, max_cluster_number = 3), countEvent = list(name = "mut", 
    min_prop = 0.05), cnvAgv = list(name = "cnv", min_prop = 0.05)))

As you can see, in this object we specified the four omic data, we chose four methods for data reduction, one for each omic, and four lists of additional parameters (as described in the help of each reduction function).

MOSClip pathway analysis on gene list

In the following commands we perform the analysis of pathways. Gene list can not be used as input for module analysis because no topology imply that no module can be identify.

We test the ‘ANGIOGENESIS’ hallmark.

genesToConsider <- row.names(pseudoExpNorm)

set.seed(1234)
angiogenesis <- multiOmicsSurvivalPathwayTest(multiOmics, angiogenesis_gl, survAnnot, 
    useThisGenes = genesToConsider)
angiogenesis

Pathway overall pvalue: 0.0597438771913962 As you can see this gene set is only marginally significant.

Then we run the first STAT3 target gene list.

STAT3_01 <- multiOmicsSurvivalPathwayTest(multiOmics, STAT3_gl$STAT3_01, survAnnot, 
    useThisGenes = genesToConsider)
STAT3_01

Pathway overall pvalue: 0.692027706640381 We now run also the second STAT3 gene list.

STAT3_02 <- multiOmicsSurvivalPathwayTest(multiOmics, STAT3_gl$STAT3_02, survAnnot, 
    useThisGenes = genesToConsider)
STAT3_02

Pathway overall pvalue: 0.0280748800350366 This STAT3 target gene set is significant.

Graphical exploration of MOSClip pathway results

MOSClip has multiple functions to explore the pathway data. We provide here some examples.

Pathway summary

We can plot a report of the three test on the gene sets sorted by p-value of the Cox proportional hazard model.

plotMultiPathwayReport(list(STAT3_01 = STAT3_01, STAT3_02 = STAT3_02, ANGIOGENESIS = angiogenesis), 
    MOcolors = c(exp = "red", mut = "blue", cnv = "yellow", met = "green"))

The heatmap summarized the \(p-value\) of the Cox proportional hazard model using all the covariates. The first column is the \(p-value\) of the model, the others columns contain the \(p-value\) of the single covariates. Colorscale is used to graphically visulize the \(p-values\). The \(p-value\) heatmap helps to figure out the involvement of different omics.

Pathway Heatmap

We can explore also the most important genes of the pathway and their original pattens using an heatmap plot of the original values. The most important genes are selected as described in Martini et al.

plotPathwayHeat(STAT3_02, sortBy = c("expPC2"), paletteNames = c(exp = "red", met = "green", 
    mut = "blue", cnv = "yellow"), nrowsHeatmaps = 2, withSampleNames = F, fontsize_row = 6)

Pathway Kaplan Meier

Then we can also check differences in survival using Kaplan Meier curves dividing patients in groups with different omics patterns (e.g. patients with different methylation pattern and high and low levels of PC2 in expression).

plotPathwayKM(STAT3_02, formula = "Surv(days, status) ~ expPC2", paletteNames = "Paired")

We can take a look also to the methylation in angiogenesis.

plotPathwayHeat(angiogenesis, sortBy = c("met2k"), paletteNames = c(exp = "red", 
    met = "green", mut = "blue", cnv = "yellow"), nrowsHeatmaps = 2, withSampleNames = F, 
    fontsize_row = 6)

plotPathwayKM(angiogenesis, formula = "Surv(days, status) ~ met2k", paletteNames = "Paired")