How to format TCGA datasets

If you are here probably you have already downloaded a dataset from TCGA.

At the end of the download procedure we saved an RData file containing our favourite dataset.

Now, I am going to load it along with some R packages

library(TCGAbiolinks)
library(SummarizedExperiment)
library(maftools)
library(checkmate)
library(org.Hs.eg.db)
library(EDASeq)
library(impute)
library(doParallel)
library(MethylMix)

To process the dataset and make the analysis procedure easier to follow, I moved some functions in a separate file that we need to source.

Please download ‘functions-to-process-TCGA-data.R’ from ‘romualdi.bio.unipd.it’ and put into the ‘downloadTCGA’. To speed up the process, we used the survival annotation curated by Liu et al. Set external_annotation=FALSE id you need to process followUp infos by yourself.

source("downloadTCGA/functions-to-process-TCGA-data.R")
external_annotation=TRUE
genome = "hg38"
tumor = "OV"
project = paste0("TCGA-", tumor)
load(paste0("downloadTCGA/", project, "-hg38.RData"))

Now, we’re ready to start.

The first dataset we are going to process here is the expression data. We are going to set the gene id (a unique gene id) and store counts accordingly. We chose to work with entrezID thus we are going to translate the gene identifiers and discard all the features without a valid entrezID, creating an entrezID gene list.

geneAnnotation <- rowData(exprData)
ensembl2entrez <- mapIds(org.Hs.eg.db, keys=geneAnnotation$ensembl_gene_id,
                         column="ENTREZID", keytype="ENSEMBL")
geneAnnotation$entrezID <- ensembl2entrez
rowData(exprData) <- geneAnnotation
genes <- geneAnnotation$ensembl_gene_id[!is.na(geneAnnotation$entrezID)]

Then we will make some filters using the patients’ features. We are interested in ‘Primary Solid Tumors’.

colAnnotation <- exprData@colData
primaryTumorSel  <- colAnnotation$definition=="Primary solid Tumor"
primaryTumor <- colnames(exprData)[primaryTumorSel]

We are going to filter the expression data: genes should contains only the valid entrezID while columns should contains only primaryTumor samples.

exprData <- exprData[genes, primaryTumor, drop=F]

We have almost finshed with gene expression, except for some repeated entrezID. We are going to average the profiles of the genes present more than once in the expression matrix.

exp <- exprData@assays[[1]]
row.names(exp) <- row.names(exprData)
avg <- tapply(row.names(exprData), rowData(exprData)$entrezID, function(r){
  row = exp[r, ,drop=F]
  if (nrow(row)>1)
    row=colMeans(row, na.rm = T)
  row
})

expAvg <- do.call(rbind, avg)
row.names(expAvg) <- names(avg)

barcode = exprData@colData$patient
colnames(expAvg) <- barcode

expAvg[1:6,1:6]

Finally we have our expression matrix: entrezID x samples.

Now, we are going to prepare mutation data.

We chose to filter the data according to entrezID genes that have been measured in expression analysis. Moreover, we chose to consider only missense and nonsense mutations that have (in proof of principle) more impact on the biological functions that we want to analyze.

Mutations are filtered also according to the impact of the mutation. Some statistic are summarized.

N.B. In this analysis we chose Mutect2 pipeline. There are more of them available like Muse, varscan2, somaticsniper.

mut.statistics <- tapply(seq_along(mafMutect$Hugo_Symbol), mafMutect$IMPACT, function(idx){
  stat = as.data.frame(table(mafMutect[idx, 9]))
  names(stat)[1] <- "mutation_type"
  stat
})

considerThisImpactType <- c("HIGH", "MODERATE")
ov.mutations <- prepareMutations(mafMutect, impact=considerThisImpactType, 
                                 filterByThisEntrez=row.names(expAvg), patients=barcode)
ov.mutations$data[1:6,1:6]

We now can create the CNV file. AS we download only the primary tumors we do not need any particular transformation. For CNV, our tool need nuymeric values. So we need to transform the values.

ov.cnv <- lapply(cnv, as.numeric)
ov.cnv <- data.frame(ov.cnv, check.names = F)
row.names(ov.cnv) <- row.names(cnv)

for (i in names(ov.cnv)) {
  if (!identical(ov.cnv[[i]], as.numeric(cnv[[i]])))
    warning(paste0("sample number ", i, " is not equal after numeric transformation"))
}

ov.cnv[1:6,1:6]

Finally, we are going to process the methylation data with ‘Methyl Mix R package’.

First we are going to select the samples that are ‘primary tumors’ and ‘normals’ in the TCGA.

sampleSelection <- colData(metData)[[5]] == "Primary solid Tumor"
normalSelection <- colData(metData)[[5]] == "Solid Tissue Normal"
samples <- colnames(metData)[sampleSelection]
normals <- colnames(metData)[normalSelection]
normData <- metData[, normals]
metData <- metData[, samples]

# Remove all rows with low values and impute missing values
thr <- ncol(normData@assays[[1]])*0.4
remove <- apply(is.na(normData@assays[[1]]),1,sum) > thr
normAssay <- normData@assays[[1]][!remove, ]
normAssay <- imputeKNN(normAssay,k=5)$data

thr <- ncol(metData@assays[[1]])*0.4
remove <- apply(is.na(metData@assays[[1]]),1,sum) > thr
metAssay <- metData@assays[[1]][!remove, ]
metAssay <- imputeKNN(metAssay,k=5)$data

cl <- makeCluster(16)
registerDoParallel(cl)
metClustValues <- ClusterProbes(metAssay, normAssay)
stopCluster(cl)

The ‘metClustValues’ object contains two methylation beta-value matrices and one dictionary, which connects CpG clusters to the methylation array probes. Now we need to connect methylation clusters to genes (entrezID).

geneNames <- sapply(strsplit(row.names(metClustValues$MET_Cancer_Clustered), "---"),
                    function(x) x[1])
assList <- tapply(row.names(metClustValues$MET_Cancer_Clustered), geneNames, function(g) g)
symbol2entrez <- mapIds(org.Hs.eg.db, keys=names(assList), column="ENTREZID",
                        keytype="SYMBOL")
names(assList) <- symbol2entrez
metClustValues$eMap <- assList[!is.na(names(assList))]

To conclude, we extract the follow-up values. We are going to calculate both Overall Survival (OS) and Progression Free Survival. The definition of these two measures are made following Liu et al 2018 (https://www.sciencedirect.com/science/article/pii/S0092867418302290).

Usign the function ‘createFollowUp()’ we are going to obtain the follow-up table.

if (external_annotation) {
  load("external-resources/panCancerSurvAnnotation.RData")
  fup <-  panCancerSurvAnnotation[[tumor]]
  fup <- list(os=data.frame(row.names=fup$`TCGA Participant Barcode`,
                            status=fup$OS,
                            days=fup$`OS Time`, stringsAsFactors = F),
              pfs=data.frame(row.names=fup$`TCGA Participant Barcode`,
                             status=fup$PFI,
                             days=fup$`PFI Time`, stringsAsFactors = F))
} else {
fup  <- createFollowUp(followUp, newTumorEvent)  
}

Finally, we need to have the patients list with matched measurements in all the omics of interest.

sharePatients <- intersect(
  intersect(colnames(expAvg), colnames(metClustValues$MET_Cancer_Clustered)),
  intersect(colnames(ov.mutations$data),colnames(ov.cnv)))

rnaSeq <- expAvg[, sharePatients, drop=F]
metClustValues$MET_Cancer_Clustered <-
  metClustValues$MET_Cancer_Clustered[, sharePatients, drop=F]

ov.mutations.bck <- ov.mutations
ov.mutations$data <- ov.mutations$data[, sharePatients, drop=F]

reducedTypes <- ov.mutations$mutationsTypes[
  ov.mutations$mutationsTypes$patient %in% sharePatients, , drop=F]
ov.mutations$mutationsTypes <- droplevels(reducedTypes)

ov.mutations$originalStatistics <-  mut.statistics
mutations = ov.mutations
cnv <- ov.cnv[, sharePatients, drop=F]

if (external_annotation) {
  fileN <- paste0("downloadTCGA/", project, "-", genome, "-preprocessed-externalSurv.RData")
} else {
  fileN <- paste0("downloadTCGA/", project, "-", genome, "-preprocessed.RData")
}

save(rnaSeq, metClustValues, mutations, cnv, fup, file=fileN)