In this brief tutorial, we are going to download a dataset from TCGA. Please note that datasets are often huge and it will take some times.

First, let’s focus on the TCGA dataset at https://cancergenome.nih.gov/abouttcga/overview and pick your favourite (try ‘Lounch data portal’).

I choose TCGA-OV, i.e. High Grade Serous Ovarian Cancer (HGSOC) dataset.

Now I set some variables and load some useful libraries. Moreover, to better organize downloads I suggest to create a directory and store everithing there.

library(TCGAbiolinks)
library(SummarizedExperiment)
library(maftools)
library(checkmate)

legacy=FALSE
tumor = "OV"
project = paste0("TCGA-", tumor)
genome = "hg38"

methylation_platforms <- c("Illumina Human Methylation 27","Illumina Human Methylation 450")

dirname = "downloadTCGA"
if (!file.exists(dirname)){
  dir.create(dirname)
}

Now, we start with the download of clinical data. We are looking for patients with complete clinical annotation.

cliQuery <- GDCquery(project = project, data.category = "Clinical", file.type = "xml")
GDCdownload(cliQuery, method="client", files.per.chunk = 10, directory = "downloadTCGA/GDCdata")
followUp      <- GDCprepare_clinic(cliQuery, clinical.info = "follow_up",
                                   directory = "downloadTCGA/GDCdata")
newTumorEvent <- GDCprepare_clinic(cliQuery, clinical.info = "new_tumor_event",
                                   directory = "downloadTCGA/GDCdata")

We dowloaded two ‘data.frames’: followUp and newTumorEvent.

Now we are going to download the expression counts.

expQuery <- GDCquery(project = project,
                     data.category = "Transcriptome Profiling",
                     data.type="Gene Expression Quantification",
                     workflow.type = "HTSeq - Counts",
                     legacy=legacy)

patients <- getResults(expQuery, cols="cases")
GDCdownload(expQuery, method = "client", files.per.chunk = 1, directory = "downloadTCGA/GDCdata")
exprData <- GDCprepare(expQuery, directory = "downloadTCGA/GDCdata")

Now, It’s time for the mutations. We chose the mutect2 pipeline.

mafMutect <- GDCquery_Maf(tumor, pipelines = "mutect2", directory = "downloadTCGA/GDCdata")

Finally, we are going to download the methylation data.

met <- GDCquery(project = project,
                data.category = "DNA Methylation",
                platform = methylation_platforms)

GDCdownload(met, method = "client", files.per.chunk = 1, directory="downloadTCGA/GDCdata")
metData <- GDCprepare(met, directory="downloadTCGA/GDCdata")

We are going to add also CNV data. We downloaded the data from GISTIC version 2 pipeline from FireBrowse.

gisticTable <- TCGAbiolinks::getGistic("OV-TP")
patients.cnv = substr(colnames(gisticTable)[-c(1:3)], 1, 12)
cnv <- gisticTable[,-c(1:3)]
colnames(cnv) <- patients.cnv
row.names(cnv) <- gisticTable$`Locus ID`

Now, we are going to save all the things in RData file.

save(exprData, mafMutect, metData, cnv, followUp, newTumorEvent, file=paste0(dirname, '/',project, "-", genome, ".RData"))

That’s all. The download is done and data are now ready for the preprocessing.

Wanna try yourself?