In this brief tutorial, we are going to download a dataset from TCGA. Please note that datasets are often huge and it will take some times.

First, let’s focus on the TCGA dataset at and pick your favourite (try ‘Lounch data portal’).

I choose TCGA-OV, i.e. High Grade Serous Ovarian Cancer (HGSOC) dataset.

Now I set some variables and load some useful libraries. Moreover, to better organize downloads I suggest to create a directory and store everithing there.


tumor = "OV"
project = paste0("TCGA-", tumor)
genome = "hg38"

methylation_platforms <- c("Illumina Human Methylation 27","Illumina Human Methylation 450")

dirname = "downloadTCGA"
if (!file.exists(dirname)){

Now, we start with the download of clinical data. We are looking for patients with complete clinical annotation.

cliQuery <- GDCquery(project = project, data.category = "Clinical", file.type = "xml")
GDCdownload(cliQuery, method="client", files.per.chunk = 10, directory = "downloadTCGA/GDCdata")
followUp      <- GDCprepare_clinic(cliQuery, = "follow_up",
                                   directory = "downloadTCGA/GDCdata")
newTumorEvent <- GDCprepare_clinic(cliQuery, = "new_tumor_event",
                                   directory = "downloadTCGA/GDCdata")

We dowloaded two ‘data.frames’: followUp and newTumorEvent.

Now we are going to download the expression counts.

expQuery <- GDCquery(project = project,
                     data.category = "Transcriptome Profiling",
                     data.type="Gene Expression Quantification",
                     workflow.type = "HTSeq - Counts",

patients <- getResults(expQuery, cols="cases")
GDCdownload(expQuery, method = "client", files.per.chunk = 1, directory = "downloadTCGA/GDCdata")
exprData <- GDCprepare(expQuery, directory = "downloadTCGA/GDCdata")

Now, It’s time for the mutations. We chose the mutect2 pipeline.

mafMutect <- GDCquery_Maf(tumor, pipelines = "mutect2", directory = "downloadTCGA/GDCdata")

Finally, we are going to download the methylation data.

met <- GDCquery(project = project,
                data.category = "DNA Methylation",
                platform = methylation_platforms)

GDCdownload(met, method = "client", files.per.chunk = 1, directory="downloadTCGA/GDCdata")
metData <- GDCprepare(met, directory="downloadTCGA/GDCdata")

We are going to add also CNV data. We downloaded the data from GISTIC version 2 pipeline from FireBrowse.

gisticTable <- TCGAbiolinks::getGistic("OV-TP")
patients.cnv = substr(colnames(gisticTable)[-c(1:3)], 1, 12)
cnv <- gisticTable[,-c(1:3)]
colnames(cnv) <- patients.cnv
row.names(cnv) <- gisticTable$`Locus ID`

Now, we are going to save all the things in RData file.

save(exprData, mafMutect, metData, cnv, followUp, newTumorEvent, file=paste0(dirname, '/',project, "-", genome, ".RData"))

That’s all. The download is done and data are now ready for the preprocessing.

Wanna try yourself?