In this brief tutorial, we are going to download a dataset from TCGA. Please note that datasets are often huge and it will take some times.
First, let’s focus on the TCGA dataset at https://cancergenome.nih.gov/abouttcga/overview and pick your favourite (try ‘Lounch data portal’).
I choose TCGA-OV, i.e. High Grade Serous Ovarian Cancer (HGSOC) dataset.
Now I set some variables and load some useful libraries. Moreover, to better organize downloads I suggest to create a directory and store everithing there.
library(TCGAbiolinks)
library(SummarizedExperiment)
library(maftools)
library(checkmate)
legacy=FALSE
tumor = "OV"
project = paste0("TCGA-", tumor)
genome = "hg38"
methylation_platforms <- c("Illumina Human Methylation 27","Illumina Human Methylation 450")
dirname = "downloadTCGA"
if (!file.exists(dirname)){
dir.create(dirname)
}
Now, we start with the download of clinical data. We are looking for patients with complete clinical annotation.
cliQuery <- GDCquery(project = project, data.category = "Clinical", file.type = "xml")
GDCdownload(cliQuery, method="client", files.per.chunk = 10, directory = "downloadTCGA/GDCdata")
followUp <- GDCprepare_clinic(cliQuery, clinical.info = "follow_up",
directory = "downloadTCGA/GDCdata")
newTumorEvent <- GDCprepare_clinic(cliQuery, clinical.info = "new_tumor_event",
directory = "downloadTCGA/GDCdata")
We dowloaded two ‘data.frames’: followUp and newTumorEvent.
Now we are going to download the expression counts.
expQuery <- GDCquery(project = project,
data.category = "Transcriptome Profiling",
data.type="Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
legacy=legacy)
patients <- getResults(expQuery, cols="cases")
GDCdownload(expQuery, method = "client", files.per.chunk = 1, directory = "downloadTCGA/GDCdata")
exprData <- GDCprepare(expQuery, directory = "downloadTCGA/GDCdata")
Now, It’s time for the mutations. We chose the mutect2 pipeline.
mafMutect <- GDCquery_Maf(tumor, pipelines = "mutect2", directory = "downloadTCGA/GDCdata")
Finally, we are going to download the methylation data.
met <- GDCquery(project = project,
data.category = "DNA Methylation",
platform = methylation_platforms)
GDCdownload(met, method = "client", files.per.chunk = 1, directory="downloadTCGA/GDCdata")
metData <- GDCprepare(met, directory="downloadTCGA/GDCdata")
We are going to add also CNV data. We downloaded the data from GISTIC version 2 pipeline from FireBrowse.
gisticTable <- TCGAbiolinks::getGistic("OV-TP")
patients.cnv = substr(colnames(gisticTable)[-c(1:3)], 1, 12)
cnv <- gisticTable[,-c(1:3)]
colnames(cnv) <- patients.cnv
row.names(cnv) <- gisticTable$`Locus ID`
Now, we are going to save all the things in RData file.
save(exprData, mafMutect, metData, cnv, followUp, newTumorEvent, file=paste0(dirname, '/',project, "-", genome, ".RData"))
That’s all. The download is done and data are now ready for the preprocessing.
Wanna try yourself?