This is the data processing pipeline for the tuberculosis R/Bioconductor package; it should be run inside the tuberculosis.pipeline Docker container. The Docker container is available through Docker Hub, and the tuberculosis.pipeline GitHub repository has a branch, docker, which contains the related Dockerfile.
Be sure you are in the right place before proceeding; the tuberculosis.pipeline GitHub repository is intended for developers and extremely curious users of the tuberculosis R/Bioconductor package alone. The functions in the tuberculosis.pipeline R package are not documented and receive only a high-level overview here. A number of steps in the pipeline upload/download files to/from Google Drive or read/write files from/to Google Sheets; anyone wishing to run the pipeline would need connect their own Google Drive and Google Sheets accounts. When run inside the tuberculosis.pipeline Docker container, the pipeline should be completely reproducible – the NOTES.md file details all changes beyond standard operation of the pipeline that are necessary to reproduce data from the tuberculosis R/Bioconductor package exactly. “Sample Number” in the NOTES.md file refers to the “Sample Number” column of the series-metadata Google Sheets file.
The series-metadata Google Sheets file provides information about all GEO Series that are or will be made available through the tuberculosis R/Bioconductor package. The sheet can be updated with a single function call as follows.
tuberculosis.pipeline:::update_series_metadata()
This function call will query GEO using the following syntax and obtain metadata about each Series. If no new Series have been published since the series-metadata Google Sheets file was last updated, the function will give a message as such.
“Tuberculosis”[MeSH Terms] AND “Homo sapiens”[Organism] AND “gse”[Filter] AND (“Expression profiling by array”[Filter] OR “Expression profiling by high throughput sequencing”[Filter])
Note that the “Sample Number” reported for a Series in the series-metadata Google Sheets file may be slightly higher than what is available through the tuberculosis R/Bioconductor package because of filtering that occurs during data processing.
The gene names used by the pipeline are HGNC-approved GRCh38 gene names
from the genenames.org REST API. To update
the gene names, the sysdata.R
file can be sourced as follows; however
this is somewhat discouraged because it necessitates updating microarray
annotation and reprocessing of both microarray and sequencing data.
source("data-raw/sysdata.R")
When gene names have been updated, microarray annotation must also be
updated to reflect the changes. To update probe to gene mappings, there
is a R file for each GEO Platform
in the data-raw
directory, all of which can be sourced as follows.
source("data-raw/data-GPL10558.R")
source("data-raw/data-GPL13497.R")
source("data-raw/data-GPL1352.R")
source("data-raw/data-GPL15207.R")
source("data-raw/data-GPL17077.R")
source("data-raw/data-GPL1708.R")
source("data-raw/data-GPL17586.R")
source("data-raw/data-GPL17692.R")
source("data-raw/data-GPL21185.R")
source("data-raw/data-GPL23126.R")
source("data-raw/data-GPL4133.R")
source("data-raw/data-GPL5175.R")
source("data-raw/data-GPL570.R")
source("data-raw/data-GPL6102.R")
source("data-raw/data-GPL6244.R")
source("data-raw/data-GPL6254.R")
source("data-raw/data-GPL6480.R")
source("data-raw/data-GPL6848.R")
source("data-raw/data-GPL6884.R")
source("data-raw/data-GPL6947.R")
source("data-raw/data-GPL96.R")
source("data-raw/data-HG-U133_Plus_2.R")
source("data-raw/data-HG-U133A.R")
source("data-raw/data-PrimeView.R")
source("data-raw/data-U133_X3P.R")
The process of updating probe to gene mappings is computationally intensive because of the number of regex matches that must be made. To speed up the process where multiple cores are available, the following command can be issued to take advantage of parallel processing before sourcing the microarray annotation files above.
future::plan(strategy = future::multisession())
When gene names and probe to gene mappings have been updated, both
microarray and sequencing data should be reprocessed. If the genome
build has changed, the write_sh
function should be revised to reflect
the change.
Microarray data from each GEO
Series that is or will be made available through the
tuberculosis
R/Bioconductor package has a R file that was used in data processing.
The source of the data-raw/GSE102459.R
file is shown below for
reference.
tuberculosis.pipeline:::get_sample_metadata("GSE102459") |>
tuberculosis.pipeline:::tidyup_sample_metadata() |>
tuberculosis.pipeline:::review_sample_metadata() |>
purrr::imap(tuberculosis.pipeline:::upload_sample_metadata) |>
purrr::imap(tuberculosis.pipeline:::get_microarray_data) |>
purrr::imap(tuberculosis.pipeline:::upload_microarray_data)
First, the get_sample_metadata
function is used to retrieve sample
metadata from the GEO Series Matrix
File(s). Then, the tidyup_sample_metadata
function is used to clean
the sample metadata and filter out samples based on exclusion criteria.
After that, the review_sample_metadata
function checks that sample
metadata from at least one sample remains for further processing. If so,
the upload_sample_metadata
function is used to upload the sample
metadata to Google Drive; the function simply returns the same sample
metadata that was uploaded for further use. The get_microarray_data
function then uses the sample metadata to download raw per sample
microarray data from GEO
(e.g. CEL
files) and constructs a gene expression matrix that is
finally uploaded to Google Drive as a CSV
file by the
upload_microarray_data
function.
Sequencing data from each GEO
Series that is or will be made available through the
tuberculosis
R/Bioconductor package has a R file that was used in data processing.
The source of the data-raw/GSE101705.R
file is shown below for
reference.
tuberculosis.pipeline:::get_sample_metadata("GSE101705") |>
tuberculosis.pipeline:::tidyup_sample_metadata() |>
tuberculosis.pipeline:::review_sample_metadata() |>
purrr::map(tuberculosis.pipeline:::translate_geo_rownames) |>
purrr::imap(tuberculosis.pipeline:::upload_sample_metadata) |>
purrr::imap(tuberculosis.pipeline:::write_sh)
Most of the functions are identical to those used to process microarray
data described above, but two other functions are used here. The
translate_geo_rownames
function is used to translate
GEO Sample accession numbers to
SRA Run accession numbers because
they are needed to download raw per sample sequencing data (i.e. fastq
files). The write_sh
function writes a shell script to
exec/GSE101705/GSE101705.sh
which uses sratoolkit
and nextflow
to
process fastq
files into gene expression measurements. On SGE or
SGE-like clusters the script can be submitted with the qsub
command.
qsub GSE101705.sh
The shell script will create an array job with one task per sample and
uses the nf-core/rnaseq pipeline inside
a Singularity container to process data in a reproducible manner. When
the job has finished, to check that all samples were processed
successfully use grep -L "Succeeded : 13" GSE101705/*.log
; no output
indicates success. If there are samples that were not processed
correctly, read the log file and repeat the qsub
command if necessary
– only the unsuccessful samples will be reprocessed. Then the
upload_sequencing_data
function is used to construct a gene expression
matrix that is uploaded to Google Drive as a CSV
file; the GSE101705
argument in the following example is the path to the directory where the
shell script was run.
tuberculosis.pipeline:::upload_sequencing_data("GSE101705")
The upload_sequencing_data
function is also computationally intensive
because of the number of regex matches that must be made. To speed up
the process where multiple cores are available, the following command
can be issued to take advantage of parallel processing before the
function call above.
future::plan(strategy = future::multisession())
Finally, while some of the syntax of the shell script is specific to the Boston University Shared Computing Cluster, where sequencing data for the tuberculosis R/Bioconductor package were prepared, only minor modifications would be required for submission on a different cluster.
To prepare expression data to be made available through the
tuberculosis
R/Bioconductor package, a few additional steps are required. Microarray
and sequencing data that have been uploaded to Google Drive as CSV
files must be downloaded and saved as R matrix objects as follows.
tuberculosis.pipeline:::dowload_expression_data() |>
tuberculosis.pipeline:::write_expression_matrix() |>
tuberculosis.pipeline:::write_metadata_csv_file()
The dowload_expression_data
function is used to download expression
data from Google Drive and will create a directory of CSV
files at
~/expression-data
. The directory is used by the
write_expression_matrix
function, which reads in the CSV
files and
writes R matrix objects to the ~/abs-upload-data
directory. Then, the
write_metadata_csv_file
function is used to write a CSV
file
describing the R matrix objects for
ExperimentHub to the
~/eh-csv-metadata
directory; it will be used by a Bioconductor core
team member to insert records into the
ExperimentHub
database. The R matrix objects should be uploaded to Azure Blob Storage,
as follows, before the records are inserted into the
ExperimentHub
database.
tuberculosis.pipeline:::upload_expression_data("~/abs-upload-data/2021-09-15/")
The command above requires credentials from Bioconductor; the date would also be different depending on when the pipeline was run.