Skip to content

Latest commit

 

History

History
270 lines (230 loc) · 11.7 KB

README.md

File metadata and controls

270 lines (230 loc) · 11.7 KB

tuberculosis.pipeline

This is the data processing pipeline for the tuberculosis R/Bioconductor package; it should be run inside the tuberculosis.pipeline Docker container. The Docker container is available through Docker Hub, and the tuberculosis.pipeline GitHub repository has a branch, docker, which contains the related Dockerfile.

Developer Notes

Be sure you are in the right place before proceeding; the tuberculosis.pipeline GitHub repository is intended for developers and extremely curious users of the tuberculosis R/Bioconductor package alone. The functions in the tuberculosis.pipeline R package are not documented and receive only a high-level overview here. A number of steps in the pipeline upload/download files to/from Google Drive or read/write files from/to Google Sheets; anyone wishing to run the pipeline would need connect their own Google Drive and Google Sheets accounts. When run inside the tuberculosis.pipeline Docker container, the pipeline should be completely reproducible – the NOTES.md file details all changes beyond standard operation of the pipeline that are necessary to reproduce data from the tuberculosis R/Bioconductor package exactly. “Sample Number” in the NOTES.md file refers to the “Sample Number” column of the series-metadata Google Sheets file.

Series Metadata

The series-metadata Google Sheets file provides information about all GEO Series that are or will be made available through the tuberculosis R/Bioconductor package. The sheet can be updated with a single function call as follows.

tuberculosis.pipeline:::update_series_metadata()

This function call will query GEO using the following syntax and obtain metadata about each Series. If no new Series have been published since the series-metadata Google Sheets file was last updated, the function will give a message as such.

“Tuberculosis”[MeSH Terms] AND “Homo sapiens”[Organism] AND “gse”[Filter] AND (“Expression profiling by array”[Filter] OR “Expression profiling by high throughput sequencing”[Filter])

Note that the “Sample Number” reported for a Series in the series-metadata Google Sheets file may be slightly higher than what is available through the tuberculosis R/Bioconductor package because of filtering that occurs during data processing.

Gene Annotation

The gene names used by the pipeline are HGNC-approved GRCh38 gene names from the genenames.org REST API. To update the gene names, the sysdata.R file can be sourced as follows; however this is somewhat discouraged because it necessitates updating microarray annotation and reprocessing of both microarray and sequencing data.

source("data-raw/sysdata.R")

When gene names have been updated, microarray annotation must also be updated to reflect the changes. To update probe to gene mappings, there is a R file for each GEO Platform in the data-raw directory, all of which can be sourced as follows.

source("data-raw/data-GPL10558.R")
source("data-raw/data-GPL13497.R")
source("data-raw/data-GPL1352.R")
source("data-raw/data-GPL15207.R")
source("data-raw/data-GPL17077.R")
source("data-raw/data-GPL1708.R")
source("data-raw/data-GPL17586.R")
source("data-raw/data-GPL17692.R")
source("data-raw/data-GPL21185.R")
source("data-raw/data-GPL23126.R")
source("data-raw/data-GPL4133.R")
source("data-raw/data-GPL5175.R")
source("data-raw/data-GPL570.R")
source("data-raw/data-GPL6102.R")
source("data-raw/data-GPL6244.R")
source("data-raw/data-GPL6254.R")
source("data-raw/data-GPL6480.R")
source("data-raw/data-GPL6848.R")
source("data-raw/data-GPL6884.R")
source("data-raw/data-GPL6947.R")
source("data-raw/data-GPL96.R")
source("data-raw/data-HG-U133_Plus_2.R")
source("data-raw/data-HG-U133A.R")
source("data-raw/data-PrimeView.R")
source("data-raw/data-U133_X3P.R")

The process of updating probe to gene mappings is computationally intensive because of the number of regex matches that must be made. To speed up the process where multiple cores are available, the following command can be issued to take advantage of parallel processing before sourcing the microarray annotation files above.

future::plan(strategy = future::multisession())

When gene names and probe to gene mappings have been updated, both microarray and sequencing data should be reprocessed. If the genome build has changed, the write_sh function should be revised to reflect the change.

Microarray Data

Microarray data from each GEO Series that is or will be made available through the tuberculosis R/Bioconductor package has a R file that was used in data processing. The source of the data-raw/GSE102459.R file is shown below for reference.

tuberculosis.pipeline:::get_sample_metadata("GSE102459") |>
    tuberculosis.pipeline:::tidyup_sample_metadata() |>
    tuberculosis.pipeline:::review_sample_metadata() |>
    purrr::imap(tuberculosis.pipeline:::upload_sample_metadata) |>
    purrr::imap(tuberculosis.pipeline:::get_microarray_data) |>
    purrr::imap(tuberculosis.pipeline:::upload_microarray_data)

First, the get_sample_metadata function is used to retrieve sample metadata from the GEO Series Matrix File(s). Then, the tidyup_sample_metadata function is used to clean the sample metadata and filter out samples based on exclusion criteria. After that, the review_sample_metadata function checks that sample metadata from at least one sample remains for further processing. If so, the upload_sample_metadata function is used to upload the sample metadata to Google Drive; the function simply returns the same sample metadata that was uploaded for further use. The get_microarray_data function then uses the sample metadata to download raw per sample microarray data from GEO (e.g. CEL files) and constructs a gene expression matrix that is finally uploaded to Google Drive as a CSV file by the upload_microarray_data function.

Sequencing Data

Sequencing data from each GEO Series that is or will be made available through the tuberculosis R/Bioconductor package has a R file that was used in data processing. The source of the data-raw/GSE101705.R file is shown below for reference.

tuberculosis.pipeline:::get_sample_metadata("GSE101705") |>
    tuberculosis.pipeline:::tidyup_sample_metadata() |>
    tuberculosis.pipeline:::review_sample_metadata() |>
    purrr::map(tuberculosis.pipeline:::translate_geo_rownames) |>
    purrr::imap(tuberculosis.pipeline:::upload_sample_metadata) |>
    purrr::imap(tuberculosis.pipeline:::write_sh)

Most of the functions are identical to those used to process microarray data described above, but two other functions are used here. The translate_geo_rownames function is used to translate GEO Sample accession numbers to SRA Run accession numbers because they are needed to download raw per sample sequencing data (i.e. fastq files). The write_sh function writes a shell script to exec/GSE101705/GSE101705.sh which uses sratoolkit and nextflow to process fastq files into gene expression measurements. On SGE or SGE-like clusters the script can be submitted with the qsub command.

qsub GSE101705.sh

The shell script will create an array job with one task per sample and uses the nf-core/rnaseq pipeline inside a Singularity container to process data in a reproducible manner. When the job has finished, to check that all samples were processed successfully use grep -L "Succeeded : 13" GSE101705/*.log; no output indicates success. If there are samples that were not processed correctly, read the log file and repeat the qsub command if necessary – only the unsuccessful samples will be reprocessed. Then the upload_sequencing_data function is used to construct a gene expression matrix that is uploaded to Google Drive as a CSV file; the GSE101705 argument in the following example is the path to the directory where the shell script was run.

tuberculosis.pipeline:::upload_sequencing_data("GSE101705")

The upload_sequencing_data function is also computationally intensive because of the number of regex matches that must be made. To speed up the process where multiple cores are available, the following command can be issued to take advantage of parallel processing before the function call above.

future::plan(strategy = future::multisession())

Finally, while some of the syntax of the shell script is specific to the Boston University Shared Computing Cluster, where sequencing data for the tuberculosis R/Bioconductor package were prepared, only minor modifications would be required for submission on a different cluster.

Expression Data

To prepare expression data to be made available through the tuberculosis R/Bioconductor package, a few additional steps are required. Microarray and sequencing data that have been uploaded to Google Drive as CSV files must be downloaded and saved as R matrix objects as follows.

tuberculosis.pipeline:::dowload_expression_data() |>
  tuberculosis.pipeline:::write_expression_matrix() |>
  tuberculosis.pipeline:::write_metadata_csv_file()

The dowload_expression_data function is used to download expression data from Google Drive and will create a directory of CSV files at ~/expression-data. The directory is used by the write_expression_matrix function, which reads in the CSV files and writes R matrix objects to the ~/abs-upload-data directory. Then, the write_metadata_csv_file function is used to write a CSV file describing the R matrix objects for ExperimentHub to the ~/eh-csv-metadata directory; it will be used by a Bioconductor core team member to insert records into the ExperimentHub database. The R matrix objects should be uploaded to Azure Blob Storage, as follows, before the records are inserted into the ExperimentHub database.

tuberculosis.pipeline:::upload_expression_data("~/abs-upload-data/2021-09-15/")

The command above requires credentials from Bioconductor; the date would also be different depending on when the pipeline was run.