Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make genomic FASTA input optional #1490

Merged
merged 26 commits into from
Jan 22, 2025
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a69d1d2
Update salmon indexing module
pinin4fjords Jan 21, 2025
c4a416d
Make fasta optional for gtf filtering
pinin4fjords Jan 21, 2025
0723238
Allow no fasta during param checks
pinin4fjords Jan 21, 2025
7c73f77
Rework prepare_genome for optional fasta
pinin4fjords Jan 21, 2025
53a1638
Fix bbsplit param usage for optional fasta
pinin4fjords Jan 21, 2025
4bb0af1
Add test for no fasta
pinin4fjords Jan 21, 2025
64a4547
lint fix
pinin4fjords Jan 21, 2025
d6ef689
Add snap for new test
pinin4fjords Jan 21, 2025
cbd5201
Restore output comments
pinin4fjords Jan 21, 2025
47b292c
Restore input comments
pinin4fjords Jan 21, 2025
b5e676b
Restore file comment
pinin4fjords Jan 21, 2025
b622f53
Restore existence checks
pinin4fjords Jan 21, 2025
0fdf742
Remove some unecessary changes
pinin4fjords Jan 21, 2025
f139bbe
Update changelog
pinin4fjords Jan 21, 2025
a9684ea
Remove duplicate section
pinin4fjords Jan 21, 2025
35ec56c
Fix for tweaked filtered GTF name
pinin4fjords Jan 22, 2025
0d4ef8f
Fix for tweaked filtered GTF name
pinin4fjords Jan 22, 2025
ae062b9
Update docs
pinin4fjords Jan 22, 2025
efb8e07
Temporarily disable 'latest-everything' testing due to incompatibilit…
pinin4fjords Jan 22, 2025
4e2dce2
Merge branch 'optional_fasta' of https://github.com/nf-core/rnaseq in…
pinin4fjords Jan 22, 2025
445ca7d
Apply suggestions from code review
pinin4fjords Jan 22, 2025
c45fbe5
Apply suggestions from code review
pinin4fjords Jan 22, 2025
bd585b0
Fix file names in snap
pinin4fjords Jan 22, 2025
871644d
Merge branch 'optional_fasta' of https://github.com/nf-core/rnaseq in…
pinin4fjords Jan 22, 2025
f07b1b1
Update usage.md
pinin4fjords Jan 22, 2025
21eb5ad
prettier
pinin4fjords Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Apply suggestions from code review
pinin4fjords authored Jan 22, 2025

Partially verified

This commit is signed with the committer’s verified signature.
spydon’s contribution has been verified via GPG key.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
commit c45fbe5b7cf7e49839d9e035ebb91f20d4fcfb98
6 changes: 3 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -128,7 +128,7 @@ The `--aligner hisat2` option is not currently supported using ARM architecture

By default, the pipeline uses [STAR](https://github.com/alexdobin/STAR) (i.e. `--aligner star_salmon`) to map the raw FastQ reads to the reference genome, project the alignments onto the transcriptome and to perform the downstream BAM-level quantification with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html). STAR is fast but requires a lot of memory to run, typically around 38GB for the Human GRCh37 reference genome. Since the [RSEM](https://github.com/deweylab/RSEM) (i.e. `--aligner star_rsem`) workflow in the pipeline also uses STAR you should use the [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) aligner (i.e. `--aligner hisat2`) if you have memory limitations.

You also have the option to pseudoalign and quantify your data directly with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) or [Kallisto](https://pachterlab.github.io/kallisto/) by specifying `salmon` or `kallisto` to the `--pseudo_aligner` parameter. The selected pseudoaligner will then be run in addition to the standard alignment workflow defined by `--aligner`, mainly because it allows you to obtain QC metrics with respect to the genomic alignments. However, you can provide the `--skip_alignment` parameter if you would like to run Salmon or Kallisto in isolation. By default, the pipeline will use the genome fasta and gtf file to generate the transcripts fasta file, and then to build the Salmon index. You can override these parameters using the `--transcript_fasta` and `--salmon_index` parameters, respectively. By default, even `--skip_alignment set` Salmon will still use the genomic FASTA file, providing the sequences as 'decoys' (see [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode)), and this is the recommended mode of operation in this situation. However, if you do not supply a FASTA file, Salmon will run without those decoys, using only transcript sequences in the index.
You also have the option to pseudoalign and quantify your data directly with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) or [Kallisto](https://pachterlab.github.io/kallisto/) by specifying `salmon` or `kallisto` to the `--pseudo_aligner` parameter. The selected pseudoaligner will then be run in addition to the standard alignment workflow defined by `--aligner`, mainly because it allows you to obtain QC metrics with respect to the genomic alignments. However, you can provide the `--skip_alignment` parameter if you would like to run Salmon or Kallisto in isolation. By default, the pipeline will use the genome fasta and gtf file to generate the transcripts fasta file, and then to build the Salmon index. You can override these parameters using the `--transcript_fasta` and `--salmon_index` parameters, respectively. By default, when specifying `--pseudo_aligner salmon` without an index, even with `--skip_alignment set` Salmon will still use the genomic FASTA file when building an index, providing the sequences as 'decoys' (see [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode)), and this is the recommended mode of operation in this situation. However, if you do not supply a FASTA file, Salmon will run without those decoys, using only transcript sequences in the index.

The library preparation protocol (library type) used by Salmon quantification is inferred by the pipeline based on the information provided in the samplesheet, however, you can override it using the `--salmon_quant_libtype` parameter. You can find the available options in the [Salmon documentation](https://salmon.readthedocs.io/en/latest/library_type.html). Similarly, strandedness is taken from the sample sheet or calculated automatically, and passed to Kallisto on a per-library basis, but you can apply a global override by setting the Kallisto strandedness parameters in `--extra_kallisto_quant_args` like `--extra_kallisto_quant_args '--fr-stranded'` see the [Kallisto documentation](https://pachterlab.github.io/kallisto/manual).

@@ -227,7 +227,7 @@ Notes:
- If `--gene_bed` is not provided then it will be generated from the GTF file.
- If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
- When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).
- If the `--skip_alignment` option is used along with `--transcript_fasta`, the pipeline can technically run without providing the genomic FASTA (`--fasta`). However, this approach is **not recommended**, as any dynamically generated Salmon index will lack decoys. To ensure optimal indexing with decoys, it is **highly recommended** to include the genomic FASTA (`--fasta`) whenever possible, unless a pre-existing decoy-aware Salmon index is supplied. For more details on the benefits of decoy-aware indexing, refer to the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode).
- If the `--skip_alignment` option is used along with `--transcript_fasta`, the pipeline can technically run without providing the genomic FASTA (`--fasta`). However, this approach is **not recommended** with `--pseudo_aligner salmon`, as any dynamically generated Salmon index will lack decoys. To ensure optimal indexing with decoys, it is **highly recommended** to include the genomic FASTA (`--fasta`) with Salmon, unless a pre-existing decoy-aware Salmon index is supplied. For more details on the benefits of decoy-aware indexing, refer to the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode).

#### Reference genome

@@ -346,7 +346,7 @@ nextflow run \
-profile docker
```

This is not usually recommended unless you also supply a previously generated decoy-aware Salmon transcriptome.
This is not usually recommended with Salmon unless you also supply a previously generated decoy-aware Salmon transcriptome.

> **NB:** Loading iGenomes configuration remains the default for reasons of consistency with other workflows, but should be disabled when not using iGenomes, applying the recommended usage above.