-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow cat_cat to retain double file extensions for .gz files #4230
Conversation
- Adds logic to capture file extensions that preceed `.gz`, e.g. `.fasta.gz` and use this double extension as the suffix for the output file. - Updates tests accordingly.
Instead of a conditional check using a regex capture group and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`, this update searches for a regex match and directly adds it as a suffix if there is a match, and uses the existing lastIndexOf() method to capture a single file extension otherwise. Co-authored-by: Matthias Hörtenhuber <[email protected]>
@nf-core-bot fix linting |
I'll need to take a closer look at this later, but the tests that appeared to work before are now failing in the CI pipeline. Did anything big change recently? |
@pmoris . I've fixed the tests. There were two issues:
I also resolved the conflicts with Feel free to merge it. |
This means that .fastq.gz will remain .fastq.gz and can then match the condition to bypass the SAMTOOLS_FASTA process. There is a pull-request for this, nf-core/modules#4230
Hi @muffato , thanks for looking into this! I had just picked up this PR myself (as you might have guessed based on my question in the nf-core slack channel ;) ). I ended up starting a new branch based on the latest version of |
…#4230) * Retain double file extensions for .gz files - Adds logic to capture file extensions that preceed `.gz`, e.g. `.fasta.gz` and use this double extension as the suffix for the output file. - Updates tests accordingly. * Re-add cat/cat entry to pytest_modules.yml * Simplify code to extract double .gz file extensions Instead of a conditional check using a regex capture group and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`, this update searches for a regex match and directly adds it as a suffix if there is a match, and uses the existing lastIndexOf() method to capture a single file extension otherwise. Co-authored-by: Matthias Hörtenhuber <[email protected]> * [automated] Fix linting with Prettier * Moved the function outside of the process block * bugfix: missing closing bracket * Reordered the lines to make the diff shorter * Changed the tests to regular snapshots so that we can see and check the output file name --------- Co-authored-by: Matthias Hörtenhuber <[email protected]> Co-authored-by: nf-core-bot <[email protected]> Co-authored-by: Matthieu Muffato <[email protected]>
…#4230) * Retain double file extensions for .gz files - Adds logic to capture file extensions that preceed `.gz`, e.g. `.fasta.gz` and use this double extension as the suffix for the output file. - Updates tests accordingly. * Re-add cat/cat entry to pytest_modules.yml * Simplify code to extract double .gz file extensions Instead of a conditional check using a regex capture group and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`, this update searches for a regex match and directly adds it as a suffix if there is a match, and uses the existing lastIndexOf() method to capture a single file extension otherwise. Co-authored-by: Matthias Hörtenhuber <[email protected]> * [automated] Fix linting with Prettier * Moved the function outside of the process block * bugfix: missing closing bracket * Reordered the lines to make the diff shorter * Changed the tests to regular snapshots so that we can see and check the output file name --------- Co-authored-by: Matthias Hörtenhuber <[email protected]> Co-authored-by: nf-core-bot <[email protected]> Co-authored-by: Matthieu Muffato <[email protected]>
…#4230) * Retain double file extensions for .gz files - Adds logic to capture file extensions that preceed `.gz`, e.g. `.fasta.gz` and use this double extension as the suffix for the output file. - Updates tests accordingly. * Re-add cat/cat entry to pytest_modules.yml * Simplify code to extract double .gz file extensions Instead of a conditional check using a regex capture group and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`, this update searches for a regex match and directly adds it as a suffix if there is a match, and uses the existing lastIndexOf() method to capture a single file extension otherwise. Co-authored-by: Matthias Hörtenhuber <[email protected]> * [automated] Fix linting with Prettier * Moved the function outside of the process block * bugfix: missing closing bracket * Reordered the lines to make the diff shorter * Changed the tests to regular snapshots so that we can see and check the output file name --------- Co-authored-by: Matthias Hörtenhuber <[email protected]> Co-authored-by: nf-core-bot <[email protected]> Co-authored-by: Matthieu Muffato <[email protected]>
This PR attempts to
.gz
, which can sometimes result in problems if downstream tools expect files to have double extensions like.fastq.gz
.cat_cat
entry inpytest_modules.yml
that happened in this PR.Handling compressed file extensions
For 1, unfortunately, I didn't manage to find an elegant solution that applies to all cases, but I opted for the current one which I found to be reasonably pragmatic:
This logic seems to work fine for files like
file.fasta.gz
, while leaving regular.gz
files (without any preceding extension) alone.Where it fails are files that contain dots in their name and where the length of the penultimate stretch between dots happens to be 1-5 characters long. E.g., if someone were to use a file like
Homo_sapiens.GRCh38.dna.chromosome.MT.gz
,.MT
would wrongly be considered a file extension. Note that the actual ensembl file is namedHomo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
, where.fa.gz
would be captured correctly.Input needed on how/if to make this behaviour optional
It might be the case that users don't want this new behaviour to be the default, so it could be made optional instead. I can think of two options to do this: use a channel input or rely on
ext.args
.I fear the former would mean that all current implementations of the cat_cat module would need to add an empty
[]
channel input (since optional inputs are not yet implemented in nextflow, are they?). The latter option feels a bit off though as well, becauseext.args
seems to usually be reserved for existing arguments specific to the underlying tool. Looking at the docs, this doesn't fit any of the definitions provided there. The third option is just making this new behaviour standard (as it is written now), without an option to modify it.What do you think is best?
Possible alternatives/improvements that might be worth considering
.fastq.gz
,.fa.gz
,.fasta.gz
, etc. But where do we draw the line and ensure it is generic enough?lastIndexOf
method that I used.Testing problems
file.gz
andfile.fasta.gz
outputs.TMPDIR=~ PROFILE=singularity pytest --tag cat/cat --symlink --keep-workflow-wd
, and for conda too, but for docker I receive an error (AssertionError:
versions.ymlnot found in the output directory
). I have an alias pointing between docker and podman since I'm running Fedora, that might be causing the problem?nf-core modules test cat/cat
also fails due to file access permissions on my machine.nf-test test modules/nf-core/cat/cat/tests/main.nf.test --profile
for the different profiles though and those seemed to work without me needing to alter any of the files (except for docker again).EDIT: the conda tests started failing after I rebased on upstream, likely due to commit 516189e.
PR checklist
Closes #3353
versions.yml
file. (=> is this a reminder to create these inside processes or should I provide a specific file?)label
PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware