Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cat_cat to retain double file extensions for .gz files #4230

Merged
merged 10 commits into from
Jan 16, 2024
Merged

Allow cat_cat to retain double file extensions for .gz files #4230

merged 10 commits into from
Jan 16, 2024

Conversation

pmoris
Copy link
Contributor

@pmoris pmoris commented Oct 25, 2023

This PR attempts to

  1. address the issue raised here. Briefly, the CAT_CAT module discards any extension(s) that occur before .gz, which can sometimes result in problems if downstream tools expect files to have double extensions like .fastq.gz.
  2. fix the (accidental?) removal of the cat_cat entry in pytest_modules.yml that happened in this PR.

Handling compressed file extensions

For 1, unfortunately, I didn't manage to find an elegant solution that applies to all cases, but I opted for the current one which I found to be reasonably pragmatic:

Check if the desired output file ends in .gz and if there is a stretch of 1-5 letters or digits preceded by a dot, which itself is preceded by at least one more character.
If so, extract the two last file extensions, e.g. .fasta.gz.
If not, extract only the regular final extension, e.g. .txt.
If the output name is provided via task.ext.prefix, nothing is changed.

This logic seems to work fine for files like file.fasta.gz, while leaving regular .gz files (without any preceding extension) alone.

Where it fails are files that contain dots in their name and where the length of the penultimate stretch between dots happens to be 1-5 characters long. E.g., if someone were to use a file like Homo_sapiens.GRCh38.dna.chromosome.MT.gz, .MT would wrongly be considered a file extension. Note that the actual ensembl file is named Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz, where .fa.gz would be captured correctly.

Input needed on how/if to make this behaviour optional

It might be the case that users don't want this new behaviour to be the default, so it could be made optional instead. I can think of two options to do this: use a channel input or rely on ext.args.

I fear the former would mean that all current implementations of the cat_cat module would need to add an empty [] channel input (since optional inputs are not yet implemented in nextflow, are they?). The latter option feels a bit off though as well, because ext.args seems to usually be reserved for existing arguments specific to the underlying tool. Looking at the docs, this doesn't fit any of the definitions provided there. The third option is just making this new behaviour standard (as it is written now), without an option to modify it.

What do you think is best?

Possible alternatives/improvements that might be worth considering

  • To avoid the problem with file names with dots, we could instead create conditions for most of the common expected file types: .fastq.gz, .fa.gz, .fasta.gz, etc. But where do we draw the line and ensure it is generic enough?
  • Tidy up the file extension capture logic, by re-using the regex capture group, rather than the nested lastIndexOf method that I used.

Testing problems

  • I modified an existing test to handle this new behaviour. I'm confused by how the stubs are supposed to work, so not sure if all conditions for those are still met.
  • Depending on the choice of making this optional, a new test should be added to differentiate between file.gz and file.fasta.gz outputs.
  • Tests work fine when I run TMPDIR=~ PROFILE=singularity pytest --tag cat/cat --symlink --keep-workflow-wd, and for conda too, but for docker I receive an error (AssertionError: versions.yml not found in the output directory). I have an alias pointing between docker and podman since I'm running Fedora, that might be causing the problem?
  • nf-core modules test cat/cat also fails due to file access permissions on my machine.
  • I'm not quite sure if I've understood the differences between the tests in the main modules directory and in the testing directory. I ran nf-test test modules/nf-core/cat/cat/tests/main.nf.test --profile for the different profiles though and those seemed to work without me needing to alter any of the files (except for docker again).

EDIT: the conda tests started failing after I rebased on upstream, likely due to commit 516189e.

PR checklist

Closes #3353

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file. (=> is this a reminder to create these inside processes or should I provide a specific file?)
  • [x ] Follow the naming conventions.
  • Follow the parameters requirements.
  • [x ] Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware

@pmoris pmoris added bug Something isn't working enhancement New feature or request good first issue Good for newcomers update module Ready for Review labels Oct 25, 2023
@pmoris pmoris requested a review from maxulysse October 25, 2023 11:19
modules/nf-core/cat/cat/main.nf Outdated Show resolved Hide resolved
pmoris and others added 3 commits October 25, 2023 15:48
- Adds logic to capture file extensions that
preceed `.gz`, e.g. `.fasta.gz` and use this
double extension as the suffix for the output
file.
- Updates tests accordingly.
Instead of a conditional check using a regex capture group
and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`,
this update searches for a regex match and directly adds it as a suffix
if there is a match, and uses the existing lastIndexOf() method
to capture a single file extension otherwise.

Co-authored-by: Matthias Hörtenhuber <[email protected]>
@FriederikeHanssen
Copy link
Contributor

@nf-core-bot fix linting

@pmoris
Copy link
Contributor Author

pmoris commented Nov 9, 2023

I'll need to take a closer look at this later, but the tests that appeared to work before are now failing in the CI pipeline.

Did anything big change recently?

@muffato
Copy link
Member

muffato commented Jan 12, 2024

@pmoris . I've fixed the tests. There were two issues:

  • the closing bracket of the substring was missing
  • it seems functions can't be defined inside process blocks, so I moved it outside

I also resolved the conflicts with master and updated the scripts because this module has moved to the nf-test framework in the meantime.
I changed the nf-test tests to do regular MD5 checks inside of file content so that we can check the filename as well. And we can indeed see that the output file is now named test.gff3.gz.

Feel free to merge it.

@muffato muffato self-requested a review January 12, 2024 14:19
muffato added a commit to sanger-tol/blobtoolkit that referenced this pull request Jan 12, 2024
This means that .fastq.gz will remain .fastq.gz and can then match the
condition to bypass the SAMTOOLS_FASTA process.

There is a pull-request for this, nf-core/modules#4230
@pmoris
Copy link
Contributor Author

pmoris commented Jan 16, 2024

Hi @muffato , thanks for looking into this! I had just picked up this PR myself (as you might have guessed based on my question in the nf-core slack channel ;) ).

I ended up starting a new branch based on the latest version of master, because resolving the conflicts in this one proved to be a bit of a headache, due to all the changes in the testing framework. Your changes look similar to what I ended up with though, so that's great. Cheers!

@pmoris pmoris added this pull request to the merge queue Jan 16, 2024
Merged via the queue into nf-core:master with commit 81f27e7 Jan 16, 2024
10 checks passed
@pmoris pmoris deleted the cat_cat branch January 16, 2024 09:33
lrauschning pushed a commit to lrauschning/modules that referenced this pull request Jan 17, 2024
…#4230)

* Retain double file extensions for .gz files

- Adds logic to capture file extensions that
preceed `.gz`, e.g. `.fasta.gz` and use this
double extension as the suffix for the output
file.
- Updates tests accordingly.

* Re-add cat/cat entry to pytest_modules.yml

* Simplify code to extract double .gz file extensions

Instead of a conditional check using a regex capture group
and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`,
this update searches for a regex match and directly adds it as a suffix
if there is a match, and uses the existing lastIndexOf() method
to capture a single file extension otherwise.

Co-authored-by: Matthias Hörtenhuber <[email protected]>

* [automated] Fix linting with Prettier

* Moved the function outside of the process block

* bugfix: missing closing bracket

* Reordered the lines to make the diff shorter

* Changed the tests to regular snapshots so that we can see and check the output file name

---------

Co-authored-by: Matthias Hörtenhuber <[email protected]>
Co-authored-by: nf-core-bot <[email protected]>
Co-authored-by: Matthieu Muffato <[email protected]>
lrauschning pushed a commit to lrauschning/modules that referenced this pull request Jan 17, 2024
…#4230)

* Retain double file extensions for .gz files

- Adds logic to capture file extensions that
preceed `.gz`, e.g. `.fasta.gz` and use this
double extension as the suffix for the output
file.
- Updates tests accordingly.

* Re-add cat/cat entry to pytest_modules.yml

* Simplify code to extract double .gz file extensions

Instead of a conditional check using a regex capture group
and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`,
this update searches for a regex match and directly adds it as a suffix
if there is a match, and uses the existing lastIndexOf() method
to capture a single file extension otherwise.

Co-authored-by: Matthias Hörtenhuber <[email protected]>

* [automated] Fix linting with Prettier

* Moved the function outside of the process block

* bugfix: missing closing bracket

* Reordered the lines to make the diff shorter

* Changed the tests to regular snapshots so that we can see and check the output file name

---------

Co-authored-by: Matthias Hörtenhuber <[email protected]>
Co-authored-by: nf-core-bot <[email protected]>
Co-authored-by: Matthieu Muffato <[email protected]>
jennylsmith pushed a commit to RSC-RP/modules that referenced this pull request Mar 20, 2024
…#4230)

* Retain double file extensions for .gz files

- Adds logic to capture file extensions that
preceed `.gz`, e.g. `.fasta.gz` and use this
double extension as the suffix for the output
file.
- Updates tests accordingly.

* Re-add cat/cat entry to pytest_modules.yml

* Simplify code to extract double .gz file extensions

Instead of a conditional check using a regex capture group
and a nested lastIndexOf() extraction for double extensions `.xxxxx.gz`,
this update searches for a regex match and directly adds it as a suffix
if there is a match, and uses the existing lastIndexOf() method
to capture a single file extension otherwise.

Co-authored-by: Matthias Hörtenhuber <[email protected]>

* [automated] Fix linting with Prettier

* Moved the function outside of the process block

* bugfix: missing closing bracket

* Reordered the lines to make the diff shorter

* Changed the tests to regular snapshots so that we can see and check the output file name

---------

Co-authored-by: Matthias Hörtenhuber <[email protected]>
Co-authored-by: nf-core-bot <[email protected]>
Co-authored-by: Matthieu Muffato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers Ready for Review update module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CAT_CAT does not deal well with compressed file extensions
5 participants