Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter recruited fix #29

Merged
merged 4 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/awsfulltest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,6 @@ jobs:

- name: Launch workflow via Seqera Platform
uses: seqeralabs/action-tower-launch@v2
# TODO nf-core: You can customise AWS full pipeline tests as required
# Add full size test data (but still relatively small datasets for few samples)
# on the `test_full.config` test runs with only one set of parameters
with:
workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }}
access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
Expand Down
2 changes: 1 addition & 1 deletion .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ template:
skip_features:
- igenomes
- fastqc
version: 1.0.0
version: 1.1.0dev
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.1.0 - [yyyy/mm/dd]

### `Fixed`

- [#29](https://github.com/nf-core/proteinfamilies/pull/29) - Fix `hmmalign` empty input crash error, by preventing the `FILTER_RECRUITED` module from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering the `hmmsearch` results [#28](https://github.com/nf-core/proteinfamilies/issues/28)

## v1.0.0 - [2025/02/05]

Initial release of nf-core/proteinfamilies, created with the [nf-core](https://nf-co.re/) template.
Expand Down
4 changes: 2 additions & 2 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/releases/tag/1.0.0"
This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/tree/dev"
target="_blank">nf-core/proteinfamilies</a> analysis pipeline. For information about
how to interpret these results, please see the <a href="https://nf-co.re/proteinfamilies/1.0.0/docs/output"
how to interpret these results, please see the <a href="https://nf-co.re/proteinfamilies/dev/docs/output"
target="_blank">documentation</a>.
report_section_order:
"nf-core-proteinfamilies-methods-description":
Expand Down
55 changes: 29 additions & 26 deletions bin/filter_recruited.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,32 +103,35 @@ def validate_and_parse_hit_name(hit):


def extract_fasta_subset(filtered_sequences, fasta, out_fasta):
open_func = gzip.open if fasta.endswith(".gz") else open
with open_func(fasta, "rt") as in_fasta:
fasta_dict = {record.id: str(record.seq) for record in SeqIO.parse(in_fasta, "fasta")}

with gzip.open(out_fasta, "wt") as out_file:
for filtered_sequence in filtered_sequences:
try:
sequence_name, env_from, env_to = validate_and_parse_hit_name(filtered_sequence)

# Get the original sequence
original_record = fasta_dict[sequence_name]

# Extract the specific range (adjust indices for 0-based indexing)
extracted_seq = original_record[env_from-1:env_to]

# Determine the new sequence ID
if len(extracted_seq) == len(original_record):
new_id = sequence_name # Omit range if full-length
else:
new_id = f"{sequence_name}/{env_from}-{env_to}"

out_file.write(f">{new_id}\n{extracted_seq}\n")
except KeyError:
print(f"Sequence {sequence_name} not found in the input FASTA.", file=sys.stderr)
except ValueError as e:
print(e, file=sys.stderr)
if filtered_sequences:
open_func = gzip.open if fasta.endswith(".gz") else open
with open_func(fasta, "rt") as in_fasta:
fasta_dict = {record.id: str(record.seq) for record in SeqIO.parse(in_fasta, "fasta")}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider pyfastx at some point... it's pretty fast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already in the TODOs ;)


with gzip.open(out_fasta, "wt") as out_file:
for filtered_sequence in filtered_sequences:
try:
sequence_name, env_from, env_to = validate_and_parse_hit_name(filtered_sequence)

# Get the original sequence
original_record = fasta_dict[sequence_name]

# Extract the specific range (adjust indices for 0-based indexing)
extracted_seq = original_record[env_from-1:env_to]

# Determine the new sequence ID
if len(extracted_seq) == len(original_record):
new_id = sequence_name # Omit range if full-length
else:
new_id = f"{sequence_name}/{env_from}-{env_to}"

out_file.write(f">{new_id}\n{extracted_seq}\n")
except KeyError:
print(f"Sequence {sequence_name} not found in the input FASTA.", file=sys.stderr)
except ValueError as e:
print(e, file=sys.stderr)
else:
print("No filtered sequences remained to write out. Skipping out_fasta file creation.")


def filter_recruited(domtbl, fasta, length_threshold, out_fasta):
Expand Down
2 changes: 1 addition & 1 deletion modules/local/filter_recruited/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ process FILTER_RECRUITED {
val(length_threshold)

output:
tuple val(meta), path("*.fasta.gz"), emit: fasta
tuple val(meta), path("*.fasta.gz"), emit: fasta, optional: true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the inner workings of this pipeline but this can cause some samples to be taken out of the flow, which can cause some join issues later on. Have you checked that this doesn't break anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, tested until the end of the pipeline and nothing breaks. That's what we would like to do with such edge cases, discard them. In a future update, we will also report them. This could go hand to hand with a general quality_check subworkflow that we are aiming to create and append to the pipeline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright sounds good :)

path "versions.yml" , emit: versions

when:
Expand Down
2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ manifest {
mainScript = 'main.nf'
defaultBranch = 'master'
nextflowVersion = '!>=24.04.2'
version = '1.0.0'
version = '1.1.0dev'
doi = ''
}

Expand Down
20 changes: 10 additions & 10 deletions ro-crate-metadata.json

Large diffs are not rendered by default.