Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to force the sequence type when loading a FASTA file #477

Closed
gezmi opened this issue Jun 16, 2023 · 4 comments
Closed

No way to force the sequence type when loading a FASTA file #477

gezmi opened this issue Jun 16, 2023 · 4 comments

Comments

@gezmi
Copy link

gezmi commented Jun 16, 2023

Hi,

I am trying to create sequence logos from fasta files, and there I some that I cannot process but for no obvious reasons.

My code is:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq

file = fasta.FastaFile()
file.read('1MFG.fa')

seqs = seq.io.fasta.get_alignment(file)
seq.SequenceProfile.from_alignment(seqs)

For the 1MFG.fa file, this gives
ValueError: There is no common alphabet that extends all alphabets
1MFG.txt
1JWG.txt
(I needed to convert them to txt just for uploading to github, they were named 1MFG.fa and 1JWG.fa, respectively.)

But for the 1JWG.fa file, it works. Can you help me debugging this? Both only contains the 20 common amino acids.

Thank you for you help!

@padix-key
Copy link
Member

Hi, thanks for reporting. The problem is that the sequences are very short, so that the sequence type determination of the convenience function get_alignment() identifies some of the sequences as nucleotide sequences. While Biotite allows Alignment objects of mixed sequence types (reasonable for example for protein sequence to structural alphabet alignment), this does not work for SequenceProfile creation. The solution would be to avoid using the convenience function and be explicit about the sequence type:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.sequence.align as align


def read_alignment(fasta_file):
    seq_strings = list(fasta_file.values())
    sequences = [
        # Explicit creation of ProteinSequence
        seq.ProteinSequence(seq_str.replace("-",""))
        for seq_str in seq_strings
    ]
    trace = align.Alignment.trace_from_strings(seq_strings)
    return align.Alignment(sequences, trace)

file = fasta.FastaFile()
file.read('1MFG.fa')

alignment = read_alignment(file)
print(alignment)
seq.SequenceProfile.from_alignment(alignment)

Although it is quite nonintuitive that get_alignment() may return an alignment with mixed sequence types, finding the right sequence type over a large range of sequences would probably generally decrease the performance. So in my opinion we should just mention in the documentation, that this scenario may happen.

@gezmi
Copy link
Author

gezmi commented Jun 16, 2023

Would it be possible to explicitly set the type of sequence when reading the fastest file? So that everything is forced to be protein/nucleotide etc? Now it works, but that may also help others in the future.

Thank you!

@padix-key
Copy link
Member

Yes, I think that would also be a reasonable solution. I think an optional seq_type parameter for fasta.get_sequence(), fasta.get_sequences() and fasta.get_alignment() should be sufficient. I will keep this issue open, until this is implemented.

@padix-key padix-key changed the title There is no common alphabet that extends all alphabets No way to force the sequence type when loading a FASTA file Jun 18, 2023
@padix-key
Copy link
Member

Thanks to @t0mdavid-m the new seq_type parameter is now implemented, hence I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants