No way to force the sequence type when loading a FASTA file #477

gezmi · 2023-06-16T13:09:17Z

Hi,

I am trying to create sequence logos from fasta files, and there I some that I cannot process but for no obvious reasons.

My code is:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq

file = fasta.FastaFile()
file.read('1MFG.fa')

seqs = seq.io.fasta.get_alignment(file)
seq.SequenceProfile.from_alignment(seqs)

For the 1MFG.fa file, this gives
ValueError: There is no common alphabet that extends all alphabets
1MFG.txt
1JWG.txt
(I needed to convert them to txt just for uploading to github, they were named 1MFG.fa and 1JWG.fa, respectively.)

But for the 1JWG.fa file, it works. Can you help me debugging this? Both only contains the 20 common amino acids.

Thank you for you help!

The text was updated successfully, but these errors were encountered:

padix-key · 2023-06-16T20:24:36Z

Hi, thanks for reporting. The problem is that the sequences are very short, so that the sequence type determination of the convenience function get_alignment() identifies some of the sequences as nucleotide sequences. While Biotite allows Alignment objects of mixed sequence types (reasonable for example for protein sequence to structural alphabet alignment), this does not work for SequenceProfile creation. The solution would be to avoid using the convenience function and be explicit about the sequence type:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.sequence.align as align


def read_alignment(fasta_file):
    seq_strings = list(fasta_file.values())
    sequences = [
        # Explicit creation of ProteinSequence
        seq.ProteinSequence(seq_str.replace("-",""))
        for seq_str in seq_strings
    ]
    trace = align.Alignment.trace_from_strings(seq_strings)
    return align.Alignment(sequences, trace)

file = fasta.FastaFile()
file.read('1MFG.fa')

alignment = read_alignment(file)
print(alignment)
seq.SequenceProfile.from_alignment(alignment)

Although it is quite nonintuitive that get_alignment() may return an alignment with mixed sequence types, finding the right sequence type over a large range of sequences would probably generally decrease the performance. So in my opinion we should just mention in the documentation, that this scenario may happen.

gezmi · 2023-06-16T20:27:11Z

Would it be possible to explicitly set the type of sequence when reading the fastest file? So that everything is forced to be protein/nucleotide etc? Now it works, but that may also help others in the future.

Thank you!

padix-key · 2023-06-18T09:59:22Z

Yes, I think that would also be a reasonable solution. I think an optional seq_type parameter for fasta.get_sequence(), fasta.get_sequences() and fasta.get_alignment() should be sufficient. I will keep this issue open, until this is implemented.

padix-key · 2023-06-19T19:36:40Z

Thanks to @t0mdavid-m the new seq_type parameter is now implemented, hence I will close this issue.

padix-key changed the title ~~There is no common alphabet that extends all alphabets~~ No way to force the sequence type when loading a FASTA file Jun 18, 2023

padix-key added enhancement good first issue labels Jun 18, 2023

t0mdavid-m mentioned this issue Jun 18, 2023

Add manual seq type #478

Merged

padix-key closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to force the sequence type when loading a FASTA file #477

No way to force the sequence type when loading a FASTA file #477

gezmi commented Jun 16, 2023

padix-key commented Jun 16, 2023

gezmi commented Jun 16, 2023

padix-key commented Jun 18, 2023

padix-key commented Jun 19, 2023

No way to force the sequence type when loading a FASTA file #477

No way to force the sequence type when loading a FASTA file #477

Comments

gezmi commented Jun 16, 2023

padix-key commented Jun 16, 2023

gezmi commented Jun 16, 2023

padix-key commented Jun 18, 2023

padix-key commented Jun 19, 2023