Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should one search for ORFs? #299

Open
jakobnissen opened this issue Oct 23, 2023 · 5 comments
Open

How should one search for ORFs? #299

jakobnissen opened this issue Oct 23, 2023 · 5 comments

Comments

@jakobnissen
Copy link
Member

CC @camilogarciabotero, ref. #297

How would a user go about findings ORFs with BioSequences? To do this, they need to be able to do several fundamental operations that BioSequences currently doesn't support:

  • BioRegex doesn't support most find* functions, like findnext and findall.
  • There is no way to search for an amino acid in a nucseq, given a translation table.
  • There is no way to search only in-frame.

It's possible this needs to be added in Kmers.jl, since Kmers contain the RNACodon type, the CodonSet type, and reverse translation. Kmers.jl is currently in an incomplete state, but I'm slowly working rewriting it.

So, my current proposal is:

  • Implement the find* functions for BioRegex.
  • In Kmers, allow searching for a CodonSet in a nucseq.

Maybe we should include a convenience function in Kmers that is something like this:

function find_next(s::BioSequence, codons::CodonSet, from::Index)
    v = view(s, from:lastindex(s))
    for (step, kmer) in enumerate(SpacedKmers{RNAAlphabet{2}, 3, 3}(v))
        kmer codons && return 3 * (step-1) + from
    end
    nothing
end
@kescobo
Copy link
Member

kescobo commented Oct 23, 2023

We should definitely support find*, and the idea of in-frame searching would indeed be cool.

Maybe we should include a convenience function in Kmers

findnext is alread in base, is there a reason we woudn't extend that?

@jakobnissen
Copy link
Member Author

Sorry, I wasn't being clear. We should definitely extend the find* methods for bioregex, regardless of this issue.
To search for stop codons or amino acids in a nucleotide sequence, I think we need the Codon, CodonSet reverse translation, and SpacedKmers functionality from Kmers.jl, so we might want to implement the whole "search for codon thing" there.
I don't super love the distinction - really, codons and reverse translation has little to do with kmers and should maybe belong in this package, but implementation wise it's just soooo much nicer to have codons be implemented as 3-mers.

@kescobo
Copy link
Member

kescobo commented Oct 23, 2023

I mean, codons seem like 3-mers to me 🤷 - actually, they might be the O.G. Kmer 😆

I suppose my questions would be

  1. Is Kmers.jl going to be sufficiently low-level that it makes sense for this package to depend on it?
  2. Is there a way to do a naive implementation of search w/o Kmers, in a way that can be made more efficient / ergonomic long-term once Kmers.jl is done?

@camilogarciabotero
Copy link
Member

camilogarciabotero commented Oct 23, 2023

Thanks for opening the issue out of the PR, I did have my doubts about the implementation and where to put it. Finding that there were other predicates in BioSequences looked to me like it was a suitable place, but it indeed is very idiosyncratic for broad use.

This hasprematurestop method came up first (now I'm using it somewhere else...) from the ORF library ( GeneFinder.jl) I've been working on. And I did have these exact issues you mentioned when trying to implement a findorf method:

BioRegex doesn't support most find* functions, like findnext and findall.
There is no way to search for an amino acid in a nucseq, given a translation table.
There is no way to search only in-frame.

The implementation I ended up with was a little cumbersome: It first creates an iterator of a BioRegex that match a complex ORF:

function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool = false) where N
    regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
    finder(x) = findfirst(regorf, sequence, first(x) + 1) # + 3
    itr = takewhile(!isnothing, iterated(finder, findfirst(regorf, sequence)))
    return itr
end

And then traverse the sequence looking for the iterator match in both strands of the sequence and stores the ORFs that match the BioRegex:

function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, min_len::Int64 = 6) where N
    orfs = Vector{ORF}()
    reversedseq = reverse_complement(sequence)
    seqlen = length(sequence)

    frames = Dict(0 => 3, 1 => 1, 2 => 2)

    for strand in ('+', '-')
        seq = strand == '-' ? reversedseq : sequence

        @inbounds for location in locationiterator(seq; alternative_start)
            if length(location) >= min_len
                frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
                push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
            end
        end
    end
    return sort(orfs)
end

It takes care of the frame issue by simply calculating the module of the start position of each match (location.start % 3). It is now working fine for most ORF cases. But it allocates tones of memory. The idea to have find_next and the CodonSet as well as a SpacedKmer would definitely improve the way in which this implementation could go.

Another alternative I was thinking about (but I lack the expertise) is to create a better Regex and create a finite automaton that parses it and then stores the matches (I actually don't know if it is viable). But the problem is that the BioRegex uses non-deterministic expression (*?) that operates lazily to capture even the smallest ORF possible (M*).

@JokingHero
Copy link

Please take a look here: https://github.com/Roleren/ORFik/blob/master/src/findOrfsFasta.cpp

Its well tested implementation and very fast at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants