Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix indel inference in ancestral reconstructions #131

Closed
metasoarous opened this issue Jan 27, 2017 · 11 comments
Closed

fix indel inference in ancestral reconstructions #131

metasoarous opened this issue Jan 27, 2017 · 11 comments

Comments

@metasoarous
Copy link
Member

DNAML assumes that any insertions in the tip alignment started at the root. And since the naive is technically not the root but a tip, it's inferred from our lineage alignments (where we manually put the naive at the root) that an insertion happened shortly after the naive sequence (the root), and then disappeared right most of the other tips (all but the tip with the actual insertion). This threw @lauranoges for a loop, and is likely to do the same with others as well.

I think we can fix this pretty simply: If we look at each gap in the naive sequence, we can see which tip sequence don't share that gap (and there must be some, or they would have been filtered out at this point), and place gaps in all internal node sequences except those non-tip sequences decending from the mrca of the nodes with the insertions. This would be easy enough to code up and generally solve the problem. Thoughts @matsen?

@matsen
Copy link
Contributor

matsen commented Jan 28, 2017

Not totally agreed. I agree that it's an issue, but the problem is hard.

I think that the easiest way forward is to get a result is to use gap coding, which perhaps might work well? Here is a paper which appears to do something related but more sophisticated.

@matsen matsen self-assigned this Feb 14, 2017
@matsen
Copy link
Contributor

matsen commented Mar 26, 2017

I think we should move to PRANK.

  • PRANK has smart ways of dealing with fast and slow rates, and treats indels in a phylogenetic fashion (https://paperpile.com/shared/dSsCir)
  • PRANK can codon align (should be better than protein align and backtrans align)
  • PRANK can do ancestral sequence reconstruction where indels are treated "properly" (heuristically, but a lot better than treating gaps as missing data)

@metasoarous sorry, but I'm going to hand this off to you to try this out!

@matsen matsen assigned metasoarous and unassigned matsen Mar 26, 2017
@metasoarous
Copy link
Member Author

Sorry!? Hah! Begone demon Phylip!

I'm on it.

@matsen
Copy link
Contributor

matsen commented Mar 31, 2017

Er, sorry, but PRANK isn't going to infer the trees for you...

Now that I'm thinking about it, @cswarth had some funny experiences with the PRANK ancestral sequence reconstruction. Is that right, Chris?

@cswarth
Copy link
Contributor

cswarth commented Mar 31, 2017

We used prank to infer ancestral sequences and trees for PREAST,

https://github.com/matsengrp/PREAST/blob/master/bin/infer.sh#L157

I don't recall the specifics of how it came up with a tree.

@metasoarous
Copy link
Member Author

Thanks @cswarth

@matsen Looks like it can infer and spit out its own guide trees via the -showtree command, but perhaps they're not really to be trusted. Then again, in our situation, maybe they're just as trustworthy as anything else we're looking at.

In any case, if we supply our own trees, we can use PRANK just for the ancestral construction and for cleaning up the alignment (obviously we'd already have to have an alignment for producing the input tree), and this would free us up to choose something saner for the tree construction, yes?

@metasoarous
Copy link
Member Author

@matsen So how should we do this? If we want to use the ancestral sequences, we need to already have the final tree, so that the ancestral seqs correspond to that topology. But then what do we use for the alignment going into that tree? Do you think it's fine to just subset the big muscle alignment, and feed that into dnaml? Or is it worth taking those sequences and aligning them with a preliminary run of PRANK first, to get a better final tree? (And then do a second round of PRANK after to get the ancestral sequences?)

@matsen
Copy link
Contributor

matsen commented Apr 25, 2017

We could do either strategy. IIUC the alignment problem isn't especially hard, right? The challenge here is to get ancestral sequences on the tree in the presence of indels.

@metasoarous
Copy link
Member Author

Ug... well, here's some sour apples. Prank complains with the like of Problem with the guidetee: brackets (79,79) and commas (210) don't match) when we pass in a parsimony tree. Presumably, because we have multifurcations :-/ I guess I can translate the multifurcations into a series of bifurcations, and then infer back? I guess we'll see how our ml/parsimony battle pans out. Maybe we won't need to worry about this. For now I'll just try and get things working with ml.

@matsen
Copy link
Contributor

matsen commented May 3, 2017

👍 for continuing with ML.

@metasoarous metasoarous removed this from the MB release milestone Jul 26, 2017
@metasoarous
Copy link
Member Author

metasoarous commented Jul 26, 2017

As discussed elsewhere, PRANK's ancestral state reconstruction appears to be a joke (harhar...). As @krdav and I thoroughly demonstrated to ourselves, the internal node sequences are all mismatched from the tree. @krdav has opened an issue for this here: ariloytynoja/prank-msa#16.

For now, I'm going to put this issue on Ice, in case they fix things or we find another way around this issue. I will however take it off the MB release milestone.

PS Thanks again for all your work on this @krdav!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants