-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ancestral sequence reconstruction #7
Comments
I sent this via email, but I'm not sure if it went through. Will reply directly here from now on. Also I believe @wsdewitt has a way of reconstructing ancestral states through The methods for this that I am familiar with are Though I think |
Yeah phylip can print the unobserved ancestral seqs, using either max parsimony (dnapars) or max likelihood (dnaml) programs. I ran your fasta through dnaml. The attached *outtree.txt contains the newick tree, and the associated *outfile.txt file contains a bunch of data including: an ascii tree with internal nodes numerically labelled; a section beginning "Probable sequences at interior nodes" that shows the sequences of all nodes in a (somewhat annoyingly phylipesque) wrapped format. |
The methods for this that I am familiar with are Though I think On Fri, Oct 21, 2016 at 1:32 PM, Laura Noges [email protected]
|
Just to point out some alternate means of doing ASR that are almost certainly faster than
|
RAxML will work. However, I didn't like the single pass ancestral reconstruction it does. Its algorithm is to walk up the tree computing partial likelihoods and then take the state with the maximum partial for each node. When BEAST samples an ancestral state reconstruction, it does a forward/backward pass to compute partials, and then sample from the root downwards. In TreeTime, Richard does a similar forward/backward to select the most likely root state and then update partials and on down the tree. |
Thanks a bunch for the thoughts, @trvrb . Like you say in that issue, the bottom-up means of doing ASR is not computing the true marginal probabilities for bases being at various sites. So, I guess bye-bye RAxML. I'm not completely sold on progressively selecting the ML state either in this case with potentially long branch lengths. Given that we're trimming the tree now, I think we should be fine computationally using dnaml. I suppose I'm implicitly believing that it's calculating things based on marginal probabilities, but I'll just ask Joe. Another alternative would be to use bppancestor. |
From Joe F himself:
Yes, Dnaml does ancestral sequence reconstruction (when you turn on the proper option) by pruning likelihood inwards toward that node from all directions (upward and downward) so as to have the result depend on the states at all tips, properly. I am startled to hear that RAxML does not do this, but perhaps RAxML had the partial results from the top-down pass already lying around in the nodes so it can take both into account. In that case, when using the same nucleotide substitution model, both programs should always get the same result, aside from rounding error. |
I just used the default: one category of site, constant rate. This is something we could play with. dnaml lets us specify the number of categories, and the model for rate variation: invariant, gamma, mixed gamma and invariant, general HMM (with patch size for autocorrelation of rates). |
Read the manual some and o_O. Things have progressed a little since the last dnaml update. Let's just use 4 gamma categories with alpha = 0.5. |
@matsen Update: I've taken a look at the indels for seed 043 (Vh, late LN4 timepoint) and it looks like the indel sequences that were pulled out are largely the same V and J, while the CDR3s (ie. the NTI and D genes) are very different. This case study makes me think that the indels are indeed not worth pursuing since they are so different from the naive and seed sequences. |
|
Recently with @lauradoepker 's qa013 data, there arose a question about branch length normalization because @psathyrella 's local branching metrics script was saying dnaml trees had max branch depth greater than 1 even though we thought branch lengths were being normalized by evolutionary distance by dnaml. @afmagee 's and @psathyrella 's theory for why these trees were so long was basically:
This promtped a comparison of dnaml using raxml as control to see if raxml would optimize and arrive at a different answer for this transition-transversion ratio, which we were setting statically with dnaml at 2. Andy suggested we compared total lengths of the dnaml, raxml trees; they look very different:
I'm posting this here because @psathyrella and I are not comfortable making an executive decision on tree building programs or parameters, and this seems like where the discussion left off re: choosing a method of ASR. cc @matsen @metasoarous |
Great, nice summary. So it seems like we know that dnaml is giving us pathologically long trees in some significant fraction of cases. I think what we're not sure of is whether that means we should fiddle with dnaml parameters, use a different program, or not worry about it. Laura didn't seem terribly fussed about the issue when we told her, presumably since she's not using the tree lengths at this point. Of course even if that's the case, that doesn't mean six months from now she won't start using them, having at that point forgotten that they're terrible. The issue is also a problem for my lb metric stuff, since it effectively changes the tau value by a factor of 2 to 4. But I'm not really worried about it just now, in practice it just means my previous assumption that I should use the cft tree since it's more accurate isn't correct. |
We also might worry about the topology itself. It's possible that the dnaml trees are just longer (by a factor of 2 or more in many cases). But if the branch lengths can be that far off, it seems worth worrying about the topology in general. |
Thanks for putting this information together @eharkins. Dnaml may give us wonky branch lengths, but unfortunately we have yet to find software that does as a good a job at ancestral sequence reconstruction. And this is not for lack of trying (#130, #131, #149, #170). To summarize, we tested out about a half dozen different programs, and they more or less all sucked for one reason or another. In some cases the results they were outputting were clearly mangled. In other's they just didn't stack up to the simulation-based benchmarks that @krdav put together. So while I'd love nothing more than to use a different piece of software, we're likely best off sticking with dnaml for the time being. If we can get more reasonable branch lengths from dnaml by adjusting the model, I'm all for that. |
How difficult is it going to be to add steps to the pipeline? There would need to be a step before running dnaml that estimates these parameters, and getting better parameter values is not trivial without a program that does tree inference. One could do some counting-based approximations to get in the ballpark for the transition-transversion ratio, but in all my exploration of hacky cheap ways to estimate ASRV parameters from the sequences without trees, I have never been particularly successful. If it is possible to insert something like RAxML into the pipeline, then decent parameter estimates are possible. In this case, dnaml does not need to do as much work to optimize the tree if the RAxML tree is fed in as a starting tree, or indeed any work if the RAxML tree is fed in and tree optimization is turned off, leaving dnaml only to evaluate ancestral states. |
Adding steps to the pipeline is not difficult. We could easily add a step to run RAxML for parameter estimates and/or starter/guide trees. I think if we were to get too different in what we're doing here, it would be worth running @krdav's tests again (can you please point us to code for these Kristian?). It might also be worth looking back through the software we tested in the past and seeing if any of the more egregious bugs have been fixed with the software we thought would be most promising, but ended up having issues in the output, like Prank. |
While it's certainly not doing phylogenetics, the partis parameters have an awful lot of information about the mutation in the sample, and I'd find it hard to believe that we couldn't get a good enough estimate to at least avoid the kind of pathology we're seeing. And they'd already be sitting there, so it'd just be a matter of a few lines of python to read in what you need. |
Here are the comparisons between original dnaml runs and raxml:
Here are the comparisons between raxml-optimized parameters (t-ratio, alpha) dnaml runs and raxml:
In summary, the RF distance isn't much improved it seems like, but the tree lengths are way closer to one another having optimized parameters. |
I'm going to close this issue in favor of #170 , which has a more entertaining title. |
David and I just created trees based on Duncan's analysis of the sequences I gained recently. It would be really great if there was a way to reconstruct the sequences at internal nodes of the trees. Ie. ancestral sequences that we didn't directly observe. Any thoughts? 🤔
The text was updated successfully, but these errors were encountered: