-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time points #10
Comments
I think that this would be a good next target after cleaning up what we have already. |
@metasoarous is going to be gearing up to work on this... @psathyrella , when you have a chance could you run a subject with sequences from all timepoints merged and then let Chris know where that is? |
@psathyrella Also, we'll need some way to map back from sequences to timepoints, so I'm keen to hear your thoughts on the best way of doing that. |
yeah, it's trivial to run it with things lumped together, it's just a matter of deciding how to label the sequences. I think it'd be safest and most sensible to make the time point information transparent to partis and preprocess the input fastas to prepend time point info to the sequence IDs? That's both make it obvious where each sequences comes from, and guarantee unique ids even if there's duplicates ids across time points. |
In general, my preference would be to work with a separate csv/json/whatever mapping sequence ids to timepoints. The problem with appending timepoint is that we already have issues with the phylip output of dnaml/dnapars and seed sequence ids longer than 10 characters (oh, the joys), and appending timepoints will likely exacerbate the issue. However, you make a good point about uniqueness across timepoints. Do we know for sure that there is the potential for such overlap? Assuming there is, I think we can go ahead with your idea, and we'll figure out how to maintain sanity on our end. |
dear future duncan:
|
@psathyrella I was just looking at #138 and #72 and it looks like maybe @lauranoges is looking back at the original sequences via the ids? Is this the case @lauranoges (sorry; I know you're vacationing...)? Can you comment on whether or not you would need to do this if we tackle #72? The reason I'm asking is that to solve some problems around how we deal with multiple timepoints, it might be helpful to generate new sequence names (the scheme @psathyrella mentions in his last comment of this thread), but if we do this it would potentially stymie your ability to look back at the original sequences via the names on the tree, cause they would have changed. Thoughts? |
She definitely uses the IDs, but I think it's only within-releases, i.e. if we changed it everywhere (maybe including in the non-timepoints-mashed-together stuff) she I think wouldn't mind. |
Laura says: I can't access the github from here for some reason, but let me explain what I can: The reason I want the ID and the link to the original sequence data is to 1) verify the actual MiSeq sequence in case this is an interesting case (this is not critical) and 2) look at the raw sequence so I can see the base pairs recorded before and after the VDJ sequence (very important). This second goal allows us to glean information about which constant region was used for this antibody (i.e. IgG1 vs IgG3, etc). Sorry for the delayed response. #Patagonia |
ok, merged time points as discussed are here. edit: yeah, yeah, I know I need to do the next step, too, but patting yourself on the back is totally legit. |
@psathyrella I take it those files have the input sequence pre- VDJ-trimming, correct? If so then seems we're fine to carry forward till we take care of #72. Please don't hate me for having second thoughts on this, but I'm beginning to wonder if the timepoint prefixing wouldn't be better after all. Right now I'm working on a multiple datasets feature (#123) that will make it possible for us to load up multiple data builds into the cftweb interface at once, so folks can compare results. Changing the names as we're doing could lead to different id names between these builds, making comparisons more difficult. The long names resulting from timepoint prefixing could possibly mean a little extra work on our end making sure Phylip doesn't break, but perhaps it's worth it? As you pointed out @psathyrella, it would be convenient if one could determine the timepoint looking at a sequence id by eye. And eventually, we're going to swap out dnaml/dnapars for something more custom/tailored anyway, so it's perhaps a bit silly to cater too heavily to the asinine restrictions these tools present. Care to weigh in on this @wsdewitt? |
yes, pre-trimming. It just fiddles with the original input fasta files (ie.. from vlad). For future reference, this is what does it: https://github.com/psathyrella/datascripts/blob/55898f2e959bf46aebb8092eb7c5b1a87a028d12/merge-timepoints.py |
@psathyrella Where are we on running with all the timepoints merged? |
Who is this Vald anyway? I just took a look, and I see that there's been some processing of |
I ran some stuff, but then I got distracted by some other things. Will start the rest of the jobs now. what's there is this:
e.g.
|
I think I forgot to say, the merged time points data sets show up as lines with https://github.com/psathyrella/datascripts/blob/master/meta/kate-qrs-2016-09-09/meta.csv so, in the final output files, they show up just like any other data set |
Erm... So I'm still seeing the merged data in the final outputs. I do see the directories along the lines of In any case, using a merged timepoint seems like a nice approach to generalize things there. |
uh, isn't the merged data in the final output what you said you wanted when we chatted a few weeks ago? No new stuff has finished, since slurm seems to be misbehaving, but the stuff listed above is still there |
Yes; what I'm saying is that the partition file you pointed to above appears to be unmerged. |
Realized maybe the confusion we're having is what we mean by "final output"? In my mind it's the |
yeah, definitely, it's no, that's correct, none of the merged ones have finished. v9 is a complete rerun, including unmerged. There were a number of things that needed fixing, although the only one that springs to mind was the indel/functionality problem. |
Also, what was the error when you tried to run datascripts? This whole process would certainly be more efficient if you didn't have to wait for me to figure out which ones you actually wanted. |
@psathyrella What would you think about modifying |
ok, added it. For future reference, you can also get the timepoint using datascripts.heads from the first (dset) column in the translation csvs
|
@psathyrella Awesome! Thanks! Except... it seems there's an issue in the merge. For all three seeds, the seed is showing up in one timepoint, and all of the other sequences in a second timepoint. It's not just the timepoint column either; the dset |
well the dataset and timepoint column are just translations of each other, so the problem shouldn't be there. if I take the top of this file and grep in the translation file for the resulting uids (replacing hard returns with
I get a roughly even mix of time points:
but note that the .fa is randomly sorted (except the seeds are at the top), so that partis can take the first e.g. 100k sequences and they'll be a mix of time points, whereas the translation file is sorted by time point because, uh, I didn't bother to randomize it, and that should be ok? The seeds are listed as only one time point because otherwise they'd be in the translation file twice, since by definition they're "in" both time points, since they're from a separate experiment on the human from both time points. |
OK; I can confirm this (ps: csvuniq is here, and you should use it cause it's awesome; oh and
So you're right, it looks like all the data upstream of partis is accounted for. So let's take a look at the output:
That 1 sequence at timepoint So it appears partis or datascripts is doing something weird somewhere. It's matching up the seed exclusively with sequences from the other timepoint, but I'm at a complete loss as to why. |
I just ran another sanity check on the upstream-of-partis (data merge) side. The question is: Are the sequence names corresponding to 240dpi somehow just off?
Everything looks ok :-/ Gotta be somewhere in the partis or datascripts processing, yeah? |
well... it could just be that it doesn't find anything clonal to the seed in one of the time points. But I will investigate to see! |
I'll keep poking, I don't see anything funky so far. Another possibility is that the translation files were changed after you ran that? In any case a bunch of v10 is done, so you'd probably do better to work from that. Here's some finished v10 stuff: ./datascripts/run.py seed-partition --study kate-qrs --extra-str=v10 --dsets 1k:4k:qb850-k --check
|
Also, why was it we didn't put a string for the data set in the new sequence ids? Was it just that some downstream software can't handle long strings? This totally blows to debug not being able to easily see which data set it's from or what sequence it is. Can I switch to using data set shorthand (usually two characters, defined in the meta csv) plus the original sequence id? this would typically look like |
Yeah... there are issues with name clashing for names longer than 10 chars. I think the seq names max out at 8 chars. I suppose if we prefix with the dataset shorthand (2-char) without the separator, we should be ok (e.g If the seed only matches one of the timepoints, wouldn't we expect it to be of the same timepoint? I should also clarify that it wasn't just this one seed, but all three that I looked at that had this issue, and in each case, the seed was in a different timepoint. Seems fishy to me... I'll try running on v10. Maybe this will just work itself out, or we'll see something different for some of the other seeds. |
Woohoo! Just finished running on what there is of v10 and it looks like the problem (whatever it was) has been resolved! I've got this running now on http://stoat:5556. I'm going to leave this branch separate till we finish getting the rest of the data built with multiple timepoints. Closing for now though since the code is where it should be. |
for the sake of posterity...
uh, no, because the seeds aren't from either time point -- they're from a separate experiment on the same human, and thus are best viewed as "from" both time points in the sense that we look for them in both time points. Maybe it'd be clearer if I put n/a for the seeds' time points in the csv. |
This is an important point: many of these data sets have multiple time points, and we should do something smart with that information.
The first step would be to
The text was updated successfully, but these errors were encountered: