Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time points #10

Closed
matsen opened this issue Nov 1, 2016 · 34 comments
Closed

Time points #10

matsen opened this issue Nov 1, 2016 · 34 comments
Assignees

Comments

@matsen
Copy link
Contributor

matsen commented Nov 1, 2016

This is an important point: many of these data sets have multiple time points, and we should do something smart with that information.

The first step would be to

  • take the union of all of the clusters across various timepoints that have a given sequence, but preserving and transmitting the timepoint information for downstream steps
  • build trees on the unioned thing
  • color nodes by timepoint in the clustered tree
@matsen
Copy link
Contributor Author

matsen commented Nov 29, 2016

I think that this would be a good next target after cleaning up what we have already.

@matsen
Copy link
Contributor Author

matsen commented Feb 14, 2017

@metasoarous is going to be gearing up to work on this... @psathyrella , when you have a chance could you run a subject with sequences from all timepoints merged and then let Chris know where that is?

@metasoarous metasoarous self-assigned this Feb 14, 2017
@metasoarous
Copy link
Member

@psathyrella Also, we'll need some way to map back from sequences to timepoints, so I'm keen to hear your thoughts on the best way of doing that.

@psathyrella
Copy link
Contributor

yeah, it's trivial to run it with things lumped together, it's just a matter of deciding how to label the sequences. I think it'd be safest and most sensible to make the time point information transparent to partis and preprocess the input fastas to prepend time point info to the sequence IDs? That's both make it obvious where each sequences comes from, and guarantee unique ids even if there's duplicates ids across time points.

@metasoarous
Copy link
Member

In general, my preference would be to work with a separate csv/json/whatever mapping sequence ids to timepoints. The problem with appending timepoint is that we already have issues with the phylip output of dnaml/dnapars and seed sequence ids longer than 10 characters (oh, the joys), and appending timepoints will likely exacerbate the issue.

However, you make a good point about uniqueness across timepoints. Do we know for sure that there is the potential for such overlap? Assuming there is, I think we can go ahead with your idea, and we'll figure out how to maintain sanity on our end.

@psathyrella
Copy link
Contributor

dear future duncan:

  • merge together time points into a new fasta where each sequence has a new id of form 0000000000 where 0 can go from 0 --> z. Write accompanying csv with columns for new uid, old uid, and time point.

@metasoarous
Copy link
Member

@psathyrella I was just looking at #138 and #72 and it looks like maybe @lauranoges is looking back at the original sequences via the ids? Is this the case @lauranoges (sorry; I know you're vacationing...)? Can you comment on whether or not you would need to do this if we tackle #72? The reason I'm asking is that to solve some problems around how we deal with multiple timepoints, it might be helpful to generate new sequence names (the scheme @psathyrella mentions in his last comment of this thread), but if we do this it would potentially stymie your ability to look back at the original sequences via the names on the tree, cause they would have changed. Thoughts?

@psathyrella
Copy link
Contributor

She definitely uses the IDs, but I think it's only within-releases, i.e. if we changed it everywhere (maybe including in the non-timepoints-mashed-together stuff) she I think wouldn't mind.

@psathyrella
Copy link
Contributor

psathyrella commented Feb 16, 2017

Laura says:

I can't access the github from here for some reason, but let me explain what I can:

The reason I want the ID and the link to the original sequence data is to 1) verify the actual MiSeq sequence in case this is an interesting case (this is not critical) and 2) look at the raw sequence so I can see the base pairs recorded before and after the VDJ sequence (very important). This second goal allows us to glean information about which constant region was used for this antibody (i.e. IgG1 vs IgG3, etc).

Sorry for the delayed response. #Patagonia

@psathyrella
Copy link
Contributor

psathyrella commented Feb 18, 2017

/fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs

ok, merged time points as discussed are here.

edit: yeah, yeah, I know I need to do the next step, too, but patting yourself on the back is totally legit.

@metasoarous
Copy link
Member

@psathyrella I take it those files have the input sequence pre- VDJ-trimming, correct? If so then seems we're fine to carry forward till we take care of #72.

Please don't hate me for having second thoughts on this, but I'm beginning to wonder if the timepoint prefixing wouldn't be better after all. Right now I'm working on a multiple datasets feature (#123) that will make it possible for us to load up multiple data builds into the cftweb interface at once, so folks can compare results. Changing the names as we're doing could lead to different id names between these builds, making comparisons more difficult. The long names resulting from timepoint prefixing could possibly mean a little extra work on our end making sure Phylip doesn't break, but perhaps it's worth it? As you pointed out @psathyrella, it would be convenient if one could determine the timepoint looking at a sequence id by eye. And eventually, we're going to swap out dnaml/dnapars for something more custom/tailored anyway, so it's perhaps a bit silly to cater too heavily to the asinine restrictions these tools present. Care to weigh in on this @wsdewitt?

@psathyrella
Copy link
Contributor

yes, pre-trimming. It just fiddles with the original input fasta files (ie.. from vlad).

For future reference, this is what does it: https://github.com/psathyrella/datascripts/blob/55898f2e959bf46aebb8092eb7c5b1a87a028d12/merge-timepoints.py

@metasoarous
Copy link
Member

@psathyrella Where are we on running with all the timepoints merged?

@metasoarous
Copy link
Member

Who is this Vald anyway?

I just took a look, and I see that there's been some processing of v9 with a message in v9.txt about adding merged time points, but all the seeds results appear to have the old deepseq ids in them. Am I just looking in the wrong place?

@psathyrella
Copy link
Contributor

I ran some stuff, but then I got distracted by some other things. Will start the rest of the jobs now.

what's there is this:

./datascripts/run.py seed-partition --study kate-qrs --extra-str=v9 --dsets 1g --check

e.g.

/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v9/seeds/QB850.402-Vh/Hs-LN1-5RACE-IgG/partition.csv

@psathyrella
Copy link
Contributor

I think I forgot to say, the merged time points data sets show up as lines with merged as the time point, e.g. the bottom ones here:
p

https://github.com/psathyrella/datascripts/blob/master/meta/kate-qrs-2016-09-09/meta.csv

so, in the final output files, they show up just like any other data set

@metasoarous
Copy link
Member

metasoarous commented Mar 3, 2017

Erm... So I'm still seeing the merged data in the final outputs. I do see the directories along the lines of /fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v9/QA255-h-IgG, but nothing that looks like it's been processed out of them in seeds. Am I missing something? I tried to run myself, but my datascripts fu is not up to snuff.

In any case, using a merged timepoint seems like a nice approach to generalize things there.

@psathyrella
Copy link
Contributor

uh, isn't the merged data in the final output what you said you wanted when we chatted a few weeks ago?

No new stuff has finished, since slurm seems to be misbehaving, but the stuff listed above is still there

@metasoarous
Copy link
Member

Yes; what I'm saying is that the partition file you pointed to above appears to be unmerged.

@metasoarous
Copy link
Member

Realized maybe the confusion we're having is what we mean by "final output"? In my mind it's the partition.csv file and the run-viterbi-best-plus-*.csv files. I don't see any of these in the v9 directory that correspond to the merged data set (judging by what the sequence names look like there; either timepoint prefixed or alphanumeric encoded like 0000xy74jf). I see that there has been some processing done under the merged dataset names (again, such as v9/QA255-h-IgG, as mentioned above), but I'm not sure of how you've been running to get partition and annotation csv files I've been ingesting. Are we misunderstanding each other somewhere in all this?

@psathyrella
Copy link
Contributor

yeah, definitely, it's Hs-LN1-5RACE-IgG, not one of the merged ones.

no, that's correct, none of the merged ones have finished. v9 is a complete rerun, including unmerged. There were a number of things that needed fixing, although the only one that springs to mind was the indel/functionality problem.

@psathyrella
Copy link
Contributor

Also, what was the error when you tried to run datascripts? This whole process would certainly be more efficient if you didn't have to wait for me to figure out which ones you actually wanted.

@metasoarous
Copy link
Member

@psathyrella What would you think about modifying merge-timepoints.py so that it includes a timepoint column? That would make my life a little easier...

@psathyrella
Copy link
Contributor

ok, added it.

For future reference, you can also get the timepoint using datascripts.heads from the first (dset) column in the translation csvs

metafo = heads.read_metadata(study)
timepoint = metafo[dataset]['timepoint']

@metasoarous
Copy link
Member

@psathyrella Awesome! Thanks!

Except... it seems there's an issue in the merge. For all three seeds, the seed is showing up in one timepoint, and all of the other sequences in a second timepoint. It's not just the timepoint column either; the dset LN* tags seem to confirm that the problem is upstream of just grabbing the timepoint values. If I get a chance tonight, I'll take a look at this, but if I don't find anything would you mind taking a look?

@psathyrella
Copy link
Contributor

well the dataset and timepoint column are just translations of each other, so the problem shouldn't be there.

if I take the top of this file
head -n100 /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG.fa |grep '>' |sed 's/>//' >/tmp/out

and grep in the translation file for the resulting uids (replacing hard returns with \| and removing seeds in /tmp/out with editor here...):

grep "`cat /tmp/out`" /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG-translations.csv

I get a roughly even mix of time points:

Hs-LN1-5RACE-IgG,000000361w,147813-1,240dpi
Hs-LN1-5RACE-IgG,00000036ko,148489-1,240dpi
Hs-LN1-5RACE-IgG,0000004nld,217202-1,240dpi
Hs-LN1-5RACE-IgG,00000058bl,244066-1,240dpi
Hs-LN1-5RACE-IgG,0000005l2k,260589-1,240dpi
Hs-LN1-5RACE-IgG,0000006hkv,302720-1,240dpi
Hs-LN1-5RACE-IgG,0000007p00,358993-1,240dpi
Hs-LN1-5RACE-IgG,0000007wqf,369016-1,240dpi
Hs-LN1-5RACE-IgG,00000099ss,432605-1,240dpi
Hs-LN1-5RACE-IgG,000000a46y,471995-1,240dpi
Hs-LN1-5RACE-IgG,000000a6rd,475322-1,240dpi
Hs-LN1-5RACE-IgG,000000avlf,507508-1,240dpi
Hs-LN1-5RACE-IgG,000000dejw,625389-1,240dpi
Hs-LN4-5RACE-IgG,000000dynn,3592-17,1586dpi
Hs-LN4-5RACE-IgG,000000ec1a,20931-2,1586dpi
Hs-LN4-5RACE-IgG,000000fajn,65656-1,1586dpi
Hs-LN4-5RACE-IgG,000000fjfo,77177-1,1586dpi
Hs-LN4-5RACE-IgG,000000hgbt,166462-1,1586dpi
Hs-LN4-5RACE-IgG,000000hx48,188221-1,1586dpi
Hs-LN4-5RACE-IgG,000000iqfd,226206-1,1586dpi
Hs-LN4-5RACE-IgG,000000mam9,392342-1,1586dpi
Hs-LN4-5RACE-IgG,000000nnuk,456145-1,1586dpi
Hs-LN4-5RACE-IgG,000000q2d1,568266-1,1586dpi
Hs-LN4-5RACE-IgG,000000q5wr,572864-1,1586dpi
Hs-LN4-5RACE-IgG,000000qbz4,580725-1,1586dpi
Hs-LN4-5RACE-IgG,000000r4vs,618189-1,1586dpi
Hs-LN4-5RACE-IgG,000000r8qd,623178-1,1586dpi
Hs-LN4-5RACE-IgG,000000rep2,630907-1,1586dpi
Hs-LN4-5RACE-IgG,000000ruam,651123-1,1586dpi
Hs-LN4-5RACE-IgG,000000sdz4,676629-1,1586dpi
Hs-LN4-5RACE-IgG,000000skh7,685056-1,1586dpi
Hs-LN4-5RACE-IgG,000000sz8j,704184-1,1586dpi
Hs-LN4-5RACE-IgG,000000teg1,723894-1,1586dpi
Hs-LN4-5RACE-IgG,000000uiwk,776329-1,1586dpi
Hs-LN4-5RACE-IgG,000000v2sj,802104-1,1586dpi

but note that the .fa is randomly sorted (except the seeds are at the top), so that partis can take the first e.g. 100k sequences and they'll be a mix of time points, whereas the translation file is sorted by time point because, uh, I didn't bother to randomize it, and that should be ok? The seeds are listed as only one time point because otherwise they'd be in the translation file twice, since by definition they're "in" both time points, since they're from a separate experiment on the human from both time points.

@metasoarous
Copy link
Member

metasoarous commented Mar 9, 2017

OK; I can confirm this (ps: csvuniq is here, and you should use it cause it's awesome; oh and alias seqinfo="seqmagick info"):

% csvuniq -zc dset,timepoint /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG-translations.csv | csvlook              17-03-09 - 0:59:57
|-------------------+-----------+---------|
|  dset             | timepoint | count   |
|-------------------+-----------+---------|
|  Hs-LN1-5RACE-IgG | 240dpi    | 647867  |
|  Hs-LN4-5RACE-IgG | 1586dpi   | 826327  |
|-------------------+-----------+---------|

% seqinfo /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG.fa                                                                                                                                                                                                                                             17-03-09 - 1:09:45
name                                                                                         alignment    min_len   max_len   avg_len  num_seqs
/fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG.fa FALSE            250       530    483.45   1474194
(cft) 

So you're right, it looks like all the data upstream of partis is accounted for. So let's take a look at the output:

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% csvcut -c unique_ids /fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v9/seeds/QB850.091-Vh/QB850-h-IgG/run-viterbi-best-plus-2.csv | tail -n 1 | sed 's/:/\n/g' > 091-seqids
(cft) 

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% csvgrep -c new -f 091-seqids /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG-translations.csv | csvuniq -zc dset,timepoint | csvlook                                                                                                                                                                   17-03-09 - 1:35:34
|-------------------+-----------+--------|
|  dset             | timepoint | count  |
|-------------------+-----------+--------|
|  Hs-LN1-5RACE-IgG | 240dpi    | 1      |
|  Hs-LN4-5RACE-IgG | 1586dpi   | 290    |
|-------------------+-----------+--------|

That 1 sequence at timepoint 240dpi is the seed, and the other seqs are... well... other seqs. I see the same thing with the two other seeds we ran this on.

So it appears partis or datascripts is doing something weird somewhere. It's matching up the seed exclusively with sequences from the other timepoint, but I'm at a complete loss as to why.

@metasoarous
Copy link
Member

I just ran another sanity check on the upstream-of-partis (data merge) side. The question is: Are the sequence names corresponding to 240dpi somehow just off?

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% csvgrep -c timepoint -m 240dpi /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG-translations.csv | csvcut -c new > 240dpi-seqids

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% head 240dpi-seqids                                                                                                                                            17-03-09 - 2:15:44
new
QB850.402-Vh
QB850.144-Vh
QB850.022-Vh
QB850.043-Vh
QB850.048-Vh
QB850.049-Vh
QB850.424-Vh
QB850.001-Vh
QB850.417-Vh

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% wc -l 240dpi-seqids                                                                                                                                           17-03-09 - 2:15:56
647868 240dpi-seqids

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% seqmagick convert --include-from-file 240dpi-seqids /fh/fast/matsen_e/data/kate-qrs-2016-09-09/vlad-processed-data-with-seed-seqs/QB850-h-IgG.fa 240dpi-seqs.fa

csmall@stoat:~matsengrp/working/csmall/cft^10-multiple-timepoints ±
% seqinfo 240dpi-seqs.fa                                                                                                                                        17-03-09 - 2:17:52
name           alignment    min_len   max_len   avg_len  num_seqs
240dpi-seqs.fa FALSE            250       530    481.00    647867

Everything looks ok :-/ Gotta be somewhere in the partis or datascripts processing, yeah?

@psathyrella
Copy link
Contributor

well... it could just be that it doesn't find anything clonal to the seed in one of the time points. But I will investigate to see!

@psathyrella
Copy link
Contributor

I'll keep poking, I don't see anything funky so far. Another possibility is that the translation files were changed after you ran that? In any case a bunch of v10 is done, so you'd probably do better to work from that. Here's some finished v10 stuff:

./datascripts/run.py seed-partition --study kate-qrs --extra-str=v10 --dsets 1k:4k:qb850-k --check

QB850 240dpi  IgK  QB850.402-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.402-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.144-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.144-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.022-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.022-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.043-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.043-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.048-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.048-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.049-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.049-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.424-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.424-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.021-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.021-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.017-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.017-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.405-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.405-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.430-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.430-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 240dpi  IgK  QB850.091-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.091-Vk/Hs-LN1-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.402-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.402-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.144-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.144-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.022-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.022-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.043-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.043-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.048-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.048-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.049-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.049-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.424-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.424-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.021-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.021-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.017-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.017-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.405-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.405-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.430-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.430-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 1586dpi IgK  QB850.091-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.091-Vk/Hs-LN4-5RACE-IgK/partition.csv)
QB850 merged  IgK  QB850.402-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.402-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.144-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.144-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.022-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.022-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.043-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.043-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.048-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.048-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.049-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.049-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.424-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.424-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.021-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.021-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.017-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.017-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.405-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.405-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.430-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.430-Vk/QB850-k-IgK/partition.csv)
QB850 merged  IgK  QB850.091-Vk                       output exists, skipping (/fh/fast/matsen_e/processed-data/partis/kate-qrs-2016-09-09/v10/seeds/QB850.091-Vk/QB850-k-IgK/partition.csv)

@psathyrella
Copy link
Contributor

Also, why was it we didn't put a string for the data set in the new sequence ids? Was it just that some downstream software can't handle long strings? This totally blows to debug not being able to easily see which data set it's from or what sequence it is.

Can I switch to using data set shorthand (usually two characters, defined in the meta csv) plus the original sequence id? this would typically look like 1g-763444-1

@metasoarous
Copy link
Member

Yeah... there are issues with name clashing for names longer than 10 chars. I think the seq names max out at 8 chars. I suppose if we prefix with the dataset shorthand (2-char) without the separator, we should be ok (e.g 1g763444-1), as long as we don't get any longer sequence ids in the future. @wsdewitt - We do have things error if there is an overlap between ids in the dnaml/dnaparse if we end up getting clashes, yes? Assuming that's the case, we can go forward with this and deal with issues if they come up (maybe thawing out #149).

If the seed only matches one of the timepoints, wouldn't we expect it to be of the same timepoint? I should also clarify that it wasn't just this one seed, but all three that I looked at that had this issue, and in each case, the seed was in a different timepoint. Seems fishy to me...

I'll try running on v10. Maybe this will just work itself out, or we'll see something different for some of the other seeds.

@metasoarous
Copy link
Member

Woohoo! Just finished running on what there is of v10 and it looks like the problem (whatever it was) has been resolved! I've got this running now on http://stoat:5556.

I'm going to leave this branch separate till we finish getting the rest of the data built with multiple timepoints. Closing for now though since the code is where it should be.

@psathyrella
Copy link
Contributor

for the sake of posterity...

If the seed only matches one of the timepoints, wouldn't we expect it to be of the same timepoint?

uh, no, because the seeds aren't from either time point -- they're from a separate experiment on the same human, and thus are best viewed as "from" both time points in the sense that we look for them in both time points. Maybe it'd be clearer if I put n/a for the seeds' time points in the csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants