-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a flag to preserve the call.name field #660
Comments
Thanks @mbookman for your suggestions. I agree that offering a way to allow users to skip sample name hashing is an important use case. @arostamianfar also asked for this feature. Your suggestion offers a lot of flexibility to users for running vcf_to_bq, however, we are avoiding variabilities in columns is because we don't want extra complexity in bq_to_vcf; we are expected to be able to correctly export variants back to VCF files for all those different combinations of possible columns. How about this solution: |
Thanks @samanvp . That proposal sounds reasonable, but
I think there should be different column names for |
Hi @mbookman and @arostamianfar , We talked with @samanvp , and after some consideration, we thought that maybe the best solution would be to introduce a new flag (call it --duplicate_raw_sample_name or something along those lines) that would rather than modifying data in sample ID column, would instead introduce one more column in the sample record, containing raw name. This approach would slightly increase storage costs, as call name/ID data would be duplicated in two separate columns. However, benefits would include:
What do you think? |
Thanks, @tneymanov! I think that's a good suggestion. Perhaps even |
Yes, most probably call.name, to keep consistency with v1 schema. |
Hi @tneymanov. I think this is a fine solution. For large tables (especially ones that don't drop hom-ref calls as discussed in #661), it could mean a fair amount of extra storage that might not be needed. However if you are cost sensitive for storage costs, you can reshape the table after load and remove the sample_id. I'd suggest one of:
However, perhaps to turn this around a bit, have other users requested that the |
+1 to keeping |
Thanks for the suggestions @mbookman ! Regarding the sample ID there were two reasons for its implementation:
Regarding defaulting the behaviour to always set call.name in addition to sample_ids, it would increase the storage costs for the default run of the tool and would be a larger design change than just adding a flag. |
This changed was pushed to master in PR #677 and will be included in the next release. |
If not using the sample optimized tables, but only using the variant optimized tables, it would be nice for
vcf_to_bq
to produce a table with thename
field in thecall
record instead of thesample_id
.One can translate the
sample_id
to thename
, and the table with thesample_id
is going to be smaller than one with thename
(typically), but the SQL complexity and overhead of doing the translation does not seem worthwhile for all users.Adding a flag, such as
--preserve-call-name
(boolean) would be one option. That said, it might be nice to allow users to specify what fields they want in the call record. For example:The tricky case would be around the
sample_id
andname
and making the most common cases straight-forward. It would be nice not to have to specify a full list of fields just to pick up the callname
instead of thesample_id
, so I propose the following:All fields, replace
sample_id
withname
:All fields, replace
name
withsample_id
(the current default behavior):All fields, include both
name
andsample_id
:I'm not really sure when someone would use the last option, so I'm not opposed to it being unavailable.
You could make the options SQL-like with the SELECT * EXCEPT syntax, but that feels a bit like overkill.
The text was updated successfully, but these errors were encountered: