Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to preserve the call.name field #660

Closed
mbookman opened this issue Sep 9, 2020 · 9 comments
Closed

Add a flag to preserve the call.name field #660

mbookman opened this issue Sep 9, 2020 · 9 comments

Comments

@mbookman
Copy link
Contributor

mbookman commented Sep 9, 2020

If not using the sample optimized tables, but only using the variant optimized tables, it would be nice for vcf_to_bq to produce a table with the name field in the call record instead of the sample_id.

One can translate the sample_id to the name, and the table with the sample_id is going to be smaller than one with the name (typically), but the SQL complexity and overhead of doing the translation does not seem worthwhile for all users.

Adding a flag, such as --preserve-call-name (boolean) would be one option. That said, it might be nice to allow users to specify what fields they want in the call record. For example:

--call-fields="name,genotype,DP,AD"

The tricky case would be around the sample_id and name and making the most common cases straight-forward. It would be nice not to have to specify a full list of fields just to pick up the call name instead of the sample_id, so I propose the following:

  • All fields, replace sample_id with name:

      --call-fields="name,*"
    
  • All fields, replace name with sample_id (the current default behavior):

      --call-fields="sample_id,*"
    
  • All fields, include both name and sample_id:

      --call-fields="sample_id,name,*"
    

I'm not really sure when someone would use the last option, so I'm not opposed to it being unavailable.

You could make the options SQL-like with the SELECT * EXCEPT syntax, but that feels a bit like overkill.

@samanvp
Copy link
Contributor

samanvp commented Sep 24, 2020

Thanks @mbookman for your suggestions.

I agree that offering a way to allow users to skip sample name hashing is an important use case. @arostamianfar also asked for this feature.

Your suggestion offers a lot of flexibility to users for running vcf_to_bq, however, we are avoiding variabilities in columns is because we don't want extra complexity in bq_to_vcf; we are expected to be able to correctly export variants back to VCF files for all those different combinations of possible columns.

How about this solution:
--sample_name_encoding in addition to the current two allowed values (WITHOUT_FILE_PATH and WITH_FILE_PATH) we allowed a third value, say None, which will not hash the sample names. In that case the type of call.sample_id column will be STRING. My guess is this will allow for a simpler bq_to_vcf implementation. @tneymanov do you agree with this statement?

@mbookman
Copy link
Contributor Author

Thanks @samanvp . That proposal sounds reasonable, but

  • The name of the column would be sample_id, rather than name
  • Creating the sample-optimized table would then be confusing as it requires encoded (numeric) sample IDs

I think there should be different column names for sample_id vs. name.

@tneymanov
Copy link
Collaborator

Hi @mbookman and @arostamianfar ,

We talked with @samanvp , and after some consideration, we thought that maybe the best solution would be to introduce a new flag (call it --duplicate_raw_sample_name or something along those lines) that would rather than modifying data in sample ID column, would instead introduce one more column in the sample record, containing raw name.

This approach would slightly increase storage costs, as call name/ID data would be duplicated in two separate columns.

However, benefits would include:

  • we would not need to manage vague field sample ID that could be both raw str or int hash, which is a plus both during coding and product clarity.
  • BQ to VCF flow will not need modification as we won't require users to indicate how the table was generated (as we still have sample_id and sample_info_table to get call names)
  • Philosophically, we're still pushing the usage of the sample IDs, and introduce this flag mostly for backwards compatibility until users fully migrate to v2 schema - we still believe this should not be the default behaviour of the tool.
  • Given that this solution is easier and shouldn't interfere with regular flows, maybe we can manage to finish and release it by Oct 7th (for @arostamianfar usecase).

What do you think?

@arostamianfar
Copy link
Contributor

Thanks, @tneymanov! I think that's a good suggestion. Perhaps even --store_raw_sample_name? And to clarify, would this be stored under call.name (similar to the old schema)?

@tneymanov
Copy link
Collaborator

Yes, most probably call.name, to keep consistency with v1 schema.

@mbookman
Copy link
Contributor Author

Hi @tneymanov.

I think this is a fine solution. For large tables (especially ones that don't drop hom-ref calls as discussed in #661), it could mean a fair amount of extra storage that might not be needed. However if you are cost sensitive for storage costs, you can reshape the table after load and remove the sample_id.

I'd suggest one of:

  • --preserve-call-name (or --preserve-sample-name)
  • --include-call-name (or --include-sample-name)

However, perhaps to turn this around a bit, have other users requested that the call.name be dropped? In the spirit of simplicity, perhaps the default behavior is to include both the name and sample_id and instead add a flag like --omit-call-name.

@arostamianfar
Copy link
Contributor

+1 to keeping call.name by default!

@tneymanov
Copy link
Collaborator

Thanks for the suggestions @mbookman !

Regarding the sample ID there were two reasons for its implementation:

  1. One of our customers asked us to implement a solution to differentiate between identical sample names that could be inputed from two different files during the VCF-to-BQ flow.
  2. When creating sample query optimized tables, we would need to partition on call name, which we could not do on string fields so they needed to be converted to integers first.

Regarding defaulting the behaviour to always set call.name in addition to sample_ids, it would increase the storage costs for the default run of the tool and would be a larger design change than just adding a flag.

@tneymanov
Copy link
Collaborator

This changed was pushed to master in PR #677 and will be included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants