Add a flag to preserve the call.name field #660

mbookman · 2020-09-09T00:23:11Z

If not using the sample optimized tables, but only using the variant optimized tables, it would be nice for vcf_to_bq to produce a table with the name field in the call record instead of the sample_id.

One can translate the sample_id to the name, and the table with the sample_id is going to be smaller than one with the name (typically), but the SQL complexity and overhead of doing the translation does not seem worthwhile for all users.

Adding a flag, such as --preserve-call-name (boolean) would be one option. That said, it might be nice to allow users to specify what fields they want in the call record. For example:

--call-fields="name,genotype,DP,AD"

The tricky case would be around the sample_id and name and making the most common cases straight-forward. It would be nice not to have to specify a full list of fields just to pick up the call name instead of the sample_id, so I propose the following:

All fields, replace sample_id with name:
```
  --call-fields="name,*"
```
All fields, replace name with sample_id (the current default behavior):
```
  --call-fields="sample_id,*"
```
All fields, include both name and sample_id:
```
  --call-fields="sample_id,name,*"
```

I'm not really sure when someone would use the last option, so I'm not opposed to it being unavailable.

You could make the options SQL-like with the SELECT * EXCEPT syntax, but that feels a bit like overkill.

The text was updated successfully, but these errors were encountered:

samanvp · 2020-09-24T18:48:25Z

Thanks @mbookman for your suggestions.

I agree that offering a way to allow users to skip sample name hashing is an important use case. @arostamianfar also asked for this feature.

Your suggestion offers a lot of flexibility to users for running vcf_to_bq, however, we are avoiding variabilities in columns is because we don't want extra complexity in bq_to_vcf; we are expected to be able to correctly export variants back to VCF files for all those different combinations of possible columns.

How about this solution:
--sample_name_encoding in addition to the current two allowed values (WITHOUT_FILE_PATH and WITH_FILE_PATH) we allowed a third value, say None, which will not hash the sample names. In that case the type of call.sample_id column will be STRING. My guess is this will allow for a simpler bq_to_vcf implementation. @tneymanov do you agree with this statement?

mbookman · 2020-09-24T20:26:51Z

Thanks @samanvp . That proposal sounds reasonable, but

The name of the column would be sample_id, rather than name
Creating the sample-optimized table would then be confusing as it requires encoded (numeric) sample IDs

I think there should be different column names for sample_id vs. name.

tneymanov · 2020-09-24T21:20:32Z

Hi @mbookman and @arostamianfar ,

We talked with @samanvp , and after some consideration, we thought that maybe the best solution would be to introduce a new flag (call it --duplicate_raw_sample_name or something along those lines) that would rather than modifying data in sample ID column, would instead introduce one more column in the sample record, containing raw name.

This approach would slightly increase storage costs, as call name/ID data would be duplicated in two separate columns.

However, benefits would include:

we would not need to manage vague field sample ID that could be both raw str or int hash, which is a plus both during coding and product clarity.
BQ to VCF flow will not need modification as we won't require users to indicate how the table was generated (as we still have sample_id and sample_info_table to get call names)
Philosophically, we're still pushing the usage of the sample IDs, and introduce this flag mostly for backwards compatibility until users fully migrate to v2 schema - we still believe this should not be the default behaviour of the tool.
Given that this solution is easier and shouldn't interfere with regular flows, maybe we can manage to finish and release it by Oct 7th (for @arostamianfar usecase).

What do you think?

arostamianfar · 2020-09-24T21:55:41Z

Thanks, @tneymanov! I think that's a good suggestion. Perhaps even --store_raw_sample_name? And to clarify, would this be stored under call.name (similar to the old schema)?

tneymanov · 2020-09-24T22:00:56Z

Yes, most probably call.name, to keep consistency with v1 schema.

mbookman · 2020-09-24T22:10:11Z

Hi @tneymanov.

I think this is a fine solution. For large tables (especially ones that don't drop hom-ref calls as discussed in #661), it could mean a fair amount of extra storage that might not be needed. However if you are cost sensitive for storage costs, you can reshape the table after load and remove the sample_id.

I'd suggest one of:

--preserve-call-name (or --preserve-sample-name)
--include-call-name (or --include-sample-name)

However, perhaps to turn this around a bit, have other users requested that the call.name be dropped? In the spirit of simplicity, perhaps the default behavior is to include both the name and sample_id and instead add a flag like --omit-call-name.

arostamianfar · 2020-09-24T23:43:56Z

+1 to keeping call.name by default!

tneymanov · 2020-09-29T18:56:13Z

Thanks for the suggestions @mbookman !

Regarding the sample ID there were two reasons for its implementation:

One of our customers asked us to implement a solution to differentiate between identical sample names that could be inputed from two different files during the VCF-to-BQ flow.
When creating sample query optimized tables, we would need to partition on call name, which we could not do on string fields so they needed to be converted to integers first.

Regarding defaulting the behaviour to always set call.name in addition to sample_ids, it would increase the storage costs for the default run of the tool and would be a larger design change than just adding a flag.

tneymanov · 2020-10-01T19:05:49Z

This changed was pushed to master in PR #677 and will be included in the next release.

tneymanov closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a flag to preserve the call.name field #660

Add a flag to preserve the call.name field #660

mbookman commented Sep 9, 2020

samanvp commented Sep 24, 2020

mbookman commented Sep 24, 2020

tneymanov commented Sep 24, 2020

arostamianfar commented Sep 24, 2020

tneymanov commented Sep 24, 2020

mbookman commented Sep 24, 2020

arostamianfar commented Sep 24, 2020

tneymanov commented Sep 29, 2020

tneymanov commented Oct 1, 2020

Add a flag to preserve the call.name field #660

Add a flag to preserve the call.name field #660

Comments

mbookman commented Sep 9, 2020

samanvp commented Sep 24, 2020

mbookman commented Sep 24, 2020

tneymanov commented Sep 24, 2020

arostamianfar commented Sep 24, 2020

tneymanov commented Sep 24, 2020

mbookman commented Sep 24, 2020

arostamianfar commented Sep 24, 2020

tneymanov commented Sep 29, 2020

tneymanov commented Oct 1, 2020