-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove slash from all data variable names #81
Comments
Hi Eric, no objection, sounds like several things will fit with xarray more
naturally if we avoid slash so happy to go with your suggestion.
…On Sat, 1 Aug 2020, 15:22 Eric Czech, ***@***.***> wrote:
We're currently naming variables like call/genotype, variant/id, sample/id,
etc., but I think we should switch to call_genotype, variant_id, sample_id,
etc.
The disadvantages of using the slashes are:
- Xarray stores these as separate Zarr groups which means you can't
load an sgkit dataset with single command. You have to instead do something
like this: ds = xr.merge([xr.open_zarr(path, group=g) for g in
['call', 'variant', 'sample']). There is no clear advantage to have
the variables split up on disk by this grouping. If they were instead
grouped by something more meaningful like contig, the partitioning would
make more sense but creating directories based on similar variables does
not.
- Assigning variables requires a kwargs splat rather than using the
simpler ds.assign(call_genotype=...) syntax, e.g. ds.assign(**{'call/genotype':
...})
- I've found that for some datasets, you can't pass custom Zarr
encodings to Xarry when variables have '/' in the name -- the bug has been
hard to reproduce on a small dataset so I'm not sure why yet.
- You cannot autocomplete variable names on a dataset instance
The only disadvantage I can see to not using the '/' is that it offers a
convenient delimiter for extracting the group name for a set of variables
like "variant" or "call". I don't think that's difficult to live without
and using underscore case is more common in other pydata projects anyhow.
@alimanfoo <https://github.com/alimanfoo> or @tomwhite
<https://github.com/tomwhite> do you have any objections to this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/81>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQR2SL5NMCUEVY37SY3R6QQKTANCNFSM4PRZBAQQ>
.
|
@eric-czech +1 btw, have you thought about storing those strings as constants, so that we don't have to retype them? Example here: https://github.com/pystatgen/sgkit/blob/rav/api_dataset/sgkit/api.py#L51-L69 (doesn't have to be like that obviously). |
Yes and I don't want to use them: https://github.com/pystatgen/sgkit/issues/17 |
No objections from me. |
I'm going to push back slightly here, as using slashes is quite a natural way of grouping things, and communicates neatly to users that everything starting with Your points about mapping to Python variables are well taken though, which is why I'm only pushing back very, very gently! |
So, to me this comes down to the following question: do we ever imagine being able to do something like I realise that this doesn't necessarily map to how things work with xarray in practise, but we should at least keep the door open to this sort of structure in the future, if we think it might be useful. |
I think most (but certainly not all) of that structure is present though in the dimensions. Xarray doesn't have it unfortunately but kind of like multi-indexes in pandas or the
I think the structure is still accessible though I suppose it remains to be seen if some structure within a given set of dimensions will also be necessary. That could potentially be organized as variable groups in |
Sounds good @eric-czech - I just wanted to raise a flag here. |
File upstream? |
@hammer I would if I could see a way to generalize it beyond what's easy with a comprehension. I imagine they wouldn't want to add it without a stronger case and I'm coming up empty for one at the moment. |
Changed in https://github.com/pystatgen/sgkit/pull/83. |
We're currently naming variables like
call/genotype
,variant/id
,sample/id
, etc., but I think we should switch tocall_genotype
,variant_id
,sample_id
, etc.The disadvantages of using the slashes are:
ds = xr.merge([xr.open_zarr(path, group=g) for g in ['call', 'variant', 'sample'])
. There is no clear advantage to having the variables split up on disk by this grouping. If they were instead grouped by something more meaningful like contig, the partitioning would make more sense but creating directories based on similar variables does not.ds.assign(call_genotype=...)
syntax, e.g.ds.assign(**{'call/genotype': ...})
The only disadvantage I can see to not using the '/' is that it offers a convenient delimiter for extracting the group name for a set of variables like "variant" or "call". I don't think that's difficult to live without and using underscore case is more common in other pydata projects anyhow.
@alimanfoo or @tomwhite do you have any objections to this?
The text was updated successfully, but these errors were encountered: