Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode_cf_datetime() casts dask arrays to NumPy arrays #3834

Closed
andersy005 opened this issue Mar 5, 2020 · 3 comments
Closed

encode_cf_datetime() casts dask arrays to NumPy arrays #3834

andersy005 opened this issue Mar 5, 2020 · 3 comments

Comments

@andersy005
Copy link
Member

Currently, when xarray.coding.times.encode_cf_datetime() is called, it always casts the input to a NumPy array. This is not what I would expect when the input is a dask array. I am wondering if we could make this operation lazy when the input is a dask array?

"""
dates = np.asarray(dates)

In [46]: import numpy as np                                                                                        

In [47]: import xarray as xr                                                                                       

In [48]: import pandas as pd                                                                                       

In [49]: times = pd.date_range("2000-01-01", "2001-01-01", periods=11)                                             

In [50]: time_bounds = np.vstack((times[:-1], times[1:])).T                                                        

In [51]: arr = xr.DataArray(time_bounds).chunk()                                                                   

In [52]: arr                                                                                                       
Out[52]: 
<xarray.DataArray (dim_0: 10, dim_1: 2)>
dask.array<xarray-<this-array>, shape=(10, 2), dtype=datetime64[ns], chunksize=(10, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0, dim_1

In [53]: xr.coding.times.encode_cf_datetime(arr)                                                                   
Out[53]: 
(array([[     0,  52704],
        [ 52704, 105408],
        [105408, 158112],
        [158112, 210816],
        [210816, 263520],
        [263520, 316224],
        [316224, 368928],
        [368928, 421632],
        [421632, 474336],
        [474336, 527040]]),
 'minutes since 2000-01-01 00:00:00',
 'proleptic_gregorian')

Cc @jhamman

@jhamman
Copy link
Member

jhamman commented Mar 5, 2020

Thanks @andersy005 for the clear example. Looking at encode_cf_datetime(), it seems like we didn't intend to make this function dask friendly. Perhaps @spencerkclark or @shoyer have thoughts on the prospects of making this work. My guess is that this will be pretty tricky, mostly because in some cases, we infer the units/dtype/etc on the fly.

@shoyer
Copy link
Member

shoyer commented Mar 6, 2020

Right, it casts to NumPy because it needs to look at the data to figure out units and calendar.

That said, if you pre-supply units and calendar there's no reason why it couldn't work with dask arrays. That would probably be a worthwhile refactor.

@dcherian
Copy link
Contributor

Closed by #8575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants