Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

Open
fjanoos opened this issue Sep 7, 2019 · 4 comments

Comments

@fjanoos
Copy link

fjanoos commented Sep 7, 2019

Problem Description

When using DataSet.from_dataframe (DataArray.from_series) to convert a pandas dataframe with DateTimeIndex having a timezone - xarray convert the datetime into a nanosecond index - rather than keeping it as a datetime-index type.

MCVE Code Sample

print( df.index ) 
DatetimeIndex(['2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               '2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               ...
               '2019-08-20 16:00:00-05:00', '2019-08-20 16:00:00-05:00'],
              dtype='datetime64[ns, EST]', name='time', length=12713014, freq=None)
ds = xr.DataSet.from_dataframe( df.head( 1000 )  ) 
print( ds['time'] )
<xarray.DataArray 'time' (time: 7)>
array([946933200000000000, 947019600000000000, 947106000000000000,
       947192400000000000, 947278800000000000, 947538000000000000,
       947624400000000000, ...], dtype=object)
Coordinates:
  * time     (time) object 946933200000000000 ... 947624400000000000

Expected Output

After removing the tz localization from the DateTimeIndex of the dataframe , the conversion to a DataSet preserves the time-index (without converting it to nanoseconds)

df.index = df.index.tz_convert('UTC').tz_localize(None)
ds = xr.DataSet.from_dataframe( df.head(1000) ) 
print( ds['time] )
<xarray.DataArray 'time' (time: 7)>
array(['2000-01-03T21:00:00.000000000', '2000-01-04T21:00:00.000000000',
       '2000-01-05T21:00:00.000000000', '2000-01-06T21:00:00.000000000',
       '2000-01-07T21:00:00.000000000', '2000-01-10T21:00:00.000000000',
       '2000-01-11T21:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-03T21:00:00 ... 2000-01-11T21:00:00

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.9.0-9-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None

xarray: 0.12.3+81.g41fecd86
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.3
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 40.8.0
pip: 19.0.3
conda: 4.7.11
pytest: 4.3.1
IPython: 7.4.0
sphinx: 1.8.5

@shoyer
Copy link
Member

shoyer commented Sep 15, 2019

You should be getting a warning about this if you use the latest version of pandas. In the future, this behavior will change to return an object dtype array full of pandas Datetime objects. Unfortunately NumPy doesn't have a built-in datetime with time-zone stype, so this is about the best we can do.

@scottyhq
Copy link
Contributor

scottyhq commented Apr 21, 2021

Just wanted to rekindle discussion here and ping @dcherian and @benbovy , the current workaround for pandas DatetimeIndex with timezone info (dtype='datetime64[ns, EST]') is to drop the timezone piece or use to_index() and operate in pandas, then reassign the time coordinate: See #1036 and #3163.

If I'm following https://github.com/pydata/xarray/blob/master/design_notes/flexible_indexes_notes.md this is another potential example of improved user-friendliness where we could have timezone-aware indexes and therefore call pandas methods like pandas.core.indexes.datetimes.DatetimeIndex.tz_convert() directly as a DataArray method?

This would definitely be great for remote sensing data that is usually stored with UTC timestamps, but often analysis requires converting to local time.

@dcherian
Copy link
Contributor

I am confused on the following point after reading the indexing refactor design notes on removing IndexVariable.

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

@shoyer
Copy link
Member

shoyer commented Apr 21, 2021

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

No, unfortunate it is not possible to use a pandas.Index directly inside Variable.data, because pandas.Index is not compatible with the NumPy array API -- in particular it is stuck with 1D data. Instead, we will need to wrap the array in some adapter class to make it compatible. Ideally this wrapper would be a fully N-dimensional wrapper for pandas.Series objects, but for a first pass it would probably be fine to raise an error if indexing would create a higher dimensional array.

The bigger issue is that elsewhere in Xarray probably needs updates to avoid assuming that all dtype objects are numpy.dtype instances.

BaptisteVandecrux added a commit to GEUS-Glaciology-and-Climate/pypromice that referenced this issue Dec 18, 2023
While trying to prescribe an open-ended adjustments, I noticed that it currently causes an error.

When we read the start end end times of adjustments, they receive a time zone info due to their ISO format. The AWS xarray dataset does not have a time zone info (because of an [xarray limitation](pydata/xarray#3291)). So the timezone info is removed from the adjustments time bounds (l.183-184).

What was missing is that when start or end date of adjustments are blank (meaning open-start, open-ended bounds), we use a timestamp (then time-zone-naive) from the AWS dataset, and that it then causes an error later on when trying to remove the time-zone info from these same time-zone-naive bounds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants