xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

fjanoos · 2019-09-07T10:10:40Z

Problem Description

When using DataSet.from_dataframe (DataArray.from_series) to convert a pandas dataframe with DateTimeIndex having a timezone - xarray convert the datetime into a nanosecond index - rather than keeping it as a datetime-index type.

MCVE Code Sample

print( df.index )

DatetimeIndex(['2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               '2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               ...
               '2019-08-20 16:00:00-05:00', '2019-08-20 16:00:00-05:00'],
              dtype='datetime64[ns, EST]', name='time', length=12713014, freq=None)

ds = xr.DataSet.from_dataframe( df.head( 1000 )  ) 
print( ds['time'] )

<xarray.DataArray 'time' (time: 7)>
array([946933200000000000, 947019600000000000, 947106000000000000,
       947192400000000000, 947278800000000000, 947538000000000000,
       947624400000000000, ...], dtype=object)
Coordinates:
  * time     (time) object 946933200000000000 ... 947624400000000000

Expected Output

After removing the tz localization from the DateTimeIndex of the dataframe , the conversion to a DataSet preserves the time-index (without converting it to nanoseconds)

df.index = df.index.tz_convert('UTC').tz_localize(None)
ds = xr.DataSet.from_dataframe( df.head(1000) ) 
print( ds['time] )

<xarray.DataArray 'time' (time: 7)>
array(['2000-01-03T21:00:00.000000000', '2000-01-04T21:00:00.000000000',
       '2000-01-05T21:00:00.000000000', '2000-01-06T21:00:00.000000000',
       '2000-01-07T21:00:00.000000000', '2000-01-10T21:00:00.000000000',
       '2000-01-11T21:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-03T21:00:00 ... 2000-01-11T21:00:00

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.9.0-9-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None

xarray: 0.12.3+81.g41fecd86
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.3
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 40.8.0
pip: 19.0.3
conda: 4.7.11
pytest: 4.3.1
IPython: 7.4.0
sphinx: 1.8.5

The text was updated successfully, but these errors were encountered:

shoyer · 2019-09-15T00:00:37Z

You should be getting a warning about this if you use the latest version of pandas. In the future, this behavior will change to return an object dtype array full of pandas Datetime objects. Unfortunately NumPy doesn't have a built-in datetime with time-zone stype, so this is about the best we can do.

scottyhq · 2021-04-21T07:12:11Z

Just wanted to rekindle discussion here and ping @dcherian and @benbovy , the current workaround for pandas DatetimeIndex with timezone info (dtype='datetime64[ns, EST]') is to drop the timezone piece or use to_index() and operate in pandas, then reassign the time coordinate: See #1036 and #3163.

If I'm following https://github.com/pydata/xarray/blob/master/design_notes/flexible_indexes_notes.md this is another potential example of improved user-friendliness where we could have timezone-aware indexes and therefore call pandas methods like pandas.core.indexes.datetimes.DatetimeIndex.tz_convert() directly as a DataArray method?

This would definitely be great for remote sensing data that is usually stored with UTC timestamps, but often analysis requires converting to local time.

dcherian · 2021-04-21T15:18:57Z

I am confused on the following point after reading the indexing refactor design notes on removing IndexVariable.

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

shoyer · 2021-04-21T20:59:18Z

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

No, unfortunate it is not possible to use a pandas.Index directly inside Variable.data, because pandas.Index is not compatible with the NumPy array API -- in particular it is stuck with 1D data. Instead, we will need to wrap the array in some adapter class to make it compatible. Ideally this wrapper would be a fully N-dimensional wrapper for pandas.Series objects, but for a first pass it would probably be fine to raise an error if indexing would create a higher dimensional array.

The bigger issue is that elsewhere in Xarray probably needs updates to avoid assuming that all dtype objects are numpy.dtype instances.

While trying to prescribe an open-ended adjustments, I noticed that it currently causes an error. When we read the start end end times of adjustments, they receive a time zone info due to their ISO format. The AWS xarray dataset does not have a time zone info (because of an [xarray limitation](pydata/xarray#3291)). So the timezone info is removed from the adjustments time bounds (l.183-184). What was missing is that when start or end date of adjustments are blank (meaning open-start, open-ended bounds), we use a timestamp (then time-zone-naive) from the AWS dataset, and that it then causes an error later on when trying to remove the time-zone info from these same time-zone-naive bounds.

fjanoos mentioned this issue Sep 18, 2019

Error saving xr.Dataset with timezone aware time index to netcdf format. #3320

Open

TomAugspurger mentioned this issue Dec 30, 2019

Dataset.from_dataframe will produce a FutureWarning for DatetimeTZ data #2666

Open

scottyhq mentioned this issue Apr 19, 2021

Time coordinates are sometimes integers, not datetime64 gjoseph92/stackstac#2

Closed

schweima mentioned this issue Oct 6, 2021

[BUG] Timeseries.from_dataframe() does not handle localized datetime64 index correctly anymore unit8co/darts#412

Closed

BaptisteVandecrux mentioned this issue Dec 18, 2023

fixing open-ended adjustements error GEUS-Glaciology-and-Climate/pypromice#216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

fjanoos commented Sep 7, 2019

shoyer commented Sep 15, 2019

scottyhq commented Apr 21, 2021 •

edited

Loading

dcherian commented Apr 21, 2021

shoyer commented Apr 21, 2021 •

edited

Loading

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

Comments

fjanoos commented Sep 7, 2019

Problem Description

MCVE Code Sample

Expected Output

Output of xr.show_versions()

shoyer commented Sep 15, 2019

scottyhq commented Apr 21, 2021 • edited Loading

dcherian commented Apr 21, 2021

shoyer commented Apr 21, 2021 • edited Loading

Output of `xr.show_versions()`

scottyhq commented Apr 21, 2021 •

edited

Loading

shoyer commented Apr 21, 2021 •

edited

Loading