-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HDF5 reading/writing (with optional dependencies) #3520
Comments
If there is a backend for this in Rust, I think we could work it out in arrow2. It is a quite important format imo. |
That's a good idea too! |
HDF5 has a very big specification: https://docs.hdfgroup.org/hdf5/develop/_f_m_t3.html Rust bindings to libhdf5 can be found at: https://github.com/aldanor/hdf5-rust |
I think we should explore both. Rust backed under a feature flag, and python as optional dependency. I can imagine that it increases binary size quite a bit. |
I'm really excited about Polars. But almost all of my data is in large HDF5 files (actually, NetCDF). I can convert to Parquet files. But loading directly from HDF5 into Polars would be ideal 🙂 Are there any plans to add support to Polars to read HDF5 (ideally lazily)? |
There is this rust crate for reading https://github.com/georust/netcdf But as far as I understand NetCDF contains mostly multidimensional data instead of 1D arrays like the arrow format, so I am not sure how useful it would be in general to even consider support for this. |
Good point! For my work, I use NetCDF for n-dimensional arrays, and 2d arrays. But, if I'm reading this right, this comment suggests that tensor types might come to Polars at some point. It'd be absolutely amazing to be able to load n-dim NetCDF files directly into an n-dimensional in-memory array in Polars 🙂 |
(hdf5-rust author here)
These are not just bindings to libhdf5 (that's the Re: NetCDF, while it uses HDF5 as a storage medium, it's not the same thing, it's more like an opinionated meta-format on top of that that is very popular in some communities (e.g. geo). I think we could make HDF5 work with polars, but would be nice to have something like a spec, or at least a wishlist with some examples – i.e. what do we want?
Problem is, polars/arrow/parquet etc are column based, whereas hdf5 is array-based. If you have a polars frame with columns
There's even more ambiguity when reading existing data: if you have a structured HDF5 dataset with fields "a" and "b", you may want
One way to go would be to check what pandas does and do the same thing, so you can dump a dataframe from pandas and read it back from polars. Perhaps that's an easiest way to get started. |
I believe that most of the hdf5 files that we are expected to be able to read are created by pandas. So, yes, I think we should start with supporting what they do. |
I agree, copying Pandas' behaviour sounds like a great place to start! |
Pandas PyTable basic fle format overview: |
I have an application that uses h5py to read h5 data into pandas and I've started to convert this app from Pandas to Polars - the data is relatively big and I'm looking for more performance. I use h5py to read the h5 datasets into numpy structured arrays (heterogeneous types) and the numpy structured arrays transfer very easily into pandas dataframes. But getting that same data into a Polars dataframe is proving to be a problem - basically the structure of the numpy structured array is lost and I end up with a single column dataframe with an object dtype. I suspect there are many users who get their h5 data via h5py and for these users, just supporting fast/easy construction of polars dataframes from numpy structured arrays would be ideal |
Copying Pandas behavior for creating dataframes from numpy structured arrays would be great!!! |
It seems np structured array is only a helper here, which the pandas dataframe constructor knows how to handle. So the 'clean' way would be to access the hdf5 directly. But I must admit I have no idea if that is the easier option. |
It may be work considering leveraging Vaex for this. You can read/write to hdf5 files natively, and they map directly to/from numpy arrays
|
Done... structured array support (both initialising frames and exporting from them) will be available in the upcoming |
Folks, Tall towers like
have many users in geographic info -- but towers get shaky as you add more and more stories
and Moving-away-hdf5 (2016) -- excellent. Fwiw, my use case: 100 ior so 2 GB hdf5 files (24*30, 824, 848) |
On the topic of the performance of Zarr (and xarray), we've started a concerted effort to significantly speed up Zarr, possibly be re-writing a bunch of it from scratch in Rust (partly inspired by Polars!): zarr-developers/zarr-python#1479 I agree 100% with your general point: A lot of "scientific Python" code is built on 30 year old foundations. And those foundations are starting to look a little shaky! (Because high-performance code today looks quite different to high-performance code from 30 years ago). So I do agree that there's an argument for thinking hard about re-building some of these things from the ground-up. |
Seems that hdf5 files can be horribly complicated From the top floor of this tower you won't even SEE the ancient crud way down below, |
Interesting. Are you working with weather data ? I knocked together a quick project which ECMWFs eccodes library and exposes the data as arrow https://github.com/hugopendlebury/gribtoarrow Would be interested in doing something similar with HDF5 do you have any sample files ? |
Hi Hugo, fwiw, a trival test case is On weather data, https://opendata.dwd.de/climate_environment/REA/COSMO_REA6/converted/hourly/2D/* A big unknown in runtimes is cache performance, SSD L1 L2 ...; cheers |
In one of my projects, we use a python stack with pandas doing all the DataFrame stuff. We're currently depending on an SQL database but want to migrate to pytables. It would be amazing if polars offered an easy way how to load/store data from/to pytables HDF5 files! |
Looks like pytables may be the way to go to support this on the Python side. That would probably be a good first step. We can look into Rust support later. |
Yes, we can start with pytables. For Rust support I first want to extend the plugin system with readers. |
Hi, can I have a go at implementing this with pytables? |
@galbwe Definitely! You can take inspiration from |
Python dev and Newbie Rust dev, I would like to try and implement that HDF5 crate and create a data load function.. |
I'm going to work on adding this client crate into polars in my fork, when I'm ready I'll submit a PR. |
@timothyhutz, I'd welcome a 1-page spec and a little testbench first.
Added: do you have a short list of max 5 use cases for people to vote on: |
When could one expect to have hdf5 implementation to be done? |
This does not have priority for us. A community contribution is welcome. |
Actually, I am working on Polars IO plugins and this might be the first candidate. |
Ok, I did some investigation and reading pandas style hdf5 files is a no go. Pandas style is just a glorified pickle. In the following example import pandas as pd
df = pd.DataFrame({
"a": ["foo", "bar", None, "h"],
"b": [None, True, False, True],
"c": [1.0, None, 3.0, 4.0],
})
df.to_hdf('data.h5', key='df', mode="w") Only column And going through python to pickle and infer types is a no-go. It is super slow, ambiguous and can execute random code (because of pickle). |
For me, I'd mostly want to use Polars to read HDF5 (and NetCDF) files which were not created by Pandas. But maybe I'm an outlier? |
Unfortunately, I am not so advanced...
Cannot talk for the entire community, but I would also use it to read HDF5 files generated by other sources than Pandas. |
Can you give me an example what is in those files then? Numerical arrays? I'd like some examples/spec to know how to implement it. Maybe share some files. |
For me, I mostly work with weather predictions and satellite data, so I mostly work with NetCDF (which uses HDF5 under the hood, IIUC). NOAA's amazing Open Data Dissemination programme (NODD) has released 59 petabytes of data on cloud object storage. Some of that data is NetCDF. For example, the GFS warm start data is netcdf. (The warm start dataset is the second dataset listed on that AWS page) |
Interesting since I also consume NOAA data but in GRIB Format, which I download using Amazon. Is there any advantage to NetCDF? From what I've seen it's less flexible than Grib but works better with tools like xarray (which are a bit restrictive and due to trying to create a cube of the data can fabricate results - since it create a cartesian product for the dimensions. |
I tried to open one of those files in pytables, but it cannot open it.
I need a bit more handhelding. I don't know anything about those formats, so I need a good cherrypicked example and the type of data you want out of it. |
No worries! I can try to help! I think those Here are some files which definitely should be NetCDF: https://noaa-gfs-warmstart-pds.s3.amazonaws.com/index.html#gdas.20210102/00/RESTART/ |
On the question of what data I'd want from these files... TBH, I probably haven't tried to open a NetCDF file in Pandas for years. I pretty much exclusively use To be specific: I often work with "numerical weather predictions" (NWPs) which are dense, gridded weather forecasts produced by physical models of the weather, run on huge supercomputers. NWPs are dense n-dim arrays, where n can be as high as 7. The 7 dimensions are:
Which is all to say that, if I'm honest, I',m afraid I probably wouldn't see myself using Polars to process NWPs (because I need to handle up to 7 dimensions, and I need to process the spatial coords). Or is the "grand plan" is to build functionality into Polars that could enable something like "a Rust |
We can reduce friction by figuring out how to load data most efficiently to polars memory.
The text was updated successfully, but these errors were encountered: