-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue reading HDF5Compound data #408
Comments
I seem to recall that @mbauman once suggested the interface We didn't implement it at the time because |
Thanks for the information! What exactly should |
But the bottom line is simple: presumably most of the time for HDF5Compound is spent "decoding the type." The main point is that it's not straightforward to convert from a type-as-described-by-HDF5's-disk-format to a Julia type---what if the Julia type is defined in some package, how is the HDF5 module supposed to know what to do? But life gets easier if you pass the desired Julia type as an input. In that case the read operation essentially becomes just two lines, buf = Vector{T}(length(obj))
h5d_read(obj.id, memtype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf) and it should be rather faster 😄. In addition to the nice feature that the data are already arranged in proper format for you to work with them. You might also want to provide a function that validates that the Julia type matches the layout of the HDF5 type (similar to the logic of the current read operation), but that could be a separate function call and should not have to slow things down for every single read. |
Thanks for your input! I am currently playing around with the skeleton you sketched and trying to figure out how |
You have a good memory, Tim! I didn't even remember that. Since my participation here has been fairly limited, it was pretty easy to pull up the related issue by limiting the search to ones where I commented: #169 |
It is definitely a tougher job than I thought. I'll update as soon as I got something ;) Thanks for cross referencing! |
Any progress on this, or interest in having help? |
A further comment. The suggestion appears to be passing in a target type to copy into. In my use case I'd like for the underlying HDF5 format to from time to time, or from data source to data source, be extended with new fields. I'd much prefer generating a compatible immutable on-the-fly to match the underlying data. Aside from user possibly wanting to provide their own types, are there problems with this? In order to have the generated type work well with user's code and dispatch, I was thinking you could pass in an abstract type to inherit from. |
I have no progress regarding the HDF5Compound approach but I can of course write down my experiences and how I ended up solving this issue. Sorry in advance for being "off topic", I went a different approach which totally outperforms Pandas, but it may help one or the other... Reading the HDF5Compound datasets is nice and easy but in our experiment, we store up to 10000 events in a file, with something like 5000 hits per event. Something like this (for 10000 events with 5000 hits each):
To get the hits e.g. for event #23, I read the This is approach was something like 50x faster than reading the hits from a dataset (B-tree access) and parsing |
I hit upon this issue and wrote some code that implements @timholy 's suggestion. It is indeed faster by an order of magnitude: function read_fast(dset::HDF5.HDF5Dataset, T::DataType)
filetype = HDF5.datatype(dset) # packed layout on disk
memtype_id = HDF5.h5t_get_native_type(filetype.id) # padded layout in memory
@assert sizeof(T) == HDF5.h5t_get_size(memtype_id) "Type sizes don't match!"
out = Vector{T}(length(dset))
HDF5.h5d_read(dset.id, memtype_id, HDF5.H5S_ALL, HDF5.H5S_ALL, HDF5.H5P_DEFAULT, out)
HDF5.h5t_close(memtype_id)
out
end
struct Element
field1::Float32
field2::Int64
field3::UInt8
end
data = read_fast(h5["mydataset"], Element) |
I just arrived here again 😆 Again, thanks @timholy for the suggestion and @damiendr for the implementation. I changed the line Here is my old implementation:
and yours
Note that both include the HDF5 file opening, which however is a negligible overhead in this case. |
Dear all,
I noticed that the readout of HDF5Compound datasets is very slow compared to python (with pandas, which uses pytables as backend), namely a factor of 10.
The sample dataset is a 200k x 15 array and here are the readout times using pandas and Julia:
I was a bit surprised since I switched to Julia to boost the analysis performance (which is totally faster than in python/numpy) but in the end it runs in the same time due to the big impact of the slow readout.
Am I doing something wrong? Is there a way to speed up the read-in?
Should I prepare a sample file?
The text was updated successfully, but these errors were encountered: