Performance issue reading HDF5Compound data #408

tamasgal · 2017-05-19T08:59:04Z

Dear all,

I noticed that the readout of HDF5Compound datasets is very slow compared to python (with pandas, which uses pytables as backend), namely a factor of 10.

The sample dataset is a 200k x 15 array and here are the readout times using pandas and Julia:

In [1]: import pandas as pd
In [2]: %time data = pd.read_hdf("run1.h5", "angles/99/hits")
CPU times: user 224 ms, sys: 17.8 ms, total: 242 ms
Wall time: 244 ms

v0.5.0> using HDF5
v0.5.0> using BenchmarkTools
v0.5.0> f = h5open("run1.h5", "r")
HDF5 data file: run1.h5
v0.5.0> @benchmark data = read(f["angles/99/hits"])
BenchmarkTools.Trial:
  memory estimate:  291.37 MiB
  allocs estimate:  8815153
  --------------
  minimum time:     2.897 s (5.34% GC)
  median time:      2.918 s (6.31% GC)
  mean time:        2.918 s (6.31% GC)
  maximum time:     2.939 s (7.27% GC)
  --------------
  samples:          2
  evals/sample:     1

I was a bit surprised since I switched to Julia to boost the analysis performance (which is totally faster than in python/numpy) but in the end it runs in the same time due to the big impact of the slow readout.

Am I doing something wrong? Is there a way to speed up the read-in?

Should I prepare a sample file?

The text was updated successfully, but these errors were encountered:

timholy · 2017-05-19T14:21:59Z

I seem to recall that @mbauman once suggested the interface read(io, "var"=>T) where T is a Julia immutable type with the same memory layout as the compound type on disk. I suspect that might solve all your performance problems.

We didn't implement it at the time because Pair (which is what => gives you) was introduced in what was then "bleeding edge" julia-0.4pre and not available in stable julia versions; once 0.4 came out, this never got revisited. It would be a great contribution to implement this.

tamasgal · 2017-05-19T18:42:55Z

Thanks for the information! What exactly should "var" represent? I am not sure how this Pair readout should work, but maybe you can point me to the discussion so I can try to implement it somehow ;)

timholy · 2017-05-19T19:11:48Z

"var" was meant to be the name of the dataset, e.g., "angles/99/hits". But of course there are other options, like read(dset=>T). To be honest, it's been so long since I worked closely with this package that I don't remember many details about our previous thoughts. Nor was I successful in searching for the original discussion. Sorry.

But the bottom line is simple: presumably most of the time for HDF5Compound is spent "decoding the type." The main point is that it's not straightforward to convert from a type-as-described-by-HDF5's-disk-format to a Julia type---what if the Julia type is defined in some package, how is the HDF5 module supposed to know what to do?

But life gets easier if you pass the desired Julia type as an input. In that case the read operation essentially becomes just two lines,

buf = Vector{T}(length(obj))
h5d_read(obj.id, memtype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf)

and it should be rather faster 😄. In addition to the nice feature that the data are already arranged in proper format for you to work with them.

You might also want to provide a function that validates that the Julia type matches the layout of the HDF5 type (similar to the logic of the current read operation), but that could be a separate function call and should not have to slow things down for every single read.

tamasgal · 2017-05-19T21:44:21Z

Thanks for your input! I am currently playing around with the skeleton you sketched and trying to figure out how memtype_id should "look like" or how it is treated. I thought it would be some kind of a memory footprint... It seems I need to dig into the HDF5 docs though ;) But I now more or less understand how it should work.

mbauman · 2017-05-22T16:31:16Z

You have a good memory, Tim! I didn't even remember that. Since my participation here has been fairly limited, it was pretty easy to pull up the related issue by limiting the search to ones where I commented: #169

tamasgal · 2017-05-25T20:11:02Z

It is definitely a tougher job than I thought. I'll update as soon as I got something ;)

Thanks for cross referencing!

ancapdev · 2017-06-27T15:01:41Z

Any progress on this, or interest in having help?

ancapdev · 2017-07-12T10:51:31Z

A further comment.

The suggestion appears to be passing in a target type to copy into. In my use case I'd like for the underlying HDF5 format to from time to time, or from data source to data source, be extended with new fields.

I'd much prefer generating a compatible immutable on-the-fly to match the underlying data. Aside from user possibly wanting to provide their own types, are there problems with this?

In order to have the generated type work well with user's code and dispatch, I was thinking you could pass in an abstract type to inherit from.

tamasgal · 2017-07-12T21:15:57Z

I have no progress regarding the HDF5Compound approach but I can of course write down my experiences and how I ended up solving this issue. Sorry in advance for being "off topic", I went a different approach which totally outperforms Pandas, but it may help one or the other...

Reading the HDF5Compound datasets is nice and easy but in our experiment, we store up to 10000 events in a file, with something like 5000 hits per event.
My initial HDF5 structure was: separate HDF5Compound datasets (like mentioned in the first post), but it turned out that having thousands of datasets in an HDF5 file is a bad idea. To read in one of those, the internal index tree (B-tree) has to be accessed every time, which in our case actually was one of the bottlenecks.
After a lot of trial and error, I ended up with a 1D array structures for each hit-attributed stored under a single group and an HDF5Compound index-dataset, which stores the index of the first hit and the number of hits per entry.

Something like this (for 10000 events with 5000 hits each):

/hits/_indices   (10000, 2)
/hits/time   (50000000)
/hits/dom_id   (50000000)
/hits/pmt_id   (50000000)
/hits/tot   (50000000)
/hits/triggered   (50000000)
...

To get the hits e.g. for event #23, I read the /hits/_indices (which I keep in memory) at index #23, which will give me the first hit index (let's call it idx) and the number of hits (n). So I can easily grab the corresponding hit attributes by reading /hits/time[idx:idx+n]. I then utilise a for-loop over 1:n and create Hits by passing them the entries of the corresponding index.

This is approach was something like 50x faster than reading the hits from a dataset (B-tree access) and parsing HDF5Compound instances to the final Hit instance.

damiendr · 2017-12-27T02:01:59Z

I hit upon this issue and wrote some code that implements @timholy 's suggestion. It is indeed faster by an order of magnitude:

function read_fast(dset::HDF5.HDF5Dataset, T::DataType)
    filetype = HDF5.datatype(dset) # packed layout on disk
    memtype_id = HDF5.h5t_get_native_type(filetype.id) # padded layout in memory
    @assert sizeof(T) == HDF5.h5t_get_size(memtype_id) "Type sizes don't match!"
    out = Vector{T}(length(dset))
    HDF5.h5d_read(dset.id, memtype_id, HDF5.H5S_ALL, HDF5.H5S_ALL, HDF5.H5P_DEFAULT, out)
    HDF5.h5t_close(memtype_id)
    out
end

struct Element
    field1::Float32
    field2::Int64
    field3::UInt8
end

data = read_fast(h5["mydataset"], Element)

tamasgal · 2019-01-27T13:40:23Z

I just arrived here again 😆

Again, thanks @timholy for the suggestion and @damiendr for the implementation.

I changed the line out = Vector{T}(length(dset)) to out = Vector{T}(undef, length(dset)), so it's now working on Julia 0.7+ and it's really fast!
On of the main reasons is the one-shot allocation of the output array.

Here is my old implementation:

> @btime KM3NeT.read_tracks(fpath)
617.613 ms (4431147 allocations: 236.08 MiB)

and yours

> @btime read_tracks_fast(fpath)
39.539 ms (32 allocations: 9.03 MiB)

Note that both include the HDF5 file opening, which however is a negligible overhead in this case.

tamasgal mentioned this issue Jul 19, 2019

Add h5read_compound to read compound structures efficiently #559

Closed

kleinhenz mentioned this issue Nov 1, 2019

improve performance/ergonomics of reading compound datatypes #592

Merged

musm closed this as completed in #592 Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue reading HDF5Compound data #408

Performance issue reading HDF5Compound data #408

tamasgal commented May 19, 2017

timholy commented May 19, 2017

tamasgal commented May 19, 2017

timholy commented May 19, 2017

tamasgal commented May 19, 2017

mbauman commented May 22, 2017

tamasgal commented May 25, 2017

ancapdev commented Jun 27, 2017

ancapdev commented Jul 12, 2017 •

edited

Loading

tamasgal commented Jul 12, 2017

damiendr commented Dec 27, 2017

tamasgal commented Jan 27, 2019

Performance issue reading HDF5Compound data #408

Performance issue reading HDF5Compound data #408

Comments

tamasgal commented May 19, 2017

timholy commented May 19, 2017

tamasgal commented May 19, 2017

timholy commented May 19, 2017

tamasgal commented May 19, 2017

mbauman commented May 22, 2017

tamasgal commented May 25, 2017

ancapdev commented Jun 27, 2017

ancapdev commented Jul 12, 2017 • edited Loading

tamasgal commented Jul 12, 2017

damiendr commented Dec 27, 2017

tamasgal commented Jan 27, 2019

ancapdev commented Jul 12, 2017 •

edited

Loading