Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue reading HDF5Compound data #408

Closed
tamasgal opened this issue May 19, 2017 · 11 comments · Fixed by #592
Closed

Performance issue reading HDF5Compound data #408

tamasgal opened this issue May 19, 2017 · 11 comments · Fixed by #592

Comments

@tamasgal
Copy link
Contributor

Dear all,

I noticed that the readout of HDF5Compound datasets is very slow compared to python (with pandas, which uses pytables as backend), namely a factor of 10.

The sample dataset is a 200k x 15 array and here are the readout times using pandas and Julia:

In [1]: import pandas as pd
In [2]: %time data = pd.read_hdf("run1.h5", "angles/99/hits")
CPU times: user 224 ms, sys: 17.8 ms, total: 242 ms
Wall time: 244 ms
v0.5.0> using HDF5
v0.5.0> using BenchmarkTools
v0.5.0> f = h5open("run1.h5", "r")
HDF5 data file: run1.h5
v0.5.0> @benchmark data = read(f["angles/99/hits"])
BenchmarkTools.Trial:
  memory estimate:  291.37 MiB
  allocs estimate:  8815153
  --------------
  minimum time:     2.897 s (5.34% GC)
  median time:      2.918 s (6.31% GC)
  mean time:        2.918 s (6.31% GC)
  maximum time:     2.939 s (7.27% GC)
  --------------
  samples:          2
  evals/sample:     1

I was a bit surprised since I switched to Julia to boost the analysis performance (which is totally faster than in python/numpy) but in the end it runs in the same time due to the big impact of the slow readout.

Am I doing something wrong? Is there a way to speed up the read-in?

Should I prepare a sample file?

@timholy
Copy link
Member

timholy commented May 19, 2017

I seem to recall that @mbauman once suggested the interface read(io, "var"=>T) where T is a Julia immutable type with the same memory layout as the compound type on disk. I suspect that might solve all your performance problems.

We didn't implement it at the time because Pair (which is what => gives you) was introduced in what was then "bleeding edge" julia-0.4pre and not available in stable julia versions; once 0.4 came out, this never got revisited. It would be a great contribution to implement this.

@tamasgal
Copy link
Contributor Author

Thanks for the information! What exactly should "var" represent? I am not sure how this Pair readout should work, but maybe you can point me to the discussion so I can try to implement it somehow ;)

@timholy
Copy link
Member

timholy commented May 19, 2017

"var" was meant to be the name of the dataset, e.g., "angles/99/hits". But of course there are other options, like read(dset=>T). To be honest, it's been so long since I worked closely with this package that I don't remember many details about our previous thoughts. Nor was I successful in searching for the original discussion. Sorry.

But the bottom line is simple: presumably most of the time for HDF5Compound is spent "decoding the type." The main point is that it's not straightforward to convert from a type-as-described-by-HDF5's-disk-format to a Julia type---what if the Julia type is defined in some package, how is the HDF5 module supposed to know what to do?

But life gets easier if you pass the desired Julia type as an input. In that case the read operation essentially becomes just two lines,

buf = Vector{T}(length(obj))
h5d_read(obj.id, memtype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf)

and it should be rather faster 😄. In addition to the nice feature that the data are already arranged in proper format for you to work with them.

You might also want to provide a function that validates that the Julia type matches the layout of the HDF5 type (similar to the logic of the current read operation), but that could be a separate function call and should not have to slow things down for every single read.

@tamasgal
Copy link
Contributor Author

Thanks for your input! I am currently playing around with the skeleton you sketched and trying to figure out how memtype_id should "look like" or how it is treated. I thought it would be some kind of a memory footprint... It seems I need to dig into the HDF5 docs though ;) But I now more or less understand how it should work.

@mbauman
Copy link
Member

mbauman commented May 22, 2017

You have a good memory, Tim! I didn't even remember that. Since my participation here has been fairly limited, it was pretty easy to pull up the related issue by limiting the search to ones where I commented: #169

@tamasgal
Copy link
Contributor Author

It is definitely a tougher job than I thought. I'll update as soon as I got something ;)

Thanks for cross referencing!

@ancapdev
Copy link

Any progress on this, or interest in having help?

@ancapdev
Copy link

ancapdev commented Jul 12, 2017

A further comment.

The suggestion appears to be passing in a target type to copy into. In my use case I'd like for the underlying HDF5 format to from time to time, or from data source to data source, be extended with new fields.

I'd much prefer generating a compatible immutable on-the-fly to match the underlying data. Aside from user possibly wanting to provide their own types, are there problems with this?

In order to have the generated type work well with user's code and dispatch, I was thinking you could pass in an abstract type to inherit from.

@tamasgal
Copy link
Contributor Author

I have no progress regarding the HDF5Compound approach but I can of course write down my experiences and how I ended up solving this issue. Sorry in advance for being "off topic", I went a different approach which totally outperforms Pandas, but it may help one or the other...

Reading the HDF5Compound datasets is nice and easy but in our experiment, we store up to 10000 events in a file, with something like 5000 hits per event.
My initial HDF5 structure was: separate HDF5Compound datasets (like mentioned in the first post), but it turned out that having thousands of datasets in an HDF5 file is a bad idea. To read in one of those, the internal index tree (B-tree) has to be accessed every time, which in our case actually was one of the bottlenecks.
After a lot of trial and error, I ended up with a 1D array structures for each hit-attributed stored under a single group and an HDF5Compound index-dataset, which stores the index of the first hit and the number of hits per entry.

Something like this (for 10000 events with 5000 hits each):

/hits/_indices   (10000, 2)
/hits/time   (50000000)
/hits/dom_id   (50000000)
/hits/pmt_id   (50000000)
/hits/tot   (50000000)
/hits/triggered   (50000000)
...

To get the hits e.g. for event #23, I read the /hits/_indices (which I keep in memory) at index #23, which will give me the first hit index (let's call it idx) and the number of hits (n). So I can easily grab the corresponding hit attributes by reading /hits/time[idx:idx+n]. I then utilise a for-loop over 1:n and create Hits by passing them the entries of the corresponding index.

This is approach was something like 50x faster than reading the hits from a dataset (B-tree access) and parsing HDF5Compound instances to the final Hit instance.

@damiendr
Copy link

I hit upon this issue and wrote some code that implements @timholy 's suggestion. It is indeed faster by an order of magnitude:

function read_fast(dset::HDF5.HDF5Dataset, T::DataType)
    filetype = HDF5.datatype(dset) # packed layout on disk
    memtype_id = HDF5.h5t_get_native_type(filetype.id) # padded layout in memory
    @assert sizeof(T) == HDF5.h5t_get_size(memtype_id) "Type sizes don't match!"
    out = Vector{T}(length(dset))
    HDF5.h5d_read(dset.id, memtype_id, HDF5.H5S_ALL, HDF5.H5S_ALL, HDF5.H5P_DEFAULT, out)
    HDF5.h5t_close(memtype_id)
    out
end

struct Element
    field1::Float32
    field2::Int64
    field3::UInt8
end

data = read_fast(h5["mydataset"], Element)

@tamasgal
Copy link
Contributor Author

I just arrived here again 😆

Again, thanks @timholy for the suggestion and @damiendr for the implementation.

I changed the line out = Vector{T}(length(dset)) to out = Vector{T}(undef, length(dset)), so it's now working on Julia 0.7+ and it's really fast!
On of the main reasons is the one-shot allocation of the output array.

Here is my old implementation:

> @btime KM3NeT.read_tracks(fpath)
617.613 ms (4431147 allocations: 236.08 MiB)

and yours

> @btime read_tracks_fast(fpath)
39.539 ms (32 allocations: 9.03 MiB)

Note that both include the HDF5 file opening, which however is a negligible overhead in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants