-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide offset for memory mapping / contiguous layout #321
Comments
Thanks for stopping by @maartenbreddels. If one created a Zarr Array with no chunks (contiguously as you say) and disabled compression, then it should be possible to memory map the entire Zarr file and access it if you like. One can check to see if an Currently we load full chunks into memory, which it sounds like you don't want. The easiest way to fix this is the proposal in issue ( #265 ). Simpler still might be to add a flag to |
Also not totally sure what you mean by offset here. Could you please clarify? Is this suppose to be the size of the binary file's header or something? |
I missed the notifications for this. But by offset I mean, given a particular file, at what location in that file does the array data start. |
There are no offsets currently. |
Hi there 👋, just gently bumping this thread to see if there's been any progress on either |
Hi @weiji14, I think this depends on a bit of technical discussion with @maartenbreddels. Zarr is intended primarily for storing data in chunks, optionally with compression. It is possible to store data without chunks (effectively a single chunk, as @jakirkham suggested) and with no compression, but then there is no benefit over using Arrow or HDF5 in contiguous layout. So I'm wondering: (1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression? (2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)? |
Hi @alimanfoo, I suppose I'm primarily interested in the out-of-core/memmap use case, which seems to be what #265 is handling. To the best of my knowledge, Vaex currently supports 2D data tables, but doesn't work so well on N-dimensional cases which is what Zarr is built to handle.
I suppose there is, if the benefit of doing out-of-core processing on memory-limited hardware outweighs that of running parallel workloads on multiple chunks. E.g. processing on a laptop rather than on a cluster.
Not very well educated on the technical aspects, but is it possible to memory map to individual zarr chunks (assuming no compression), since they're basically just files in the filesystem, or is it more trouble than what it's worth? |
Hi @weiji14
The next question is, in this case, what value would using zarr add over other backends (hdf5, arrow)?
I imagine you could memory map chunks with no compression. And you could also figure out for a given row what chunk and what offset within a chunk you need. Cheers, |
A very good point 😄 I can only speak for the HDF5 case since I don't use arrow, but off the top of my head, what I can think of are 1) the ability to see some of the ndarray's metadata (via the external
Ok, good to know that it's a possibility. Should have probably mentioned too that I'm coming in as an |
Related: #265 #149
Thanks @rabernat for pointing me to this library.
In vaex (out of core dataframes) I use .hdf5 or arrow files which are memory mapped, which gives really good performance. Arrow natively supports this, and .hdf5 can be used if contiguous layout is specified. In this case, you can ask the offset (and size) of the array in the file. Once the offsets, types, endianness and lengths/shapes are collected, the Nd-arrays can be memory mapped, linked to a numpy array, and passed around to any library, giving you lazy reading and no memory wasting out of the box.
Is it an idea for zarr to support this layout and, provide an API to get this offset? This would make it really easy for me to support zarr in vaex. In case of chuncked storage, or compressions options, the hdf5 library returns an offset of -1 (the h5py translates that to None I think).
The text was updated successfully, but these errors were encountered: