Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Optionally support memory-mapping DirectoryStore values #377

Closed
wants to merge 9 commits into from
Closed

RFC: Optionally support memory-mapping DirectoryStore values #377

wants to merge 9 commits into from

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Jan 4, 2019

Fixes #265
Depends on #378 & #379 & #380

Adds the memmap option to DirectoryStore, which defaults to False (current behavior). If set to True, this will memory-map each value returned instead of loading it all into memory. This behavior is extended to NestedDirectoryStore and TempStore.

Given the recent work in PR ( zarr-developers/numcodecs#128 ) and PR ( #352 ), this just works out-of-the-box. The tests needed some adjustment to be a bit more flexible as they were strictly requiring bytes before. Also had to implement pop and popitem to ensure data was copied into memory before files were removed.

xref: https://github.com/zarr-developers/zarr/issues/320
xref: #321
xref: https://github.com/zarr-developers/zarr/issues/334

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@alimanfoo
Copy link
Member

Thanks, very helpful to see this.

I'm realising that I don't really understand memory mapped files very well at all, so these questions might not make sense. But these come to mind:

  • What do we gain by using numpy.memmap rather than mmap directly?
  • With memory-mapped files, the underlying file needs to be kept open (is that true?). If lots of chunks are being read, could this cause any problems due to lots of open files?
  • What about writing? Would it ever make sense to also write data via a memory map?

One thing I'm really not clear about is the division of responsibility between operating system and Python in terms of who manages the memory mapping and owns the memory. I believe it is the operating system, but I'm not clear.

If this is the case, what happens when two threads (or processes) open a memory map onto the same file? Do they access the same memory map, or are two separate memory maps created?

What about if a single thread (or process) sequentially opens a memory map onto the same file multiple times? Will the same memory map be re-used, or will it be created again from scratch each time?

@jakirkham
Copy link
Member Author

Thanks for giving this a quick look. These are good questions. They forced me to think a bit more about this problem, which has been really helpful. Sorry for writing so much.

What do we gain by using numpy.memmap rather than mmap directly?

Initially this just seemed like the easiest way to go. After digging into the NumPy code a bit, I think we are saving a little on boilerplate as NumPy is just using mmap under the hood. You're probably right that using mmap directly is fine.

With memory-mapped files, the underlying file needs to be kept open (is that true?).

At the system level the file does need to be kept open for the memory-mapped file to work.

That said, Python specifically duplicates the file descriptor of the file when constructing the mmap object; so, can close the original file in Python (NumPy leverages this in memmap).

If lots of chunks are being read, could this cause any problems due to lots of open files?

Certainly. Though the same could be said for how we read files now. There can be an issue with having too many file descriptors in use as noted in issue 10897. However NumPy's memmap avoids this by just closing the old file and we could too if we use mmap.

The main problem we can run into is if we don't explicitly close the memory-mapped file. The memory-mapped file will of course close itself when it scopes out (ref count drops to zero or garbage collection gets to it). However we could still have these things lingering in the interim, which would affect the number of used file descriptors. IDK how much of an issue this would be in practice given these likely get closed quickly (due to ref counting), but it is worth keeping in mind.

What about writing? Would it ever make sense to also write data via a memory map?

It could, but I don't think it makes sense in our current implementation of DirectoryStore. This would be more useful if we wrote directly into chunks. It also could make sense if we kept memory-maps open for a while in read-write mode leveraging the cache described above.

Maybe this is a good argument for a separate store? If we did go that way, we might want to give some thought to how we compose a new store like this with other directory layouts. We'd also want to think about what it means to make an atomic change in this setup and whether that is important.

One thing I'm really not clear about is the division of responsibility between operating system and Python in terms of who manages the memory mapping and owns the memory. I believe it is the operating system, but I'm not clear.

That sounds right to me. FWIU the operating system reads from the file into kernel space and shares that memory segment with the process. With a typical read there would be a system call to read followed by a copy of the data from kernel space into process space. Memory-mapping avoids the system calls and copying. Though this is oversimplifying a lot.

Should add the memory usage still gets counted against your process somehow.

If this is the case, what happens when two threads (or processes) open a memory map onto the same file? Do they access the same memory map, or are two separate memory maps created?

As the memory used is in the kernel (not in the process) this can be reused easily not only across threads, but also across processes (as long as sharing is enabled when creating the memory-map). Memory-mapping is actually a convenient way to create process shared memory generally.

What about if a single thread (or process) sequentially opens a memory map onto the same file multiple times? Will the same memory map be re-used, or will it be created again from scratch each time?

It's worth point out that generally opening and closing a file multiple times comes at a cost (whether or not it is memory mapped), which we already do. Certainly memory mapping a file each time has its own cost as well. To cutdown on these we might consider having an LRU cache of file or memory map objects. ( #381 ) This would be a good way to control the file descriptor issue described earlier as well.

@jakirkham jakirkham mentioned this pull request Jan 4, 2019
@jakirkham
Copy link
Member Author

As a side note, would be good to get the 3 testing PRs this depends on (listed in the OP) reviewed and hopefully merged in the interim. Those seems like useful additions independent of this PR.

@alimanfoo
Copy link
Member

Thanks @jakirkham for the info, much appreciated - learning a lot. Funny that you can interact with the mmap after the original file is closed, I did see that in the numpy memmap source code but wasn't sure I was seeing it right. I.e., this works:

In [4]: with open('test-mmap.dat', 'r+b') as f: 
   ...:     mm = mmap.mmap(f.fileno(), 0) 
   ...:                                                                                                                                                       

In [5]: mm[:]                                                                                                                                                 
Out[5]: b'asdlkjdaslkjdasljkdasjlkdas\n'

In [6]: mm[:5] = b'aaaaa'                                                                                                                                     

I'll try to look at the upstream PRs ASAP.

@jakirkham
Copy link
Member Author

Right that’s because mmap duplicates the file descriptor. It’s both useful and odd.

@alimanfoo
Copy link
Member

Just a short note to say that in reading around memory-mapped files, I happened again on information about LMDB (home page, wikipedia page). I don't understand the details but get the sense that if we want to explore the potential benefits of memory-mapping then we may do well to build on existing key-value store implementations where a lot of thought may already have gone into managing memory maps while also providing other useful properties like support for concurrent access. Zarr already has an LMDBStore class, which I have found extremely fast in some casual benchmarks (e.g., substantially faster then berkeley DB). LevelDB also comes up in discussions of memory-mapping, Zarr doesn't have a store class for this yet but would probably be straightforward via plyvel.

This is not to stifle exploration of possible uses of Python's mmap module, but just thought I'd make this connection.

@jhamman
Copy link
Member

jhamman commented Jan 5, 2019

Btw, it also sounds like MongoDB has a memory mapping storage backend. This will also be available once #372 is completed.

@jakirkham jakirkham changed the title Optionally support memory-mapping DirectoryStore values RFC: Optionally support memory-mapping DirectoryStore values Feb 14, 2019
jakirkham added 9 commits May 6, 2019 02:41
Instead of strictly expecting values returned from a store to be `bytes`
instances only in the tests, relax this to anything that supports the
buffer protocol and has the expected content. This makes it possible to
return things like NumPy `memmap` objects and still have the tests work
correctly.
This provides users the option to memory-map the data from the
`DirectoryStore` instead of reading it all into memory at once. This can
be useful if the chunks are especially large. Also can be useful in
combination with zero-copy changes made to Numcodecs to decompress data
straight from disk into a decompressed buffer. In cases where the data
is not compressed, this can make accessing pieces of chunks pretty
efficient.

To make resource management simple, we leverage NumPy's `memmap`, which
cleans up after itself once no more references are held (e.g. RAII). So
there is no need to explicitly close the file. This way if the
memory-mapped file is only used for a single operation and tossed, it
will automatically be cleaned up without needing other code outside the
store to handle this.

So as to ensure no inadvertent changes are made to the store, the
memory-mapped file is opened in read-only mode. Thus we disallow
accidental overwrites of data on disk. This constraint applies to any
views taken of the `memmap` (e.g. NumPy `ndarray`s or `memoryview`s). Of
course users can still use `__setitem__` to overwrite the data on disk.

At present, memory-mapping only occurs when reading data from disk. It
is not used to write data to disk. With reading data from disk, there
are clear benefits (as noted above). However it is unclear what the
benefits are for writing data to disk using memory-mapping. Particularly
as we completely overwrite data on disk in an atomic fashion, which
would be difficult to reproduce using a memory-mapped file.

Other stores inheriting from the `DirectoryStore` can benefit from this
change if they also pass along a `memmap` keyword argument to the
constructor and reuse its `__getitem__` either as is or in their own
subclass implementation of the method. No further changes are required.

Currently memory-mapping is optional and is opt-in. If it is seen to be
generally useful, we may make it opt-out instead. Though the option
remains useful should users find this does or does not make sense in
different contexts. Users could hack the value of `memmap` should they
need to on a case-by-case basis (e.g. only used for specific values),
but it is not necessarily recommended.
Ensure that `TempStore` and `NestedDirectoryStore` can also set the
`memmap` flag and thus benefit from its behavior.
Simply extend all of the tests applied to `TestDirectoryStore` except
set the `memmap` flag to `True` instead of its default value, `False`.
As memory-mapped files will become invalid once deleted,
`DirectoryStore`'s `pop` and `popitem` need to be implemented such that
they don't hold invalid references to files removed from the disk. To
handle this, simply call `ensure_bytes` on any value retrieved to ensure
it is copied into memory. This will be a no-op if memory-mapping was
disabled as the value would already be a `bytes` instance.
As stores have various properties like needing to be able to compare
equal, NumPy `memmap` objects are not ideal for this purpose. Not to
mention they subclass NumPy `ndarray`'s, which makes it hard to get a
proper NumPy `ndarray` that views the data from things like
`ensure_ndarray`. Plus being a less commonly used object in NumPy, users
are less familiar with it. Finally NumPy's docs state, "...[`memmap`]
doesn’t quite fit properly as a subclass." However `memoryview`s bypass
these issues while also providing a convenient container for
constructing a NumPy `ndarray` with `ensure_ndarray` later. Hence we go
ahead and coerce the NumPy `memmap` object to a `memoryview`.

ref: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.memmap.html
Instead of indirectly using things like `__getitem__` and `__delitem__`
in the `DirectoryStore` implementation of `pop`, make direct usage of
the file system.
Just simplifies things a bit.
Corrects how the path is determined.
@jbaayen
Copy link

jbaayen commented Jun 4, 2021

Is there a plan to merge this PR at some point?

@jakirkham
Copy link
Member Author

I don't personally have time to work on this. Though if this is of interest to you and you are willing to take it up, would encourage you to resubmit and will close this out

@joshmoore joshmoore added the help wanted Issue could use help from someone with familiarity on the topic label Sep 3, 2021
@jbaayen
Copy link

jbaayen commented Sep 8, 2021

This can be implemented in recent Zarr versions using a simple subclass:

class MemoryMappedDirectoryStore(DirectoryStore):
    def _fromfile(self, fn):
        return memoryview(np.memmap(fn, mode="r"))

@joshmoore
Copy link
Member

Interesting, so @jakirkham fixed it himself in #503? 😄

@joshmoore
Copy link
Member

Shall we close then @jakirkham or is there anything you want to capture? Maybe just @jbaayen's example in the tutorial?

@jakirkham
Copy link
Member Author

Possibly. Though we may want this supported in NestedDirectoryStore as well, which would make this still desirable

@joshmoore
Copy link
Member

Hmmm.... I thought that would have worked through inheritance. Ok. Happy to leave this open until someone gets a chance to look at it more closely.

@jakirkham
Copy link
Member Author

Ah meaning one should inherit from NestedDirectoryStore and then override this method? Sure I suppose that would work

@jakirkham jakirkham deleted the branch zarr-developers:master May 23, 2022 20:57
@jakirkham jakirkham closed this May 23, 2022
@jakirkham
Copy link
Member Author

FWIW a variant using the Python builtin mmap instead of np.memmap could be

class MemoryMappedDirectoryStore(DirectoryStore):
    def _fromfile(self, fn):
        with open(fn, "rb") as fh:
            return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ))

@jakirkham
Copy link
Member Author

Filed a doc issue ( #1245 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memmap reads from directory store
5 participants