-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/whole file memmap #1230
Feature/whole file memmap #1230
Conversation
instead of memmapping each array (which opens a new file per array memmapped), memmap the file with the python standard library mmap then use numpy.ndarray.__new__ to create views into the memmap for each array (as per docs in numpy.memmap)
Previous to the PR when memmapping each memmapped array would consume one open file. For files with large numbers of arrays this increases the liklihood of hitting the maximum number of open files for a single process and also drastically slows down file open times. This PR memmaps the entire file (when memmapping is enabled) and provides the memmap (a python mmap.mmap object from the standard library) to each memmapped array (via a call to ndarray.__new__). File updates (asdf.update) complicates the handling of this whole file memmapping as it can result in changes to the file on disk which require regeneration of the memmap. The test test_block_data_change checks for this regeneration.
On Windows if 0 is provided as the number of bytes to mmap, the file pointer end up at a negative index which causes issues with seek and an OSError on __exit__. Instead, seek to the end of the file, get the location with tell, and provide the non-zero size to mmap.
Prior code closed all memmaps when BlockManger.close was called. This was used during the edit command when the yaml grew larger than the previously allocated space. An call to close was added to allow tests of the edit command to work on Windows.
Refactors memmap handling for AsdfFile.update into flush_memmap and close_memmap added to generic_io classes. Cleans up a few comments.
a3fb9c8
to
0581bd8
Compare
asdf/asdf.py
Outdated
# close memmaps so they will regenerate | ||
if fd.can_memmap(): | ||
fd.close_memmap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be under the finally
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a much better way to do it. I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully that will work. It is nice because it puts it cleanly into the context manager framework being used in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall these changes look great.
if self._memmapped and self._data is not None: | ||
if NUMPY_LT_1_7: # pragma: no cover | ||
try: | ||
self._data.flush() | ||
except ValueError: | ||
pass | ||
else: | ||
self._data.flush() | ||
if self._data._mmap is not None: | ||
self._data._mmap.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, we should try to find more of these bits left over from supporting old versions of dependencies and remove them.
asdf/generic_io.py
Outdated
def close_memmap(self): | ||
""" | ||
Close the memmapped file (if one was mapped with memmap_array) | ||
""" | ||
raise NotImplementedError() | ||
|
||
def flush_memmap(self): | ||
""" | ||
Flush any pending writes to the memmapped file (if one was mapped with | ||
memmap_array) | ||
""" | ||
raise NotImplementedError() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an error message here describing that this should be implemented if needed.
Also, do we need these implemented by all the child classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a rather generic error message to these (and the NotImplementedError in memmap_array). Improvements to the wording are appreciated.
https://github.com/braingram/asdf/blob/feature/whole_file_memmap/asdf/generic_io.py#L617
I believe all of the calls to memmap_array, flush_memmap and close_memmap are within conditional blocks that check can_memmap. RealFile is the only subclass that returns True for can_memmap (I'm not sure if others would be possible) and RealFile has all of these implemented. I want to say a mixin might make sense here but I am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike mixins in general. I was more asking in case all the subclasses where implementing their own version in which case we would want this to be an abc method which is required to be implemented by the subclasses rather than having the error message.
Since this is the case I like what you have done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
Instead of memmaping each array (which consumes one open file descriptor per array) this PR will memmap the entire file (when enabled) and provide the mmap.mmap to each array.
This has the added benefit of greatly increasing open speed for files with a large number of arrays. For example the current jwst ifupost nirspec reference file has 240 arrays. With this PR (on my machine) open takes ~1.5 seconds compared to ~7 without this PR.
I'm opening this as draft to run the downstream tests.