Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/whole file memmap #1230

Merged

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Nov 14, 2022

Instead of memmaping each array (which consumes one open file descriptor per array) this PR will memmap the entire file (when enabled) and provide the mmap.mmap to each array.

This has the added benefit of greatly increasing open speed for files with a large number of arrays. For example the current jwst ifupost nirspec reference file has 240 arrays. With this PR (on my machine) open takes ~1.5 seconds compared to ~7 without this PR.

I'm opening this as draft to run the downstream tests.

instead of memmapping each array (which opens a new file per array
memmapped), memmap the file with the python standard library mmap
then use numpy.ndarray.__new__ to create views into the memmap for
each array (as per docs in numpy.memmap)
Previous to the PR when memmapping each memmapped array would
consume one open file. For files with large numbers of arrays
this increases the liklihood of hitting the maximum number of
open files for a single process and also drastically slows down
file open times.

This PR memmaps the entire file (when memmapping is enabled)
and provides the memmap (a python mmap.mmap object from the
standard library) to each memmapped array (via a call to
ndarray.__new__).

File updates (asdf.update) complicates the handling of this
whole file memmapping as it can result in changes to the file
on disk which require regeneration of the memmap. The test
test_block_data_change checks for this regeneration.
On Windows if 0 is provided as the number of bytes to mmap, the
file pointer end up at a negative index which causes issues with
seek and an OSError on __exit__. Instead, seek to the end of the
file, get the location with tell, and provide the non-zero size
to mmap.
Prior code closed all memmaps when BlockManger.close was called. This
was used during the edit command when the yaml grew larger than the
previously allocated space.

An call to close was added to allow tests of the edit command to
work on Windows.
Refactors memmap handling for AsdfFile.update into flush_memmap
and close_memmap added to generic_io classes.

Cleans up a few comments.
@braingram braingram force-pushed the feature/whole_file_memmap branch from a3fb9c8 to 0581bd8 Compare November 15, 2022 15:12
@braingram braingram marked this pull request as ready for review November 15, 2022 15:15
@braingram braingram closed this Nov 15, 2022
@braingram braingram reopened this Nov 15, 2022
asdf/asdf.py Outdated
Comment on lines 1263 to 1265
# close memmaps so they will regenerate
if fd.can_memmap():
fd.close_memmap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be under the finally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a much better way to do it. I'll make that change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully that will work. It is nice because it puts it cleanly into the context manager framework being used in that case.

Copy link
Contributor

@WilliamJamieson WilliamJamieson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall these changes look great.

Comment on lines -1227 to -1236
if self._memmapped and self._data is not None:
if NUMPY_LT_1_7: # pragma: no cover
try:
self._data.flush()
except ValueError:
pass
else:
self._data.flush()
if self._data._mmap is not None:
self._data._mmap.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, we should try to find more of these bits left over from supporting old versions of dependencies and remove them.

Comment on lines 619 to 631
def close_memmap(self):
"""
Close the memmapped file (if one was mapped with memmap_array)
"""
raise NotImplementedError()

def flush_memmap(self):
"""
Flush any pending writes to the memmapped file (if one was mapped with
memmap_array)
"""
raise NotImplementedError()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an error message here describing that this should be implemented if needed.

Also, do we need these implemented by all the child classes?

Copy link
Contributor Author

@braingram braingram Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a rather generic error message to these (and the NotImplementedError in memmap_array). Improvements to the wording are appreciated.

https://github.com/braingram/asdf/blob/feature/whole_file_memmap/asdf/generic_io.py#L617

I believe all of the calls to memmap_array, flush_memmap and close_memmap are within conditional blocks that check can_memmap. RealFile is the only subclass that returns True for can_memmap (I'm not sure if others would be possible) and RealFile has all of these implemented. I want to say a mixin might make sense here but I am open to suggestions.

Copy link
Contributor

@WilliamJamieson WilliamJamieson Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike mixins in general. I was more asking in case all the subclasses where implementing their own version in which case we would want this to be an abc method which is required to be implemented by the subclasses rather than having the error message.

Since this is the case I like what you have done.

Copy link
Contributor

@perrygreenfield perrygreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WilliamJamieson WilliamJamieson merged commit 4f8481a into asdf-format:master Nov 21, 2022
@braingram braingram deleted the feature/whole_file_memmap branch November 22, 2022 00:18
braingram added a commit to braingram/asdf that referenced this pull request Dec 30, 2022
mmap access changes in whole file memmap asdf-format#1230 may have
made this flag redundant.
braingram added a commit to braingram/asdf that referenced this pull request Jan 3, 2023
mmap access changes in whole file memmap asdf-format#1230 may have
made this flag redundant.
braingram added a commit to braingram/asdf that referenced this pull request Jan 3, 2023
mmap access changes in whole file memmap asdf-format#1230 may have
made this flag redundant.
WilliamJamieson pushed a commit to braingram/asdf that referenced this pull request Jan 6, 2023
mmap access changes in whole file memmap asdf-format#1230 may have
made this flag redundant.
braingram added a commit to braingram/asdf that referenced this pull request Jan 9, 2023
mmap access changes in whole file memmap asdf-format#1230 may have
made this flag redundant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants