-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy and 32-bit address spaces #182
Comments
Does Gimli have any established "realistic" loads with which to assess the cost? If you're going to reuse cache lines, then parsing will definitely have to stop borrowing data out of the file's mmap'd image. Have you looked into mapping only the portions of the executables that contain the debug info, instead of mapping the whole file? |
I think the closest would be our https://github.com/gimli-rs/addr2line#performance Perhaps @khuey, @jvns, and @kamalmarhubi have more here.
I have vague hopes for using
I have not. It should be possible to read just the object file's header to get the offset ranges for each section and then map just those sections into memory. If the section is very large, however, we are back at square one again. I'm not sure whether it is worth pursuing this approach or not. |
My libxul's Unfortunately I don't have performance data for anything other than toy programs yet with my stuff. |
Can we change the API to return offsets instead of references, so that any copies are delayed? For the mmap case, you would still be able to turn that into a zero-copy reference. |
"the API" == the hypothetical So if we wanted to read 10 contiguous bytes from the file, we return (a newtype of) a tuple This would work great for strings we get out of |
I meant anything in gimli's API that returns So the idea is we return the offset instead of immediately doing an allocation+copy. For the non-mmap case, you'll probably still need to do the allocation+copy if you want to read it, but you may not always want to.
Yes this would be for things like |
I had a play with converting the API to use offsets, and it didn't work out very well. The problem is that we need to support values that might be offsets into one of a number of sections. So the API probably needs to return a struct of |
To make this a bit more concrete, here's a precis of what I did We have an abstract type typedef struct _DiImage DiImage;
typedef uint64_t DiOffT; You can create an image from a local file .. DiImage* ML_(img_from_local_file)(const HChar* fullpath); .. and by other means not relevant here (by connecting to a Then you can ask some basic stuff /* Virtual size of the image. */
DiOffT ML_(img_size)(const DiImage* img);
/* Does the section [offset, +size) exist in the image? */
Bool ML_(img_valid)(const DiImage* img, DiOffT offset, SizeT size); You can copy arbitrary chunks out of the image: /* Get info out of an image. If any part of the section denoted by
[offset, +size) is invalid, does not return. */
void ML_(img_get)(/*OUT*/void* dst,
DiImage* img, DiOffT offset, SizeT size); which is general, but pretty damn awkward. So there are various /* Fetch various sized primitive types from the image. These operate
at the endianness and word size of the host. */
UShort ML_(img_get_UShort)(DiImage* img, DiOffT offset);
/* Do strlen of a C string in the image. */
SizeT ML_(img_strlen)(DiImage* img, DiOffT off); (and many more) and you can do some not-so-obvious things that have to be done in /* Calculate the "GNU Debuglink CRC" for the image. This
unfortunately has to be done as part of the DiImage implementation
because it involves reading the entire image, and is therefore
something that needs to be handed off to the remote server -- since
to do it otherwise would imply pulling the entire image across the
connection, making the client/server split pointless. */
UInt ML_(img_calc_gnu_debuglink_crc32)(DiImage* img); and finally of course /* Destroy an existing image. */
void ML_(img_done)(DiImage*); |
Ecch. The previous comment window seems to have auto-transformed various |
(I edited the comment for formatting.) |
The downside of the above is of course that you have to copy out The fact that this all had to be done in C (hence, no templates, traits, Obviously the slowdown depends on how much stuff you haul out of |
Oh, and the Dwarf line number info too. |
Status update: AIUI, @julian-seward1 went with a more incremental approach using code that already exists in mozilla-central. This is still something that I'd like to have sometime before any hypothetical 1.0 release (which we haven't talked about at all and seems very far in the future). Also, haven't had many cycles for gimli and related projects lately, so I probably won't be doing this myself super soon either. |
The Valgrind implementation of the segmented image machinery recently |
Assuming that the structs you're returning are POD (which I think they are?) then you ought to be able to return I keep thinking about this because I have yet to settle on the nicest way of parsing existing binary data formats in Rust. I was fiddling with something related that may be relevant: In that example I was worried about returning references to unaligned data, which is not that hard to work around but does require a copy, but you could use the same principle to hand back a copy when you're not using mmap. |
gimli currently uses |
|
gimli
is zero-copy0, which has the effect that whole ELF/MachO sections must be completely loaded into memory before they are given togimli
. When combined with large object files and large debug sections, this is problematic on 32-bit architectures, which have a limited address space.Some Firefox folks are investigating using
gimli
for symbolicating Firefox's builtin profiler's backtraces, and this is a limiting issue.libxul.so
alone is 1.2GB, and there just isn't enough contiguous vm address space formmap
ing it whole (and often plain not enough physical memory available).Julian Seward (one of said Firefox folks) tells me he had this same issue with Valgrind: it used to map whole object files, and it constantly failed on 32-bit architectures (particularly ARM). He replaced Valgrind's use of raw base pointers and offsets with an abstraction layer to read individual values. Under the hood, the abstraction is basically a direct-mapped cache with a few very large (64K) lines. Eventually, they even gained swappable backend implementations for the abstraction, one of which transparently fetched debug info over the network.
So. Can we revisit the zero-copy approach and take an approach similar to what Valgrind did?
I think we could abstract reading raw values with a
Reader
trait, and make anMmapWholeFile
implementation of this trait to (mostly) maintain the status quo. (It isn't clear to me if we will be able to maintain the extant zero-copies, however.) Then, we could have a secondReader
trait implementation,DirectMappedCache
, like what Valgrind does.This will likely introduce more error propagation and
try!
s withingimli
code, but I suspect that it won't change the API for users very much if at all, since parsing is already fallible.As far as performance goes: I'll quote Julian's recounting of his experience doing this transition in Valgrind:
My hope is that we could mitigate even that slowdown by using the
MmapWholeFile
implementation of theReader
trait on 64-bit architectures by default in ouraddr2line
anddwarfdump
ports.Finally, if we decide to continue down this road, Julian has volunteered to contribute the necessary changes.
Thoughts @philipc @jonhoo @tromey @khuey @kamalmarhubi @jvns @jimblandy?
Thanks!
0 True, a bunch of stuff can't be and isn't actually zero-copy, like LEB128 parsing, but we avoid copies as much as possible right now.
The text was updated successfully, but these errors were encountered: