-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Size optimizations for range slices. #35
base: master
Are you sure you want to change the base?
Conversation
…irs and u32 pairs
So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze. (Although I'm still inclined to merge it, since I see I'm going to cut a new release soon with Unicode 14, but without this. (But with your |
Yeah it's hard to say how much it matters. There are two benefits:
So, it's hard to say. Overall, my hunch is that while we can avoid the padding in this case, there are probably other cases where we'd like to split things out like that (for one reason or another). So, if the complexity would help out other cases that want to separate keys from values, it's probably worth it. If it's just for this case, and probably couldn't be reused, it's a lot more dubious. I haven't looked at the code in over two years either, so I don't remember the details. Note: I haven't looked at this patch in over 2 years, it's also possible that I can make it simpler, I'll look when I get a chance. |
Note: This branch's parent is the branch from #34, so it includes that commit
too. Once that lands I can rebase this branch onto master.
This implements two size optimizations for slice range tables, described below.
Neither are that interesting or clever.
In addition to this, I also reran the table generation code (it had gone stale)
and changed
./scripts/generate-unicode-tables
so that it can now optionallytake a path to the
ucd-generate
binary. This was mostly so I could addbenchmarks containing these optimizations.
Split character ranges
Instead of generating
&[(u32, u32)]
, if--split-ranges
is passed we emit twotables; one including ranges that fit in u16, and one for those that do not. Any
that span the boundary are split in to two ranges, so that for a given
char
,you only have to test one table or the other, based on the value of that character.
Note that this is incompatible with
--char
, as there's no way to use achar
literal and have 16bit output.
In practice how much this saves depends on the table, and the upper bound is
reducing size to 50%. That said, in practice it does remarkably well for some
tables, even beating rustc's skiplist format on a few, despite how simple it is.
Searching is carried out by some function like this:
It's a little complex, but I don't think it warrants being put in a separate
library we ask them to depend on.
Note that this also works with tables that map codepoint ranges to values. For
example:
ucd-generate general-category --rust-enum --split-ranges
will producea table like:
When used like this, searching is similar. However, for cases like this, passing
the
--separate-values
is beneficial.Separated value arrays
The general category example above shows a flaw, which is that despite the fact
that
size_of::<GeneralCategory>()
is 1.This is not a flaw specific to the split ranges code, and exists for the other
mapped ranges too.
The optimization places the "value" type into a seprate array. That is, if your
table was a
&[(char, char, SomeEnum)]
, this optimization turns it into:(&[(char, char)], &[SomeEnum])
, where both slices have the same lengthTo search this, you do a binary search over the range table, and use the index
in the enum table.
The benefit here is that now, neither slice will have any padding values.
Additionally, this works with the --split-ranges option mentioned above, in
order to reduce size as far as possible.
One note is we don't emit two separate separated value arrays -- just one whose
length is the sum of the of the (u32, u32) and (u16, u16). I don't think this
matters in practice. There are slight performance benefits this way (shared
object concerns), but I'd be willing to have two -- a point of hesitation I had
that pushed me in the current direction was: it could be easy to mix up which
table was which, as they're the same type.
My hope is that if there's only one array, you'll look it up.
In practice, all this together improves performance a lot. More than I would
have expected -- it seems to be one of the faster entries in the benchmark --
(caches...), however the main reason I added it was for size reduction. As an
example, when both optimizations are used on the General_Category property value
table, it shrinks from 38388 bytes to 18703 bytes.
Edit: A couple of notes I thought of:
u32
pairs were inclusive or exclusive when I first used it.