Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size optimizations for range slices. #35

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

thomcc
Copy link
Contributor

@thomcc thomcc commented May 13, 2020

Note: This branch's parent is the branch from #34, so it includes that commit
too. Once that lands I can rebase this branch onto master.

This implements two size optimizations for slice range tables, described below.
Neither are that interesting or clever.

In addition to this, I also reran the table generation code (it had gone stale)
and changed ./scripts/generate-unicode-tables so that it can now optionally
take a path to the ucd-generate binary. This was mostly so I could add
benchmarks containing these optimizations.

Split character ranges

Instead of generating &[(u32, u32)], if --split-ranges is passed we emit two
tables; one including ranges that fit in u16, and one for those that do not. Any
that span the boundary are split in to two ranges, so that for a given char,
you only have to test one table or the other, based on the value of that character.

Note that this is incompatible with --char, as there's no way to use a char
literal and have 16bit output.

In practice how much this saves depends on the table, and the upper bound is
reducing size to 50%. That said, in practice it does remarkably well for some
tables, even beating rustc's skiplist format on a few, despite how simple it is.

Searching is carried out by some function like this:

fn search_split_property_table(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)]),
) -> bool {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    }
}

It's a little complex, but I don't think it warrants being put in a separate
library we ask them to depend on.

Note that this also works with tables that map codepoint ranges to values. For
example: ucd-generate general-category --rust-enum --split-ranges will produce
a table like:

pub const GENERAL_CATEGORY: (
    &'static [(u16, u16, GeneralCategory)],
    &'static [(u32, u32, GeneralCategory)],
) = (&[
    // ...
], &[
    // ...
]);

When used like this, searching is similar. However, for cases like this, passing
the --separate-values is beneficial.

Separated value arrays

The general category example above shows a flaw, which is that despite the fact
that size_of::<GeneralCategory>() is 1.

This is not a flaw specific to the split ranges code, and exists for the other
mapped ranges too.

The optimization places the "value" type into a seprate array. That is, if your
table was a &[(char, char, SomeEnum)], this optimization turns it into:
(&[(char, char)], &[SomeEnum]), where both slices have the same length

To search this, you do a binary search over the range table, and use the index
in the enum table.

fn search_value_table<S>(
    search_char: char,
    table: (&[(u32, u32)], &[S]),
) -> Option<&S> {
    table.0.binary_search_by(|&(s, e)| {
        if s > (search_char as u32) {
            Ordering::Greater
        } else if e < (search_char as u32) {
            Ordering::Less
        } else {
            Ordering::Equal
        }
    }).ok().map(|index| table.1[index])
}

The benefit here is that now, neither slice will have any padding values.

Additionally, this works with the --split-ranges option mentioned above, in
order to reduce size as far as possible.

One note is we don't emit two separate separated value arrays -- just one whose
length is the sum of the of the (u32, u32) and (u16, u16). I don't think this
matters in practice. There are slight performance benefits this way (shared
object concerns), but I'd be willing to have two -- a point of hesitation I had
that pushed me in the current direction was: it could be easy to mix up which
table was which, as they're the same type.

My hope is that if there's only one array, you'll look it up.

fn search_split_value_table<S>(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)], &[S]),
) -> Option<&S> {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).map(|i| table.2[i]).ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
            // Note: the offset returned by the search of the 32-bit
            // table is assumed to be relative to the end of the
            // 16-bit table.
        }).map(|i| table.2[i + table.1.len()]).ok()
    }
}

In practice, all this together improves performance a lot. More than I would
have expected -- it seems to be one of the faster entries in the benchmark --
(caches...), however the main reason I added it was for size reduction. As an
example, when both optimizations are used on the General_Category property value
table, it shrinks from 38388 bytes to 18703 bytes.


Edit: A couple of notes I thought of:

  1. I'm in no way tied to these names. I don't think they're particularly good, even.
  2. I wish there were a place in ucd-generate's documentation to put more detailed descriptions of the table formats, and sample usage code. I remember not even being sure if the u32 pairs were inclusive or exclusive when I first used it.

@BurntSushi
Copy link
Owner

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze. (Although I'm still inclined to merge it, since I see ucd-generate as a dumping grounds of sorts.)

I'm going to cut a new release soon with Unicode 14, but without this. (But with your [char; 3] PR.) Once that's out, if you wanted to get this PR into shape, I'll do my best to take a closer look at it soon and try to get it into another release.

@thomcc
Copy link
Contributor Author

thomcc commented Jul 6, 2022

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze.

Yeah it's hard to say how much it matters. There are two benefits:

  1. It's good for performance; binary search is moderately cache hostile, so you want to ensure that the data that gets pulled in for each fetch is relevant. That said, I doubt these are frequently on the hot path in real code, and it would be very hard to measure as micro-benchmarks for cold-cache situations are very hard.

  2. More importantly, in cases where the value is 1 byte (like enums and such), it reduces total table size by avoiding the padding required to align to a char/u32/u16. This can be pretty significant, as it ends up causing overhead on each entry. For (u16, u16, u8) this is ~16% of the table spent on padding, and for (u32, u32, u8), it's 25%.

    However, the padding can also avoided by changing u16 to [u8; 2] and u32 to [u8; 3] (well, [u8; 4] but if you're going to do that, you might as well take advantage of the fact that Unicode scalars are 21 bits while you're at it). Then, ([u8; 3], [u8; 3], u8) and ([u8; 2], [u8; 2], u8) have no padding (in practice, obviously the compiler is free to do whatever it wants to repr(Rust) types, but it's reasonable to assume it won't decide to waste bytes for no reason).

    That said, that would be something I'd add in another patch, as it's starting to get to a point where Discussion: Emitting code to search tables with complex layouts #39 is useful.

So, it's hard to say. Overall, my hunch is that while we can avoid the padding in this case, there are probably other cases where we'd like to split things out like that (for one reason or another). So, if the complexity would help out other cases that want to separate keys from values, it's probably worth it. If it's just for this case, and probably couldn't be reused, it's a lot more dubious. I haven't looked at the code in over two years either, so I don't remember the details.

Note: I haven't looked at this patch in over 2 years, it's also possible that I can make it simpler, I'll look when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants