Size optimizations for range slices. #35

thomcc · 2020-05-13T02:22:37Z

Note: This branch's parent is the branch from #34, so it includes that commit
too. Once that lands I can rebase this branch onto master.

This implements two size optimizations for slice range tables, described below.
Neither are that interesting or clever.

In addition to this, I also reran the table generation code (it had gone stale)
and changed ./scripts/generate-unicode-tables so that it can now optionally
take a path to the ucd-generate binary. This was mostly so I could add
benchmarks containing these optimizations.

Split character ranges

Instead of generating &[(u32, u32)], if --split-ranges is passed we emit two
tables; one including ranges that fit in u16, and one for those that do not. Any
that span the boundary are split in to two ranges, so that for a given char,
you only have to test one table or the other, based on the value of that character.

Note that this is incompatible with --char, as there's no way to use a char
literal and have 16bit output.

In practice how much this saves depends on the table, and the upper bound is
reducing size to 50%. That said, in practice it does remarkably well for some
tables, even beating rustc's skiplist format on a few, despite how simple it is.

Searching is carried out by some function like this:

fn search_split_property_table(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)]),
) -> bool {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).is_ok()
    }
}

It's a little complex, but I don't think it warrants being put in a separate
library we ask them to depend on.

Note that this also works with tables that map codepoint ranges to values. For
example: ucd-generate general-category --rust-enum --split-ranges will produce
a table like:

pub const GENERAL_CATEGORY: (
    &'static [(u16, u16, GeneralCategory)],
    &'static [(u32, u32, GeneralCategory)],
) = (&[
    // ...
], &[
    // ...
]);

When used like this, searching is similar. However, for cases like this, passing
the --separate-values is beneficial.

Separated value arrays

The general category example above shows a flaw, which is that despite the fact
that size_of::<GeneralCategory>() is 1.

This is not a flaw specific to the split ranges code, and exists for the other
mapped ranges too.

The optimization places the "value" type into a seprate array. That is, if your
table was a &[(char, char, SomeEnum)], this optimization turns it into:
(&[(char, char)], &[SomeEnum]), where both slices have the same length

To search this, you do a binary search over the range table, and use the index
in the enum table.

fn search_value_table<S>(
    search_char: char,
    table: (&[(u32, u32)], &[S]),
) -> Option<&S> {
    table.0.binary_search_by(|&(s, e)| {
        if s > (search_char as u32) {
            Ordering::Greater
        } else if e < (search_char as u32) {
            Ordering::Less
        } else {
            Ordering::Equal
        }
    }).ok().map(|index| table.1[index])
}

The benefit here is that now, neither slice will have any padding values.

Additionally, this works with the --split-ranges option mentioned above, in
order to reduce size as far as possible.

One note is we don't emit two separate separated value arrays -- just one whose
length is the sum of the of the (u32, u32) and (u16, u16). I don't think this
matters in practice. There are slight performance benefits this way (shared
object concerns), but I'd be willing to have two -- a point of hesitation I had
that pushed me in the current direction was: it could be easy to mix up which
table was which, as they're the same type.

My hope is that if there's only one array, you'll look it up.

fn search_split_value_table<S>(
    search_char: char,
    table: (&[(u16, u16)], &[(u32, u32)], &[S]),
) -> Option<&S> {
    if let Ok(search_char16) = u16::try_from(search_char as u32) {
        table.0.binary_search_by(|&(s, e)| {
            if s > search_char16 {
                Ordering::Greater
            } else if e < search_char16 {
                Ordering::Less
            } else {
                Ordering::Equal
            }
        }).map(|i| table.2[i]).ok()
    } else {
        table.1.binary_search_by(|&(s, e)| {
            if s > search_char {
                Ordering::Greater
            } else if e < search_char {
                Ordering::Less
            } else {
                Ordering::Equal
            }
            // Note: the offset returned by the search of the 32-bit
            // table is assumed to be relative to the end of the
            // 16-bit table.
        }).map(|i| table.2[i + table.1.len()]).ok()
    }
}

In practice, all this together improves performance a lot. More than I would
have expected -- it seems to be one of the faster entries in the benchmark --
(caches...), however the main reason I added it was for size reduction. As an
example, when both optimizations are used on the General_Category property value
table, it shrinks from 38388 bytes to 18703 bytes.

Edit: A couple of notes I thought of:

I'm in no way tied to these names. I don't think they're particularly good, even.
I wish there were a place in ucd-generate's documentation to put more detailed descriptions of the table formats, and sample usage code. I remember not even being sure if the u32 pairs were inclusive or exclusive when I first used it.

…irs and u32 pairs

…m or --enum

BurntSushi · 2022-07-05T14:13:31Z

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze. (Although I'm still inclined to merge it, since I see ucd-generate as a dumping grounds of sorts.)

I'm going to cut a new release soon with Unicode 14, but without this. (But with your [char; 3] PR.) Once that's out, if you wanted to get this PR into shape, I'll do my best to take a closer look at it soon and try to get it into another release.

thomcc · 2022-07-06T15:32:02Z

So this all looks great. One question I have is how much the separated values part of this helps? It does seem to add a fair bit of complexity. So what I'm wondering is whether the juice is worth the squeeze.

Yeah it's hard to say how much it matters. There are two benefits:

It's good for performance; binary search is moderately cache hostile, so you want to ensure that the data that gets pulled in for each fetch is relevant. That said, I doubt these are frequently on the hot path in real code, and it would be very hard to measure as micro-benchmarks for cold-cache situations are very hard.
More importantly, in cases where the value is 1 byte (like enums and such), it reduces total table size by avoiding the padding required to align to a char/u32/u16. This can be pretty significant, as it ends up causing overhead on each entry. For (u16, u16, u8) this is ~16% of the table spent on padding, and for (u32, u32, u8), it's 25%.

However, the padding can also avoided by changing u16 to [u8; 2] and u32 to [u8; 3] (well, [u8; 4] but if you're going to do that, you might as well take advantage of the fact that Unicode scalars are 21 bits while you're at it). Then, ([u8; 3], [u8; 3], u8) and ([u8; 2], [u8; 2], u8) have no padding (in practice, obviously the compiler is free to do whatever it wants to repr(Rust) types, but it's reasonable to assume it won't decide to waste bytes for no reason).

That said, that would be something I'd add in another patch, as it's starting to get to a point where Discussion: Emitting code to search tables with complex layouts #39 is useful.

So, it's hard to say. Overall, my hunch is that while we can avoid the padding in this case, there are probably other cases where we'd like to split things out like that (for one reason or another). So, if the complexity would help out other cases that want to separate keys from values, it's probably worth it. If it's just for this case, and probably couldn't be reused, it's a lot more dubious. I haven't looked at the code in over two years either, so I don't remember the details.

Note: I haven't looked at this patch in over 2 years, it's also possible that I can make it simpler, I'll look when I get a chance.

Thom Chiovoloni added 7 commits May 10, 2020 14:24

Add support for emitting flattened case tables

fa1af98

Add support for splitting range tables into separate parts for u16 pa…

699684e

…irs and u32 pairs

Add support for generating enum value tables separately

d39f229

Rerun script since the version needed a bump

6602200

Allow generate-unicode-tables script to run from cargo build output

9bd75c0

Add benchmarks for split/separate slice variations

ead70e1

Add missing flags, prevent using --separate-values without --rust-enu…

c18c4b3

…m or --enum

thomcc mentioned this pull request May 13, 2020

Add --combined option #37

Merged

thomcc mentioned this pull request Aug 6, 2020

[meta] Reduce dependencies and on-disk footprint qpdb/mentat#278

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size optimizations for range slices. #35

Size optimizations for range slices. #35

thomcc commented May 13, 2020 •

edited

Loading

BurntSushi commented Jul 5, 2022

thomcc commented Jul 6, 2022

Size optimizations for range slices. #35

Are you sure you want to change the base?

Size optimizations for range slices. #35

Conversation

thomcc commented May 13, 2020 • edited Loading

Split character ranges

Separated value arrays

BurntSushi commented Jul 5, 2022

thomcc commented Jul 6, 2022

thomcc commented May 13, 2020 •

edited

Loading