Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --combined option #37

Merged
merged 5 commits into from
Nov 22, 2020
Merged

Add --combined option #37

merged 5 commits into from
Nov 22, 2020

Conversation

dylni
Copy link
Contributor

@dylni dylni commented May 13, 2020

Thanks for creating this very useful crate.

This adds a --combined option that works the same way as not adding the option, except all tables are combined into a single table. It can create smaller tables than --enum when it's only necessary to check if a code point is part of any variant.

It also supports related categories, which --enum doesn't.

Copy link
Contributor

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel very useful to me. The only one of these that has meaningful output at all is the SCRIPT table, and that's because we don't include Unknown/Zzzz in there (which might be a bug, IDK) -- The tables are &[(0, 0x10ffff)]. I guess maybe the idea is you could use it with --exclude / --include? Anyway, @BurntSushi's call.

If we're going to add it, it should be supported for all the other cases where we currently support --enum, probably...

Just in terms of the code itself I think it would be better if it tried harder to use the existing output functions in writer.rs.

I also suspect it would be better if the flag were on the writer.opts structure...

src/writer.rs Outdated
@@ -417,6 +458,29 @@ impl Writer {
Ok(())
}

fn range_set_slice(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is the same as ranges_slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

src/writer.rs Outdated
/// Write a set that contains ranges of codepoints.
///
/// The smallest numeric type is used when applicable.
pub fn range_set(&mut self, name: &str, set: &[u32]) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you just turn the slice into BTreeSet instead you could use the existing ranges function here. This would make supporting trie output easy, as well as the --split-ranges option in #35, for example.

Actually supporting these might make the help text for the flags confusing though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

src/app.rs Outdated
@@ -221,6 +221,11 @@ pub fn app() -> App<'static, 'static> {
.arg(Arg::with_name("rust-enum").long("rust-enum").help(
"Emit a Rust enum and a table that maps codepoints to bidi class.",
))
.arg(
Arg::with_name("combined").long("combined").help(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better if you put this up with the other args and just clone it in. I don't know why --enum and --rust-enum don't, but they also have slightly varying help text for each one, which is probably why.

Copy link
Contributor Author

@dylni dylni May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I was using --rust-enum as a reference.

@dylni
Copy link
Contributor Author

dylni commented May 13, 2020

This doesn't feel very useful to me. The only one of these that has meaningful output at all is the SCRIPT table

I would only be using this for the general-category subcommand, which does have useful output. The others were only added for compatibility, but I can remove them if they don't make sense.

Just in terms of the code itself I think it would be better if it tried harder to use the existing output functions in writer.rs.

You're right. I didn't expect these functions to exist. It's much better to use them.

I also suspect it would be better if the flag were on the writer.opts structure...

I don't think this would make sense, but you know the code better than I do. This option is similar to --enum and --rust-enum, which also aren't on the structure. It also conflicts with both of those options.

@thomcc
Copy link
Contributor

thomcc commented May 13, 2020

Hmm... For a while I've wanted the ability to generate a custom table based on a combination of property bools / property values. For example, perl-word is an example of a custom table like this, as are a lot of things I've wanted.

Anyway, I'm going to file a bug for adding something like that -- it would have solved your use case, and some of mine. I think there are still open questions around that though, so I'm not saying that filing the bug makes this obsolete.

@BurntSushi
Copy link
Owner

@dylni This looks like a simple addition so I'm inclined to accept it if it's useful, but I don't quite understand the use case. Can you say more about why this is useful to you? It sounds like you want to use it with general-categry, but isn't every codepoint in exactly one general category? So how does this help you?

@dylni
Copy link
Contributor Author

dylni commented Nov 18, 2020

@BurntSushi Yes, I use it to generate this file. That crate only needs to check if characters belong to one of the "other" or "separator" categories, so searching one table is easier and more efficient.

If you remove the --combined option from the command, 3 tables are generated instead of 1, which is not optimal for my use case.

It sounds like you want to use it with general-categry, but isn't every codepoint in exactly one general category? So how does this help you?

Yes, I could search both tables, and that would very likely have a negligible impact on performance. However, since a smaller and simpler table can be generated, that is what I would prefer to include in my crate.

@BurntSushi
Copy link
Owner

@dylni Ah I see. I think the thing missing from your explanation is that you were only generating a subset of general category tables via the --include flag.

Can you add a little extra docs to the --combined flag here saying why someone might want to use it? Basically just a couple sentences from your comments here.

@dylni
Copy link
Contributor Author

dylni commented Nov 19, 2020

@BurntSushi

Ah I see. I think the thing missing from your explanation is that you were only generating a subset of general category tables via the --include flag.

Oh, I can see how this seems very strange without that context.

Can you add a little extra docs to the --combined flag here saying why someone might want to use it? Basically just a couple sentences from your comments here.

Sure, the last commit extends the documentation.

"Emit a single table with all included codepoint ranges. You might \
want to use this option when checking if characters belong to a \
subset of categories, since only one table will need to be checked. \
Searching the combined table should be simpler and more efficient.",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Thank you!

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@BurntSushi BurntSushi merged commit 957d038 into BurntSushi:master Nov 22, 2020
@BurntSushi
Copy link
Owner

This PR is on crates.io in ucd-generate 0.2.9.

@dylni
Copy link
Contributor Author

dylni commented Nov 22, 2020

Thanks! It's working very well

@dylni dylni deleted the add-combined-option branch November 22, 2020 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants