-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subcommand to generate custom tables containing a combination of properties / property values. #38
Comments
Briefly, yes, I think this is a great idea. However, we unfortunately shouldn't use regex syntax for this. Or more specifically, we shouldn't use I am currently working on moving |
Okay. Do you have a preference for how the parser is written? (e.g. hand-written, Also, not that it matters here, but... ucd-parse still uses |
I guess I would question whether we need a parser at all, and whether we can do this via a command line interface. I guess if all we supported was a union operation, then it would be easy.
The linked issue is long, but I brought this up in it (I think), and apparently it's the conceptual dependency that's the problem? That is, any old regex engine could be substituted in One thing I was thinking about was making Otherwise, I generally hand-roll parsers. I'd prefer not bringing in any extra dependencies. But I also don't think we should re-implement the Unicode character class parser that's in So:
|
So, using I had then intended on using something only slightly less naive than the
I can think of a number of use cases for more advanced operations... But perhaps they aren't worth supporting. We could always add a variant later that accepted char set syntax, or something equivalently powerful.
This is also my preference, not that it seems like it will matter here. |
Essentially: a flag for compiling a table that is roughtly equivalent to a regex character class.
For example, currently the perl-word flag could be expressed as
[\p{alpha}\p{joinc}\p{gc=digit}\p{gc=M}\p{gc=pc}]
, or something along those lines.I also wanted this when I was experimenting with different options for getting a more accurate version of unicode-width (one that operates on strings and takes grapheme sequences into account, using heuristics to determine the actual rendered width on terminals).
When I've wanted this, I've resorted to just hacking it into a local branch of ucd-generate, or a separate project with a bunch of copypasted code. This also would make the work in #37 unnecessary, as it could be expressed as just a combination of
\p{gc=...}
s.One option would be to accept regex char class syntax and use regex-syntax to parse it... The benefit here is that syntax is quite rich and supports the full spectrum of set operations (for example, the
include
/exclude
flags we currently support can't express intersection or inversion, and doesn't nest).But I believe it would be using the version of the UCD it was compiled with? Is this true? Does it matter? I'm on my lunch break now and haven't investigated this, so it's possible it's easy and does exactly what we want.
However we do it, we'd want to express something like:
Which then could be evaluated in a fairly easy manner, ignoring many concerns around performance, error handling, etc:
Then, we'd use that to emit tables.
The text was updated successfully, but these errors were encountered: