-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add opt-in astral support to Unicode addons, without separate files #29
Comments
I like it! I’m just not sure using compressed data even for the astral ranges (and decompressing it in JavaScript) is the way to go. It just seems more efficient to simply include the “raw” ranges as strings, and then build the regular expressions as needed (possibly lazily/on-demand, as you suggested). I would be interested to see how much of a difference in file size the compressed ranges make after gzip. I assume the savings gained by compressing the data would be minimal, and wouldn’t be worth the overhead of having to decompress the data in JavaScript — but maybe that’s just me. |
I just did some tests that suggest you're probably right. These are very limited tests, but here's what I've got: Unicode Categories 1.2.0 (BMP only): Gzipped, unminified: Gzipped, unminified, with "\u" prefixes added and "\xhh" instead of "\u00hh" (where possible): Numbers for astral data are harder to generate, at the moment. Here is the astral portion of the After gzipping, that comes out at 229 bytes. Here is the same data, as a pregenerated regex pattern with surrogate pairs: http://pastebin.com/C8LHSnMq After gzipping, that's 250 bytes, or 9.2% larger. I think I'm willing to live with these numbers. Lets go without compression, at least for astral data. What do you think about BMP data, though? Should that also not be compressed? Are the modest savings not worth it? Keep in mind that the difference stacks up when loading all of the Unicode addon files. But then, not using any compression would mean avoiding any need for lazy caching. I'm leaning toward removing all compression, unless you or others disagree. |
1.3 KB is something, but what really matters (IMHO) is the difference between gzipping and not gzipping after minification. Any numbers on that? I’m still leaning towards “not worth it”, but that’s just me :) Either way, for BMP data it’s less work to decompress the data, so in that case run-time performance might not be much of an issue. But I definitely wouldn’t do it for the astral data. Just my €.02. |
My rough guess is that removing BMP compression will add 4 or 5 KB to Adding minification to the numbers I posted earlier probably wouldn't make much difference, since OK, so astral compression is out. BMP compression could go either way, but since we're both leaning against it, let's leave that out, too. What would be incredibly helpful is if you could update your XRegExp-specific category and script generators (and perhaps add a block generator, too) to match the following format. Separate BMP and all-plane generators are no longer needed. Since your unicode-data project doesn't yet include a generator for most of the properties in If you're able to put up a web page somewhere that accepts a list of 21-bit code points and spits out a range with surrogate pairs (or if someone is able to point me to an existing page that does this), I can manually update the data in Here's a format that I think will work well: [
{
name: 'Name',
alias: 'Alias_Name', // included only for categories
noMerge: true, // included only when needed
bmp: '\0-\uFFFF',
astral: '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
},
…
] Notes:
No rush at all, for any of this. If you have suggestions for changing the generated output, I'm certainly interested. You're welcome to update Thanks again for all the work you've done to make XRegExp's Unicode support kick serious internationalized booty. :-) |
No worries, I’ll gladly update the XRegExp-specific range generators. Thanks for the detailed instructions. |
Much appreciated, good sir! FYI, I've edited my last comment to add more details, including the |
Some feedback:
If we’re gonna use an extra property anyway, why not split it up like this? [
{
name: 'Name',
alias: 'Alias_Name', // included only for categories
bmp: '[BMP-only range]',
astral: '[astral-only range, based on surrogate pairs]',
orphanedHighSurrogates: '[range that matches the standalone high surrogates]'
},
…
] (I’m sure you can come up with a more concise name for
http://git.io/unicode already takes care of all that. :) It also uses the U+0020 space symbol unescaped. I don’t unescape (or simplify the escapes of) other symbols, as the hexadecimal code point values are much easier to read in regular expression ranges IMHO. It looks like you agree.
Why not just an empty string instead of // Using `null`:
all = (bmp === null ? '' : bmp) + '|' + (astral === null ? '' : astral) + '|' + (orphanedHighSurrogates === null ? '' : orphanedHighSurrogates)
// Using empty strings:
all = bmp + '|' + astral + '|' + orphanedHighSurrogates Pending your feedback on my above suggestions, I have the scripts ready to generate output for categories, blocks and scripts. Here are some numbers:
If you can provide me with the code points as an array (e.g. Some properties require parsing |
Thanks for all the feedback. Definitely want to improve this where possible.
Three reasons. In order of least to most important:
I know the There is a significant side benefit to creating a function that can combine ranges, though. Hypothetically, you could significantly reduce the file size of If you're able to tackle the
Yeah, I noticed those in your generator, which is why I mentioned them. :) I don't personally agree about literal space rather than
Empty string instead of
(I take it those are numbers for unminified files.) Thanks for posting the numbers. But that's actually worse than I expected, and is kind of a tough sell considering that there is no benefit by default (unless you opt-in to astral mode and your data includes astral code points). Oh well, I suppose it's still better than including separate files for each addon. Gzipped numbers for the current BMP-only versions of the files:
An alternative approach might be to leave unicode-base.js, unicode-categories.js, unicode-scripts.js, unicode-blocks.js, and unicode-properties.js with BMP-only data (i.e., they use the new data format, but they leave out the Your thoughts? This alternative approach might require, e.g., changing the generators to accept arguments for whether they're outputting astral, BMP, or combined data (or just including separate astral and BMP generators, as is currently done for categories).
That would be awesome--thanks! That format is totally doable. I can send you a pastebin link or some such for the property data, after the other details about the generators are worked out. |
How would you feel about renaming Btw, according to my script, this property is only needed for the I also noticed a lot of blocks are missing from the current You can preview the changes here: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support Let me know if more tweaks are needed! Re: the alternative approach, I don’t know. I think it makes the most sense to match astral code points by default, as AFAIK that’s what e.g. |
I appreciate that you're trying to hold me to a good name that both defaults to The point of If the what or why of any of this is not fully clear, it might be easier to discuss it on Skype or similar. If you think that would help, email me at [email protected] for my Skype name.
Thanks for this list. Based on these, though, the only one that needs special handling (and therefore the There are two items in {
name: 'Any',
hasIndependentSets: true,
bmp: '\0-\uFFFF',
astral: '[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]'
// Alternatively, BMP code points (when included in astral) can come last:
// astral: '[\uD800-\uDBFF][\uDC00-\uDFFF]|[\0-\uFFFF]'
}
I imagine you're talking about the astral-only blocks that include code points 0x10000 and higher. I haven't included them so far since, up to now, XRegExp has only supported the BMP. For such blocks (as well as astral-only scripts), the
Looks cool. Still do need a few tweaks, though. Sorry for all the hassle and for not making things clear enough the first time around. The main changes needed are for the surrogate-related stuff, described above. There's one other thing, though: I see that you've wrapped the
Only a select few regex flavors support astral code points. Java only started supporting them via, e.g., |
Ok, I made
Done.
No problem, I got rid of the brackets.
Oh, in that case I fully agree with your new proposal!
I really don’t mind tweaking the generator — if you just let me know what exactly needs to be changed, I’ll be happy to change it so you get the output you need. |
Answered: #commitcomment-1472579. Unfortunately, the answer is complicated and leads to more questions. But I think we're almost there. |
Okay, I’ve applied the suggested changes (I think!): https://github.com/mathiasbynens/XRegExp/compare/add-astral-support Let me know if I missed anything, or if more tweaks are needed. |
Wo0t! Both the output and the generator scripts are looking lovely.
A couple last issues for
{
name: 'Bamum',
bmp: '\uA6A0-\uA6F7',
astral: '\uA6A0-\uA6F7|\uD81A[\uDC00-\uDE38]'
}, That should be: {
name: 'Bamum',
bmp: '\uA6A0-\uA6F7',
astral: '\uD81A[\uDC00-\uDE38]'
},
{
name: 'C',
alias: 'Other',
hasIndependentSets: true,
bmp: '\0-\x1F...\uFFFF',
astral: '\0-\x1F...\uFFFF|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
}, That should be: {
name: 'C',
alias: 'Other',
hasIndependentSets: true,
bmp: '\0-\x1F...\uFFFF',
astral: '[\0-\x1F...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
}, In your defense, I never mentioned anything about this rule. :) As we discussed on Twitter, I'm thinking of adding |
Thanks, the example output helps a lot! Updated: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support |
Schweeeeeet! But, ack! One last thing....sorry I didn't catch this last time!! Here's the current output for {
name: 'C',
alias: 'Other',
hasIndependentSets: true,
bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
astral: '[\0-\x1F...\uD7FC-\uF8FF...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]'
}, That includes high surrogates in the first alternation option in the The easiest and probably best way to fix this is to simply put the BMP value last (for items that include {
name: 'C',
alias: 'Other',
hasIndependentSets: true,
bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
astral: '\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]|[\0-\x1F...\uD7FC-\uF8FF...\uFFFF]'
}, Another more complicated way to fix this would be to put only the high surrogates last, like this: {
name: 'C',
alias: 'Other',
hasIndependentSets: true,
bmp: '\0-\x1F...\uD7FC-\uF8FF...\uFFFF',
astral: '[\0-\x1F...\uD7FC-\uD7FF\uDC00-\uF8FF...\uFFFF]|\uD808[\uDF6F-\uDFFF]|...|\uD82C[\uDC02-\uDFFF]|[\uD800-\uDBFF]'
}, This alternative approach is probably not worth it. |
Done: https://github.com/mathiasbynens/XRegExp/compare/add-astral-support If this looks alright, I’ll submit a pull request. |
Do eeet! Looks good to me. Thanks for your longsuffering work and collaboration on this and the extensive revisions. I'll start work on Unicode Base so it can work with this data. After this is merged and the Unicode addons are updated to work with the new data, I'll submit the code point data for |
OK, I've largely rewritten Unicode Base to work with the new data. I've also added some tests, and added the newly generated data to all of the Unicode addons except Unicode Properties (which doesn't yet have the new-style data). When plugging the data into the addons, I realized that, since you're now simply pushing the entire BMP range (rather than just high surrogates) to the end of There is one additional change I'm hoping you can take care of, since I'm not sure how you'd prefer to handle it: Because the generator outputs string escapes that end up being passed to XRegExp as literal characters (in other words, the output is The hex codes for each of the 13 metacharacters are as follows:
I'll be posting the data for |
See #34. I’ve made a commit implementing the first option, then another one that switched to the latter option (which you seemed to prefer), so you can always switch back to the other approach if you want. IMHO the first option is more readable, and I doubt the byte savings of using the other option would make a difference after gzip compression, but it’s your call :) Turns out the “compact” option results in a larger file after gzip compression. Using hexadecimal escapes (e.g. $ curl -s https://raw.github.com/mathiasbynens/XRegExp/caec8eaa3366b35a32b40e29821a045951f78200/tools/output/categories.js | gzip | wc -c
14187 Using simple backslash escapes (e.g. $ curl -s https://raw.github.com/mathiasbynens/XRegExp/c9cd440fcb6bd20d69414ed25b60533ea5987b0c/tools/output/categories.js | gzip | wc -c
14200 |
Merged. I reverted the second commit, so that the first option (hexadecimal escapes, e.g. Apart from FYI, all planned changes for the XRegExp 2.1.0 package are now done, with two exceptions:
The v2.1.0 changes are listed here. |
Right back at you! ⁵ |
OK, here be the 21-bit code points for the Unicode Level-1 properties used in
Here's the [0x00AD, 0x034F, 0x115F, 0x1160, 0x17B4, 0x17B5, '180B-180D', '200B-200F', '202A-202E', '2060-206F', 0x3164, 'FE00-FE0F', 0xFEFF, 0xFFA0, 'FFF0-FFF8', '1D173-1D17A', 'E0000-E0FFF'] Note that, for ranges, I've included them as hyphen-delimited strings, rather than listing out each code point. If you'd like me to change the data to list out each code point as a number, just let me know. |
Having each code point as a number (rather than the ranges) would be much easier for me, as it would allow me to re-use the existing I already have the data for these three properties: |
No response for four days, so I’ll assume it’s not straightforward to generate the data in the format I requested. No worries, I just wrote a script that converts your data into a format You can sanity-check the output here: #36 By the way, may I ask how you generated this data manually? I’m stilll looking in to how it could be done efficiently in unicode-data. |
Thank you!! I've been travelling for the last few days, which was the main reason I hadn't gotten back to you. But the solution you came up with (using a new Python generator script to process the data I provided earlier) is easier and better than what I likely would have done. To generate the data manually, I used UnicodeSet from unicode.org. E.g., for \p{Alphabetic}. Using UnicodeSet's output, it just takes a few regex search-and-replace operations in a text editor to produce the data I provided. UnicodeSet doesn't support I suppose it's entirely possible to scrape UnicodeSet output for unicode-data, but then (AFAIK) you could only work with the latest version of Unicode. Is it possible that PyICU might give you some tools or data to work with that would be helpful toward this? At long last, this issue is closedededed. 😄 |
@mathiasbynens, @walling, this issue picks up from #25, since that's now a closed/merged pull request.
Prior to merging the opt-in astral support from the Unicode Categories Astral addon into the (default) Unicode Categories addon (which is automatically included in
xregexp-all.js
, and therefore in the npm package), following are the changes I think would be beneficial:Prerequisite:
Data for base categories like
\p{L}
needs to be added, not just\p{Ll}
,\p{Lu}
, etc.Changes:
The
XRegExp.addUnicodePackage
function in Unicode Base should change from accepting an object with BMP data and an object with optional aliases to instead accept the following array:The above data will be stored in the private
unicode
object, without any preprocessing. Two new private lookup objects will be added:bmp
andastral
. These won't be populated automatically, but instead augmented on first use of each Unicode name in a regex. In other words, these are used to cache generated data.Astral ranges with surrogate pairs will be built and cached in JavaScript code, on first use.
For scripts and blocks that exist only within astral planes, the
bmp
property of the objects accepted byaddUnicodePackage
should be set tonull
. For addons that include astral support, theastral
property should always be included, withnull
as the value for properties that have no astral code points. Theastral
property shouldn't be included by addons that don't yet include astral support.The
\p{…}
(etc.) syntax token handler in Unicode Base should be updated to checkXRegExp.isInstalled("astral")
in itshandler
(main) function. Iftrue
, combine data from thebmp
andastral
objects, and throw aSyntaxError
if the matchscope
is"class"
. Thetrigger
function currently used inunicode-categories-astral.js
will no longer be necessary.Since the BMP and astral data will be split up, these changes shouldn't inflate
unicode-categories.js
too much. At least, BMP data will not be included twice.With these changes in place, separate BMP and all-plane addons won't be needed, and users can opt in or out of astral support at any time.
The text was updated successfully, but these errors were encountered: