-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The code unit rules for _16 and _32 functions need to be more clearly explained #187
Comments
All functions use "number of characters" as size values, and all character buffers must be naturally aligned. Unaligned data cause slowdown or crashes (usually bus error) depending on the cpu. The |
That's a bit imprecise. It's not "number of characters" but "number of code points", because there are characters that consist of two code points. |
This is easier to understand this way. Btw characters and code points are the same thing, code units are the atomic part of encoding. |
PCRE2 expects 16--bit data to be in vectors of uint16_t -- that is, in the natural BE/LE order for the current environment (same for 32-bit data). If a user is handling 16-bit data as individual bytes it is their responsibility to get it into uint16_t code units before passing it to PCRE2. I will work on the documentation to try to make this more clear. |
I have added a few sentences to pcre2api.1 to try to state more clearly that string lengths are always in code units, not bytes. |
After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.
What also misled me was that it was explained that
PCRE2_SIZE
is same assize_t
, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).May I ask that this is explained more clearly, with these points:
PCRE2_MATCH_INVALID_UTF
), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.
The text was updated successfully, but these errors were encountered: