Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The code unit rules for _16 and _32 functions need to be more clearly explained #187

Closed
tempelmann opened this issue Jan 11, 2023 · 5 comments

Comments

@tempelmann
Copy link

After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.

What also misled me was that it was explained that PCRE2_SIZE is same as size_t, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).

May I ask that this is explained more clearly, with these points:

  1. When using the 16 or 32 bit functions, then the buffer length is half or a fourth of the buffer's length in bytes.
  2. The 16 or 32 bit unichars are expected at their word-alignment. That means that (a) UTF-16BE chars won't be found in a buffer containing UTF-16BE and (b) if strings are searched in binary data (with the option PCRE2_MATCH_INVALID_UTF), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).

Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.

@zherczeg
Copy link
Collaborator

All functions use "number of characters" as size values, and all character buffers must be naturally aligned. Unaligned data cause slowdown or crashes (usually bus error) depending on the cpu. The size_t type often represent a number of records, see fread or fwrite in libc. Here the record is a single character.

@tempelmann
Copy link
Author

That's a bit imprecise. It's not "number of characters" but "number of code points", because there are characters that consist of two code points.

@zherczeg
Copy link
Collaborator

This is easier to understand this way. Btw characters and code points are the same thing, code units are the atomic part of encoding.

carenas added a commit to carenas/pcre2 that referenced this issue Jan 12, 2023
@PhilipHazel
Copy link
Collaborator

PCRE2 expects 16--bit data to be in vectors of uint16_t -- that is, in the natural BE/LE order for the current environment (same for 32-bit data). If a user is handling 16-bit data as individual bytes it is their responsibility to get it into uint16_t code units before passing it to PCRE2. I will work on the documentation to try to make this more clear.

@PhilipHazel
Copy link
Collaborator

I have added a few sentences to pcre2api.1 to try to state more clearly that string lengths are always in code units, not bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants