The code unit rules for _16 and _32 functions need to be more clearly explained #187

tempelmann · 2023-01-11T16:21:03Z

After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.

What also misled me was that it was explained that PCRE2_SIZE is same as size_t, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).

May I ask that this is explained more clearly, with these points:

When using the 16 or 32 bit functions, then the buffer length is half or a fourth of the buffer's length in bytes.
The 16 or 32 bit unichars are expected at their word-alignment. That means that (a) UTF-16BE chars won't be found in a buffer containing UTF-16BE and (b) if strings are searched in binary data (with the option PCRE2_MATCH_INVALID_UTF), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).

Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.

The text was updated successfully, but these errors were encountered:

zherczeg · 2023-01-11T17:05:43Z

All functions use "number of characters" as size values, and all character buffers must be naturally aligned. Unaligned data cause slowdown or crashes (usually bus error) depending on the cpu. The size_t type often represent a number of records, see fread or fwrite in libc. Here the record is a single character.

tempelmann · 2023-01-11T17:30:46Z

That's a bit imprecise. It's not "number of characters" but "number of code points", because there are characters that consist of two code points.

Closes: PCRE2Project#187

zherczeg · 2023-01-12T06:06:34Z

This is easier to understand this way. Btw characters and code points are the same thing, code units are the atomic part of encoding.

Closes: PCRE2Project#187

PhilipHazel · 2023-01-13T12:02:19Z

PCRE2 expects 16--bit data to be in vectors of uint16_t -- that is, in the natural BE/LE order for the current environment (same for 32-bit data). If a user is handling 16-bit data as individual bytes it is their responsibility to get it into uint16_t code units before passing it to PCRE2. I will work on the documentation to try to make this more clear.

PhilipHazel · 2023-01-20T15:40:44Z

I have added a few sentences to pcre2api.1 to try to state more clearly that string lengths are always in code units, not bytes.

carenas added a commit to carenas/pcre2 that referenced this issue Jan 12, 2023

doc: document limitations with PCRE2_MATCH_INVALID_UTF

e1796ea

Closes: PCRE2Project#187

carenas mentioned this issue Jan 12, 2023

doc: document limitations with PCRE2_MATCH_INVALID_UTF #188

Closed

carenas added a commit to carenas/pcre2 that referenced this issue Jan 12, 2023

doc: document limitations with PCRE2_MATCH_INVALID_UTF

5ae5175

Closes: PCRE2Project#187

PhilipHazel closed this as completed Jan 20, 2023

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The code unit rules for _16 and _32 functions need to be more clearly explained #187

The code unit rules for _16 and _32 functions need to be more clearly explained #187

tempelmann commented Jan 11, 2023

zherczeg commented Jan 11, 2023

tempelmann commented Jan 11, 2023

zherczeg commented Jan 12, 2023

PhilipHazel commented Jan 13, 2023

PhilipHazel commented Jan 20, 2023

The code unit rules for _16 and _32 functions need to be more clearly explained #187

The code unit rules for _16 and _32 functions need to be more clearly explained #187

Comments

tempelmann commented Jan 11, 2023

zherczeg commented Jan 11, 2023

tempelmann commented Jan 11, 2023

zherczeg commented Jan 12, 2023

PhilipHazel commented Jan 13, 2023

PhilipHazel commented Jan 20, 2023