-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of invalid UTF-8 strings #24420
Comments
Great analysis, thanks. That change to |
I have made a PR in #24423 to show how the changed Additionally I have identified a small problem with I do not know if we want to care too much about invalid strings, but in general they happen in practice. |
So I have a plan for this that I've been meaning to implement for quite some time. The basic idea is to allow bytes that are not part of a valid character to be represented as special
Converting an invalid It seems like it's time to give this approach another try. |
@StefanKarpinski It would be great to have it for 1.0. Is there a place where the design of this is laid out (given I have already spent some time recently on strings I would gladly help working on it)? The key question is how the boundaries of invalid |
The most straightforward approach is individual bytes. This is what many implementations already do – e.g. when you see mojibake in browsers, it is for individual invalid bytes. The most recent Unicode standard has some comment on this, which I believe says something at odds with this – I'll have to dig it up. |
I can't find the reference to Unicode changing its recommendation on invalid encoding handling :( |
The standard http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf on page 129 states:
where well formed UTF-8 byte sequences are given by Table 3-7 there. I would recommend to follow the standard as in my view it is important to be able to say that Julia 1.0 conforms in 100% to the standard. This would mean:
The consequences are that:
The bad thing is that (maybe except documentation - but this part is simple) all should be done probably in one big PR as the changes are interconnected. |
And actually #24414 will be even more important if this proposal is implemented as probably |
Ah, thanks for finding that! I'm a bit ambivalent about the Unicode 10 recommendation here since, e.g. when processing data that is a mix of UTF-8 and Latin-1 (which is fairly common), it will emit some replacement code points that are variable numbers of Latin-1 characters. I don't think the reasons they give for having a standard way to do this apply here, since this decoding will not interact with non-Julia systems. When writing an invalid character out, we should output the exact bytes that came in. This is at odds with "invalid characters are interpreted as U+FFFD in string processing": for comparison with other code points, we should use value equality, for output to a stream, we should just output the original bytes. In some sense, I'm not entirely sure that the subject of that section – i.e. "Best Practices for Using U+FFFD" – actually applies here. We're not actually replacing invalid data with |
I also meant not to change everything invalid to Given your proposal the question is how would you convert such a string My understanding was that according to the Unicode standard the first three bytes should be treated as an invalid single character and the last byte is a valid Given my earlier proposal to set a highest byte to indicate an invalid character it would be encoded as two characters: How would encoding of invalid UTF-8 byte sequence |
So you mean when printing a quoted version of the string? In that situation it seems better to print it using
As three separate invalid characters followed by
The standard specifies how to do replacement of invalid encodings with
Sure, that's certainly an option. We may want to consider whether we can represent invalid data in other encodings as well, but perhaps that's a step too far. Invalid UTF-16 code units are easy enough since that can only be a single pair of bytes, but invalid UTF-32 code units are a little trickier since they can be arbitrarily large.
See above. I'm not sure what you consider "printing" to be here.
The representation I'm considering is zero-padded UTF-8 data in 32-bit chunks. That means that you always just print 1 low byte and 2-3 additional non-zero bytes. Decoding becomes entirely a matter of deciding where to break the input stream into chunks. |
Amusingly, I just hit a case where the Unicode 10 behavior would be much easier to implement 😂 |
I like your approach and I think it is better than mine. The question how to treat invalid UTF-8 is orthogonal - i.e. no matter what approach is chosen (Unicode 10 approach or single byte approach) both can be safely implemented using the new |
Here's some work in progress for my proposed Agree that the handling of invalid UTF-8 is orthogonal. And, yes, the Unicode 10 approach only requires a single byte of lookahead which is considerably easier to deal with. |
@StefanKarpinski When I run my benchmark given above I get the following problem (reduced only to what is relevant):
is this intended (I suppose not)? |
Yup, that's a bug – thanks for testing and catching it! I've got a fix coming. Next week, after the feature freeze, I'm going to write a comprehensive test suite for all these string functions (and write faster versions of multistep |
reported by bkamins in #24420
closing this as it is fixed now. |
Current handling of invalid UTF-8 strings is inconsistent between different string handling functions in Base.
I have seen similar discussions in the past, but they related to earlier implementations of strings so I sum up here what kinds of problems I see currently (sorry for spamming if this issue was decided earlier).
Here goes the example with the comments:
The reasons for those inconsistencies are:
is_valid_continuation
to check for valid indices (i.e. they assume that if something is notis_valid_continuation
it starts a new character);next
behaves in other way (and in particular not consistent withnextind
) as it tries inslow_utf8_next
to load the whole width of the character as indicated byutf8_trailing
not checking if they are valid continuations.This shows why
collect(y2)
produces garbage in the second element.length(y2)
indicates that there are two characters, butnext(y2, 1)
jumps to4
beyond the end ofy2
.I assume that there is no canonic way to handle invalid UTF-8 (since they are invalid) but I would recommend use a consistent rule in all string handling functions. The least breaking change that would make it consistent would be to change
next
so that it goes forward to the same point asnextind
and if the string is invalid always returns\ufffd
(now it may return garbage likenext(y2, 1)[1]
).The text was updated successfully, but these errors were encountered: