-
-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22890 MF2: Add lone surrogate test to parser #3167
Conversation
@echeran you review other MessageFormatter changes before. Could you also review this PR? Thanks |
8699988
to
68b5ef5
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I rebased, and it turns out that #3063 fully fixed the infinite loop bug, so I removed all the commits except the one with the new tests. |
Could you change the title of this PR from "Fix infinity loop in parser" to "Add lone surrogate test to MessageFormatter2 Parser" Thanks |
Could you also add similar test to Java ? |
@FrankYFTang First, I added checks to your ICU4C test so as to require a syntax error. I also added a test for Java. I was hoping to add this to the shared data-driven tests, but I couldn't figure out how to escape the unpaired surrogate strings for JSON. So, the tests are separate for now. ICU4J wasn't erroring out on this case (though there was no infinite loop either), so I fixed that -- cc @mihnita. But maybe that should be a separate PR, since the ICU4J bug isn't critical. What do you think? |
.build(errorCode); | ||
UnicodeString result = msgfmt1.formatToString({}, errorCode); | ||
assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode); | ||
errorCode.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we reset the errorCode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that in intltest, each test has to reset the error code so it doesn't carry over to the next test -- is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, errorCode is a local variable in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the errorCode.reset()
call, then the test fails. I assume this has to do with the reference to this
:
IcuTestErrorCode errorCode(*this, "testHighLoneSurrogate");
and that there's some shared state that results in an error if the error code is non-success at the end of a test.
A lot of other tests in intltest have this pattern (errorCode
declared as a local IcuTestErrorCode
variable, and then errorCode.reset()
at the end of the method.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so sorry, I got confused, it should be
errorCode.expectErrorAndReset(U_MF_SYNTAX_ERROR, "testHighLoneSurrogate");
instead of
assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
errorCode.reset();
same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed in a9a6de2
.build(errorCode); | ||
UnicodeString result = msgfmt2.formatToString({}, errorCode); | ||
assertEquals("testLowLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode); | ||
errorCode.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we reset the errorCode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fb0c873
to
5d95452
Compare
I haven't been able to reproduce the fuzzer failure in the latest CI run yet, but I'll keep trying. |
Unpaired surrogates are not an error, according to the spec. |
5d95452
to
135a911
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
135a911 should fix the fuzzer error. |
I think they're a syntax error?
And the definition of |
a9a6de2
to
af45a5b
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
I won't merge this right away so that @mihnita has a chance to reply further re: whether unpaired surrogates are a syntax error. |
https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md
If the spec ended up saying something else, I am really not happy about it. I don't want to reopen that discussion. And I would not make such a change so very late. One might make an argument about the C++ implementation. But Java is UTF-16 everywhere, and it does not "explode" on improperly paired surrogates. (OK, deep in the belly of |
@mihnita Looks like unicode-org/message-format-wg#290 is the PR that introduced this (from Aug. 2022). |
I will open an issue with MessageFormat WG. This is really unfriendly for Java, JavaScript, Windows C "wide APIs" (that take a wchar_t, which is 16 bit). It is not the job of a formatting function to validate and reject UTF corectness, at least in the above mentioned environments. For example in Java we have I will make my case in the WG issue. But I am against making this change in Java with this PR, sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the Java changes from the PR.
We normally treat unpaired surrogates like unassigned code points. |
af45a5b
to
b3efbe2
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
OK -- I've removed the Java test and changes. Needs a re-approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
Mihai
Add a test to ICU4C for handling of lone surrogates. Incidentally fix uninitialized-memory bug in MessageFormatter (initialize `errors` to nullptr) Co-authored-by: Frank Tang <[email protected]>
b3efbe2
to
99fbca1
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I've created unicode-org/message-format-wg#895 |
Also see #3166
Checklist