Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.7.3 Returns u2029 or u2028 from JSON.generate with script_safe (escape_slash) set to true for some UTF-8 characters #715

Closed
nvasilevski opened this issue Dec 3, 2024 · 2 comments · Fixed by #716

Comments

@nvasilevski
Copy link

nvasilevski commented Dec 3, 2024

Affected characters: ["ဨ", "ဩ", "〨", "〩", "䀨", "䀩", "倨", "倩", "怨", "怩", "瀨", "瀩", "耨", "耩", "逨", "逩", "ꀨ", "ꀩ", "뀨", "뀩", "쀨", "쀩", "퀨", "퀩", "", "", "", ""]

Expected behavior (<2.7.3)

::JSON.generate({values:["倩", "瀨"]}, script_safe: true)
=> "{\"values\":[\"\",\"\"]}"

Actual behavior (>=2.7.3)

::JSON.generate({values:["倩", "瀨"]}, script_safe: true)
=> "{\"values\":[\"\\u2029\",\"\\u2028\"]}"

Most likely the cause

https://github.com/ruby/json/pull/629/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR74-R85

Context

We seem to be unintentionally falling into the branch:

                case 3: {
                    unsigned char b2 = ptr[pos + 1];
                    if (RB_UNLIKELY(out_script_safe && b2 == 0x80)) {
                        unsigned char b3 = ptr[pos + 2];
                        if (b3 == 0xA8) {
                            FLUSH_POS(3);
                            fbuffer_append(out_buffer, "\\u2028", 6);
                            break;
                        } else if (b3 == 0xA9) {
                            FLUSH_POS(3);
                            fbuffer_append(out_buffer, "\\u2029", 6);
                            break;
 

Because

  1. The length of both and characters is 3: ["倩", "瀨"].map(&:bytesize) => [3, 3]
  2. Second byte happens to be 0x80: ["倩", "瀨"].map { _1.bytes[1].to_s(16) } => ["80", "80"]
  3. Third bytes happen to be 0xA9 and 0xA8: ["倩", "瀨"].map { _1.bytes[2].to_s(16) } => ["a9", "a8"]

Possible solution:

Should the condition include check of the first byte to be equal 0xE2? ["\u2028", "\u2029"].map { _1.bytes.first.to_s(16) } => ["e2", "e2"]
Something like

if (RB_UNLIKELY(out_script_safe && ch == 0xE2 && b2 == 0x80)) {

I'll look into proposing a PR but I don't mind if someone else is eager to propose a fix

@byroot
Copy link
Member

byroot commented Dec 3, 2024

Thanks :/

I released 2.9.0 with this fix.

@nvasilevski nvasilevski changed the title v2.7.3 Changes the behavior of JSON.generate with script_safe (escape_slash) set to true for some Japanese characters v2.7.3 Returns u2029 or u2028 from JSON.generate with script_safe (escape_slash) set to true for some UTF-8 characters Dec 3, 2024
@nvasilevski
Copy link
Author

Thanks! For better discoverability I changed the title and here is the list of other characters that should experience the same bug due to matching second & third bytes

["ဨ", "ဩ", "〨", "〩", "䀨", "䀩", "倨", "倩", "怨", "怩", "瀨", "瀩", "耨", "耩", "逨", "逩", "ꀨ", "ꀩ", "뀨", "뀩", "쀨", "쀩", "퀨", "퀩", "", "", "", ""]

byroot added a commit to byroot/ruby that referenced this issue Dec 5, 2024
byroot added a commit to ruby/ruby that referenced this issue Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants