Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to read complex Unicode string embedded in JSON #4417

Closed
0xg0nz0 opened this issue Jul 14, 2024 · 3 comments
Closed

Failing to read complex Unicode string embedded in JSON #4417

0xg0nz0 opened this issue Jul 14, 2024 · 3 comments

Comments

@0xg0nz0
Copy link

0xg0nz0 commented Jul 14, 2024

Description

I tried to load urltestdata.json in nlohmann-hson, and get:

parse error at line 4853, column 45: syntax error while parsing value - invalid string: surrogate U+D800..U+DBFF must be followed by U+DC00..U+DFFF; last read: '"http://example.com/\uD800\uD801'

But this is the official WHATWG URL validation test set, and multiple JSON validators that I tried online

Reproduction steps

A simple parse of the above file reproduces it:

    std::filesystem::path testSourceLocation(__FILE__);
    auto buildDir = testSourceLocation.parent_path() / "../../build";
    auto testFixturePath = buildDir / "urltestdata.json";

    std::ifstream testFixtureIn(testFixturePath);
    nlohmann::json testFixtureJson = nlohmann::json::parse(testFixtureIn);

Expected vs. actual results

I expected this file to parse without errors.

Minimal code example

See above.

Error messages

As per above:


parse error at line 4853, column 45: syntax error while parsing value - invalid string: surrogate U+D800..U+DBFF must be followed by U+DC00..U+DFFF; last read: '"http://example.com/\uD800\uD801'


### Compiler and operating system

Ubunto 22.04 (Noble) with gcc 13.2

### Library version

3.11.3 (vcpkg)

### Validation

- [ ] The bug also occurs if the latest version from the [`develop`](https://github.com/nlohmann/json/tree/develop) branch is used.
- [ ] I can successfully [compile and run the unit tests](https://github.com/nlohmann/json#execute-unit-tests).
@0xg0nz0
Copy link
Author

0xg0nz0 commented Jul 14, 2024

Note I do have a hacky workaround for this in my CMakeLists.txt, which nicely demonstrates that it really is just that one test case which appears to be causing issues for nlohmann-json:


  # Download the WTP horror show of URL conformance tests
  set(JSON_URL "https://raw.githubusercontent.com/web-platform-tests/wpt/master/url/resources/urltestdata.json")
  set(JSON_DEST "${CMAKE_BINARY_DIR}/urltestdata.json")
  set(JSON_FILE "urltestdata.json")

  # Download the JSON file at configure time
  file(DOWNLOAD ${JSON_URL} ${JSON_DEST}
      STATUS download_status)

  # Check if the download was successful
  list(GET download_status 0 status_code)
  if(status_code EQUAL 0)
      message(STATUS "Downloaded ${JSON_FILE} from GitHub")
  else()
      message(FATAL_ERROR "Download of ${JSON_FILE} failed: ${download_status}")
  endif()

  # Remove bad test cases from the JSON file
  message(STATUS "Removing bad test case from ${JSON_FILE} with sed")
  execute_process(COMMAND sed -i 4852,4866d ${JSON_DEST}
                  RESULT_VARIABLE sed_result
                  ERROR_VARIABLE sed_error)

  # Check if the sed command was successful
  if(NOT sed_result EQUAL "0")
      message(FATAL_ERROR "Failed to execute sed command: ${sed_error}")
  endif()

@nlohmann
Copy link
Owner

nlohmann commented Jul 15, 2024

The error message states the problem \uD800\uD801 is an invalid surrogate pair.

  • \uD800 is a high-surrogate code unit, as it is in the range D800..DBFF
  • \uD801 is not a low-surrogate code unit, as it is not in the range DC00..DFFF.

This check is implemented in https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/input/lexer.hpp#L331.

I don't know why other validators accept this JSON, but it contains invalid UTF-8.

Update: References: https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G2630

image

@0xg0nz0
Copy link
Author

0xg0nz0 commented Jul 15, 2024

Thanks @nlohmann I will close this issue and report it directly to WHATWG. I can confirm it is just this test case with the problem; possibly other validators are not checking Unicode conformance as strictly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants