Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fparse fails when set with 6 decimal points is followed by set with string value #93

Closed
rjonesey opened this issue Jul 17, 2024 · 31 comments

Comments

@rjonesey
Copy link

Encountered an issue where this specific json won't parse when there are 6 decimal places followed by a string value. It will successfully parse when there five numbers after the decimal.

Please let me know if there is anything else I can do to help.

Thank you

RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.123456
    },
    {
      "value":"test"
    }
  ]
})"
)
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : basic_string

RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.12345
    },
    {
      "value":"test"
    }
  ]
})"
)
#> $tags
#>      value
#> 1 0.123450
#> 2     test

Created on 2024-07-16 with reprex v2.0.2

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jul 17, 2024

Thanks, can reproduce. But now scratching my head as I cannot find (substrings of) the error message in either our glue code in src/ or in the included simdjson in inst/include.

@lemire Any quick guess where the string length constraint may come from?

PS 0: On my system the error is

> Error in .deserialize_json(json = json, query = query, empty_array = empty_array,  : 
  basic_string::erase: __pos (which is 9) > this->size() (which is 8)
> 

PS 1: Oh I see now. This is specific to text columns. If it is all numbers we have no issue:

> RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.123456
    },
    {
      "value":1.234567
    }
  ]
})"
)
. + > $tags
     value
1 0.123456
2 1.234567

> 

That opens more avenue for mischief: maybe it is Rcpp forming the column vector? Still can't locate the error though.

@lemire
Copy link
Collaborator

lemire commented Jul 17, 2024

@eddelbuettel The simdjson library itself avoids std::string as much as possible.

@eddelbuettel
Copy link
Owner

We use them / need them before materializing to R.

@rjonesey I don't think there will be a simple or quick fix. Sorry.

@lemire
Copy link
Collaborator

lemire commented Jul 17, 2024

out.erase(out.find_last_not_of('0') + 2, std::string::npos);

@lemire
Copy link
Collaborator

lemire commented Jul 17, 2024

I don't think there will be a simple or quick fix. Sorry.

You might be overly pessimistic. :-)

@lemire
Copy link
Collaborator

lemire commented Jul 17, 2024

I'll prepare a PR.

@eddelbuettel
Copy link
Owner

Wowser. Nice catch!!

@JosiahParry
Copy link

You all rock! Thank you so much. And thank you @rjonesey for filing an issue.

This was discovered while we were testing new functionality in arcgislayers::query_layer_attachments() to access image metadata. I will update the minimum required version of RcppSimdJson when a PR is made.

Thanks again for this wonderful package!

@rjonesey
Copy link
Author

rjonesey commented Jul 17, 2024

Thanks everyone for the quick work.

Apologies, did not mean to close the issue on comment.

@lemire
Copy link
Collaborator

lemire commented Jul 17, 2024

#94

@eddelbuettel
Copy link
Owner

Tagged as micro-release 0.1.12.1 which you can install from the repo or (within the hour) from the r-universe page.

@rjonesey
Copy link
Author

The micro-release is now causing a new error with backslashes.

  RcppSimdJson::fparse(
    r"({
  "description":"\u0000"
})"
  )
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : Embedded NUL in string.
  
  RcppSimdJson::fparse(
    r"({
  "description":"\u"
})"
  )
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : STRING_ERROR: Problem while parsing a string
  
  RcppSimdJson::fparse(
    r"({
  "description":"u0000"
})"
  )
#> $description
#> [1] "u0000"

Created on 2024-07-18 with reprex v2.0.2

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

Can R support null chars in its strings?

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jul 18, 2024

I think so -- utf-8, being newer, is available but you may need to enable it via a call to encoding. In the Rcpp package we have a few simple unit tests (and had them for a decade+):

// [[Rcpp::export]]
String test_String_ctor_encoding2() {
    String y("å");
    y.set_encoding(CE_UTF8);
    return y;
}

@eddelbuettel
Copy link
Owner

Turns out that these days it also works without the encoding call. Welcome to 2024.

> Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; s.set_encoding(CE_UTF8); return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('std::string doesThisWork() { auto s = std::string{"å"}; return s; }')
> doesThisWork()
[1] "å"
> 
```

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

@eddelbuettel How do you feel about what this guy wrote in 2021?

Capture d’écran, le 2024-07-18 à 16 36 55

It seems that he thought that null characters were not allowed. I have no idea whether he was right.

@eddelbuettel
Copy link
Owner

Was that 2021 post from our repo here?

I am a little fuzzy on null characters -- but they are allowed in some strings. It may require a raw string though (so we would have to convert to and from). This comes up sometimes when serializing / deserializing. Digging ...

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jul 18, 2024

Here is something going to same direction (i.e. raw): https://stackoverflow.com/a/55006350/23224962

But yes, R being a C-based product sticks with C vectors so nul-terminated strings make it hard/impossible to assign it to standard character variables.

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

Was that 2021 post from our repo here?

Yes. @rjonesey's comment was once an issue that you closed...

#68

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

Or maybe the reporter closed it:
Capture d’écran, le 2024-07-18 à 17 01 15

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

It is just not a bug, it is a specification issue. @rjonesey is trying to load a JSON which has a string with a nul in it. Yet R does not accept strings with nuls.

So what do you do?

Note that the JSON specification is very clear, it does allow nuls. It seems that R is very clear that it does not.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jul 18, 2024

The key issue is still the R parser. You can't do it at the command-line:

> a <- "\u0000"
Error: nul character not allowed (<input>:1:12)
>

As demonstrated in #68, "below" R in an extension library such as simdjson we can of course do whatever we want. But there is an issue in how you'd prepare any such results for return back to R in particular in an R application such as a JSON parser expected to return strings as you cannot form such strings.

So I think if we wanted to support this we'd have to "invent" a new representation. Which will of course not be portable as it will be our one-off.

So I fear the standard answer of "there are limits in what we can parse as there are limits in what we can return" still holds.

@lemire
Copy link
Collaborator

lemire commented Jul 18, 2024

@rjonesey Given that R’s strings disallow nuls, can you elaborate on what your application is doing?

@rjonesey
Copy link
Author

Sure, I'll do my best here. Please let me know if the details I provide are insufficient. The fparse function is attempting to parse a response that contains a lot of metadata from images in a cloud service. The images are from hundreds of user uploads from their mobile device using an application called Survey123, an Esri product. These images are stored in an ArcOnline feature service, a sort of cloud object custom to ArcOnline. I am attempting to load the URLs of the images from the ArcOnline feature layer to then download the images themselves. I do not know why the metadata in the images contains so many nul characters. I think it has to do with the camera/phone settnigs of the individual users, as I can select specific records, and obtain the photos, while others fail. In the current function, if one photo fails the entire function fails.

@eddelbuettel
Copy link
Owner

For completeness here is what the the widely-used (yet slower) jsonlite says:

> library(jsonlite)
> fromJSON(r"( { "description":"\u0000" } )")
$description
[1] ""

> fromJSON(r"( { "description":"\u" } )")
Error: lexical error: invalid (non-hex) character occurs after '\u' inside string.
                     { "description":"\u" } 
                     (right here) ------^
> fromJSON(r"( { "description":"u0000" } )")
$description
[1] "u0000"

> 

@rjonesey
Copy link
Author

@JosiahParry may be able to speak more to the details and functionality. He developed the function and package being used. Thank you all.

@rjonesey
Copy link
Author

For completeness here is what the the widely-used (yet slower) jsonlite says:

> library(jsonlite)
> fromJSON(r"( { "description":"\u0000" } )")
$description
[1] ""

> fromJSON(r"( { "description":"\u" } )")
Error: lexical error: invalid (non-hex) character occurs after '\u' inside string.
                     { "description":"\u" } 
                     (right here) ------^
> fromJSON(r"( { "description":"u0000" } )")
$description
[1] "u0000"

> 

Very odd. The speed up with rcppsimdjson is nice. My example is likely flawed in the second part with the \u I was going for simplicity of example, but I do not know if there are any standalone \u strings in the metadata. Reading it as an empty string as it does in the first example would be acceptable for my specific use case. But I do not know if that is acceptable here.

@eddelbuettel
Copy link
Owner

This may end up being a local deployment problem for you. I searched a little -- and some answer out there show using an argument 'skipNul' (or alike) where is exists (readLines, similar in data.table::fread) or advocate for a quick filter over the input data.

@lemire
Copy link
Collaborator

lemire commented Jul 19, 2024

@rjonesey

The micro-release is now causing a new error with backslashes.
But I do not know if that is acceptable here.

The micro-release is not related to this new issue. In my view, it is not a bug in rcppsimdjson. The problem is that you are providing JSON data with nuls in its strings. So \u0000 is valid in JSON but it cannot be represented inside an R string. Thus it is an error. This is uncommon in my experience and likely an indication that your system is buggy. And if you really are meant to have nuls in your systems, and to ingest them in R, then I submit to you that you have to design issue. As far rcppsimdjson is concerned, I think we want an error because there is something off.

If you want the parser to act as if the \u0000 were not present, then you can include in your code a fast filtering function that removes instances of \u0000. This can be done at very high speed.

@eddelbuettel
Copy link
Owner

100% agree. It is preferable to error out on unpresentable input than to continue and pretend the data were different.

@JosiahParry
Copy link

Thanks all for the very thoughtful conversation! As mentioned previously, we are using RcppSimdJson as our primary way of processing json from many different services. https://github.com/R-ArcGIS/

In this case, we have a product called Survey123 which is used to collect data from the "field" (anywhere in the world collected via phones and other devices). We are parsing metadata about imagery that is collected from Survey123 using RcppSimdJson.

Many folks are capturing drone imagery and the metadata is in the exif format. So much of wonkiness is actually coming from the drone rather than anything else along in the process.

In this case, our workaround is to make returning metadata optional which will address @rjonesey's needs.

If this becomes a more persistent issue, I will see if I can find a lightweight R package—or make one—that can address embedded null characters in the JSON string.

Again, and as always, thank you for the great work and being so helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants