-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fparse fails when set with 6 decimal points is followed by set with string value #93
Comments
Thanks, can reproduce. But now scratching my head as I cannot find (substrings of) the error message in either our glue code in @lemire Any quick guess where the string length constraint may come from? PS 0: On my system the error is > Error in .deserialize_json(json = json, query = query, empty_array = empty_array, :
basic_string::erase: __pos (which is 9) > this->size() (which is 8)
> PS 1: Oh I see now. This is specific to text columns. If it is all numbers we have no issue: > RcppSimdJson::fparse(
r"({
"tags":
[
{
"value":0.123456
},
{
"value":1.234567
}
]
})"
)
. + > $tags
value
1 0.123456
2 1.234567
> That opens more avenue for mischief: maybe it is Rcpp forming the column vector? Still can't locate the error though. |
@eddelbuettel The simdjson library itself avoids |
We use them / need them before materializing to R. @rjonesey I don't think there will be a simple or quick fix. Sorry. |
|
You might be overly pessimistic. :-) |
I'll prepare a PR. |
Wowser. Nice catch!! |
You all rock! Thank you so much. And thank you @rjonesey for filing an issue. This was discovered while we were testing new functionality in Thanks again for this wonderful package! |
Thanks everyone for the quick work. Apologies, did not mean to close the issue on comment. |
Tagged as micro-release 0.1.12.1 which you can install from the repo or (within the hour) from the r-universe page. |
The micro-release is now causing a new error with backslashes. RcppSimdJson::fparse(
r"({
"description":"\u0000"
})"
)
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : Embedded NUL in string.
RcppSimdJson::fparse(
r"({
"description":"\u"
})"
)
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : STRING_ERROR: Problem while parsing a string
RcppSimdJson::fparse(
r"({
"description":"u0000"
})"
)
#> $description
#> [1] "u0000" Created on 2024-07-18 with reprex v2.0.2 |
Can R support null chars in its strings? |
I think so -- utf-8, being newer, is available but you may need to enable it via a call to encoding. In the Rcpp package we have a few simple unit tests (and had them for a decade+): // [[Rcpp::export]]
String test_String_ctor_encoding2() {
String y("å");
y.set_encoding(CE_UTF8);
return y;
} |
Turns out that these days it also works without the encoding call. Welcome to 2024. > Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; s.set_encoding(CE_UTF8); return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('std::string doesThisWork() { auto s = std::string{"å"}; return s; }')
> doesThisWork()
[1] "å"
>
``` |
@eddelbuettel How do you feel about what this guy wrote in 2021? ![]() It seems that he thought that null characters were not allowed. I have no idea whether he was right. |
Was that 2021 post from our repo here? I am a little fuzzy on null characters -- but they are allowed in some strings. It may require a |
Here is something going to same direction (i.e. But yes, R being a C-based product sticks with C vectors so nul-terminated strings make it hard/impossible to assign it to standard character variables. |
It is just not a bug, it is a specification issue. @rjonesey is trying to load a JSON which has a string with a nul in it. Yet R does not accept strings with nuls. So what do you do? Note that the JSON specification is very clear, it does allow nuls. It seems that R is very clear that it does not. |
The key issue is still the R parser. You can't do it at the command-line: > a <- "\u0000"
Error: nul character not allowed (<input>:1:12)
> As demonstrated in #68, "below" R in an extension library such as So I think if we wanted to support this we'd have to "invent" a new representation. Which will of course not be portable as it will be our one-off. So I fear the standard answer of "there are limits in what we can parse as there are limits in what we can return" still holds. |
@rjonesey Given that R’s strings disallow nuls, can you elaborate on what your application is doing? |
Sure, I'll do my best here. Please let me know if the details I provide are insufficient. The fparse function is attempting to parse a response that contains a lot of metadata from images in a cloud service. The images are from hundreds of user uploads from their mobile device using an application called Survey123, an Esri product. These images are stored in an ArcOnline feature service, a sort of cloud object custom to ArcOnline. I am attempting to load the URLs of the images from the ArcOnline feature layer to then download the images themselves. I do not know why the metadata in the images contains so many nul characters. I think it has to do with the camera/phone settnigs of the individual users, as I can select specific records, and obtain the photos, while others fail. In the current function, if one photo fails the entire function fails. |
For completeness here is what the the widely-used (yet slower) > library(jsonlite)
> fromJSON(r"( { "description":"\u0000" } )")
$description
[1] ""
> fromJSON(r"( { "description":"\u" } )")
Error: lexical error: invalid (non-hex) character occurs after '\u' inside string.
{ "description":"\u" }
(right here) ------^
> fromJSON(r"( { "description":"u0000" } )")
$description
[1] "u0000"
> |
@JosiahParry may be able to speak more to the details and functionality. He developed the function and package being used. Thank you all. |
Very odd. The speed up with rcppsimdjson is nice. My example is likely flawed in the second part with the |
This may end up being a local deployment problem for you. I searched a little -- and some answer out there show using an argument 'skipNul' (or alike) where is exists ( |
The micro-release is not related to this new issue. In my view, it is not a bug in If you want the parser to act as if the |
100% agree. It is preferable to error out on unpresentable input than to continue and pretend the data were different. |
Thanks all for the very thoughtful conversation! As mentioned previously, we are using In this case, we have a product called Survey123 which is used to collect data from the "field" (anywhere in the world collected via phones and other devices). We are parsing metadata about imagery that is collected from Survey123 using RcppSimdJson. Many folks are capturing drone imagery and the metadata is in the exif format. So much of wonkiness is actually coming from the drone rather than anything else along in the process. In this case, our workaround is to make returning metadata optional which will address @rjonesey's needs. If this becomes a more persistent issue, I will see if I can find a lightweight R package—or make one—that can address embedded null characters in the JSON string. Again, and as always, thank you for the great work and being so helpful! |
Encountered an issue where this specific json won't parse when there are 6 decimal places followed by a string value. It will successfully parse when there five numbers after the decimal.
Please let me know if there is anything else I can do to help.
Thank you
Created on 2024-07-16 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: