Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with reading large integer? #369

Closed
bsbkeven opened this issue May 22, 2014 · 18 comments
Closed

Error with reading large integer? #369

bsbkeven opened this issue May 22, 2014 · 18 comments

Comments

@bsbkeven
Copy link

Hello,

I am running jq (version 1.3), and find that it seems to be reading large integers (>= 17 digits) incorrectly. For example, if I run
jq '.id'
on the following input:

    {"id" : 125276004817190914}
    {"id" : 12527600481719091}
    {"id" : 1252760048171909}
    {"id" : 12527.6004817190914}

the output is:

    125276004817190910
    12527600481719092
    1252760048171909
    12527.600481719091

Results on the first and second line (18 and 17 digits) are incorrect, but the third line (16-digits integer) and the fourth (floating number) are correct.

Seems like this hasn't been reported before. So I just want to bring it up and wonder if anyone has had this issue as well? Thanks!

@nicowilliams
Copy link
Contributor

It's been reported many times, actually :)

The problem is that jq uses C doubles to represent numbers, and on pretty
much all modern systems that's an IEEE 754 double, which can only represent
integers without loss between -2^53..2^53. 125276004817190914 is about 14
times larger than the largest integer that jq can represent losslessly,
therefore jq can only approximate it.

@bsbkeven
Copy link
Author

Oh I see. Thanks for the clarification, and thanks for the great work!

Cheers,
Yiye

On May 21, 2014, at 7:31 PM, Nico Williams <[email protected]mailto:[email protected]> wrote:

It's been reported many times, actually :)

The problem is that jq uses C doubles to represent numbers, and on pretty
much all modern systems that's an IEEE 754 double, which can only represent
integers without loss between -2^53..2^53. 125276004817190914 is about 14
times larger than the largest integer that jq can represent losslessly,
therefore jq can only approximate it.


Reply to this email directly or view it on GitHubhttps://github.com//issues/369#issuecomment-43842711.

@pkoppstein
Copy link
Contributor

jq's handling of number's is simply wrong, at least with respect to its own documentation. The description of "." is:

This is a filter that takes its input and produces it unchanged as output.

Since the documentation is in accordance with the philosophy of JSON, it seems to me that jq should do what the documentation says. Of course it would be more than acceptable if there were an option that would determine how numbers are to be handled.

@nicowilliams
Copy link
Contributor

. is an expression in a jq program, but it does no parsing -- it deals in
already-parsed values only.

jq programs consume parsed JSON values. It is the jq processor that
arranges to do parsing and encoding, as well as the to_json, from_json,
and @json functions/expressions. In terms of internal code organization,
a jq program is run by the jq_start() and jq_next() C functions, which are
called from the main program (main()); the latter parses values with
jv_parse*(), and passes parsed values to jq_start(). I can see that the
difference between "a jq program" (e.g., .[][]) and "the jq program"
(i.e., the executable named jq) is confusing, and the manual doesn't
document it.

The jq parser (i.e., the jv_parse*() C functions) only supports IEEE 754
double representations of numbers. Many JSON implementations do this, but
not all; some use 32 and 64 bit signed integers, others use bignum
representations (exact for any rational that fits in memory). There are
interoperability considerations as a result, and these aren't really jq's
fault. Of course, jq could switch to a bignum representation, but that'd
be a fair bit of work, and would require a suitable bignum library with
acceptable license.

As for options... jq will almost certainly never have any run-time
options for number representation and handling. A build-time option for
IEEE 754 double or bignum is about the most that can be expected.

@pkoppstein
Copy link
Contributor

In response to nicowilliams -- If the problem is one of design, so be it, but it is a problem. Infinite precision arithmetic would obviate the problem for integers; perhaps decimals should be parsed as strings, and only converted to floats when js is given an instruction to perform an arithmetic operation.

To highlight the fact that some JSON tools get it right, consider:

 $ echo 12311111111111111111111111321 | jq -M .
 12311111111111112000000000000

 $ echo 12311111111111111111111111321 | jsonpp
 12311111111111111111111111321

@nicowilliams
Copy link
Contributor

Indeed, some JSON tools do what you like. However, many don't. The only
numbers that reliably interoperate with all JSON implementations are ones
with exact 32-bit signed integer representations. Beyond that all signed
integers in the -2^53..2^53 range (the range of integers that IEEE 754
doubles can represent exactly). Beyond that real numbers with exact IEEE
754 double representations. Beyond that you're not likely to interop with
many implementations at all.

The problem here isn't just jq. It's JSON. JSON didn't (RFC4627) and
doesn't (RFC7159) specify that arbitrary bignums must be supported.

JavaScript implementations, for example, only handle IEEE 754 doubles.

@pkoppstein
Copy link
Contributor

This is a response to nicowilliams's post.

There is a distinction to be drawn between JSON as defined at json.org and specified by ECMA (http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf), and JSON as envisioned by http://tools.ietf.org/html/rfc7159.

JSON as defined at json.org and by ECMA just has a number type. There is no concept of "precision" or any limit specifically for numbers. As far as the definition at json.org is concerned, JSON numbers are essentially strings for representing decimals with a finite representation. As the ECMA specification says:

 JSON is agnostic about numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal. That can make interchange between different programming languages difficult. JSON instead offers only the representation of numbers that humans use: a sequence of digits. All programming languages know how to make sense of digit sequences even if they disagree on internal representations. That is enough to allow interchange. 

On the other hand, the draft specification at http://tools.ietf.org/html/rfc7159 does allow "implementations to set limits on the range and precision of numbers accepted", so perhaps the intent of jq is to be such an implementation. If that is the case, then I believe the documentation should be more upfront about the issue. I certainly was misled by the prominence given to this statement very early in the jq Manual:

  Since jq by default pretty-prints all output, this trivial program [jq -M .] can be a useful way of formatting JSON output 

A case could also be made that "implementations" that "set limits" should emit error or warning messages if the limits are breached.

It seems to me, however, that jq would be much more useful if it retained precision AT LEAST in the absence of arithmetic operations.

That is, it would be very nice if "jq -M ." really could be used to format JSON without ever altering any values.

Thanks.

@tischwa
Copy link

tischwa commented May 24, 2014

On 05/24/14 08:19, pkoppstein wrote:

This is a response to nicowilliams's post.

There is a distinction to be drawn between JSON as defined at json.org, and JSON as envisioned by http://tools.ietf.org/html/rfc7159.

JSON as defined at json.org just has a number type. There is no concept of "precision" or any limit specifically for numbers. As far as the definition at json.org is concerned, JSON numbers are essentially strings for representing decimals with a finite representation.

On the other hand, the draft specification at http://tools.ietf.org/html/rfc7159 does allow "implementations to set limits on the range and precision of numbers accepted", so perhaps the intent of jq is to be such an implementation. If that is the case, then I believe the documentation should be more upfront about the issue. I certainly was misled by the prominence given to this statement very early in the jq Manual:

   Since jq by default pretty-prints all output, this trivial program [jq -M .] can be a useful way of formatting JSON output

A case could also be made that "implementations" that "set limits" should emit error or warning messages if the limits are breached.

It seems to me, however, that jq would be much more useful if it retained precision AT LEAST in the absence of arithmetic operations.

+1

There was quite a bit discussion about this in the past, you can find
those in the issue list on github.

I think there was a prototypical implementation already from someone
exactly doing what you propose: Read number as strings and convert to
double only if arithmetic is involved. I'm not sure why that
implementation didn't make it in.

BTW, awk does it that way too:

%echo '111111111111111111' | awk '{print $1, 1*$1}'
111111111111111111 111111111111111104

Regards,

 Tilo

@pkoppstein
Copy link
Contributor

Thanks, Tilo, especially for mentioning "awk" -- for two reasons.

First, the following text from jq's "home page" advertises jq as being like awk and sed:

jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

Contrast:

$ sed <<< 11111111111111111
11111111111111111

$ jq -M . <<< 11111111111111111
11111111111111112

Second, and more importantly, regular expressions are intrinsic to both sed and awk. I've mentioned that elsewhere (#164), so by way of summary I'll just say that although I really do like PEGs, jq, being modern and unconstrained by backwards-compatibility issues, really ought to support regular expressions with named captures, hopefully in the manner of ruby, and hopefully sooner rather than later :-)

@nicowilliams
Copy link
Contributor

RFC7159 also doesn't say anything about IEEE 754 being the standard for
JSON. All it does is note the variance in implementations.

No one says that jq shouldn't have bignums. The point is that you can't
expect bignum support universally.

@nicowilliams
Copy link
Contributor

A possible candidate library would be https://github.com/libtom/libtomfloat, but it seems to be abandoned, and the WARNING seems scary (but we can always write tests for it and fix it).

Another possible candidate is any number of bignum integer libraries, like libtommath and bsdnt, and make our own bignum real library.

@jrdriscoll
Copy link

I don't have a lot to add to this discussion other than to note that I was shocked that echo 11111111111111111 | jq '.' produced something other than its input. That being said, I never expected infinite precision arithmetic.

@dequis
Copy link

dequis commented Oct 3, 2016

Submitted #1246, which solves this issue in a conservative way - no bigint libraries, only for ints up to 64 bits, and only if no operations are done over them, much like awk's behavior in previous comments:

$ echo 111111111111111111 | jq -c '[., 1*.]'
[111111111111111111,111111111111111100]

zifeishan added a commit to HazyResearch/deepdive that referenced this issue Jan 12, 2017
`jq` has a precision bug while loading probabilities with big vids:

```
$ printf '1 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294272    200    300
$ printf '2 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294274    200    300
$ printf '3 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '4 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '5 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '6 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294278    200    300
$ printf '7 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294280    200    300
```

this is a known issue and shockingly there seems no good way in jq to support better precision:
jqlang/jq#369

We've tried using awk but versioning seems a problem. Here we just try to use python3 to do the job.
Just used a poor
@sp-james-mcmurray
Copy link

I had this issue today with jq-1.5.

Where for large integers, it is changing the last values to 0s, even converting to strings doesn't help (as it reads it as integer first):

"3265374331746778600"
"3349146353896582000"
"3658445187091539500"
"381327942920288"
"4540495826739245000"
"4609284046671461000"

Should be:

3265374331746778747
3349146353896582298
3658445187091539618
4540495826739245237
4609284046671461009

@nicowilliams
Copy link
Contributor

@sp-james-mcmurray Internally jq uses IEEE754 doubles for number representation. Any integers whose absolute values are larger than 2^52 - 1 will not be faithfully represented.

@mitar
Copy link

mitar commented Feb 19, 2019

I just use Python and its json module. Now that Python maintains the order in its dicts, it is easy to modify JSON and get output match the input, expect for the changes you want.

@nicowilliams
Copy link
Contributor

@mitar please don't leave this comment on every issue that deals with jq and IEEE754. Thanks.

@umairrafiq
Copy link

One workaround for this issue is to double quote integers as string before sending to jq.
An easy way to do that in bash using perl :
perl -pe 's/("(?:\.|[^"])*")|-?\d+(?:.\d+)?(?:[eE][-+]?\d+)?/$1||qq("$&")/ge'
sourced from : https://unix.stackexchange.com/a/504446/368108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants