Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling floating point precision in sqlite #150

Closed
gmaclennan opened this issue Aug 8, 2023 · 11 comments
Closed

Handling floating point precision in sqlite #150

gmaclennan opened this issue Aug 8, 2023 · 11 comments
Assignees
Labels
mvp Requirement for MVP

Comments

@gmaclennan
Copy link
Member

Description

Writing a floating point number (e.g. non-decimal) to sqlite and reading it again results in a different value due to floating point conversion errors. This is noticeable when storing lat/lon coordinates. We are currently using the SQLite REAL data type for storing lat and lon. More about how SQLite handles floating point here: https://www.sqlite.org/floatingpoint.html

This might also be related to encoding of floating point numbers in Protobuf, which is also involved in the round-trip of writing-reading data in Mapeo Core.

I think the issue is compounded by Javascript and Sqlite using different ways to store floating points.

It's not ideal that when storing data with lat/lon coordinates, the values that come back are different to those that went in. I wonder that maybe we switch to using an integer field for lat and lon, multiplying as needed for a fixed precision. e.g. we could support 6 decimal places of precision by storing lat / lon as value * Math.pow(10,6).

I think this is probably preferable to getting different values back than what you write, which is not what people would expect.

Suggestions about the best way forwards?

@achou11
Copy link
Member

achou11 commented Aug 8, 2023

I wonder that maybe we switch to using an integer field for lat and lon, multiplying as needed for a fixed precision. e.g. we could support 6 decimal places of precision by storing lat / lon as value * Math.pow(10,6)

I think this would make sense! A quick search online seems to confirm that this is a reasonable approach (albeit annoying)

I think this is probably preferable to getting different values back than what you write, which is not what people would expect.

Agreed!

@gmaclennan
Copy link
Member Author

I'm thinking maybe 7 decimal places for lat/lon is good. This gives precision of at least 0.011112m (1.1cm) - with greater precision for longitude away from the equator. This is way more accurate than most GPS, which have a max precision of about 5m, however survey GPS and even RTK can get down to millimeter precision (although the most common standard for surveys is Trimble devices, which have about 2-3cm precision I think).

The advantage of 7 decimal places is that we can fit them in a 32-bit signed integers (decoding 64-bit integers to Javascript can be complicated). Min, max of sint32 is -2,147,483,647 to 2,147,483,647.

7 decimal precision lat/lon is going to be -1,800,000,000 to 1,800,000,000

We could go 8 decimal places for millimeter precision, but I think this is overkill for Mapeo, given the overhead of then needing 64-bit integers.

@sethvincent
Copy link
Contributor

In what ways will this impact queries?
What about using strings? 💩

@gmaclennan
Copy link
Member Author

Using strings in some ways is easiest - ensures round-trip accuracy. It would mean we can't query lat lon though (except via json - see below), but we don't right now anyway. Using integers I think would just work due to the way drizzle transforms values. One option with strings is that SQLite json queries could probably parse it into a number and run queries on it. It wouldn't be indexed but it would work.

@tomasciccola
Copy link
Contributor

ts-proto also introduces precison errors on the roundtrip. If we're using ints, I agree using int32 would be the best and recommended by ts-proto (see https://github.com/stephenh/ts-proto section NumberTypes).
For tests to pass now I'm doing the weird thing of counting decimals of the encoded value and then removing extra decimals on the decoded value for comparison, which is not ideal.
In observation/{lat,lon} is the only place I found this issue (There are other places where we use numbers but I don't think there's a change for them to be floats)
RE: using strings instead of ints I don't have a strong opinion. But if we want to do smth like spacial querying we would need to transform a string to int for that, right?

@gmaclennan
Copy link
Member Author

Can you try double and see how it changes the precision / difference.

@tomasciccola
Copy link
Contributor

tomasciccola commented Aug 9, 2023

Can you try double and see how it changes the precision / difference.

yeap, already tried here and it worked! Don't know if we should keep this conversation open related SQL handling of floats

@gmaclennan
Copy link
Member Author

Will publish the mapeo/schema changes tomorrow and see how that affects the tests here and if it's still an issue. I guess it will become an issue at a certain level of precision, but not sure what that is?

@gmaclennan
Copy link
Member Author

Switching protobuf storage to double seems to have solved the problem for many situations, however it might still be an issue when trying to store very high precision coordinates. Some other references on this issue that I found:

Going to leave this open to potentially address in the future, although migration might be tricky.

@gmaclennan gmaclennan added the post-mvp de-scoped to after MVP label Aug 22, 2023
@EvanHahn EvanHahn added mvp Requirement for MVP and removed post-mvp de-scoped to after MVP labels Mar 14, 2024
EvanHahn added a commit that referenced this issue Mar 28, 2024
If we use the ["integer" JSON Schema type][0], we should save that as an
INTEGER column, not a REAL column. Both will work for many numbers, but
larger integers (like `2**63`) can only be safely stored in INTEGERs.
This should be a bit more reliable.

As of this patch, we don't have any integer types in our schemas, so
this has no effect. However, that may change as we address [upcoming
issues][1].

[0]: https://json-schema.org/understanding-json-schema/reference/numeric#integer
[1]: #150
@EvanHahn
Copy link
Contributor

tl;dr: I think we're all good here.


I anticipate two types of bug here:

  1. Data serialization bugs. For example, we lose data when going from JS ↔ SQLite, or JS ↔ protobuf.
  2. Data precision bugs. For example, we can't represent small distances.

I've don't think either of these are problems for us.

Bug 1: data serialization

We should be okay as long as we use standard floats of the same type, which we do:

We could have problems going between, say, float32 and float64, but we're not doing that as far as I can tell.

More details

I don't always trust docs, so here's a script that verifies that we don't lose data going between JS and SQLite:

import Database from 'better-sqlite3'
import assert from 'node:assert/strict'

const db = Database(':memory:')

db.exec('CREATE TABLE numbers (value REAL NOT NULL) STRICT')
const insert = db.prepare('INSERT INTO numbers (value) VALUES (?)')
const select = db.prepare('SELECT value FROM numbers LIMIT 1').pluck()
const clear = db.prepare('DELETE FROM numbers')

;[
  123,
  0.1,
  0.1 + 0.2,
  Number.MIN_SAFE_INTEGER,
  Number.MAX_SAFE_INTEGER,
  Number.EPSILON,
  Number.MIN_VALUE,
].forEach((number) => {
  clear.run()
  insert.run(number)

  assert.equal(select.get(), number)
})

Bug 2: precision

64-bit floats are accurate to within a single digit number of nanometers.1 This is probably fine for our purposes.

How I came to this

IEEE floats have limited precision.

This imprecision isn't constant. Numbers lose precision as their magnitude gets larger.2 Therefore, if we want to test the worst case precision, we should test the largest number we represent. In our case, that number is 180.

For us, the largest possible imprecision would be between (90°, 180°) and (89.9999999999999857891°, 179.999999999999971578°).3

I wrote this short Python program to compute the Haversine distance between these points, in meters:

from sympy import asin, sqrt, sin, cos, pi, sympify

r = 6378.137 * 1000


def degrees_to_radians(degrees):
    return degrees * pi / 180


def haversine(p1, p2):
    lat_1 = degrees_to_radians(p1[0])
    lng_1 = degrees_to_radians(p1[1])
    lat_2 = degrees_to_radians(p2[0])
    lng_2 = degrees_to_radians(p2[1])

    dlat = lat_1 - lat_2
    dlng = lng_1 - lng_2

    return sympify(2 * r) * asin(
        sqrt(sin(dlat / 2) ** 2 + cos(lat_1) * cos(lat_2) * sin(dlng / 2) ** 2)
    )


def main():
    p1 = sympify((90, 180))
    p2 = sympify(("89.9999999999999857891", "179.999999999999971578"))

    distance_in_meters = haversine(p1, p2)

    print("meters:", distance_in_meters.evalf())
    print("centimeters:", (distance_in_meters * 100).evalf())
    print("nanometers:", (distance_in_meters * 1e9).evalf())


if __name__ == "__main__":
    main()

This returned about 1.58 nanometers. The Haversine distance isn't perfectly accurate, but I think it's good enough for us.

(I also tried the same thing with float32 values and it's about 85 centimeters. Might also be okay for us, but I wouldn't want to risk that.)

Footnotes

  1. According to the United States National Nanotechnology Initiative, "A sheet of paper is about 100,000 nanometers thick".

  2. This MDN example shows this in a bit more detail.

  3. 179.999999999999971578 is the float64 number closest to 180. Same for 89.9999999999999857891. See float.exposed to play around with this.

@EvanHahn EvanHahn self-assigned this Mar 28, 2024
EvanHahn added a commit that referenced this issue Mar 28, 2024
If we use the ["integer" JSON Schema type][0], we should save that as an
INTEGER column, not a REAL column. Both will work for many numbers, but
larger integers (like `2**63`) can only be safely stored in INTEGERs.
This should be a bit more reliable.

As of this patch, we don't have any integer types in our schemas, so
this has no effect. However, that may change as we address [upcoming
issues][1].

[0]: https://json-schema.org/understanding-json-schema/reference/numeric#integer
[1]: #150
@EvanHahn
Copy link
Contributor

EvanHahn commented Apr 4, 2024

I think we're all good here. Closing.

@EvanHahn EvanHahn closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mvp Requirement for MVP
Projects
None yet
Development

No branches or pull requests

5 participants