Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPKG: creation of spatial index is quite slow #7614

Closed
theroggy opened this issue Apr 18, 2023 · 16 comments · Fixed by #8596
Closed

GPKG: creation of spatial index is quite slow #7614

theroggy opened this issue Apr 18, 2023 · 16 comments · Fixed by #8596

Comments

@theroggy
Copy link
Contributor

theroggy commented Apr 18, 2023

Not really a new observation, but the creation of a spatial index on a (larger) Geopackage file is very slow. I made a test script in python to illustrate this.

It creates a grid with 1.002.001 squares, writes this to .shp and .gpkg and creates spatial indexes on them. The output:

writing test.shp without spatial index took 0:00:14.067565
create spatial index on test.shp took 0:00:04.890439
writing test.gpkg without spatial index took 0:00:06.265831
writing test_si.gpkg with spatial index took 0:00:38.086723
create spatial index on test.gpkg with default sqlite cache took 0:00:36.834020
create spatial index on test.gpkg with 512 MB sqlite cache took 0:00:37.406399

I wonder, is it clear why this is slow or where the bottleneck is located? In GDAL, Spatialite, SqLite, ???

Where would be a logical place to post this issue to try to get this improved? Or have such efforts already been tried?

@rouault
Copy link
Member

rouault commented Apr 18, 2023

or where the bottleneck is located?

in SQLite (Spatialite isn't used at all for that). GDAL has already tried optimizing this as best as it can, but there isn't nothing more we can do on our side I can think of. This is probably coming from the write side of SQLite and/or its RTree building
You might try to post on https://sqlite.org/forum/, but you'll probably need to prepare a fully self contained SQLite SQL text file (independent of GDAL GeoPackage SQL functions) with raw statements like:

CREATE VIRTUAL TABLE my_rtree USING rtree(id, minx, maxx, miny, maxy);
INSERT INTO my_rtree VALUES (0,1,2,3,4);
[...]

(you might use a WITH RECURSIVE ... INSERT INTO my_rtree SELECT ... syntax to insert in a compact way lots of records)

@theroggy
Copy link
Contributor Author

I was able to reproduce without any gdal or spatialite involved that the time being taken to create the index is indeed ~100% the "INSERT INTO my_tree...", so 100% sqlite.

I had another look at the qix format, and as far as I can tell it is a slightly different tree implementation (quadtree)... so I'll have a look if I can get some timings using e.g. https://rtree.readthedocs.io/en/latest/ so the time comparison is as "comparable" as possible.

@jratike80
Copy link
Collaborator

The R*Tree module https://www.sqlite.org/rtree.html is probably more complicated than it may feel. Maybe the fastest possible writing has not been the main design goal. Normally indexes are read much more often that written, but naturally a data provider has a different point of view.
I wonder if it could be faster to create simple indexed minx/maxx/miny/maxy columns and how much slower it would be use those in bbox queries.

@rouault
Copy link
Member

rouault commented Apr 19, 2023

I wonder if it could be faster to create simple indexed minx/maxx/miny/maxy columns and how much slower it would be use those in bbox queries.

Cf the NGA GeoInt Geometry Index extension: http://ngageoint.github.io/GeoPackage/docs/extensions/geometry-index.html . It probably needs also additional creation of regular index on the min_x, max_x, min_y, max_y columns

@rouault
Copy link
Member

rouault commented Apr 20, 2023

It probably needs also additional creation of regular index on the min_x, max_x, min_y, max_y columns

actually no, that's detrimental to performance in some cases.
And the performance boost of using nga_geometry_index is "just" a x1.6 over using the bounding box included in the GeoPackage geometry blob (with #7621 applied):

$ time ogrinfo nz-building-outlines-no-index.gpkg -sql "select count(*) from \"nz-building-outlines\" where ST_EnvelopesIntersects(geometry, 1500000.0,5200000.0,1800000.0,5800000.0)" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701300


real	0m0,563s
user	0m0,294s
sys	0m0,267s

$ time ogrinfo nz-building-outlines-ngageoint.gpkg -sql "select count(*) from nga_geometry_index where max_x>=1500000 and max_y>= 5200000 and min_x <=1800000 and min_y <= 5800000" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701300


real	0m0,360s
user	0m0,290s
sys	0m0,068s

@jratike80
Copy link
Collaborator

It would be nice to see also the timing for R*Tree.

@rouault
Copy link
Member

rouault commented Apr 20, 2023

It would be nice to see also the timing for R*Tree.

$ time ogrinfo nz-building-outlines.gpkg -sql "SELECT count(*) FROM \"rtree_nz-building-outlines_geometry\" WHERE maxx >= 1500000 AND minx <= 1800000 AND maxy >= 5200000 AND miny <= 5800000" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701301


real	0m0,146s
user	0m0,120s
sys	0m0,024s

And the RTree is magnitude order much faster than linear scanning when selecting a very small number of objects (here this selects ~ 20% of the table)

@jratike80
Copy link
Collaborator

So it feels like the creation of the R*Tree index is slower for a reason and it gets paid back rather soon when the data are queried.

@theroggy
Copy link
Contributor Author

theroggy commented Apr 20, 2023

As mentioned above I had a look to find an rtree implementation to benchmark against because .qix uses a quandtree. I don't know any details on the difference, but comparing the same algoritm should be a fairer comparison.

In geopandas one of the available indexes is an rtree, so I had a look at that. Apparently the rtree library used under the hood is a python wrapper around a c library. Not sure how fast or slow it is... but for the sample data above creating the index takes ~9 seconds compared to the 38 seconds to fill up the index in sqlite.

The 9 s is without serializing to file, but still it gives the impression there might be room for improvement in the sqlite implementation.

An important thing to note is that the "rtree" library has an explicit bulk/stream load option that would be a lot faster. I tested both, and the 9 seconds is for the bulk load way, filling the index in a row-by-row loop way took over 2 minutes... Not sure if python overhead in the loop plays a significant role or not, but that way it is a lot slower than sqlite.

Possibly such a bulk load option might be the trick that could boost the initial spatial index creation in sqlite as well?

Code snippet of my rtree test:

import datetime
from rtree import index

# Prepare test data
x, y = (0, 0)
rects = []
size = 10
while y <= 10000:
    while x <= 10000:
        rects.append((x, y, x + size, y + size))
        x += size
    x = 0
    y += size

# Create rtree
start = datetime.datetime.now()
data = ((i, item, None) for i, item in enumerate(rects))
index_properties = index.Property(type=index.RT_RTree)
idx = index.Index(data, properties=index_properties)
print(f"create rtree took {datetime.datetime.now() - start}")

@jratike80
Copy link
Collaborator

jratike80 commented Apr 20, 2023

I fear that for making the creation of the R*Tree index in SQLite faster one should have a thorough understanding about the virtual table mechanism of SQLite https://www.sqlite.org/vtab.html, and what the R*Tree module is doing when it creates the contents to the shadow tables ...node, ...parent, and ...rowid.
I suggest to write mail to SQLite developers and ask about the design principles of the R*Tree module.

There may well be alternative methods for writing a spatial index faster. I guess that such system would require a vendor specific extension because the GeoPackage standard requires that the gpkg_rtree_index extension is using the standard module

Requirements
This extension uses the rtree implementation provided by the SQLite R*Tree Module extension
documented at http://www.sqlite.org/rtree.html.

BTW, maybe a useless test, but copying the biggest shadow table with 6.8 million rows took 9 seconds on my laptop. Maybe it suggests that SQLite is using time for creating RTree that is optimized for fast queries, and the bottle neck is not in writing.

create table test_index
as select * from 
rtree_korkeuskayra_geom_node

@jratike80
Copy link
Collaborator

jratike80 commented Apr 20, 2023

Hey @rouault , what if you continued what you did in #7621, and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already? Where do the ST_MinX etc. functions find the values?

@theroggy
Copy link
Contributor Author

theroggy commented Apr 20, 2023

I suggest to write mail to SQLite developers

It is indeed the plan to post in the SQLite support forum and ask if they see an option to speed up the creation of the index. I'm busy collecting info to give the post some body.

... and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already?

The R*Tree index is already populated using the bounding boxes.

Where do the ST_MinX etc. functions find the values?

The GeoPackage BLOB (typically) already contains the bounding box, so ST_MinX etc. can just read them directly without any further processing. Additionally I tested a few days ago if it made a difference to first copy the ST_MinX,... of the geometries to a simple table, and then create the rtree index based on that, but this didn't give any difference, so hence my conclusion/confirmation that ~100% of the time is spent building/inserting into the rtree index.

@rouault
Copy link
Member

rouault commented Apr 20, 2023

and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already?

yes, that's already done. In https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/gpkg/ogrgeopackagetablelayer.cpp#L2427, in CreateFeature(), OGR collects the bounding box computed to create the header GeoPackage geometry blob, store that in a memory queue, builds a RTree in a dummy in-memory sqlite database in a thread, and then copies the shadow tables to the real database at the end. That was implemented per #6573. There's really nothing more we can done on OGR side, without going into really low level details of the SQLite R*Tree implementation. That is messing directly with the content of the shadow tables (which, as far as I can see, isn't documented in plain language). Non trivial effort, and it is not obvious if we could beat SQLite implementation (perhaps...). It would be better if SQLite itself could have a fast bulk mode implemented, if that's ever possible given its architecture

@theroggy
Copy link
Contributor Author

theroggy commented Apr 21, 2023

Hopefully there is still a bigger speedup possible, but I had an interesting find.

I'm using python via conda(-forge) on windows for my tests and timings. When running my sql-only (gpkg independent) script I noticed that the "default" sqlite.exe downloaded from sqlite.org was significantly faster than the conda version: 18 secs instead of 31 secs for the insert in the rtree.

So, I already posted an issue for that aspect here: conda-forge/sqlite-feedstock#89

@theroggy
Copy link
Contributor Author

I just posted in the sqlite forum checking whether it would be possible to add a bulk load functionality for rtree's:
https://sqlite.org/forum/forumpost/af937a36b3

Crossing fingers ;-)

@theroggy theroggy changed the title GPKG: creation of spatial index is slow GPKG: creation of spatial index is quite slow Sep 14, 2023
rouault added a commit to rouault/gdal that referenced this issue Oct 22, 2023
```
$ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m16,433s
user    0m18,093s
sys     0m2,135s
```

vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion
from SQLite master sqlite/sqlite@7de8ae2).
Otherwise it is 38 seconds

```
$ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m28,000s
user    0m40,287s
sys     0m5,244s
```

Fixes OSGeo#7614
@rouault
Copy link
Member

rouault commented Oct 22, 2023

Fast RTree creation implemented in #8596

rouault added a commit to rouault/gdal that referenced this issue Oct 22, 2023
```
$ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m16,433s
user    0m18,093s
sys     0m2,135s
```

vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion
from SQLite master sqlite/sqlite@7de8ae2).
Otherwise it is 38 seconds

```
$ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m28,000s
user    0m40,287s
sys     0m5,244s
```

Fixes OSGeo#7614
rouault added a commit to rouault/gdal that referenced this issue Oct 22, 2023
```
$ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m16,433s
user    0m18,093s
sys     0m2,135s
```

vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion
from SQLite master sqlite/sqlite@7de8ae2).
Otherwise it is 38 seconds

```
$ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg
real    0m28,000s
user    0m40,287s
sys     0m5,244s
```

Fixes OSGeo#7614
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants