GPKG: creation of spatial index is quite slow #7614

theroggy · 2023-04-18T12:58:40Z

Not really a new observation, but the creation of a spatial index on a (larger) Geopackage file is very slow. I made a test script in python to illustrate this.

It creates a grid with 1.002.001 squares, writes this to .shp and .gpkg and creates spatial indexes on them. The output:

writing test.shp without spatial index took 0:00:14.067565
create spatial index on test.shp took 0:00:04.890439
writing test.gpkg without spatial index took 0:00:06.265831
writing test_si.gpkg with spatial index took 0:00:38.086723
create spatial index on test.gpkg with default sqlite cache took 0:00:36.834020
create spatial index on test.gpkg with 512 MB sqlite cache took 0:00:37.406399

I wonder, is it clear why this is slow or where the bottleneck is located? In GDAL, Spatialite, SqLite, ???

Where would be a logical place to post this issue to try to get this improved? Or have such efforts already been tried?

The text was updated successfully, but these errors were encountered:

rouault · 2023-04-18T13:40:41Z

or where the bottleneck is located?

in SQLite (Spatialite isn't used at all for that). GDAL has already tried optimizing this as best as it can, but there isn't nothing more we can do on our side I can think of. This is probably coming from the write side of SQLite and/or its RTree building
You might try to post on https://sqlite.org/forum/, but you'll probably need to prepare a fully self contained SQLite SQL text file (independent of GDAL GeoPackage SQL functions) with raw statements like:

CREATE VIRTUAL TABLE my_rtree USING rtree(id, minx, maxx, miny, maxy);
INSERT INTO my_rtree VALUES (0,1,2,3,4);
[...]

(you might use a WITH RECURSIVE ... INSERT INTO my_rtree SELECT ... syntax to insert in a compact way lots of records)

theroggy · 2023-04-19T07:22:26Z

I was able to reproduce without any gdal or spatialite involved that the time being taken to create the index is indeed ~100% the "INSERT INTO my_tree...", so 100% sqlite.

I had another look at the qix format, and as far as I can tell it is a slightly different tree implementation (quadtree)... so I'll have a look if I can get some timings using e.g. https://rtree.readthedocs.io/en/latest/ so the time comparison is as "comparable" as possible.

jratike80 · 2023-04-19T07:46:27Z

The R*Tree module https://www.sqlite.org/rtree.html is probably more complicated than it may feel. Maybe the fastest possible writing has not been the main design goal. Normally indexes are read much more often that written, but naturally a data provider has a different point of view.
I wonder if it could be faster to create simple indexed minx/maxx/miny/maxy columns and how much slower it would be use those in bbox queries.

rouault · 2023-04-19T09:11:11Z

I wonder if it could be faster to create simple indexed minx/maxx/miny/maxy columns and how much slower it would be use those in bbox queries.

Cf the NGA GeoInt Geometry Index extension: http://ngageoint.github.io/GeoPackage/docs/extensions/geometry-index.html . It probably needs also additional creation of regular index on the min_x, max_x, min_y, max_y columns

rouault · 2023-04-20T02:13:57Z

It probably needs also additional creation of regular index on the min_x, max_x, min_y, max_y columns

actually no, that's detrimental to performance in some cases.
And the performance boost of using nga_geometry_index is "just" a x1.6 over using the bounding box included in the GeoPackage geometry blob (with #7621 applied):

$ time ogrinfo nz-building-outlines-no-index.gpkg -sql "select count(*) from \"nz-building-outlines\" where ST_EnvelopesIntersects(geometry, 1500000.0,5200000.0,1800000.0,5800000.0)" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701300


real	0m0,563s
user	0m0,294s
sys	0m0,267s

$ time ogrinfo nz-building-outlines-ngageoint.gpkg -sql "select count(*) from nga_geometry_index where max_x>=1500000 and max_y>= 5200000 and min_x <=1800000 and min_y <= 5800000" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701300


real	0m0,360s
user	0m0,290s
sys	0m0,068s

jratike80 · 2023-04-20T07:00:32Z

It would be nice to see also the timing for R*Tree.

rouault · 2023-04-20T10:49:29Z

It would be nice to see also the timing for R*Tree.

$ time ogrinfo nz-building-outlines.gpkg -sql "SELECT count(*) FROM \"rtree_nz-building-outlines_geometry\" WHERE maxx >= 1500000 AND minx <= 1800000 AND maxy >= 5200000 AND miny <= 5800000" -al -q

Layer name: SELECT
OGRFeature(SELECT):0
  count(*) (Integer) = 701301


real	0m0,146s
user	0m0,120s
sys	0m0,024s

And the RTree is magnitude order much faster than linear scanning when selecting a very small number of objects (here this selects ~ 20% of the table)

jratike80 · 2023-04-20T11:07:10Z

So it feels like the creation of the R*Tree index is slower for a reason and it gets paid back rather soon when the data are queried.

theroggy · 2023-04-20T19:14:08Z

As mentioned above I had a look to find an rtree implementation to benchmark against because .qix uses a quandtree. I don't know any details on the difference, but comparing the same algoritm should be a fairer comparison.

In geopandas one of the available indexes is an rtree, so I had a look at that. Apparently the rtree library used under the hood is a python wrapper around a c library. Not sure how fast or slow it is... but for the sample data above creating the index takes ~9 seconds compared to the 38 seconds to fill up the index in sqlite.

The 9 s is without serializing to file, but still it gives the impression there might be room for improvement in the sqlite implementation.

An important thing to note is that the "rtree" library has an explicit bulk/stream load option that would be a lot faster. I tested both, and the 9 seconds is for the bulk load way, filling the index in a row-by-row loop way took over 2 minutes... Not sure if python overhead in the loop plays a significant role or not, but that way it is a lot slower than sqlite.

Possibly such a bulk load option might be the trick that could boost the initial spatial index creation in sqlite as well?

Code snippet of my rtree test:

import datetime
from rtree import index

# Prepare test data
x, y = (0, 0)
rects = []
size = 10
while y <= 10000:
    while x <= 10000:
        rects.append((x, y, x + size, y + size))
        x += size
    x = 0
    y += size

# Create rtree
start = datetime.datetime.now()
data = ((i, item, None) for i, item in enumerate(rects))
index_properties = index.Property(type=index.RT_RTree)
idx = index.Index(data, properties=index_properties)
print(f"create rtree took {datetime.datetime.now() - start}")

jratike80 · 2023-04-20T20:34:26Z

I fear that for making the creation of the R*Tree index in SQLite faster one should have a thorough understanding about the virtual table mechanism of SQLite https://www.sqlite.org/vtab.html, and what the R*Tree module is doing when it creates the contents to the shadow tables ...node, ...parent, and ...rowid.
I suggest to write mail to SQLite developers and ask about the design principles of the R*Tree module.

There may well be alternative methods for writing a spatial index faster. I guess that such system would require a vendor specific extension because the GeoPackage standard requires that the gpkg_rtree_index extension is using the standard module

Requirements
This extension uses the rtree implementation provided by the SQLite R*Tree Module extension
documented at http://www.sqlite.org/rtree.html.

BTW, maybe a useless test, but copying the biggest shadow table with 6.8 million rows took 9 seconds on my laptop. Maybe it suggests that SQLite is using time for creating RTree that is optimized for fast queries, and the bottle neck is not in writing.

create table test_index
as select * from 
rtree_korkeuskayra_geom_node

jratike80 · 2023-04-20T20:51:57Z

Hey @rouault , what if you continued what you did in #7621, and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already? Where do the ST_MinX etc. functions find the values?

theroggy · 2023-04-20T21:06:24Z

I suggest to write mail to SQLite developers

It is indeed the plan to post in the SQLite support forum and ask if they see an option to speed up the creation of the index. I'm busy collecting info to give the post some body.

... and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already?

The R*Tree index is already populated using the bounding boxes.

Where do the ST_MinX etc. functions find the values?

The GeoPackage BLOB (typically) already contains the bounding box, so ST_MinX etc. can just read them directly without any further processing. Additionally I tested a few days ago if it made a difference to first copy the ST_MinX,... of the geometries to a simple table, and then create the rtree index based on that, but this didn't give any difference, so hence my conclusion/confirmation that ~100% of the time is spent building/inserting into the rtree index.

rouault · 2023-04-20T21:11:33Z

and fool the standard R*Tree to index the bounding boxes inside the GeoPackage BLOBs instead of the real geometries? Or is it done already?

yes, that's already done. In https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/gpkg/ogrgeopackagetablelayer.cpp#L2427, in CreateFeature(), OGR collects the bounding box computed to create the header GeoPackage geometry blob, store that in a memory queue, builds a RTree in a dummy in-memory sqlite database in a thread, and then copies the shadow tables to the real database at the end. That was implemented per #6573. There's really nothing more we can done on OGR side, without going into really low level details of the SQLite R*Tree implementation. That is messing directly with the content of the shadow tables (which, as far as I can see, isn't documented in plain language). Non trivial effort, and it is not obvious if we could beat SQLite implementation (perhaps...). It would be better if SQLite itself could have a fast bulk mode implemented, if that's ever possible given its architecture

theroggy · 2023-04-21T20:05:44Z

Hopefully there is still a bigger speedup possible, but I had an interesting find.

I'm using python via conda(-forge) on windows for my tests and timings. When running my sql-only (gpkg independent) script I noticed that the "default" sqlite.exe downloaded from sqlite.org was significantly faster than the conda version: 18 secs instead of 31 secs for the insert in the rtree.

So, I already posted an issue for that aspect here: conda-forge/sqlite-feedstock#89

theroggy · 2023-04-24T15:29:59Z

I just posted in the sqlite forum checking whether it would be possible to add a bulk load functionality for rtree's:
https://sqlite.org/forum/forumpost/af937a36b3

Crossing fingers ;-)

``` $ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m16,433s user 0m18,093s sys 0m2,135s ``` vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion from SQLite master sqlite/sqlite@7de8ae2). Otherwise it is 38 seconds ``` $ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m28,000s user 0m40,287s sys 0m5,244s ``` Fixes OSGeo#7614

rouault · 2023-10-22T21:42:46Z

Fast RTree creation implemented in #8596

``` $ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m16,433s user 0m18,093s sys 0m2,135s ``` vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion from SQLite master sqlite/sqlite@7de8ae2). Otherwise it is 38 seconds ``` $ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m28,000s user 0m40,287s sys 0m5,244s ``` Fixes OSGeo#7614

theroggy mentioned this issue Apr 21, 2023

Performance difference on windows compared to sqlite.exe from sqlite.org conda-forge/sqlite-feedstock#89

Closed

theroggy changed the title ~~GPKG: creation of spatial index is slow~~ GPKG: creation of spatial index is quite slow Sep 14, 2023

rouault mentioned this issue Oct 22, 2023

GPKG: use much faster creation of RTree with a in-memory RTree building #8596

Merged

rouault closed this as completed in #8596 Oct 27, 2023

theroggy mentioned this issue Jan 27, 2024

ENH: increase write performance of geopackage (spatial index) geopandas/pyogrio#71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPKG: creation of spatial index is quite slow #7614

GPKG: creation of spatial index is quite slow #7614

theroggy commented Apr 18, 2023 •

edited

Loading

rouault commented Apr 18, 2023

theroggy commented Apr 19, 2023

jratike80 commented Apr 19, 2023

rouault commented Apr 19, 2023

rouault commented Apr 20, 2023

jratike80 commented Apr 20, 2023

rouault commented Apr 20, 2023

jratike80 commented Apr 20, 2023

theroggy commented Apr 20, 2023 •

edited

Loading

jratike80 commented Apr 20, 2023 •

edited

Loading

jratike80 commented Apr 20, 2023 •

edited

Loading

theroggy commented Apr 20, 2023 •

edited

Loading

rouault commented Apr 20, 2023

theroggy commented Apr 21, 2023 •

edited

Loading

theroggy commented Apr 24, 2023

rouault commented Oct 22, 2023

GPKG: creation of spatial index is quite slow #7614

GPKG: creation of spatial index is quite slow #7614

Comments

theroggy commented Apr 18, 2023 • edited Loading

rouault commented Apr 18, 2023

theroggy commented Apr 19, 2023

jratike80 commented Apr 19, 2023

rouault commented Apr 19, 2023

rouault commented Apr 20, 2023

jratike80 commented Apr 20, 2023

rouault commented Apr 20, 2023

jratike80 commented Apr 20, 2023

theroggy commented Apr 20, 2023 • edited Loading

jratike80 commented Apr 20, 2023 • edited Loading

jratike80 commented Apr 20, 2023 • edited Loading

theroggy commented Apr 20, 2023 • edited Loading

rouault commented Apr 20, 2023

theroggy commented Apr 21, 2023 • edited Loading

theroggy commented Apr 24, 2023

rouault commented Oct 22, 2023

theroggy commented Apr 18, 2023 •

edited

Loading

theroggy commented Apr 20, 2023 •

edited

Loading

jratike80 commented Apr 20, 2023 •

edited

Loading

jratike80 commented Apr 20, 2023 •

edited

Loading

theroggy commented Apr 20, 2023 •

edited

Loading

theroggy commented Apr 21, 2023 •

edited

Loading