-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPKG: creation of spatial index is quite slow #7614
Comments
in SQLite (Spatialite isn't used at all for that). GDAL has already tried optimizing this as best as it can, but there isn't nothing more we can do on our side I can think of. This is probably coming from the write side of SQLite and/or its RTree building
(you might use a WITH RECURSIVE ... INSERT INTO my_rtree SELECT ... syntax to insert in a compact way lots of records) |
I was able to reproduce without any gdal or spatialite involved that the time being taken to create the index is indeed ~100% the "INSERT INTO my_tree...", so 100% sqlite. I had another look at the qix format, and as far as I can tell it is a slightly different tree implementation (quadtree)... so I'll have a look if I can get some timings using e.g. https://rtree.readthedocs.io/en/latest/ so the time comparison is as "comparable" as possible. |
The R*Tree module https://www.sqlite.org/rtree.html is probably more complicated than it may feel. Maybe the fastest possible writing has not been the main design goal. Normally indexes are read much more often that written, but naturally a data provider has a different point of view. |
Cf the NGA GeoInt Geometry Index extension: http://ngageoint.github.io/GeoPackage/docs/extensions/geometry-index.html . It probably needs also additional creation of regular index on the min_x, max_x, min_y, max_y columns |
actually no, that's detrimental to performance in some cases.
|
It would be nice to see also the timing for R*Tree. |
And the RTree is magnitude order much faster than linear scanning when selecting a very small number of objects (here this selects ~ 20% of the table) |
So it feels like the creation of the R*Tree index is slower for a reason and it gets paid back rather soon when the data are queried. |
As mentioned above I had a look to find an rtree implementation to benchmark against because .qix uses a quandtree. I don't know any details on the difference, but comparing the same algoritm should be a fairer comparison. In geopandas one of the available indexes is an rtree, so I had a look at that. Apparently the rtree library used under the hood is a python wrapper around a c library. Not sure how fast or slow it is... but for the sample data above creating the index takes ~9 seconds compared to the 38 seconds to fill up the index in sqlite. The 9 s is without serializing to file, but still it gives the impression there might be room for improvement in the sqlite implementation. An important thing to note is that the "rtree" library has an explicit bulk/stream load option that would be a lot faster. I tested both, and the 9 seconds is for the bulk load way, filling the index in a row-by-row loop way took over 2 minutes... Not sure if python overhead in the loop plays a significant role or not, but that way it is a lot slower than sqlite. Possibly such a bulk load option might be the trick that could boost the initial spatial index creation in sqlite as well? Code snippet of my rtree test:
|
I fear that for making the creation of the R*Tree index in SQLite faster one should have a thorough understanding about the virtual table mechanism of SQLite https://www.sqlite.org/vtab.html, and what the R*Tree module is doing when it creates the contents to the shadow tables ...node, ...parent, and ...rowid. There may well be alternative methods for writing a spatial index faster. I guess that such system would require a vendor specific extension because the GeoPackage standard requires that the gpkg_rtree_index extension is using the standard module
BTW, maybe a useless test, but copying the biggest shadow table with 6.8 million rows took 9 seconds on my laptop. Maybe it suggests that SQLite is using time for creating RTree that is optimized for fast queries, and the bottle neck is not in writing.
|
It is indeed the plan to post in the SQLite support forum and ask if they see an option to speed up the creation of the index. I'm busy collecting info to give the post some body.
The R*Tree index is already populated using the bounding boxes.
The GeoPackage BLOB (typically) already contains the bounding box, so ST_MinX etc. can just read them directly without any further processing. Additionally I tested a few days ago if it made a difference to first copy the ST_MinX,... of the geometries to a simple table, and then create the rtree index based on that, but this didn't give any difference, so hence my conclusion/confirmation that ~100% of the time is spent building/inserting into the rtree index. |
yes, that's already done. In https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/gpkg/ogrgeopackagetablelayer.cpp#L2427, in CreateFeature(), OGR collects the bounding box computed to create the header GeoPackage geometry blob, store that in a memory queue, builds a RTree in a dummy in-memory sqlite database in a thread, and then copies the shadow tables to the real database at the end. That was implemented per #6573. There's really nothing more we can done on OGR side, without going into really low level details of the SQLite R*Tree implementation. That is messing directly with the content of the shadow tables (which, as far as I can see, isn't documented in plain language). Non trivial effort, and it is not obvious if we could beat SQLite implementation (perhaps...). It would be better if SQLite itself could have a fast bulk mode implemented, if that's ever possible given its architecture |
Hopefully there is still a bigger speedup possible, but I had an interesting find. I'm using python via conda(-forge) on windows for my tests and timings. When running my sql-only (gpkg independent) script I noticed that the "default" sqlite.exe downloaded from sqlite.org was significantly faster than the conda version: 18 secs instead of 31 secs for the insert in the rtree. So, I already posted an issue for that aspect here: conda-forge/sqlite-feedstock#89 |
I just posted in the sqlite forum checking whether it would be possible to add a bulk load functionality for rtree's: Crossing fingers ;-) |
``` $ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m16,433s user 0m18,093s sys 0m2,135s ``` vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion from SQLite master sqlite/sqlite@7de8ae2). Otherwise it is 38 seconds ``` $ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m28,000s user 0m40,287s sys 0m5,244s ``` Fixes OSGeo#7614
Fast RTree creation implemented in #8596 |
``` $ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m16,433s user 0m18,093s sys 0m2,135s ``` vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion from SQLite master sqlite/sqlite@7de8ae2). Otherwise it is 38 seconds ``` $ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m28,000s user 0m40,287s sys 0m5,244s ``` Fixes OSGeo#7614
``` $ time ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m16,433s user 0m18,093s sys 0m2,135s ``` vs without optimization. With sqlite 3.43.2, with disabling of RTree forced reinsertion from SQLite master sqlite/sqlite@7de8ae2). Otherwise it is 38 seconds ``` $ time OGR_GPKG_MAX_RAM_USAGE_RTREE=0 ogr2ogr tmp.gpkg nz-building-outlines.gpkg real 0m28,000s user 0m40,287s sys 0m5,244s ``` Fixes OSGeo#7614
Not really a new observation, but the creation of a spatial index on a (larger) Geopackage file is very slow. I made a test script in python to illustrate this.
It creates a grid with 1.002.001 squares, writes this to .shp and .gpkg and creates spatial indexes on them. The output:
I wonder, is it clear why this is slow or where the bottleneck is located? In GDAL, Spatialite, SqLite, ???
Where would be a logical place to post this issue to try to get this improved? Or have such efforts already been tried?
The text was updated successfully, but these errors were encountered: