FID filtering on formats like .shp is slow #8590

theroggy · 2023-10-21T15:30:05Z

Filtering on fid in an sql statement is quite slow in larger files for binary formats like .shp.

For text based files it is normal to be slow because the entire file needs to be parsed to be able to find the right fids.
For database-like file types like .gpkg it is fast (using "SQLite" dialect) as the fid is the primary key of the table.
For binary files like shapefile it is relatively slow, but the fid is essentially an offset in the file, so I imagine that in theory this could be fast?

I used some files downloaded from here to test it, but any slightly larger files can obviously be used.

Test script:

from datetime import datetime
from pathlib import Path
from osgeo import gdal

gdal.UseExceptions()

# Paths
src_orig = Path("C:/temp/prc2023.gpkg")
dst = "C:/temp/dst.gpkg"

# Run test
for ext in [".gpkg", ".shp", ".fgb"]:
    src = src_orig.parent / f"{src_orig.stem}{ext}"
    if not src.exists():
        ds_output = gdal.VectorTranslate(srcDS=str(src_orig), destNameOrDestDS=str(src))
        ds_output = None

    start = datetime.now()
    where = "fid IN (1, 100000, 500000)"
    options = gdal.VectorTranslateOptions(where=where)
    ds_output = gdal.VectorTranslate(srcDS=str(src), destNameOrDestDS=str(dst), options=options)
    ds_output = None

    print(f"for {ext}: took {datetime.now() - start}")

Output:

for .gpkg: took 0:00:00.080532
for .shp: took 0:00:10.678047
for .fgb: took 0:00:04.108473

The text was updated successfully, but these errors were encountered:

rouault · 2023-10-21T17:35:27Z

Improving that would require non trivial efforts, at least for the SetAttributeFilter() API, since it would require adding a specific behavior in all drivers (or at least the ones where it makes sense, that is the one that declare the OLCRandomRead capability).
Doing that for ExecuteSQL("SELECT .... FROM ... WHERE fid IN (....)") would probably be a bit simpler as the specific behaviour would be in a single place (the OGRGenSQLLayer).
It might be easier on your side to just call GetFeature(fid) on drivers that declare OLCRandomRead

theroggy · 2023-10-21T20:41:45Z

It is in the context of implementing ArrowStream support... using the arrowstream interface it is not possible to use GetFeature(fid) as far as I know.

…ter in generic GetNextArrowArray(), and use it for FlatGeoBuf one too (when it has a spatial index) (fixes OSGeo#8590)

…xtensions (fixes OSGeo#8590)

…xtensions (fixes #8590)

…xtensions (fixes OSGeo#8590)

rouault self-assigned this Oct 21, 2023

rouault added a commit to rouault/gdal that referenced this issue Oct 21, 2023

ArrowArray: implement fast 'FID IN (...)' / 'FID = ...' attribute fil…

bc5aadd

…ter in generic GetNextArrowArray(), and use it for FlatGeoBuf one too (when it has a spatial index) (fixes OSGeo#8590)

rouault mentioned this issue Oct 21, 2023

ArrowArray: implement fast 'FID IN (...)' / 'FID = ...' attribute filter… #8593

Merged

rouault added a commit to rouault/gdal that referenced this issue Oct 21, 2023

ArrowArray: implement fast 'FID IN (...)' / 'FID = ...' attribute fil…

e07940d

…ter in generic GetNextArrowArray(), and use it for FlatGeoBuf one too (when it has a spatial index) (fixes OSGeo#8590)

theroggy mentioned this issue Oct 25, 2023

Add support for fids filter with use_arrow=True geopandas/pyogrio#304

Merged

rouault closed this as completed in #8593 Oct 27, 2023

rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023

ogrinfo: really honours -if (refs OSGeo#8590)

312a3df

rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023

Doc: clarify that -if does not relax potential restrictions on file e…

d086421

…xtensions (fixes OSGeo#8590)

rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023

ogrinfo: really honours -if (refs OSGeo#8590)

6640cac

rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023

Doc: clarify that -if does not relax potential restrictions on file e…

a8fa752

…xtensions (fixes OSGeo#8590)

rouault added a commit that referenced this issue Dec 11, 2023

ogrinfo: really honours -if (refs #8590)

7147f69

rouault added a commit that referenced this issue Dec 11, 2023

Doc: clarify that -if does not relax potential restrictions on file e…

6551e47

…xtensions (fixes #8590)

ralphraul pushed a commit to 1SpatialGroupLtd/gdal that referenced this issue Mar 11, 2024

ogrinfo: really honours -if (refs OSGeo#8590)

e093987

ralphraul pushed a commit to 1SpatialGroupLtd/gdal that referenced this issue Mar 11, 2024

Doc: clarify that -if does not relax potential restrictions on file e…

81a5f8d

…xtensions (fixes OSGeo#8590)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FID filtering on formats like .shp is slow #8590

FID filtering on formats like .shp is slow #8590

theroggy commented Oct 21, 2023

rouault commented Oct 21, 2023

theroggy commented Oct 21, 2023 •

edited

Loading

FID filtering on formats like .shp is slow #8590

FID filtering on formats like .shp is slow #8590

Comments

theroggy commented Oct 21, 2023

rouault commented Oct 21, 2023

theroggy commented Oct 21, 2023 • edited Loading

theroggy commented Oct 21, 2023 •

edited

Loading