Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FID filtering on formats like .shp is slow #8590

Closed
theroggy opened this issue Oct 21, 2023 · 2 comments · Fixed by #8593
Closed

FID filtering on formats like .shp is slow #8590

theroggy opened this issue Oct 21, 2023 · 2 comments · Fixed by #8593
Assignees

Comments

@theroggy
Copy link
Contributor

Filtering on fid in an sql statement is quite slow in larger files for binary formats like .shp.

For text based files it is normal to be slow because the entire file needs to be parsed to be able to find the right fids.
For database-like file types like .gpkg it is fast (using "SQLite" dialect) as the fid is the primary key of the table.
For binary files like shapefile it is relatively slow, but the fid is essentially an offset in the file, so I imagine that in theory this could be fast?

I used some files downloaded from here to test it, but any slightly larger files can obviously be used.

Test script:

from datetime import datetime
from pathlib import Path
from osgeo import gdal

gdal.UseExceptions()

# Paths
src_orig = Path("C:/temp/prc2023.gpkg")
dst = "C:/temp/dst.gpkg"

# Run test
for ext in [".gpkg", ".shp", ".fgb"]:
    src = src_orig.parent / f"{src_orig.stem}{ext}"
    if not src.exists():
        ds_output = gdal.VectorTranslate(srcDS=str(src_orig), destNameOrDestDS=str(src))
        ds_output = None

    start = datetime.now()
    where = "fid IN (1, 100000, 500000)"
    options = gdal.VectorTranslateOptions(where=where)
    ds_output = gdal.VectorTranslate(srcDS=str(src), destNameOrDestDS=str(dst), options=options)
    ds_output = None

    print(f"for {ext}: took {datetime.now() - start}")

Output:

for .gpkg: took 0:00:00.080532
for .shp: took 0:00:10.678047
for .fgb: took 0:00:04.108473
@rouault
Copy link
Member

rouault commented Oct 21, 2023

Improving that would require non trivial efforts, at least for the SetAttributeFilter() API, since it would require adding a specific behavior in all drivers (or at least the ones where it makes sense, that is the one that declare the OLCRandomRead capability).
Doing that for ExecuteSQL("SELECT .... FROM ... WHERE fid IN (....)") would probably be a bit simpler as the specific behaviour would be in a single place (the OGRGenSQLLayer).
It might be easier on your side to just call GetFeature(fid) on drivers that declare OLCRandomRead

@theroggy
Copy link
Contributor Author

theroggy commented Oct 21, 2023

It is in the context of implementing ArrowStream support... using the arrowstream interface it is not possible to use GetFeature(fid) as far as I know.

@rouault rouault self-assigned this Oct 21, 2023
rouault added a commit to rouault/gdal that referenced this issue Oct 21, 2023
…ter in generic GetNextArrowArray(), and use it for FlatGeoBuf one too (when it has a spatial index) (fixes OSGeo#8590)
rouault added a commit to rouault/gdal that referenced this issue Oct 21, 2023
…ter in generic GetNextArrowArray(), and use it for FlatGeoBuf one too (when it has a spatial index) (fixes OSGeo#8590)
rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023
rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023
rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023
rouault added a commit to rouault/gdal that referenced this issue Dec 10, 2023
ralphraul pushed a commit to 1SpatialGroupLtd/gdal that referenced this issue Mar 11, 2024
ralphraul pushed a commit to 1SpatialGroupLtd/gdal that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants