/vsicurl/ completes without error on server disconnect if GDAL_NUM_THREADS is set #11552

trharris78 · 2024-12-27T20:26:18Z

What is the bug?

I found recently that I am getting corrupted TIFs occasionally out of my AWS workflows, but with no errors other than log messages like:

ERROR 1: Request for 363594754-363909793 failed with response_code=0

Here is one such example where a few lines are duplicated, although I have also seen cases where the failed reads put blackfill in the output.

The nature of my setup is that we have fairly large GeoTIFFs stored in S3, and I am translating smaller tiles out for batch processing. The translate is using /vsis3/, although I was able to duplicate the problem with /vsicurl/ which I will outline below. I suspect that this is some sort of intermittent communication error, for example maybe some random S3 HTTP server is dying or being disconnected for some reason.

I would expect this to result in a failure and a retry, but it's not... it's just continuing successfully to completion. I have also narrowed down that this behavior only happens when GDAL_NUM_THREADS is set. I've tried ALL_CPUS and 8, and both behave the same. Maybe this is due to some multi-threaded /vsicurl/ code that behaves differently than its single-threaded counterpart.

Steps to reproduce the issue

Test setup is a little complicated due to the nature of intermittent server disconnects, but the steps below allow me to replicate it reliably. Hopefully this translates to other systems.

The first step is to create an image similar to our large S3 TIFs. I realize that this command will create a non-tiled, uncompressed TIF, and that this is not very efficient for random access in S3. But this is how our production system works and it has some code that is not easily changed. This will create a relatively large file that takes a few seconds to translate, with non-zero pixel values:

gdal_create -of GTiff -outsize 10000 10000 -bands 4 -burn 10000 -ot UInt16 input.tif

Now, in the directory that the TIF was created, run this command to start a Docker container with nginx that will serve the TIF. I just used the nginx:latest image for this. The --name is important as it will be used to abruptly kill the server in the next command.

docker run --rm --name nginx -it -v $PWD:/usr/share/nginx/html:ro -p 80:80 nginx

Now translate the TIF via /vsicurl/. This is a two-part command where the two halves run simultaneously. The first part will sleep 2 seconds and then terminate the nginx server. The second part starts a gdal_translate command.

bash -c "sleep 2; docker stop nginx" &; GDAL_NUM_THREADS=ALL_CPUS gdal_translate /vsicurl/http://localhost/input.tif output.tif

It may be necessary to change the sleep if your system is somehow fast enough to translate the TIF entirely within 2 seconds. But for me it only gets about 30% through. The input being uncompressed slows things down, too.

What happens next is that the translate will report a bunch of errors, but then complete anyway.

The errors:

Input file size is 10000, 10000
0...10...20...30ERROR 1: Request for 250080180-260080179 failed with response_code=0
.ERROR 1: Request for 260080180-270080179 failed with response_code=0
ERROR 1: Request for 270080180-280080179 failed with response_code=0
.ERROR 1: Request for 280080180-290080179 failed with response_code=0
ERROR 1: Request for 290080180-300080179 failed with response_code=0
.ERROR 1: Request for 300080180-310080179 failed with response_code=0
ERROR 1: Request for 310080180-320080179 failed with response_code=0
40ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4054
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4055
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4056
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4057
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4058
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4059
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4060
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4061

The output image:

Every pixel in the input is 10000. The output is a valid TIF but with blackfill, because the translate got interrupted.

If you repeat this test but without setting GDAL_NUM_THREADS=ALL_CPUS, it will correctly throw an error and exit without creating output.tif:

Input file size is 10000, 10000
0...10...20..ERROR 1: TIFFReadEncodedStrip:Read error at scanline 4294967295; got 0 bytes, expected 80000
ERROR 1: TIFFReadEncodedStrip() failed.
ERROR 1: /vsicurl/http://localhost/input.tif, band 1: IReadBlock failed at X offset 0, Y offset 2625: TIFFReadEncodedStrip() failed.

I also tried, as suggested by Even, setting GDAL_HTTP_MULTIRANGE=SERIAL and GDAL_HTTP_RETRY_CODES=ALL, but the behavior is the same. If GDAL_NUM_THREADS is set to something, it will complete with a successful exit code.

Versions and provenance

Observed with GDAL 3.10.0 in the official Docker image ghcr.io/osgeo/gdal:ubuntu-small-3.10.0, and also with a local build of GDAL 3.8.5 on a 2019 MacBook Pro.

Additional context

Even and I had a short conversation about this on the gdal-dev mailing list, but I inadvertently forgot to hit 'reply all' after the first response. Here is the thread for posterity:

Me:

I am using Python and translating small chunks of imagery out of S3 and occasionally run into errors like this:

2024-12-17 18:06:16.201 MST: ERROR 1: Request for 372390946-372700449 failed with response_code=0
2024-12-17 18:06:16.201 MST: ERROR 1: Request for 2028508162-2028817665 failed with response_code=0
2024-12-17 18:06:16.201 MST: ERROR 1: Request for 2030984194-2031293697 failed with response_code=0
2024-12-17 18:06:16.202 MST: ERROR 1: Request for 1476778594-1477088097 failed with response_code=0
2024-12-17 18:06:16.202 MST: ERROR 1: Request for 374247970-374557473 failed with response_code=0

Even with gdal.UseExceptions() and GDAL_HTTP_MAX_RETRY / GDAL_HTTP_RETRY_DELAY set, it doesn’t seem to do anything other than log this error and proceed. The end result is that my output file has blank pixels in it.

I have been staring at the code a bit to understand what it means, particularly the response_code being 0. I see this in the libcurl docs:

The stored value is zero if no server response code has been received.

I still don’t quite know why this happens but could be that some AWS server closed a connection early or something. But is there any way to force a retry? I see that GDAL_HTTP_RETRY_CODES was added recently. If I set it to “ALL” will that ensure that a situation like this results in a retry even on a closed connection? I am not set up to easily reproduce this or even to see a stack trace, so I’m not sure what leads to this error. It doesn’t happen very often, but it’s disruptive when it does happen because it’s hard to catch.

Even:

Tim,

as far as I can see this error message comes from the specific ReadMultiRange() method that has no retry capabilities. It should probably, although I'm unsure from an implementation point of view how that would work with the multi request curl API. I guess it can. Anway, just to tell that you're going to be out of luck there.

A workaround would be to set GDAL_HTTP_MULTIRANGE=SERIAL to go back to the classic code path (obviously you'll loose the parallel network request aspect!), and there GDAL_HTTP_RETRY_CODES=ALL should hopefully work even for response_code=0

Me:

Thanks, Even. I'm not sure I want to give up the speed in order to get it to retry. But even throwing an exception would be better than what it does now. Is it possible, for example with some configuration setting, to make it fail? Or is that curl code too low-level to make a decision like that?

Even:

Tim,

I'm surprised it doesn't fail for you. As far as I can see a CPLError() is emitted, -1 is returned by the ReadMultiRange() method and the GeoTIFF driver tests that error code and propagates it. I might miss something obviously, but at first sight, error propagation looks appropriate

Me:

Hmm, interesting... thanks for looking. I was trying to trace through the code but I'm not familiar enough with it.

I have code that calls gdal.Translate which pulls from an S3 file into /vsimem/tile.tif, and there is retry logic around that gdal.Translate call. I wonder if it does fail, and during one of my retries it writes into an existing /vsimem/tile.tif? But I would expect it to overwrite the entire file, not to keep pixels from a previous failed attempt. Maybe I should try using a unique filename for each retry...

Anyway, thank you for the help. I suspect it's user error on my part...

Even:

Tim,

so I gave that at try by explicitly simulating the code=0 error. And it is interesting... So I missed that the call to ReadMultiRange() is actually an optimization. If that method fails, then the GeoTIFF driver will fallback to the tile-after-tile acquisition logic (which might benefit from retries, and hence from an end user perspective this isn't an error and you'll get the appropriate pixel values. One could argue that in the context of its use in the GeoTIFF driver ReadMultiRange() shouldn't emit a CPLError() at all, at least not a Failure one but maybe just a Warning, since this is not actually fatal. Hope that makes sense!

Me:

I did notice in the GTiff code there is a section that calls ReadMultiRange() and then has a comment below "// Retry without optimization", but I wasn't sure if that meant what I thought it did. I am still confused how I get the results that I do... it is admittedly very rare and I don't really have a good way to reproduce it. I still may try switching to a unique name for the in-memory dataset I'm translating into, just to be safe.

Thank you for all the info and explanation... it has been very helpful.

The text was updated successfully, but these errors were encountered:

…ded reading Fixes OSGeo#11552

…ngeL() Fixes OSGeo#11552

rouault · 2024-12-30T15:31:24Z

Fix in #11559

…ded reading Fixes #11552

…ngeL() Fixes #11552

rouault self-assigned this Dec 30, 2024

rouault added a commit to rouault/gdal that referenced this issue Dec 30, 2024

GTiff: detect I/O error when getting tile offset/count in multi-threa…

d1c4bc7

…ded reading Fixes OSGeo#11552

rouault added a commit to rouault/gdal that referenced this issue Dec 30, 2024

GTiff: CacheMultiRange(): properly react to errors in VSIFReadMultiRa…

95cbce0

…ngeL() Fixes OSGeo#11552

rouault mentioned this issue Dec 30, 2024

GTiff: more robust handling of I/O errors in network optimized code paths #11559

Merged

rouault closed this as completed in #11559 Jan 4, 2025

rouault added a commit that referenced this issue Jan 4, 2025

GTiff: detect I/O error when getting tile offset/count in multi-threa…

5f29238

…ded reading Fixes #11552

rouault added a commit that referenced this issue Jan 4, 2025

GTiff: CacheMultiRange(): properly react to errors in VSIFReadMultiRa…

9b921b8

…ngeL() Fixes #11552

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/vsicurl/ completes without error on server disconnect if GDAL_NUM_THREADS is set #11552

/vsicurl/ completes without error on server disconnect if GDAL_NUM_THREADS is set #11552

trharris78 commented Dec 27, 2024

rouault commented Dec 30, 2024

/vsicurl/ completes without error on server disconnect if GDAL_NUM_THREADS is set #11552

/vsicurl/ completes without error on server disconnect if GDAL_NUM_THREADS is set #11552

Comments

trharris78 commented Dec 27, 2024

What is the bug?

Steps to reproduce the issue

Versions and provenance

Additional context

rouault commented Dec 30, 2024