You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found recently that I am getting corrupted TIFs occasionally out of my AWS workflows, but with no errors other than log messages like:
ERROR 1: Request for 363594754-363909793 failed with response_code=0
Here is one such example where a few lines are duplicated, although I have also seen cases where the failed reads put blackfill in the output.
The nature of my setup is that we have fairly large GeoTIFFs stored in S3, and I am translating smaller tiles out for batch processing. The translate is using /vsis3/, although I was able to duplicate the problem with /vsicurl/ which I will outline below. I suspect that this is some sort of intermittent communication error, for example maybe some random S3 HTTP server is dying or being disconnected for some reason.
I would expect this to result in a failure and a retry, but it's not... it's just continuing successfully to completion. I have also narrowed down that this behavior only happens when GDAL_NUM_THREADS is set. I've tried ALL_CPUS and 8, and both behave the same. Maybe this is due to some multi-threaded /vsicurl/ code that behaves differently than its single-threaded counterpart.
Steps to reproduce the issue
Test setup is a little complicated due to the nature of intermittent server disconnects, but the steps below allow me to replicate it reliably. Hopefully this translates to other systems.
The first step is to create an image similar to our large S3 TIFs. I realize that this command will create a non-tiled, uncompressed TIF, and that this is not very efficient for random access in S3. But this is how our production system works and it has some code that is not easily changed. This will create a relatively large file that takes a few seconds to translate, with non-zero pixel values:
Now, in the directory that the TIF was created, run this command to start a Docker container with nginx that will serve the TIF. I just used the nginx:latest image for this. The --name is important as it will be used to abruptly kill the server in the next command.
Now translate the TIF via /vsicurl/. This is a two-part command where the two halves run simultaneously. The first part will sleep 2 seconds and then terminate the nginx server. The second part starts a gdal_translate command.
It may be necessary to change the sleep if your system is somehow fast enough to translate the TIF entirely within 2 seconds. But for me it only gets about 30% through. The input being uncompressed slows things down, too.
What happens next is that the translate will report a bunch of errors, but then complete anyway.
The errors:
Input file size is 10000, 10000
0...10...20...30ERROR 1: Request for 250080180-260080179 failed with response_code=0
.ERROR 1: Request for 260080180-270080179 failed with response_code=0
ERROR 1: Request for 270080180-280080179 failed with response_code=0
.ERROR 1: Request for 280080180-290080179 failed with response_code=0
ERROR 1: Request for 290080180-300080179 failed with response_code=0
.ERROR 1: Request for 300080180-310080179 failed with response_code=0
ERROR 1: Request for 310080180-320080179 failed with response_code=0
40ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4054
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4055
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4056
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4057
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4058
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4059
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4060
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~4061
The output image:
Every pixel in the input is 10000. The output is a valid TIF but with blackfill, because the translate got interrupted.
If you repeat this test but without setting GDAL_NUM_THREADS=ALL_CPUS, it will correctly throw an error and exit without creating output.tif:
Input file size is 10000, 10000
0...10...20..ERROR 1: TIFFReadEncodedStrip:Read error at scanline 4294967295; got 0 bytes, expected 80000
ERROR 1: TIFFReadEncodedStrip() failed.
ERROR 1: /vsicurl/http://localhost/input.tif, band 1: IReadBlock failed at X offset 0, Y offset 2625: TIFFReadEncodedStrip() failed.
I also tried, as suggested by Even, setting GDAL_HTTP_MULTIRANGE=SERIAL and GDAL_HTTP_RETRY_CODES=ALL, but the behavior is the same. If GDAL_NUM_THREADS is set to something, it will complete with a successful exit code.
Versions and provenance
Observed with GDAL 3.10.0 in the official Docker image ghcr.io/osgeo/gdal:ubuntu-small-3.10.0, and also with a local build of GDAL 3.8.5 on a 2019 MacBook Pro.
Additional context
Even and I had a short conversation about this on the gdal-dev mailing list, but I inadvertently forgot to hit 'reply all' after the first response. Here is the thread for posterity:
Me:
I am using Python and translating small chunks of imagery out of S3 and occasionally run into errors like this:
2024-12-17 18:06:16.201 MST: ERROR 1: Request for 372390946-372700449 failed with response_code=0
2024-12-17 18:06:16.201 MST: ERROR 1: Request for 2028508162-2028817665 failed with response_code=0
2024-12-17 18:06:16.201 MST: ERROR 1: Request for 2030984194-2031293697 failed with response_code=0
2024-12-17 18:06:16.202 MST: ERROR 1: Request for 1476778594-1477088097 failed with response_code=0
2024-12-17 18:06:16.202 MST: ERROR 1: Request for 374247970-374557473 failed with response_code=0
Even with gdal.UseExceptions() and GDAL_HTTP_MAX_RETRY / GDAL_HTTP_RETRY_DELAY set, it doesn’t seem to do anything other than log this error and proceed. The end result is that my output file has blank pixels in it.
I have been staring at the code a bit to understand what it means, particularly the response_code being 0. I see this in the libcurl docs:
The stored value is zero if no server response code has been received.
I still don’t quite know why this happens but could be that some AWS server closed a connection early or something. But is there any way to force a retry? I see that GDAL_HTTP_RETRY_CODES was added recently. If I set it to “ALL” will that ensure that a situation like this results in a retry even on a closed connection? I am not set up to easily reproduce this or even to see a stack trace, so I’m not sure what leads to this error. It doesn’t happen very often, but it’s disruptive when it does happen because it’s hard to catch.
Even:
Tim,
as far as I can see this error message comes from the specific ReadMultiRange() method that has no retry capabilities. It should probably, although I'm unsure from an implementation point of view how that would work with the multi request curl API. I guess it can. Anway, just to tell that you're going to be out of luck there.
A workaround would be to set GDAL_HTTP_MULTIRANGE=SERIAL to go back to the classic code path (obviously you'll loose the parallel network request aspect!), and there GDAL_HTTP_RETRY_CODES=ALL should hopefully work even for response_code=0
Me:
Thanks, Even. I'm not sure I want to give up the speed in order to get it to retry. But even throwing an exception would be better than what it does now. Is it possible, for example with some configuration setting, to make it fail? Or is that curl code too low-level to make a decision like that?
Even:
Tim,
I'm surprised it doesn't fail for you. As far as I can see a CPLError() is emitted, -1 is returned by the ReadMultiRange() method and the GeoTIFF driver tests that error code and propagates it. I might miss something obviously, but at first sight, error propagation looks appropriate
Me:
Hmm, interesting... thanks for looking. I was trying to trace through the code but I'm not familiar enough with it.
I have code that calls gdal.Translate which pulls from an S3 file into /vsimem/tile.tif, and there is retry logic around that gdal.Translate call. I wonder if it does fail, and during one of my retries it writes into an existing /vsimem/tile.tif? But I would expect it to overwrite the entire file, not to keep pixels from a previous failed attempt. Maybe I should try using a unique filename for each retry...
Anyway, thank you for the help. I suspect it's user error on my part...
Even:
Tim,
so I gave that at try by explicitly simulating the code=0 error. And it is interesting... So I missed that the call to ReadMultiRange() is actually an optimization. If that method fails, then the GeoTIFF driver will fallback to the tile-after-tile acquisition logic (which might benefit from retries, and hence from an end user perspective this isn't an error and you'll get the appropriate pixel values. One could argue that in the context of its use in the GeoTIFF driver ReadMultiRange() shouldn't emit a CPLError() at all, at least not a Failure one but maybe just a Warning, since this is not actually fatal. Hope that makes sense!
Me:
I did notice in the GTiff code there is a section that calls ReadMultiRange() and then has a comment below "// Retry without optimization", but I wasn't sure if that meant what I thought it did. I am still confused how I get the results that I do... it is admittedly very rare and I don't really have a good way to reproduce it. I still may try switching to a unique name for the in-memory dataset I'm translating into, just to be safe.
Thank you for all the info and explanation... it has been very helpful.
The text was updated successfully, but these errors were encountered:
What is the bug?
I found recently that I am getting corrupted TIFs occasionally out of my AWS workflows, but with no errors other than log messages like:
Here is one such example where a few lines are duplicated, although I have also seen cases where the failed reads put blackfill in the output.
The nature of my setup is that we have fairly large GeoTIFFs stored in S3, and I am translating smaller tiles out for batch processing. The translate is using /vsis3/, although I was able to duplicate the problem with /vsicurl/ which I will outline below. I suspect that this is some sort of intermittent communication error, for example maybe some random S3 HTTP server is dying or being disconnected for some reason.
I would expect this to result in a failure and a retry, but it's not... it's just continuing successfully to completion. I have also narrowed down that this behavior only happens when GDAL_NUM_THREADS is set. I've tried
ALL_CPUS
and8
, and both behave the same. Maybe this is due to some multi-threaded /vsicurl/ code that behaves differently than its single-threaded counterpart.Steps to reproduce the issue
Test setup is a little complicated due to the nature of intermittent server disconnects, but the steps below allow me to replicate it reliably. Hopefully this translates to other systems.
The first step is to create an image similar to our large S3 TIFs. I realize that this command will create a non-tiled, uncompressed TIF, and that this is not very efficient for random access in S3. But this is how our production system works and it has some code that is not easily changed. This will create a relatively large file that takes a few seconds to translate, with non-zero pixel values:
Now, in the directory that the TIF was created, run this command to start a Docker container with nginx that will serve the TIF. I just used the
nginx:latest
image for this. The--name
is important as it will be used to abruptly kill the server in the next command.Now translate the TIF via /vsicurl/. This is a two-part command where the two halves run simultaneously. The first part will sleep 2 seconds and then terminate the nginx server. The second part starts a
gdal_translate
command.It may be necessary to change the sleep if your system is somehow fast enough to translate the TIF entirely within 2 seconds. But for me it only gets about 30% through. The input being uncompressed slows things down, too.
What happens next is that the translate will report a bunch of errors, but then complete anyway.
The errors:
The output image:
Every pixel in the input is 10000. The output is a valid TIF but with blackfill, because the translate got interrupted.
If you repeat this test but without setting
GDAL_NUM_THREADS=ALL_CPUS
, it will correctly throw an error and exit without creating output.tif:I also tried, as suggested by Even, setting
GDAL_HTTP_MULTIRANGE=SERIAL
andGDAL_HTTP_RETRY_CODES=ALL
, but the behavior is the same. If GDAL_NUM_THREADS is set to something, it will complete with a successful exit code.Versions and provenance
Observed with GDAL 3.10.0 in the official Docker image
ghcr.io/osgeo/gdal:ubuntu-small-3.10.0
, and also with a local build of GDAL 3.8.5 on a 2019 MacBook Pro.Additional context
Even and I had a short conversation about this on the gdal-dev mailing list, but I inadvertently forgot to hit 'reply all' after the first response. Here is the thread for posterity:
Me:
Even:
Me:
Even:
Me:
Even:
Me:
The text was updated successfully, but these errors were encountered: