[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040

janowskijak · 2024-04-18T15:23:50Z

What happened?

Since the refactor of gcsio (2.52?) ReadAllFiles does not fully read gzipped files from GCS. Part of the file will be correctly returned but rest will go missing.

I presume this is caused by the fact that GCS performs decompressive transcoding while _ExpandIntoRanges uses the GCS objects metadata to determine the read range. This means that the file size we receive is larger than the maximum of the read range.

For example, a gzip on GCS might have a file size of 1 MB and this will be the object size in the metadata. Thus the maximum of the read range will be 1 MB. However, when beam opens the file it's already decompressed by GCS so the file size will be 1.5 MB and we won't read 0.5 MB out of it thus causing data loss.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

The text was updated successfully, but these errors were encountered:

Abacn · 2024-04-19T16:42:58Z

Thanks for reporting. Agree this is a P1 bug as it causes data loss.

Abacn · 2024-04-19T16:44:09Z

Is it possible to provide a working example that reproduce the issue, which could help triage.

liferoad · 2024-04-20T13:15:10Z

@shunping FYI

janowskijak · 2024-04-22T08:27:05Z

Is it possible to provide a working example that reproduce the issue, which could help triage.

@Abacn I don't have a working example however the steps to reproduce are:

Upload a gzip file to GCS. Make sure that the unzipped file is large enough, e.g a few MB.
Create a beam pipeline using Python SDK that reads the file from 1. using RealAllFromText.
Print or write the output of ReadAllFromText.
Observe that the file is not fully read.

EDIT: This issue will probably appear for any compression type. I just encountered it with gzip but did not test with other compression algorithms.

liferoad · 2024-04-22T15:04:56Z

I uploaded one test file here: gs://apache-beam-samples/gcs/bigfile.txt.gz (~7MB), which has 100000 lines but cannot reproduce this:

# standard libraries
import logging

# third party libraries
import apache_beam as beam
from apache_beam import Create, Map
from apache_beam.io.textio import ReadAllFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import Count

logger = logging.getLogger()
logger.setLevel(logging.INFO)

elements = [
    "gs://apache-beam-samples/gcs/bigfile.txt.gz",
]

options = PipelineOptions()

with beam.Pipeline(options=options) as p:
    (
        p
        | Create(elements)
        | "Read File from GCS" >> ReadAllFromText()
        | Count.Globally()
        | "Log" >> Map(lambda x: logging.info("Total lines %d", x))
    )

This shows:

INFO:root:Total lines 100000

Michal-Nguyen-airspace-intelligence · 2024-04-23T11:58:46Z

So I double checked and there are differences between your example and our case.

We use content encoding gzip while saving our files to GCS, you don't have encoding specified
This leads us to using ReadAllFromText with parameter compression_type=CompressionTypes.UNCOMPRESSED since the downloaded file seems to be already uncompressed (it doesn't work with CompressionTypes.AUTO), as in gcs policy
This further results in reading only fragment of the file

Furthermore, after removing encoding type from our file and using CompressionTypes.AUTO on it worked properly.
To get you example to represent our situation please add content encoding gzip to your file metadata.

Michal-Nguyen-airspace-intelligence · 2024-04-23T11:59:51Z

For quick patch we use following solution:

class ReadAllFromTextNotSplittable(ReadAllFromText):
    """This class doesn't take advantage of splitting files in bundles because
    when doing so beam was taking compressed file size as reference resulting in
    reading only a fracture of uncompressed file"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._read_all_files._splittable = False

liferoad · 2024-04-23T17:53:03Z

What does your medadata look like?

I tried this:

Then I got this error:

ERROR:apache_beam.runners.common:Error -3 while decompressing data: incorrect header check [while running '[6]: Read File from GCS/ReadAllFiles/ReadRange']

Michal-Nguyen-airspace-intelligence · 2024-04-25T10:27:25Z

This is expected, as I mentioned earlier
This leads us to using ReadAllFromText with parameter compression_type=CompressionTypes.UNCOMPRESSED since the downloaded file seems to be already uncompressed (it doesn't work with CompressionTypes.AUTO), as in gcs policy
I presume while downloading file from GCS it's already decompressed, hence the error of decompression in Beam.

Michal-Nguyen-airspace-intelligence · 2024-04-25T10:37:12Z

Metadata is as follows (also please note we checked both text/plain and application/x-gzip, both were only partially read):

liferoad · 2024-04-27T20:23:51Z

I see. We need to check decompressive transcoding for the GCS file to determine whether the content is compressed rather than relying on the file extension.

# standard libraries
import logging

# third party libraries
import apache_beam as beam
from apache_beam import Create, Map
from apache_beam.io.textio import ReadAllFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import Count

logger = logging.getLogger()
logger.setLevel(logging.INFO)

elements = [
    # "gs://apache-beam-samples/gcs/bigfile.txt.gz",
    # "gs://apache-beam-samples/gcs/bigfile_with_encoding.txt.gz",
    "gs://apache-beam-samples/gcs/bigfile_with_encoding_plain.txt.gz",
]

options = PipelineOptions()

with beam.Pipeline(options=options) as p:
    (
        p
        | Create(elements)
        | "Read File from GCS"
        >> ReadAllFromText(
            compression_type=beam.io.filesystem.CompressionTypes.UNCOMPRESSED
        )
        | Count.Globally()
        | "Log" >> Map(lambda x: logging.info("Total lines %d", x))
    )

This only loads 75,601 lines.

#19413 could be related for uploading the file to GCS.

shunping · 2024-05-16T14:51:18Z

.take-issue

kennknowles · 2024-06-07T18:19:13Z

Have we reproduced this?

liferoad · 2024-06-07T19:57:29Z

Yes, see my above link: #31040 (comment)

kennknowles · 2024-06-10T15:17:47Z

Is there a hope of a fix for 2.57.0 cherry pick? I would guess this is a longstanding issue so getting it fixed in a very thorough way for 2.58.0 is actually the best thing to do. I recall we had decompressive transcoding bugs in the past. So we should make sure we really get it right this time. And the user can mitigate by configuring GCS to not do the transcoding.

liferoad · 2024-06-10T18:19:06Z

Moved this to 2.58.0. Thanks!

jrmccluskey · 2024-07-02T14:24:50Z

Has any progress been made on this?

liferoad · 2024-07-02T19:10:56Z

Not yet. We can move this to 2.59.0.

lostluck · 2024-08-20T20:40:54Z

Has any progress been made on this?

lostluck · 2024-08-22T00:17:49Z

Moved to 2.60.0

kennknowles · 2024-08-22T14:30:47Z

Based on this getting pushed from release to release, it is clearly not a true release-blocker.

serratedserenade · 2024-11-29T05:08:16Z

Has there been any progress on this? If there is none, can anyone suggest a monkey patch or some sort of hotfix that we can apply ourselves?

EDIT: Did some disgusting hacks and it seems that if we're able to somehow pass raw_download as a parameter to the Blob.download_as_bytes function everything works fine.

liferoad · 2024-11-29T13:52:09Z

#31040 (comment) does not work for you?

serratedserenade · 2024-11-29T14:14:14Z

#31040 (comment) does not work for you?

That's what I was using prior until we realised this method was causing part of the data we were reading to be skipped. Basically same problem as here #31040 (comment)

However, just to make sure, I will double check.

Just to add details:

Using Beam 2.61 locally.
Reading from a GCS Bucket where a data provider pushes data into it with the following metadata (there is no option for them to not do this or modify how they set the headers):
- Content-Type: application/json
- Content-Encoding: gzip

EDIT: Did a quick check and realised that I was using both the one you pointed at and the UNCOMPRESSED flag, however removing the uncompressed flag just gives me the Error -3 while decompressing data: incorrect header check

kennknowles · 2024-12-02T15:54:36Z

This issue came up and was fixed in the Java SDK file IO many years ago, so we should have some reference material to work with. I am trying to find it.

And presumably also related to #18390.

shunping · 2024-12-15T03:09:32Z

Pick up this issue again. I confirm that I am able to reproduce the problem with code in #31040 (comment) with the latest Beam.

shunping · 2024-12-15T03:22:14Z

I briefly debugged the issue, it seems that the file is accessed with decompressive transcoding (https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding).

I checked the metadata of the GCS file in use, and the result is as follows.

$ gcloud storage objects describe gs://apache-beam-samples/gcs/bigfile_with_encoding_plain.txt.gz
acl:
...
bucket: apache-beam-samples
content_encoding: gzip
content_type: text/plain
crc32c_hash: /REfiQ==
creation_time: 2024-04-27T20:19:48+0000
...
metageneration: 1
name: gcs/bigfile_with_encoding_plain.txt.gz
size: 7635685
storage_class: STANDARD
storage_class_update_time: 2024-04-27T20:19:48+0000
storage_url: gs://apache-beam-samples/gcs/bigfile_with_encoding_plain.txt.gz#1714249188497359
update_time: 2024-04-27T20:19:48+0000

The size field shows the size of the gzip file, i.e. 7635685. The original file size prior to gzip should be 10100000.

$ wc bigfile_with_encoding_plain.txt
  100000  252788 10100000 bigfile_with_encoding_plain.txt

When debugging in the DoFn _ReadRange(), i see the initial range matches the file size.

I think the size is a problem here. Because the file metadata meets the criteria of decompressive transcoding, GCS is sending uncompressed data when we request it and our DoFn thinks the decompressed file size is only 7635685 bytes.

shunping · 2024-12-15T03:47:48Z

Just check around. It seems GCS decompressive transcoding causes some similar issues elsewhere.

shunping · 2024-12-15T04:14:51Z

The range is determined by the size obtained from GCSIO (

beam/sdks/python/apache_beam/io/gcp/gcsfilesystem.py

Line 339 in d7502fa

file_metadata = self._gcsIO()._status(path)

,

beam/sdks/python/apache_beam/io/gcp/gcsio.py

Line 477 in d7502fa

file_status['size'] = gcs_object.size

), which in turn called GCS Python Client Library google-cloud-storage.

The size property in the Blob object there comes from "Content-Length". (https://cloud.google.com/storage/docs/json_api/v1/objects)

shunping · 2025-01-03T20:16:40Z

To clarify the behavior of textio with various content encoding, content type, and compression settings, I've expanded the table in the Apache Beam GitHub issue #18390. This table compares the behavior across two Beam SDK versions: 2.52.0 (prior to the GCSIO migration) and 2.62.0 (the upcoming release). I also include the proposed behavior of my fix in the last column.

A few notes about how the data is generated.

For the first 3 x 2 x 3 rows, the text data is gzipped locally and then uploaded to gcs. Then the metadata values of content-type and content-encoded are manually adjusted.
For the row marked as "copy default text file", the text data is directly copied/uploaded to gcs without gzip.
For the row marked as "copy default gzip file", the gzipped text data is copied/uploaded to gcs.
For the row marked as "copy default text file with gzip-local flag", the text data is uploaded to gcs with the said flag.
gcloud storage cp -Z ./textio-test-data.1k.txt gs://apache-beam-samples/textio/textio-test-data.gzip-local.1k.txt.gz.

shunping · 2025-01-03T20:23:42Z

Notice that when content-type=application/gzip and content-encoding=gzip, GCS considers the file is doubly compressed(https://cloud.google.com/storage/docs/transcoding#gzip-gzip), which is against the actual content we store here (aka. a gzipped text file). Therefore, an exception is thrown there.

For any UnicodeDecodeError, most of them are due to the fact that users specify "UNCOMPRESSED" on a gzip file. We can let it be this way, or we can give a better and more informative error if "content-type" and/or "content-encoding" does not match the specified "compression type".

@kennknowles WDYT?

kennknowles · 2025-01-04T15:05:29Z

I agree with all of your proposals that replace "Data Loss" with "UnicodeDecodeError"

kennknowles · 2025-01-04T15:07:17Z

In all the cases where GCS transcoding causes a zlib.error I also agree with making them function correctly. Basically if the user says it is GZIP and the data really is GZIP but we know that GCS is going to decode it then we do not do a redundant decode.

shunping · 2025-01-04T18:54:01Z

I agree with all of your proposals that replace "Data Loss" with "UnicodeDecodeError"

Great! GCS decompressive transcoding is a bit unintuitive to Beam users here, and when it happens, we see data loss. I think it is more natural to expect users to specify GZIP or AUTO in those cases rather than UNCOMPRESSED, as shown in the proposal.

Basically if the user says it is GZIP and the data really is GZIP but we know that GCS is going to decode it then we do not do a redundant decode.

Yep, that's the idea, but it is implemented differently in the proposed fix (#33384).

Instead of trying to determine whether GCS is going to do decompressive transcoding, which is both unclear and inconvenient to verify, we can call GCS client library and let it always return raw data without transcoding.
At first, I was worrying about the performance of this approach. As the gzip decoding will then happen on our side (Beam), which is different from server-side decoding that is mentioned in https://cloud.google.com/storage/docs/transcoding.
However, after closely examining the GCS client library, I discover that the GCS client library actually always requests GCS for raw data (i.e. accept-encoding:gzip is always set in the request header: https://github.com/googleapis/python-storage/blob/4e9a382dd91f7d951cb9b95319b05e0cdda06415/google/cloud/storage/blob.py#L4286). It then adds an extra decoding step in itself (https://github.com/googleapis/google-resumable-media-python/blob/402feb7b38a972daad9bd3e26b80ddd0879bd53f/google/resumable_media/requests/download.py#L127) before returning the file content to the caller (in this case Beam). It merely mimicks the effects of server-side decompressive transcoding; the actual decoding workload is done on the client and we are NOT leveraging the real GCS server-side transcoding at all.

In my fix (#33384), I let GCS client library skip the decoding step in itself and rely on Beam's decoding mechanism in CompressedFile to process the file. I believe this is more intuitive to our users and the end result of this approach is exactly the same as what has been proposed in the previous table.

janowskijak added awaiting triage bug labels Apr 18, 2024

github-actions bot added python P1 labels Apr 18, 2024

Abacn removed the awaiting triage label Apr 19, 2024

liferoad added this to the 2.57.0 Release milestone Apr 29, 2024

github-actions bot assigned shunping May 16, 2024

liferoad mentioned this issue May 18, 2024

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

Closed

liferoad modified the milestones: 2.57.0 Release, 2.58.0 Release Jun 10, 2024

jrmccluskey modified the milestones: 2.58.0 Release, 2.59.0 Release Jul 3, 2024

lostluck modified the milestones: 2.59.0 Release, 2.60.0 Release Aug 22, 2024

kennknowles removed this from the 2.60.0 Release milestone Aug 22, 2024

shunping mentioned this issue Dec 15, 2024

Fix TextIO not fully reading a GCS file when decompressive transcoding happens #33384

Merged

3 tasks

damccorm closed this as completed in #33384 Jan 9, 2025

github-actions bot added this to the 2.63.0 Release milestone Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040

[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040

janowskijak commented Apr 18, 2024

Abacn commented Apr 19, 2024

Abacn commented Apr 19, 2024

liferoad commented Apr 20, 2024

janowskijak commented Apr 22, 2024 •

edited

Loading

liferoad commented Apr 22, 2024 •

edited

Loading

Michal-Nguyen-airspace-intelligence commented Apr 23, 2024

Michal-Nguyen-airspace-intelligence commented Apr 23, 2024

liferoad commented Apr 23, 2024

Michal-Nguyen-airspace-intelligence commented Apr 25, 2024

Michal-Nguyen-airspace-intelligence commented Apr 25, 2024

liferoad commented Apr 27, 2024 •

edited

Loading

shunping commented May 16, 2024

kennknowles commented Jun 7, 2024

liferoad commented Jun 7, 2024

kennknowles commented Jun 10, 2024

liferoad commented Jun 10, 2024

jrmccluskey commented Jul 2, 2024

liferoad commented Jul 2, 2024

lostluck commented Aug 20, 2024

lostluck commented Aug 22, 2024

kennknowles commented Aug 22, 2024

serratedserenade commented Nov 29, 2024 •

edited

Loading

liferoad commented Nov 29, 2024

serratedserenade commented Nov 29, 2024 •

edited

Loading

kennknowles commented Dec 2, 2024

shunping commented Dec 15, 2024 •

edited

Loading

shunping commented Dec 15, 2024 •

edited

Loading

shunping commented Dec 15, 2024

shunping commented Dec 15, 2024

shunping commented Jan 3, 2025

shunping commented Jan 3, 2025 •

edited

Loading

kennknowles commented Jan 4, 2025

kennknowles commented Jan 4, 2025

shunping commented Jan 4, 2025 •

edited

Loading

[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040

[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040

Comments

janowskijak commented Apr 18, 2024

What happened?

Issue Priority

Issue Components

Abacn commented Apr 19, 2024

Abacn commented Apr 19, 2024

liferoad commented Apr 20, 2024

janowskijak commented Apr 22, 2024 • edited Loading

liferoad commented Apr 22, 2024 • edited Loading

Michal-Nguyen-airspace-intelligence commented Apr 23, 2024

Michal-Nguyen-airspace-intelligence commented Apr 23, 2024

liferoad commented Apr 23, 2024

Michal-Nguyen-airspace-intelligence commented Apr 25, 2024

Michal-Nguyen-airspace-intelligence commented Apr 25, 2024

liferoad commented Apr 27, 2024 • edited Loading

shunping commented May 16, 2024

kennknowles commented Jun 7, 2024

liferoad commented Jun 7, 2024

kennknowles commented Jun 10, 2024

liferoad commented Jun 10, 2024

jrmccluskey commented Jul 2, 2024

liferoad commented Jul 2, 2024

lostluck commented Aug 20, 2024

lostluck commented Aug 22, 2024

kennknowles commented Aug 22, 2024

serratedserenade commented Nov 29, 2024 • edited Loading

liferoad commented Nov 29, 2024

serratedserenade commented Nov 29, 2024 • edited Loading

kennknowles commented Dec 2, 2024

shunping commented Dec 15, 2024 • edited Loading

shunping commented Dec 15, 2024 • edited Loading

shunping commented Dec 15, 2024

shunping commented Dec 15, 2024

shunping commented Jan 3, 2025

shunping commented Jan 3, 2025 • edited Loading

kennknowles commented Jan 4, 2025

kennknowles commented Jan 4, 2025

shunping commented Jan 4, 2025 • edited Loading

janowskijak commented Apr 22, 2024 •

edited

Loading

liferoad commented Apr 22, 2024 •

edited

Loading

liferoad commented Apr 27, 2024 •

edited

Loading

serratedserenade commented Nov 29, 2024 •

edited

Loading

serratedserenade commented Nov 29, 2024 •

edited

Loading

shunping commented Dec 15, 2024 •

edited

Loading

shunping commented Dec 15, 2024 •

edited

Loading

shunping commented Jan 3, 2025 •

edited

Loading

shunping commented Jan 4, 2025 •

edited

Loading