-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040
Comments
Thanks for reporting. Agree this is a P1 bug as it causes data loss. |
Is it possible to provide a working example that reproduce the issue, which could help triage. |
@shunping FYI |
@Abacn I don't have a working example however the steps to reproduce are:
EDIT: This issue will probably appear for any compression type. I just encountered it with gzip but did not test with other compression algorithms. |
I uploaded one test file here: # standard libraries
import logging
# third party libraries
import apache_beam as beam
from apache_beam import Create, Map
from apache_beam.io.textio import ReadAllFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import Count
logger = logging.getLogger()
logger.setLevel(logging.INFO)
elements = [
"gs://apache-beam-samples/gcs/bigfile.txt.gz",
]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
(
p
| Create(elements)
| "Read File from GCS" >> ReadAllFromText()
| Count.Globally()
| "Log" >> Map(lambda x: logging.info("Total lines %d", x))
) This shows:
|
So I double checked and there are differences between your example and our case.
Furthermore, after removing encoding type from our file and using |
For quick patch we use following solution:
|
This is expected, as I mentioned earlier |
I see. We need to check decompressive transcoding for the GCS file to determine whether the content is compressed rather than relying on the file extension.
This only loads 75,601 lines. #19413 could be related for uploading the file to GCS. |
.take-issue |
Have we reproduced this? |
Yes, see my above link: #31040 (comment) |
Is there a hope of a fix for 2.57.0 cherry pick? I would guess this is a longstanding issue so getting it fixed in a very thorough way for 2.58.0 is actually the best thing to do. I recall we had decompressive transcoding bugs in the past. So we should make sure we really get it right this time. And the user can mitigate by configuring GCS to not do the transcoding. |
Moved this to 2.58.0. Thanks! |
Has any progress been made on this? |
Not yet. We can move this to 2.59.0. |
Has any progress been made on this? |
Moved to 2.60.0 |
Based on this getting pushed from release to release, it is clearly not a true release-blocker. |
Has there been any progress on this? If there is none, can anyone suggest a monkey patch or some sort of hotfix that we can apply ourselves? EDIT: Did some disgusting hacks and it seems that if we're able to somehow pass |
#31040 (comment) does not work for you? |
That's what I was using prior until we realised this method was causing part of the data we were reading to be skipped. Basically same problem as here #31040 (comment) However, just to make sure, I will double check. Just to add details:
EDIT: Did a quick check and realised that I was using both the one you pointed at and the UNCOMPRESSED flag, however removing the uncompressed flag just gives me the |
This issue came up and was fixed in the Java SDK file IO many years ago, so we should have some reference material to work with. I am trying to find it. And presumably also related to #18390. |
Pick up this issue again. I confirm that I am able to reproduce the problem with code in #31040 (comment) with the latest Beam. |
I briefly debugged the issue, it seems that the file is accessed with decompressive transcoding (https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding). I checked the metadata of the GCS file in use, and the result is as follows. $ gcloud storage objects describe gs://apache-beam-samples/gcs/bigfile_with_encoding_plain.txt.gz
acl:
...
bucket: apache-beam-samples
content_encoding: gzip
content_type: text/plain
crc32c_hash: /REfiQ==
creation_time: 2024-04-27T20:19:48+0000
...
metageneration: 1
name: gcs/bigfile_with_encoding_plain.txt.gz
size: 7635685
storage_class: STANDARD
storage_class_update_time: 2024-04-27T20:19:48+0000
storage_url: gs://apache-beam-samples/gcs/bigfile_with_encoding_plain.txt.gz#1714249188497359
update_time: 2024-04-27T20:19:48+0000 The size field shows the size of the gzip file, i.e. 7635685. The original file size prior to gzip should be 10100000.
When debugging in the DoFn I think the size is a problem here. Because the file metadata meets the criteria of decompressive transcoding, GCS is sending uncompressed data when we request it and our DoFn thinks the decompressed file size is only 7635685 bytes. |
Just check around. It seems GCS decompressive transcoding causes some similar issues elsewhere. |
The range is determined by the size obtained from GCSIO (
beam/sdks/python/apache_beam/io/gcp/gcsio.py Line 477 in d7502fa
google-cloud-storage .
The size property in the Blob object there comes from "Content-Length". (https://cloud.google.com/storage/docs/json_api/v1/objects) |
To clarify the behavior of textio with various content encoding, content type, and compression settings, I've expanded the table in the Apache Beam GitHub issue #18390. This table compares the behavior across two Beam SDK versions: 2.52.0 (prior to the GCSIO migration) and 2.62.0 (the upcoming release). I also include the proposed behavior of my fix in the last column. A few notes about how the data is generated.
|
Notice that when For any @kennknowles WDYT? |
I agree with all of your proposals that replace "Data Loss" with "UnicodeDecodeError" |
In all the cases where GCS transcoding causes a zlib.error I also agree with making them function correctly. Basically if the user says it is GZIP and the data really is GZIP but we know that GCS is going to decode it then we do not do a redundant decode. |
Great! GCS decompressive transcoding is a bit unintuitive to Beam users here, and when it happens, we see data loss. I think it is more natural to expect users to specify GZIP or AUTO in those cases rather than UNCOMPRESSED, as shown in the proposal.
Yep, that's the idea, but it is implemented differently in the proposed fix (#33384).
In my fix (#33384), I let GCS client library skip the decoding step in itself and rely on Beam's decoding mechanism in CompressedFile to process the file. I believe this is more intuitive to our users and the end result of this approach is exactly the same as what has been proposed in the previous table. |
What happened?
Since the refactor of gcsio (2.52?) ReadAllFiles does not fully read gzipped files from GCS. Part of the file will be correctly returned but rest will go missing.
I presume this is caused by the fact that GCS performs decompressive transcoding while
_ExpandIntoRanges
uses the GCS objects metadata to determine the read range. This means that the file size we receive is larger than the maximum of the read range.For example, a gzip on GCS might have a file size of 1 MB and this will be the object size in the metadata. Thus the maximum of the read range will be 1 MB. However, when beam opens the file it's already decompressed by GCS so the file size will be 1.5 MB and we won't read 0.5 MB out of it thus causing data loss.
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: