Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

Open
kennknowles opened this issue Jun 3, 2022 · 15 comments

Comments

@kennknowles
Copy link
Member

We have gzipped text files in Google Cloud Storage that have the following metadata headers set:


Content-Encoding: gzip
Content-Type: application/octet-stream

Trying to read these with apache_beam.io.ReadFromText yields the following error:


ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz:
Cannot have start index greater than total size
Traceback (most recent call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 585, in _fetch_to_queue
    value = func(*args)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 610, in _get_segment
    downloader.GetRange(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 477, in GetRange
    progress, end_byte = self.__NormalizeStartEnd(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 340, in __NormalizeStartEnd
    'Cannot have start index greater than total size')
TransferInvalidError:
Cannot have start index greater than total size

WARNING:root:Task failed: Traceback (most recent
call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
line 300, in __call__
    result = evaluator.finish_bundle()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 206, in finish_bundle
    bundles = _read_values_to_bundles(reader)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 196, in _read_values_to_bundles
    read_result = [GlobalWindows.windowed_value(e) for e in reader]

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
line 79, in read
    range_tracker.sub_range_tracker(source_ix)):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 155, in read_records
    read_buffer)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 245, in _read_record
    sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 190, in _find_separator_bounds
    file_to_read, read_buffer, current_pos + 1):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 212, in _try_to_ensure_num_bytes_in_buffer
    read_data = file_to_read.read(self._buffer_size)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 460, in read
    self._fetch_to_internal_buffer(num_bytes)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 420, in _fetch_to_internal_buffer
    buf = self._file.read(self._read_size)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 472, in read
    return self._read_inner(size=size, readline=False)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 516, in _read_inner
    self._fetch_next_if_buffer_exhausted()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 577, in _fetch_next_if_buffer_exhausted
    raise exn
TransferInvalidError: Cannot have start
index greater than total size

After removing the Content-Encoding header the read works fine.

Imported from Jira BEAM-1874. Original Jira may contain additional context.
Reported by: smphhh.

@linamartensson
Copy link

Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS.

@kennknowles
Copy link
Member Author

Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:

  1. GCS transcodes and Beam works with this transparently.
    • Content-encoding: gzip
    • Content-type: X
    • Beam's IO reads it expecting contents to be X. I believe the problem is that GCS serves metadata that results in wrong splits.
  2. GCS does not transcode because the metadata is set to not transcode (current recommendation)
    • Content-encoding: <empty>
    • Content-typ: gzip
    • Beam's IO reads and the user specifies gzip or it is autodetected by the IO
  3. GCS does not transcode because the Beam IO requests no transcoding
    • Content-encoding: gzip
    • Content-type: X
    • Beam's IO passes the header Accept-Encoding: gzip

I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation.

@sqlboy
Copy link

sqlboy commented Nov 26, 2022

Guys this is a major issue.

@daniels-cysiv
Copy link

This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"?

@sqlboy
Copy link

sqlboy commented Jan 10, 2023

The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been moved off httplib2.

@kennknowles
Copy link
Member Author

Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much.

@chavdaparas
Copy link

  • you can upload the object to GCS with the Content-Type set to indicate compression and NO Content-Encoding at all, according to best practices.

Content-encoding: application/gzip
Content-type:

in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding.
reference : https://cloud.google.com/storage/docs/transcoding

beam's ReadFromText with compression_type=CompressionTypes.GZIP works fine with above option

p | "Read GCS File" >> beam.io.ReadFromText(file_pattern=file_path,compression_type=CompressionTypes.GZIP, skip_header_lines=int(skip_header))

Ways to compress the file

  1. Implicitly by specifying gsutil cp -Z <filename> <bucket>
  2. Explicitly by compressing the file first like gzip <filename> and load it to GCS

For more details around which combination works please see the table below :

Screenshot 2023-02-08 at 8 26 22 PM

@Murli16
Copy link

Murli16 commented Feb 10, 2023

Hi @kennknowles @sqlboy ,

The option that works correctly so far is as below

  1. Do a explicit compression of the file - gzip
  2. Upload the file to GCS with correct content type - application/gzip
gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/
  1. Content encoding will not be set
gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz

bucket: gcp-sandbox-1-359004
contentType: application/gzip
crc32c: v1lNUQ==
etag: CLnDx+CIif0CEAE=
generation: '1675967308358073'

The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.

While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (#18390 (comment)) can be fixed.

As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.

Please let me know if any alternate suggestion.

@BjornPrime
Copy link
Contributor

.take-issue

@liferoad
Copy link
Collaborator

@BjornPrime is working on fixing #25676, which might fix this issue as well.

@BjornPrime
Copy link
Contributor

In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files.

@kennknowles
Copy link
Member Author

I haven't thought about this in a while, but is there a problem with always passing Accept-encoding: gzip ?

@chaitanya1293
Copy link

I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having paramter: headers: |-
content-type: application/octet-stream
but it did't make any change in the error.

@liferoad
Copy link
Collaborator

same as #31040

@liferoad
Copy link
Collaborator

liferoad commented Dec 2, 2024

cc @shunping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants