Uploading .gz file to S3 results in file being stored uncompressed #451

skruger · 2018-01-19T23:16:12Z

I have compressed log files I'm trying to store in S3 using django-storages s3 backend, but when I inspect the files I discover that S3 stored them in their uncompressed form. I have done some digging and discovered that django-storages is properly identifying my files as gzipped, but setting that as a ContentEncoding argument so that S3 interprets the data as gzip for HTTP transfer encoding and uncompresses it at the HTTP layer at put time.

Content encoding detection:
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L421-L433
In the _save() method encoding is detected with mimetypes.guess_type() which in my case results in 'gzip'.

If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called. This feels undesirable with 100MB compressed log files

The bucket object obj that is created is used to call upload_fileobj with the ContentEncoding parameter which eventually calls Client.put_object with ContentEncoding set to gzip
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object

I can create a pull request, that would fix this for my use case, but I would like to understand what is expected and what option is least likely to break other people's use cases.
Options for fixing this:

Detect the .gz file extension and do not set the ContentEncoding value
Disable ContentEncoding detection entirely if AWS_IS_GZIPPED == False - Is there a use case for locally compressed assets like stylesheets that need to be decompressed on S3?
Other options I haven't thought of?

If you have any thoughts on what kind of approach you would like to see taken I can take them and get a pull request submitted for review.

The text was updated successfully, but these errors were encountered:

skruger · 2018-01-19T23:42:21Z

This appears to be related to #263 as opened by @ldng.

Perhaps understanding the use case that was supported by that change could help me know how to handle this for my use case.

…olves jschneier#451

jschneier · 2018-08-16T23:56:59Z

Hi @skruger sorry can we discuss here before adding a setting in #453? I think the proper fix is that we shouldn't override any setting given by the user. So you could add to your file field or your storage something like:

headers = {
  'Content-Encoding': 'identity'
}

Thoughts on that approach?

skruger · 2018-08-22T23:09:23Z

We can certainly discuss this. I added the configuration option in #453 to allow someone to configure the broken behavior if they were dependent on it.

I don't think it makes sense to require setting headers = anything on my storage object when I want to upload .gz files among others and actually get .gz files stored. There appears to be some confusion about what the Content-Encoding header means which is why the current behavior that was introduced after the 1.5.2 release broke my application by forcing Content-Encoding=gzip when it detects that I'm uploading a gzip encoded file which causes S3 to unzip the file before storage.

skruger · 2018-08-23T18:24:31Z

@jschneier I did a little more research to see if I could understand the problem being solved. If needed I could work on a change that actually solves the problem of how to serve gzip encoded assets. I found a post that talks about how to serve gzip encoded assets using Content-Encoding in the object's metadata in S3.

https://zanon.io/posts/serving-gzipped-files-in-amazon-s3-cloudfront

The correct behavior for serving static assets is setting Content-Encoding in the object's metadata, storing a compressed asset, and providing the Content-Encoding: gzip when a user downloads the asset. Instead the library is sending the header on upload to S3 requests so the asset is decompressed for storage.

Anyone who was trying to compress their assets probably missed that it wasn't really working because when they referenced their assets they came down uncompressed with no content encoding instead of compressed with content-encoding: gzip. If they never look at the headers and body directly the two states are indistinguishable at the browser layer.

The current behavior breaks my use case of uploading .gz files by inserting incorrect headers and it doesn't even work as intended for the use case of serving compressed static assets the original change claims to handle.

normangilmore · 2018-10-10T22:03:20Z

@skruger said "If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called."

It seems to me it doesn't just attempt it, it actually does compress it again.

So with the setting

AWS_IS_GZIPPED = True
GZIP_CONTENT_TYPES = [
    'text/plain',
    'text/csv',

and noting that mimetypes.guess_type("myfile.csv.gz") is ('text/csv', 'gzip').

it seems that the result is my uploaded file is Gzipped twice but with the filename myfile.csv.gz. At least, I have to add a .gz extension and gunzip it twice to get my data back out after I use aws s3 cp to copy it back out of S3.

So I support #453 or something similar being merged. It should be possible to upload .gz files without triggering re-compression based on the underlying mime-type.

normangilmore · 2018-10-11T00:02:33Z

Code in question is at
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L482

_type, encoding = mimetypes.guess_type(name)
...

        if self.gzip and content_type in self.gzip_content_types:
            content = self._compress_content(content)
            parameters.update({'ContentEncoding': 'gzip'})
        elif encoding:
            # If the content already has a particular encoding, set it
            parameters.update({'ContentEncoding': encoding})

So if the file is file.csv.gz, and text/csv is in self.gzip_content_types, then it will be compressed again in the first if clause. Downloading normally it seems transparent, because the content-encoding is reversed once by the browser, then you get your inner .gz file back, no surprises.

But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression.

Now look at the elif encoding: clause.
If the file is file.csv.gz, then the content encoding header will be added. On download, the file will be gunzipped by the browser. But the filename is wrong - the OS will see the .gz, and when you try to decompress it, the OS is baffled. You have to manually remove the .gz.

So to be correct, the first branch of the if should add .gz extension to show the double compression exists on S3, and the elif branch should remove the .gz extension to show that the ContentEncoding is removed. Basically, this if block prevents .gz files from being moved back and forth from S3 without modification.

normangilmore · 2018-10-11T00:08:53Z

To be clear, I am not advocating that the extension .gz actually be added or removed. I'm just pointing out that would be the logical entailment of what is being attempted.

It would be much better if the extension .gz is detected, that nothing is done to the file - obviously no re-compression, and also, do not put on the ContentEncoding header.

When uploading a compressed file to S3, don't add ContentEncoding header. These changes allow .gz files to be uploaded and downloaded without modification.

normangilmore · 2018-10-11T01:41:25Z

See #615 for my proposed fixes.

svanscho · 2019-09-27T12:56:20Z

We had the same issue. I don't understand why you would set the header on a gzipped or tgzipped file in the first place. You are saving a file which comes out decompressed on the other side. Either django-storages compresses it on its own and sets the header accordingly, or it would upload it unaltered and then doesn't set the header. Am I missing something?

I do get the compression comes in handy for large text based files. But again then the header should only be set because the content gets zipped by django-storages itself.

skruger · 2019-09-27T13:55:25Z

@svanscho I’m a little lost myself as to why this clearly broken behavior is accepted as the default when it doesn’t even work for what it claims to enable. I’m stuck on 1.5.2 (which was released in January of 2017) until this is fixed. If I start running into problems with newer versions of Django I’m going to be forced to maintain a fork until any kind of fix for this is accepted.

svanscho · 2019-09-27T15:21:46Z

We have implemented a workaround by using custom extensions, which aren't MIME-type guessed and hence the gzip header is not set/sent. Luckily we could change the extensions as we both control the server and client behaviour. Still clearly broken behaviour indeed. Thanks for the feedback.

mmacvicar-splunk · 2019-11-15T16:00:52Z

Rolling back to django-storages==1.5.2 worked around this issue for us.

hanvic · 2019-12-04T01:41:38Z

I was resolved by changing the extension to .gzip from .gz when uploading the file.
gz is common extension name, but gzip also GZIP extension.

jschneier · 2020-02-03T06:44:27Z

If an explicit ContentEncoding is set then gzip will be skipped as of #828.

skruger added a commit to skruger/django-storages that referenced this issue Jan 20, 2018

Add option to detect content encoding to fix broken gzip uploads. Res…

0c8b353

…olves jschneier#451

sww314 added bug s3boto labels Jul 12, 2018

skruger added a commit to skruger/django-storages that referenced this issue Jul 12, 2018

Add option to detect content encoding to fix broken gzip uploads. Res…

bb9bd54

…olves jschneier#451

skruger added a commit to skruger/django-storages that referenced this issue Aug 16, 2018

Add option to detect content encoding to fix broken gzip uploads. Res…

44b6cc0

…olves jschneier#451

jschneier closed this as completed Feb 3, 2020

normangilmore mentioned this issue Aug 18, 2020

Uploading .gz files with Content-Type: application/gzip and no Content-Encoding #917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uploading .gz file to S3 results in file being stored uncompressed #451

Uploading .gz file to S3 results in file being stored uncompressed #451

skruger commented Jan 19, 2018

skruger commented Jan 19, 2018

jschneier commented Aug 16, 2018 •

edited

Loading

skruger commented Aug 22, 2018

skruger commented Aug 23, 2018

normangilmore commented Oct 10, 2018

normangilmore commented Oct 11, 2018 •

edited

Loading

normangilmore commented Oct 11, 2018 •

edited

Loading

normangilmore commented Oct 11, 2018

svanscho commented Sep 27, 2019 •

edited

Loading

skruger commented Sep 27, 2019

svanscho commented Sep 27, 2019 •

edited

Loading

mmacvicar-splunk commented Nov 15, 2019

hanvic commented Dec 4, 2019 •

edited

Loading

jschneier commented Feb 3, 2020

Uploading .gz file to S3 results in file being stored uncompressed #451

Uploading .gz file to S3 results in file being stored uncompressed #451

Comments

skruger commented Jan 19, 2018

skruger commented Jan 19, 2018

jschneier commented Aug 16, 2018 • edited Loading

skruger commented Aug 22, 2018

skruger commented Aug 23, 2018

normangilmore commented Oct 10, 2018

normangilmore commented Oct 11, 2018 • edited Loading

normangilmore commented Oct 11, 2018 • edited Loading

normangilmore commented Oct 11, 2018

svanscho commented Sep 27, 2019 • edited Loading

skruger commented Sep 27, 2019

svanscho commented Sep 27, 2019 • edited Loading

mmacvicar-splunk commented Nov 15, 2019

hanvic commented Dec 4, 2019 • edited Loading

jschneier commented Feb 3, 2020

jschneier commented Aug 16, 2018 •

edited

Loading

normangilmore commented Oct 11, 2018 •

edited

Loading

normangilmore commented Oct 11, 2018 •

edited

Loading

svanscho commented Sep 27, 2019 •

edited

Loading

svanscho commented Sep 27, 2019 •

edited

Loading

hanvic commented Dec 4, 2019 •

edited

Loading