Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading .gz file to S3 results in file being stored uncompressed #451

Closed
skruger opened this issue Jan 19, 2018 · 14 comments
Closed

Uploading .gz file to S3 results in file being stored uncompressed #451

skruger opened this issue Jan 19, 2018 · 14 comments

Comments

@skruger
Copy link

skruger commented Jan 19, 2018

I have compressed log files I'm trying to store in S3 using django-storages s3 backend, but when I inspect the files I discover that S3 stored them in their uncompressed form. I have done some digging and discovered that django-storages is properly identifying my files as gzipped, but setting that as a ContentEncoding argument so that S3 interprets the data as gzip for HTTP transfer encoding and uncompresses it at the HTTP layer at put time.

Content encoding detection:
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L421-L433
In the _save() method encoding is detected with mimetypes.guess_type() which in my case results in 'gzip'.

If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called. This feels undesirable with 100MB compressed log files

The bucket object obj that is created is used to call upload_fileobj with the ContentEncoding parameter which eventually calls Client.put_object with ContentEncoding set to gzip
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object

I can create a pull request, that would fix this for my use case, but I would like to understand what is expected and what option is least likely to break other people's use cases.
Options for fixing this:

  1. Detect the .gz file extension and do not set the ContentEncoding value
  2. Disable ContentEncoding detection entirely if AWS_IS_GZIPPED == False - Is there a use case for locally compressed assets like stylesheets that need to be decompressed on S3?
  3. Other options I haven't thought of?

If you have any thoughts on what kind of approach you would like to see taken I can take them and get a pull request submitted for review.

@skruger
Copy link
Author

skruger commented Jan 19, 2018

This appears to be related to #263 as opened by @ldng.

Perhaps understanding the use case that was supported by that change could help me know how to handle this for my use case.

skruger added a commit to skruger/django-storages that referenced this issue Jan 20, 2018
skruger added a commit to skruger/django-storages that referenced this issue Jul 12, 2018
skruger added a commit to skruger/django-storages that referenced this issue Aug 16, 2018
@jschneier
Copy link
Owner

jschneier commented Aug 16, 2018

Hi @skruger sorry can we discuss here before adding a setting in #453? I think the proper fix is that we shouldn't override any setting given by the user. So you could add to your file field or your storage something like:

headers = {
  'Content-Encoding': 'identity'
}

Thoughts on that approach?

@skruger
Copy link
Author

skruger commented Aug 22, 2018

We can certainly discuss this. I added the configuration option in #453 to allow someone to configure the broken behavior if they were dependent on it.

I don't think it makes sense to require setting headers = anything on my storage object when I want to upload .gz files among others and actually get .gz files stored. There appears to be some confusion about what the Content-Encoding header means which is why the current behavior that was introduced after the 1.5.2 release broke my application by forcing Content-Encoding=gzip when it detects that I'm uploading a gzip encoded file which causes S3 to unzip the file before storage.

@skruger
Copy link
Author

skruger commented Aug 23, 2018

@jschneier I did a little more research to see if I could understand the problem being solved. If needed I could work on a change that actually solves the problem of how to serve gzip encoded assets. I found a post that talks about how to serve gzip encoded assets using Content-Encoding in the object's metadata in S3.

https://zanon.io/posts/serving-gzipped-files-in-amazon-s3-cloudfront

The correct behavior for serving static assets is setting Content-Encoding in the object's metadata, storing a compressed asset, and providing the Content-Encoding: gzip when a user downloads the asset. Instead the library is sending the header on upload to S3 requests so the asset is decompressed for storage.

Anyone who was trying to compress their assets probably missed that it wasn't really working because when they referenced their assets they came down uncompressed with no content encoding instead of compressed with content-encoding: gzip. If they never look at the headers and body directly the two states are indistinguishable at the browser layer.

The current behavior breaks my use case of uploading .gz files by inserting incorrect headers and it doesn't even work as intended for the use case of serving compressed static assets the original change claims to handle.

@normangilmore
Copy link

@skruger said "If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called."

It seems to me it doesn't just attempt it, it actually does compress it again.

So with the setting

AWS_IS_GZIPPED = True
GZIP_CONTENT_TYPES = [
    'text/plain',
    'text/csv',

and noting that mimetypes.guess_type("myfile.csv.gz") is ('text/csv', 'gzip').

it seems that the result is my uploaded file is Gzipped twice but with the filename myfile.csv.gz. At least, I have to add a .gz extension and gunzip it twice to get my data back out after I use aws s3 cp to copy it back out of S3.

So I support #453 or something similar being merged. It should be possible to upload .gz files without triggering re-compression based on the underlying mime-type.

@normangilmore
Copy link

normangilmore commented Oct 11, 2018

Code in question is at
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L482

_type, encoding = mimetypes.guess_type(name)
...

        if self.gzip and content_type in self.gzip_content_types:
            content = self._compress_content(content)
            parameters.update({'ContentEncoding': 'gzip'})
        elif encoding:
            # If the content already has a particular encoding, set it
            parameters.update({'ContentEncoding': encoding})

So if the file is file.csv.gz, and text/csv is in self.gzip_content_types, then it will be compressed again in the first if clause. Downloading normally it seems transparent, because the content-encoding is reversed once by the browser, then you get your inner .gz file back, no surprises.

But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression.

Now look at the elif encoding: clause.
If the file is file.csv.gz, then the content encoding header will be added. On download, the file will be gunzipped by the browser. But the filename is wrong - the OS will see the .gz, and when you try to decompress it, the OS is baffled. You have to manually remove the .gz.

So to be correct, the first branch of the if should add .gz extension to show the double compression exists on S3, and the elif branch should remove the .gz extension to show that the ContentEncoding is removed. Basically, this if block prevents .gz files from being moved back and forth from S3 without modification.

@normangilmore
Copy link

normangilmore commented Oct 11, 2018

To be clear, I am not advocating that the extension .gz actually be added or removed. I'm just pointing out that would be the logical entailment of what is being attempted.

It would be much better if the extension .gz is detected, that nothing is done to the file - obviously no re-compression, and also, do not put on the ContentEncoding header.

normangilmore added a commit to normangilmore/django-storages that referenced this issue Oct 11, 2018
When uploading a compressed file to S3, don't add ContentEncoding header.

These changes allow .gz files to be uploaded and downloaded without modification.
@normangilmore
Copy link

See #615 for my proposed fixes.

@svanscho
Copy link

svanscho commented Sep 27, 2019

We had the same issue. I don't understand why you would set the header on a gzipped or tgzipped file in the first place. You are saving a file which comes out decompressed on the other side. Either django-storages compresses it on its own and sets the header accordingly, or it would upload it unaltered and then doesn't set the header. Am I missing something?

I do get the compression comes in handy for large text based files. But again then the header should only be set because the content gets zipped by django-storages itself.

@skruger
Copy link
Author

skruger commented Sep 27, 2019

@svanscho I’m a little lost myself as to why this clearly broken behavior is accepted as the default when it doesn’t even work for what it claims to enable. I’m stuck on 1.5.2 (which was released in January of 2017) until this is fixed. If I start running into problems with newer versions of Django I’m going to be forced to maintain a fork until any kind of fix for this is accepted.

@svanscho
Copy link

svanscho commented Sep 27, 2019

We have implemented a workaround by using custom extensions, which aren't MIME-type guessed and hence the gzip header is not set/sent. Luckily we could change the extensions as we both control the server and client behaviour. Still clearly broken behaviour indeed. Thanks for the feedback.

@mmacvicar-splunk
Copy link

Rolling back to django-storages==1.5.2 worked around this issue for us.

@hanvic
Copy link

hanvic commented Dec 4, 2019

I was resolved by changing the extension to .gzip from .gz when uploading the file.
gz is common extension name, but gzip also GZIP extension.

@jschneier
Copy link
Owner

If an explicit ContentEncoding is set then gzip will be skipped as of #828.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants