-
-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uploading .gz file to S3 results in file being stored uncompressed #451
Comments
We can certainly discuss this. I added the configuration option in #453 to allow someone to configure the broken behavior if they were dependent on it. I don't think it makes sense to require setting |
@jschneier I did a little more research to see if I could understand the problem being solved. If needed I could work on a change that actually solves the problem of how to serve gzip encoded assets. I found a post that talks about how to serve gzip encoded assets using Content-Encoding in the object's metadata in S3. https://zanon.io/posts/serving-gzipped-files-in-amazon-s3-cloudfront The correct behavior for serving static assets is setting Content-Encoding in the object's metadata, storing a compressed asset, and providing the Anyone who was trying to compress their assets probably missed that it wasn't really working because when they referenced their assets they came down uncompressed with no content encoding instead of compressed with content-encoding: gzip. If they never look at the headers and body directly the two states are indistinguishable at the browser layer. The current behavior breaks my use case of uploading .gz files by inserting incorrect headers and it doesn't even work as intended for the use case of serving compressed static assets the original change claims to handle. |
@skruger said "If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called." It seems to me it doesn't just attempt it, it actually does compress it again. So with the setting
and noting that it seems that the result is my uploaded file is Gzipped twice but with the filename So I support #453 or something similar being merged. It should be possible to upload |
Code in question is at
So if the file is But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression. Now look at the So to be correct, the first branch of the |
To be clear, I am not advocating that the extension It would be much better if the extension |
When uploading a compressed file to S3, don't add ContentEncoding header. These changes allow .gz files to be uploaded and downloaded without modification.
See #615 for my proposed fixes. |
We had the same issue. I don't understand why you would set the header on a gzipped or tgzipped file in the first place. You are saving a file which comes out decompressed on the other side. Either django-storages compresses it on its own and sets the header accordingly, or it would upload it unaltered and then doesn't set the header. Am I missing something? I do get the compression comes in handy for large text based files. But again then the header should only be set because the content gets zipped by django-storages itself. |
@svanscho I’m a little lost myself as to why this clearly broken behavior is accepted as the default when it doesn’t even work for what it claims to enable. I’m stuck on 1.5.2 (which was released in January of 2017) until this is fixed. If I start running into problems with newer versions of Django I’m going to be forced to maintain a fork until any kind of fix for this is accepted. |
We have implemented a workaround by using custom extensions, which aren't MIME-type guessed and hence the gzip header is not set/sent. Luckily we could change the extensions as we both control the server and client behaviour. Still clearly broken behaviour indeed. Thanks for the feedback. |
Rolling back to django-storages==1.5.2 worked around this issue for us. |
I was resolved by changing the extension to .gzip from .gz when uploading the file. |
If an explicit |
I have compressed log files I'm trying to store in S3 using django-storages s3 backend, but when I inspect the files I discover that S3 stored them in their uncompressed form. I have done some digging and discovered that django-storages is properly identifying my files as gzipped, but setting that as a ContentEncoding argument so that S3 interprets the data as gzip for HTTP transfer encoding and uncompresses it at the HTTP layer at put time.
Content encoding detection:
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L421-L433
In the _save() method encoding is detected with mimetypes.guess_type() which in my case results in 'gzip'.
If I were to set AWS_IS_GZIPPED and add my gzip file type to GZIP_CONTENT_TYPES then it is going to attempt to gzip my data a second time for transfer when self._compress_content() is called. This feels undesirable with 100MB compressed log files
The bucket object
obj
that is created is used to call upload_fileobj with the ContentEncoding parameter which eventually calls Client.put_object with ContentEncoding set to gziphttp://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
I can create a pull request, that would fix this for my use case, but I would like to understand what is expected and what option is least likely to break other people's use cases.
Options for fixing this:
If you have any thoughts on what kind of approach you would like to see taken I can take them and get a pull request submitted for review.
The text was updated successfully, but these errors were encountered: