-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etag/hash calculation/comparison for multipart files is wrong #45
Comments
Actually i'm not sure if you'd rather calculate the etag if its >5gb and put that in the hash, or have the md5cksum be in the metadata of all archives and check that instead? |
Hrmm.. interesting. Been a while since I looked at this but from what I can tell the chunking is just at upload time, it gets turned into one physical file. So I would have thought the etag/md5 would be the same. Looking at https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html they say: So it may just be that we have to ignore MD5/eTags on multiparts and just "trust" it is right based on maybe file size? |
No, there’s two options- we can calculate the etag locally and save that as the hash in the db. Or we ignore the etag and save the hash as added metadata.
What’s the reason for storing the hash as metadata on the files below 5gb (as base64) but not on files over 5gb?
… On Sep 28, 2020, at 5:32 PM, Nic Pottier ***@***.***> wrote:
Hrmm.. interesting. Been a while since I looked at this but from what I can tell the chunking is just at upload time, it gets turned into one physical file. So I would have thought the etag/md5 would be the same.
Looking at https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html they say:
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
So it may just be that we have to ignore MD5/eTags on multiparts and just "trust" it is right based on maybe file size?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Not sure I understand what you are asking. We save the md5 either way as far as I can tell, it's just a matter of whether this matches S3 because S3 doesn't make etag == md5 for files uploaded as chunks. From what I can tell this becomes a problem when deleting the rows (where we check that the archive is still valid).. maybe we just back down to filesize in the cases where size is over 1 gig and we uploaded in chunks? |
What i'm asking is i don't understand why this line: I think the overall intent is right and good (hey before we delete the records out of the DB that are supposedly in this archive file, let's make sure the file is right. So what i'm proposing is either:
I'm more partial to the latter because it's more backward compatible. Basically we'd replace the |
Ya, the metadata is just that, metadata. The code checks the etag which is the S3 calculated MD5 by S3. You are right that could be set for the multi-upload case but it doesn't really inform anything since it isn't calculated on the S3 side.
|
That's a good point on option 1. Option 2 doesn't require any re-downloading. It's just a go-forward, putting the right etag value in the archive hash field in the db. For any files created before adding this, you'd have to do what I did which is to change the value in the DB to match the ETag. We only had 3-4 files this year cross the 5gb mark (monthly rollups). |
relooking at this- shouldn't this be removed too? Line 711 in 52e49ad
We're hitting this now on dailies from some of our peak days two months ago. Since the S3uploader is already configured to chunk over 5gb, that should be fine. |
Ok We're also now getting some context deadlines exceeding while building rollups. I'm going to PR some updates if this sounds fine:
Sound ok? |
Archive files over 5gb are appropriately chunked by archiver, but it breaks the hash comparison. Etag = md5hash is only true for single part files. Over that the etag is a hash of the hashes of the parts, with the number of parts added at the end.
cool for a PR that will fix this?
The text was updated successfully, but these errors were encountered: