Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etag/hash calculation/comparison for multipart files is wrong #45

Closed
tybritten opened this issue Sep 25, 2020 · 9 comments
Closed

Etag/hash calculation/comparison for multipart files is wrong #45

tybritten opened this issue Sep 25, 2020 · 9 comments

Comments

@tybritten
Copy link
Contributor

Archive files over 5gb are appropriately chunked by archiver, but it breaks the hash comparison. Etag = md5hash is only true for single part files. Over that the etag is a hash of the hashes of the parts, with the number of parts added at the end.

cool for a PR that will fix this?

@tybritten
Copy link
Contributor Author

Actually i'm not sure if you'd rather calculate the etag if its >5gb and put that in the hash, or have the md5cksum be in the metadata of all archives and check that instead?

@nicpottier
Copy link
Contributor

Hrmm.. interesting. Been a while since I looked at this but from what I can tell the chunking is just at upload time, it gets turned into one physical file. So I would have thought the etag/md5 would be the same.

Looking at https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html they say:
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.

So it may just be that we have to ignore MD5/eTags on multiparts and just "trust" it is right based on maybe file size?

@tybritten
Copy link
Contributor Author

tybritten commented Sep 28, 2020 via email

@nicpottier
Copy link
Contributor

nicpottier commented Sep 28, 2020

Not sure I understand what you are asking. We save the md5 either way as far as I can tell, it's just a matter of whether this matches S3 because S3 doesn't make etag == md5 for files uploaded as chunks.

From what I can tell this becomes a problem when deleting the rows (where we check that the archive is still valid).. maybe we just back down to filesize in the cases where size is over 1 gig and we uploaded in chunks?

@tybritten
Copy link
Contributor Author

What i'm asking is i don't understand why this line:
https://github.com/nyaruka/rp-archiver/blob/master/s3.go#L91 is here for <5gb but not there for >5gb. Also it doesn't seem that the code ever checks that metadata.

I think the overall intent is right and good (hey before we delete the records out of the DB that are supposedly in this archive file, let's make sure the file is right.

So what i'm proposing is either:

  1. make sure we set the md5chksum like line 91 no matter what size file, and then set the code to compare with that instead of the etag
  2. replace the current md5 hash calc with an etag one- it'll still match the hash for single part files, but now will also match multipart etags

I'm more partial to the latter because it's more backward compatible. Basically we'd replace the md5.New() references with something like this: https://github.com/peak/s3hash/blob/master/s3hash.go

@nicpottier
Copy link
Contributor

Ya, the metadata is just that, metadata. The code checks the etag which is the S3 calculated MD5 by S3. You are right that could be set for the multi-upload case but it doesn't really inform anything since it isn't calculated on the S3 side.

  1. is just checking that what you set is still what you set, which I don't find particularly interesting. I think size is at least a bit better of a check for things greater than 1 gig
  2. Downloading the entire file from S3 just to recalculate the hash seems a bit expensive to me. (both in computational and dollar costs) I think file size is a close enough approximation for those cases.

@tybritten
Copy link
Contributor Author

That's a good point on option 1.

Option 2 doesn't require any re-downloading. It's just a go-forward, putting the right etag value in the archive hash field in the db. For any files created before adding this, you'd have to do what I did which is to change the value in the DB to match the ETag. We only had 3-4 files this year cross the 5gb mark (monthly rollups).

@tybritten
Copy link
Contributor Author

relooking at this- shouldn't this be removed too?

return fmt.Errorf("archive too large, must be smaller than 5 gigs, build dailies if possible")

We're hitting this now on dailies from some of our peak days two months ago. Since the S3uploader is already configured to chunk over 5gb, that should be fine.

@tybritten
Copy link
Contributor Author

Ok We're also now getting some context deadlines exceeding while building rollups. I'm going to PR some updates if this sounds fine:

  • remove 5gi limit and update the hashing as mentioned earlier to save the etag if its multipart
  • extend context timeouts for larger operations

Sound ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants