Etag/hash calculation/comparison for multipart files is wrong #45

tybritten · 2020-09-25T20:50:03Z

Archive files over 5gb are appropriately chunked by archiver, but it breaks the hash comparison. Etag = md5hash is only true for single part files. Over that the etag is a hash of the hashes of the parts, with the number of parts added at the end.

cool for a PR that will fix this?

tybritten · 2020-09-25T20:59:13Z

Actually i'm not sure if you'd rather calculate the etag if its >5gb and put that in the hash, or have the md5cksum be in the metadata of all archives and check that instead?

nicpottier · 2020-09-28T21:32:00Z

Hrmm.. interesting. Been a while since I looked at this but from what I can tell the chunking is just at upload time, it gets turned into one physical file. So I would have thought the etag/md5 would be the same.

Looking at https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html they say:
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.

So it may just be that we have to ignore MD5/eTags on multiparts and just "trust" it is right based on maybe file size?

tybritten · 2020-09-28T21:42:59Z

No, there’s two options- we can calculate the etag locally and save that as the hash in the db. Or we ignore the etag and save the hash as added metadata. What’s the reason for storing the hash as metadata on the files below 5gb (as base64) but not on files over 5gb?

…

On Sep 28, 2020, at 5:32 PM, Nic Pottier ***@***.***> wrote: Hrmm.. interesting. Been a while since I looked at this but from what I can tell the chunking is just at upload time, it gets turned into one physical file. So I would have thought the etag/md5 would be the same. Looking at https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html they say: If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. So it may just be that we have to ignore MD5/eTags on multiparts and just "trust" it is right based on maybe file size? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nicpottier · 2020-09-28T21:53:19Z

Not sure I understand what you are asking. We save the md5 either way as far as I can tell, it's just a matter of whether this matches S3 because S3 doesn't make etag == md5 for files uploaded as chunks.

From what I can tell this becomes a problem when deleting the rows (where we check that the archive is still valid).. maybe we just back down to filesize in the cases where size is over 1 gig and we uploaded in chunks?

tybritten · 2020-09-28T22:25:10Z

What i'm asking is i don't understand why this line:
https://github.com/nyaruka/rp-archiver/blob/master/s3.go#L91 is here for <5gb but not there for >5gb. Also it doesn't seem that the code ever checks that metadata.

I think the overall intent is right and good (hey before we delete the records out of the DB that are supposedly in this archive file, let's make sure the file is right.

So what i'm proposing is either:

make sure we set the md5chksum like line 91 no matter what size file, and then set the code to compare with that instead of the etag
replace the current md5 hash calc with an etag one- it'll still match the hash for single part files, but now will also match multipart etags

I'm more partial to the latter because it's more backward compatible. Basically we'd replace the md5.New() references with something like this: https://github.com/peak/s3hash/blob/master/s3hash.go

nicpottier · 2020-09-28T23:10:58Z

Ya, the metadata is just that, metadata. The code checks the etag which is the S3 calculated MD5 by S3. You are right that could be set for the multi-upload case but it doesn't really inform anything since it isn't calculated on the S3 side.

is just checking that what you set is still what you set, which I don't find particularly interesting. I think size is at least a bit better of a check for things greater than 1 gig
Downloading the entire file from S3 just to recalculate the hash seems a bit expensive to me. (both in computational and dollar costs) I think file size is a close enough approximation for those cases.

tybritten · 2020-09-29T12:15:35Z

That's a good point on option 1.

Option 2 doesn't require any re-downloading. It's just a go-forward, putting the right etag value in the archive hash field in the db. For any files created before adding this, you'd have to do what I did which is to change the value in the DB to match the ETag. We only had 3-4 files this year cross the 5gb mark (monthly rollups).

tybritten · 2020-12-01T16:23:47Z

relooking at this- shouldn't this be removed too?

rp-archiver/archiver.go

Line 711 in 52e49ad

    
           return fmt.Errorf("archive too large, must be smaller than 5 gigs, build dailies if possible")

We're hitting this now on dailies from some of our peak days two months ago. Since the S3uploader is already configured to chunk over 5gb, that should be fine.

tybritten · 2021-04-07T13:23:08Z

Ok We're also now getting some context deadlines exceeding while building rollups. I'm going to PR some updates if this sounds fine:

remove 5gi limit and update the hashing as mentioned earlier to save the etag if its multipart
extend context timeouts for larger operations

Sound ok?

mzealey mentioned this issue Sep 15, 2023

ETag MD5 match issues #87

Closed

rowanseymour mentioned this issue Sep 26, 2023

🦬 Big file fixes #90

Merged

rowanseymour closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etag/hash calculation/comparison for multipart files is wrong #45

Etag/hash calculation/comparison for multipart files is wrong #45

tybritten commented Sep 25, 2020

tybritten commented Sep 25, 2020

nicpottier commented Sep 28, 2020

tybritten commented Sep 28, 2020 via email

nicpottier commented Sep 28, 2020 •

edited

Loading

tybritten commented Sep 28, 2020

nicpottier commented Sep 28, 2020

tybritten commented Sep 29, 2020

tybritten commented Dec 1, 2020

tybritten commented Apr 7, 2021

Etag/hash calculation/comparison for multipart files is wrong #45

Etag/hash calculation/comparison for multipart files is wrong #45

Comments

tybritten commented Sep 25, 2020

tybritten commented Sep 25, 2020

nicpottier commented Sep 28, 2020

tybritten commented Sep 28, 2020 via email

nicpottier commented Sep 28, 2020 • edited Loading

tybritten commented Sep 28, 2020

nicpottier commented Sep 28, 2020

tybritten commented Sep 29, 2020

tybritten commented Dec 1, 2020

tybritten commented Apr 7, 2021

nicpottier commented Sep 28, 2020 •

edited

Loading