Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

Rabin Fingerprinting #1

Closed
jbenet opened this issue Apr 15, 2015 · 10 comments
Closed

Rabin Fingerprinting #1

jbenet opened this issue Apr 15, 2015 · 10 comments

Comments

@jbenet
Copy link
Member

jbenet commented Apr 15, 2015

In use by LBFS and other systems.

@gwillen
Copy link

gwillen commented Apr 15, 2015

I looked at https://github.com/ipfs/go-ipfs/blob/master/importer/chunk/rabin.go -- it looks not-crazy, but I'll note from Wikipedia (in http://en.wikipedia.org/wiki/Rolling_hash): "The choice of a and n is critical to get good hashing; see linear congruential generator for more discussion."

Wikipedia's "a and n" are 'a' and 'MOD' here.

Looking at the LCG article, it suggests:

  • a - 1 is divisible by all prime factors of MOD
  • a - 1 is a multiple of 4 if MOD is a multiple of 4.

There is also a table of popular values in the LCG article; note that the column for 'c' isn't relevant to this application since the next byte being hashed takes the place of 'c'. (I'm not sure what effect that has on the applicability of the article, and the article on rolling hashes doesn't really explain further.)

@gwillen
Copy link

gwillen commented Apr 15, 2015

I'll also point out for the record that, since this hash is non-cryptographic, it's easy for an adversary to create a file whose breakpoints are in any place of the adversary's choice. Since you have min and max block size parameters, this probably just means the adversary gets suboptimal content deduplication, but keep it in mind.

@jbenet
Copy link
Member Author

jbenet commented Jul 4, 2015

https://github.com/restic/restic/tree/master/chunker <-- working implementation of rabin fingerprinting. maybe we can move it into https://github.com/jbenet/go-chunk

@whyrusleeping
Copy link
Member

@jbenet nice, how did you manage to find that?

@jbenet
Copy link
Member Author

jbenet commented Jul 4, 2015

@fd0 (author) came by #ipfs yesterday and pointed me to it (thanks!!). Backlog: https://botbot.me/freenode/ipfs/2015-07-04/?msg=43646408&page=3 (including link to hilarious comment :)

@fd0
Copy link

fd0 commented Jul 5, 2015

Are you interested if I split out the chunker into another package (like github.com/restic/chunker) that you could just use?

What average chunk size are you aiming for? It's about 1MiB for restic (a backup program).

Another thing: I found the Wikipedia article misleading, and nearly all implementations of rabin fingerprinting are (unfortunately, I must say) wrong. I read the original paper by Rabin and it's awesome, albeit hard to understand for people without a solid understanding of Algebra:
http://www.xmailserver.org/rabin.pdf

Other hints for optimization techniques can be found in the documentation:
https://godoc.org/github.com/restic/restic/chunker

@jbenet
Copy link
Member Author

jbenet commented Jul 5, 2015

Are you interested if I split out the chunker into another package (like github.com/restic/chunker) that you could just use?

Wither fine. We'll always vendor things anyway when we use them, so don't worry about changing urls, etc. :)

We may put it as part of https://github.com/jbenet/go-chunk if i make it

What average chunk size are you aiming for? It's about 1MiB for restic (a backup program).

I think LBFS used 2-64KiB. We'll use 2-256KiB I think.

nearly all implementations of rabin fingerprinting are (unfortunately, I must say) wrong

Even the LBFS implementation?

I read the original paper by Rabin and it's awesome, albeit hard to understand for people without a solid understanding of Algebra:

Thank you! really great paper, even though my algebra is not as solid.

@fd0
Copy link

fd0 commented Jul 5, 2015

Even the LBFS implementation?

No, the LBFS implementation was one of the few correct ones. Btw, I had trouble finding the source code for lbfs, so once I found it I converted and archived their CVS at https://github.com/fd0/lbfs

@jbenet
Copy link
Member Author

jbenet commented Feb 20, 2016

this is already in go-ipfs, thanks again @whyrusleeping and @fd0

@jbenet jbenet closed this as completed Feb 20, 2016
@meyerzinn
Copy link

Would a better implementation that takes advantage of SSE2 on Intel chips be considered for this? The current implementation takes a grueling hour to do some 50GB on a computer from DigitalOcean with 16GB ram and 8 CPUs. I feel like that's unnecessarily slow, but it's also possible I've been too used to the normal split-the-block method.

The other implementation I was referring to was https://github.com/kevinko/rabin.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants