-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand rolling chunker documentation #8952
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,6 +88,31 @@ Buzhash or Rabin fingerprint chunker for content defined chunking by | |
specifying buzhash or rabin-[min]-[avg]-[max] (where min/avg/max refer | ||
to the desired chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | ||
|
||
If you just specify 'rabin' without parameters the default sizes are | ||
'rabin-87381-262144-393216' and the parameters used for buzhash are | ||
always min: 131072 and max: 524288 with a mask of 65536. Buzhash | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that these default sizes are far from optimal for deduplication or speed according to my testing at; https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst The real average block size is actually min+avg (ignoring minimal max truncation contributions), so I prefer to use the term "target" or "tgt" instead of "avg". For any rollsum using the normal chunker (ie, no fancy dynamic match like FastCDC or truncation handler like Regression Chunking) the optimal sizes for both deduplication and speed (from cut-point-skipping) for a given average chunk size is with The code currently also seems to support specifying just For Buzhash the So the current defaults for rabin give an average block size of 341K, and buzhash (due to the bug mentioned above) of 128K+2. So better defaults for an average block size of 256K would be rabin-131071-131072-1048576, and for buzhash would be min=131072,max=1048576, mask=(1<<17) -1. I know changing these defaults now will break deduplication against content already added with the old defaults, but given buzhash has effectively been broken by a bug fixing it is probably fine. Fixing rabin is more contentious, but you could make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I was wondering what the - 1 on the bitshift is supposed to do as well... Buzhash isn't that terrible right now... It actually performed as well as rabin overall in my tests: ipfs/go-ipfs-chunker#31 (comment) But maybe @Kubuxu can give this a look? :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, it seems https://go.dev/ref/spec#Operators I was misled by incorrect documentation here; https://www.tutorialspoint.com/go/go_operators_precedence.htm So it seems the default buzhash mask is actually 131071, not 65536 as you have documented here. |
||
is the significantly lighter algorithm with comparable performance. | ||
|
||
The parameters chosen won't be stored in the resulting data and the | ||
chunker only determines the cut-marks - there's no special support | ||
needed for the reading client for any of these. | ||
Choosing a rolling chunker your data will be stored with a variable | ||
blocksizes in a file while with a fixed size-* all blocks will have | ||
the same length. | ||
|
||
Choosing a smaller chunksize might lead to better deduplication | ||
results if you store similar files, with the tradeoff of storage | ||
overhead and may result in a slower transmission speed due to more | ||
objects which need to be transmitted. | ||
|
||
Files stored with different chunkers in IPFS will NOT deduplicate, | ||
as the hashes per chunk will be different. Resulting in different | ||
CIDs for each chunk and the resulting file. | ||
|
||
If you store large files with a chance of duplicate data stored in | ||
them in a "plaintext fashing" - not compressed, it is recommended to | ||
use a rolling chunker. | ||
|
||
The following examples use very small byte sizes to demonstrate the | ||
properties of the different chunkers on a small file. You'll likely | ||
want to use a 1024 times larger chunk sizes for most files. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying the default rabin and buzhash parameters here seems good to me to help people understand what's going on.