-
Notifications
You must be signed in to change notification settings - Fork 37
buzhash: reduce target size and cutoff size #31
Conversation
Thank you for submitting this PR!
Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to change this, we need to make it configurable (like rabin and size) so users can restore the defaults.
Hey @Stebalien, the previous settings are way too large to do sensible deduplication. There's a study linked in ipfs/kubo#7966 which states, that 64 KByte is a good choice for deduplication when compression is applied too. But that's for highly mixed data and with very limited memory per terabyte of storage. We don't have such restrictions, users can choose buzhash blocked data like iso images, VM images uncompressed tar archives as well as HTML, CSS, and other text-based files. With the new buzhash size those should deduplicate very well. For any compressed or encrypted data format deduplication isn't possible anyway, so they can be stored in the default 256 K blocks and don't need the overhead of being stored with smaller blocks. |
I agree we should change the defaults (although please also consider that we're trading off per-block overhead in terms of access times, network overhead, latencies, etc). However, if we're going to change the defaults, I'd like to let users revert to the old defaults (e.g., to ensure hashes converge with previously imported data). |
I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate. Latency on disk shouldn't be an issue, since the average 8K should be still pretty neat in performance on an SSD. I mean, that's what a database usually uses as the block size. So operating systems are kind of optimized for that. Since we can do more deduplication means, we can do more efficient caching of the same amount of data. This might even lead to more performance, depending on the amount of duplicate data. We also reduce the transfer size, by being able to fetch data from the local disk, which hasn't been changed. For large file transfers, we might want to tweak the network settings to think in data size, not individual blocks. This way we would just fetch more blocks containing the same amount of data. The target size of this new setting is 8K since as mentioned above that's what databases use which means the operating system is optimized for this size of requests. When the blocks are stored in a database they would be stored sequentially anyway, so there's no real loss in write speed. In terms of flatfs, we might hit some barriers since we probably do sync after each block has been written. So we might want to recommend turning off sync if there are performance issues. Regarding old data: Buzhash is fairly new in ipfs and I don't think there are many users, especially since rabin gives a bigger deduplication ratio at the current settings. I don't think we need backward compatibility here. |
I mean, sure, there's some overhead involved - but on small files, this is probably not significant. It's more important that many different versions of SQL dumps, VM images, iso images, etc. properly deduplicate.
Probably only matters if you have enough information to make a guess. Please test `ipfs cat` (downloading from a remote host) on files with small chunks versus files with large chunks over a moderate latency (100ms) connection, you'll notice a large difference.
1. Deduplication is not a given and killing perf for encrypted files just to help with files that deduplicate well is not an option.
2. Prefetching is _hard_. The smaller the blocks, the worse we are at prefetching. `ipfs cat` currently prefetches 5-15 blocks ahead with bitswap. I welcome patches to improve this.
Setting the target blocksize to 8KiB will almost certainly reduce throughput on any conneciton with reasonable latency.
In terms of defaults, people _depend_ on being able to reproduce old hashes/blocks, both to deduplicate against existing content and for security/reproducibility. Given that making this configurable at runtime is so easy, I'm not going to lower the bar here.
I encourage you to read through old issues mentioning "block sizes and chunking" for more context on this issue.
|
@Stebalien wrote:
Fair enough, but I think we can improve the performance since it's just about latency. So it's basically the same issue as TCP with a too-small receive window. We just have to send more data "blindly" basically. There's no reason it has to be (much) slower than larger chunks.
True. But there's always the option to use just 256K blocks for these type of data (if not even larger). There's basically no point in using a rolling hash for these data to begin with. So maybe fixed 256K chunks should be the default and if we're dealing with files we could switch the chunker based on the mime type of the file if no chunker is manually set by the user?
Okay.
Will do. |
@Stebalien I'll first verify my assumptions that such small blocks are necessary for good deduplications with some sample data, before I proceed. I just don't trust this Microsoft paper and still think 4-16 K is pretty much what we need. But will we actually "snap" to 4K blocks, or 512 Bytes or will it output a random byte length? |
@Stebalien I'm finally done - this took longer than expected. TL;DR:
LMK if there's any use-case I missed that would greatly benefit from deduplication on block-level. :) ipfs rolling hash test - Sheet1.pdf /ipfs/bafykbzacedsteoatejnapzdhnhoaiyc2y5wsypl6jsif4pf5uwjsztdsk3owa |
Ah and fixed size 8K and 16K are actually pretty great for databases. So in theory we could recommend those for maximum amounts of deduplication of database data. They even outperform |
Awesome work! Were you able to test transfer times as well? |
This would have taken too long for all options I've tested. I can run this test, but only for a limited number of cases. Do you agree with my candidates? |
@Stebalien, I forgot the ping :) |
@Stebalien here's my pin time test: ipfs rolling hash pin-time - Sheet1.pdf /ipfs/QmPMsvgjoQHQPvw6vWjgHRvC3YddErgAacg9iGu4YLzQsp Note that this is the worst-case scenario since this connection is plagued with buffer bloat. Additionally as the corpus is only one big file, there only large blocks. In my first test I had also folders with many small files in which case the results would be closer. I only tested TCP, to get more reliable results, as UDP is sometimes deprioritized by providers. Result for me: Go with 12k-24k-48k, as it's a fair compromise between small chunks and overhead. The performance penalty for transfers has nothing to do with processing power, just the missing handling of longer latencies. So it should be fixable. |
Note: I agree it's "fixable", that's not the problem. The problem is that it needs to be fixed first before we can start using it. Otherwise, we just degrade perf for everyone in the mean-time. Given the numbers I'm seeing, it sounds like |
(but I'd still need the answers to the rest of those questions to tell) |
@Stebalien wrote:
I've tested the latency with git commit which is in the profile of the node on the right.
The latency of the connection is in the document, once idle and once while downloading (with 256K blocks - which did hit the ceiling on this connection).
The file size is 742 MB, the average block size should determine how many blocks each chunker created, but I can look that up, if the exact number is important.
I've used
Okay |
Well, I think we shouldn't make this configurable. A new and an old buzhash should be enough. If everyone experiments with the settings, we end up with the same as with rabin, where basically every data has different chunker settings - which doesn't help the network-wide deduplication. I actually think we should try to determine the mime type of the files added, and select static chunks or the smaller buzhash accordingly for the users. So compressed/encrypted or non-diff-duplicatable files get large chunks with good performance, and if we encounter an HTML/CSS/JS/SQL/XML/etc we switch to buzhash. This could be named "adaptive", for example. |
To continue this PR I think it makes sense to change How does this sound? |
@Stebalien wrote:
256k: 3136 blocks |
Note that I've done extensive testing of chunker sizes here; https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst And in my tests the optimal min-tgt-max sizes for a given avg chunk size is I note you didn't test using settings like this... it would be interesting to see how it compares. This do NOT match what you are proposing. |
Some observations about the rabin settings you were using in your testing in https://github.com/ipfs/go-ipfs-chunker/files/6819330/ipfs.rolling.hash.test.-.Sheet1.pdf
If you want an average block size of 32K, I'd recommend 16K-16K-128K, and for an average block size of 64K (which is a sweetspot identified by Microsoft in their chunking tests) then I'd go with 32K-32K-256K. Though in both cases you can set the max higher if you want. If you want to compare against the default 256K average chunk size I'd use 128K-128K-1M. |
buzMaxDefault = 64 << 10 | ||
buzMinLegacy = 128 << 10 | ||
buzMaxLegacy = 512 << 10 | ||
buzMask = 1<<17 - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that buzMask affects the "target" size, which is the average distance after the min size where chunk boundaries will be found. This setting gives tgt=2^17 or 128K. This means the average chunk boundary will be at 128K + 16K = 144K. Since this is larger than your max of 64K, this means most chunks will be truncated to 64K.
You need to also set buzMaskLegacy=1<<17 -1
and change buzMaskDefault=1<<14 -1
for tgt=16K for a default average block size of 32K. I'd also bump up buzMaxDefault = 128 << 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be (1<<14)-1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! It turns out <<
does have higher precedence than -
. See ipfs/kubo#8952 (comment)
@dbaarda wrote:
Okay, but that doesn't really matter, as the CIDs are all the same, right? So you get like 32391 same CIDs in a row. -- Thanks for all the feedback. Def worth going over this again. :) |
Sorry for the late reply: yes, all the min-size zero blocks have the same CID. The (maybe minor) problem is that this gives you more blocks, and thus the per-block metadata/etc overheads are worse. But that might be insignificant... I dunno. Also note that buzhash does this for not just zeros. It will return zero for any 32 byte run of any value where the uint8->uint32 map it uses returns a value with an even number of bits set. The way that map is generated I think that means at least half of all possible byte values will do this. |
Interesting observation, we could generate a new table for it to avoid that property. The primary property for the generation was keeping 50/50 split between one and zero bits in the whole table. |
This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase. You can learn more in the FAQs for the Boxo repo copying/consolidation effort. |
Fixes ipfs/kubo#7966