-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rabin chunker - object size limit exceeded #6841
Comments
Hm. This is because the maximum block size is 1MiB.
Yep, it's trying to allocate a massive file. @ribasushi could you fix this? We should probably error in github.com/ipfs/go-ipfs-chunker (maybe in FromString?). |
Thanks! I knew I was holding it wrong ... somehow. I can confirm that this is working fine:
|
Was going to say that was a massive chunk size. Deduping content with rabin
chunking will work better with smaller chunks. The ideal chunk size is
probably the same as used for the recommended block size by rsync, which is
sqrt(filesize). However the sweet-spot compromise between number of blocks
and amount of deduping for IPFS would be different. You probably don't want
chunks smaller than about 16k.
On a side note, I've looked at the code for rabin chunking and it's using
rabin fingerprint as the rollsum for chunking. This is way more complicated
and slower for little-or-no benefit than the rabin-karp rollsum. I recently
added rabin-karp support to librsync, and did a bunch of testing of several
rollsums before I chose it.
Is there any interest in a faster/simpler rollsum for rabin chunking? Of
course changing it would make it incompatible with the old chunking.
…On Tue, 21 Jan 2020, 10:48 am @RubenKelevra, ***@***.***> wrote:
Thanks!
I knew I was holding it wrong ... somehow.
I can confirm that this is working fine:
ipfs add --nocopy -t --chunker=rabin-262144-524288-1048576 bigfile.bin
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6841>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPTTM6WS4AB5ZSKM6536BTQ6YZ2XANCNFSM4KJLCFWA>
.
|
We've just addesd support for a buzhash backed chunker (in master). It's significantly faster and no worse, from what we can tell. @ribasushi is working on a custom content-based chunker with a bunch of different heuristics. For now, I'd hold off on adding additional chunking algorithms directly. However, please talk with @ribasushi about this algorithm. He's writing a testing tool for comparing chunking algorithms. |
@dbaarda I'll flag you on the issue containing the test-bed, when it is ready ( couple days, deity-willing ) |
Yeah, I was just "whatevery, maybe a large number will throw a different error" and it did. ;) I think we should just do a sanity check of the input values than crashing, and the errormessage could be better than "object size limit exceeded". |
Cool! How 'stable' do you think is this implementation? The Wikipedia mirror project is thinking about fetching always the latest versions and running an ipfs cluster. On this file types a rolling checksum is key to get the diffs small. |
@RubenKelevra while buzhash itself is stable, there will be a large amount of "extra considerations" landing in the next ~4 weeks or so, that are likely to lead to a substantially different approach to chunking. You happen to ask these questions right in the middle of my work on this checklist ipfs/specs#227 (comment), hence the "cagey answers". TLDR: Stay tuned for a bit more, before making decisions |
The mirror is outdated since mid 2017. A month or two doesn't really matter much 😂 Since we're talking about multiple terabyte of data, it's fair to say that it's probably wise to wait for some upcoming major changes, to avoid converting a cluster with maybe dozens of machines. I'm currently aware of CIDv2, buzhash, UnixFSv2 and some other experimental features which are going to be standard in the future. The ticket where we try to plan the cluster: ipfs/distributed-wikipedia-mirror#68 I'm pretty new to this project and like to help organize some stuff going onto clusters of users willing to help a project with some storage - since I don't know Go - that's basically all I can do... And write some nice bullet point lists of features I like to have hint hint |
Define "stable". We're not planning on removing it, but it's not likely to be the "final" default chunker.
|
@Stebalien If selected it will give stable outputs of the same content in the future, that we don't have to convert TB of data just because the way it's been calculated has been changed. |
@RubenKelevra in terms of "too much data" from your list the most bang-for-buck would be the chunking research. CIDv1+base32 and UnixFSv2 are important endeavors but are not directly relevant to what you are doing ( a preexisting encode of historical data ought to keep working "forever", though we are working on exact definitions of "forever" so that project like yours can make informed decisions, ping @momack2 ) In addition there is a concerted push to get the content discovery ( DHT ) work better, slated for an IPFS release in ~mid-March ( #6776 (comment), scroll down to |
@ribasushi thanks for all the input! This gives a lot of time to find some people willing to donate bandwidth and storage to build a cluster for it! |
I don't foresee us changing _this_ algorithm ever. We'll just add additional chunkers.
|
Interesting reading in all the links here.
I too did my own analysis of different rollsums for librsync. I rejected
buzzhash (AKA cyclic polynomial) because it had really bad collision rates
for ASCII data;
https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst
I didn't even test rabin fingerprint because initially I couldn't find a
decent description of the algorithm and there was some confusion around it
and the rabinkarp (AKA polynomial rolling hash) rollsum used in the
Rabin-Karp algorithm documentation. When i did find it my analysis was it
was too complicated to be a viable option for librsync and I couldn't see
any benefits over the already nearly optimal rabinkarp.
However, librsync's rollsum requirements are a little different to just
chunking (it uses them for matching), so buzzhash might be fine for
chunking.
…On Wed, 22 Jan 2020, 5:28 am Steven Allen, ***@***.***> wrote:
I don't foresee us changing _this_ algorithm ever. We'll just add
additional chunkers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6841>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPTTM2WCC5B4GH57LGIJ7LQ645FTANCNFSM4KJLCFWA>
.
|
I think we should import some real world data, like diffs from articles on Wikipedia or the Linux Kernel git, and apply the diffs to pinned objects. Then we measure how much storage requirement each setup takes. We could even build a test out of this, to see if a commit change these known numbers significantly, to find some unexpected bugs, or automatically test storage optimizations/new rolling hashes which are added to the project. |
@RubenKelevra this is literally what I am working on, but it is slower going than I'd like. With a cleared up weekend I should have a preliminary version others can hack on before Monday... |
@dbaarda I think rollsum differs quite a bit with how we've implemented buzhash so you might want to give it a look. It uses a balanced random map for 8bit to 32bit mapping and we also use xor-shit instead of add. |
@Kubuxu in librsync there are now 2 different rollsums; the original (not very good) rsync rollsum which is a hybrid of adler32 and fletcher, and a RabinKarp rollsum (AKA polynomial rolling hash, as used in the RabinKarp algorithm doc, different to the Rabin Fingerprint). It doesn't use buzzhash (AKA cyclic polynomial) because my testing showed it had unacceptably high collisions for ASCII data. I think the reason buzzhash has high collisions is ASCII data uses only a small subset of byte values, the shift/rotate operations cycle every 32 rotates, and xor doesn't mix adjacent bits. Together this means there is a high probability of bytes to repeat in the input stream in a way that align so the xor cancels them both out of the sum. Also, just swapping any 2 bytes that are 32 bytes apart will give the same hash. It means it's less sensitive to the order of the bytes in the stream, and with ASCII's reduced range of values, the chance of ending up with the same bytes in a different order is much higher. Using a balanced random map for 8->32bits doesn't help with collisions, since the same bytes always map to the same bit pattern you xor with (though it does help to better distribute the hashes over the full 32bit range). The RabinKarp rollsum uses multiplies and adds instead of rotates and xor's, which both do propagate to adjacent bits, making it much more sensitive to the order of the bytes in the stream. It gets a good hash distribution without needing an 8->32bit balanced random mapping. However, this might not matter for chunking, which probably doesn't care about collisions. What matters most is a good enough solution that is fast and efficient. Buzzhash's rotate and xor is probably a tiny bit faster than RabinKarp's multiply and add (multiplies can be expensive). However, an 8->32bit lookup table is something that ends up stealing CPU cache space from other parts of the process, and that might be worse than the multiply overheads. You'd need to test it. For a summary of the rollsums and their names see https://en.wikipedia.org/wiki/Rolling_hash |
A periodic reminder for folks following along: the actual rolling hash calculation never exists in a vacuum. There is the outer context of hashing/taggin/moving the resulting chunks around, and this almost always takes place synchronously with the chunking itself due to stream-processing. In other words focusing on the clock-count of a particular rolling hash algorithm is very misleading. I am still working on the tool I mentioned higher up, that makes it more convenient to quantify the above claim. A lot of progress has been made, but realistically at least another full week out before a |
This sounds like it's probably wise to run a test on a single CPU core and run synthetic tests additionally along, to measure how the impact on additionally jobs running in parallel would be. Prime95 comes to mind a synthetic load. I did some testing a while back and used BOINC which would run just one workunit of Seti@home as a background load with high nice setting. The interesting value for me was the system load, one minute into the test. The issue with using something like BOINC is, you can't start the same work units over and over again in an automated test to see if changes to the rest of the code may have an impact. Prime95 could run the same work on each test, helping to find performance bugs.
Sounds like you made some progress, thanks for your work! |
Version information:
go-ipfs version: 0.4.22-4e981576b-dirty
Repo version: 7
System version: amd64/linux
Golang version: go1.12.8
Description:
using the rabin chunker seems to work only on small files right now:
ipfs add --nocopy -t --chunker=rabin-262144-1048576-8388608 bigfile.bin
bigfile.bin is 670 MB, the progress freezes at some MB and this is the error message:
Additionally running this command will crash ipfs instantly:
ipfs add --nocopy -t --chunker=rabin-262144-1048576-99999999999 bigfile.bin
The text was updated successfully, but these errors were encountered: