Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabin chunker - object size limit exceeded #6841

Closed
RubenKelevra opened this issue Jan 20, 2020 · 21 comments · Fixed by #6884
Closed

rabin chunker - object size limit exceeded #6841

RubenKelevra opened this issue Jan 20, 2020 · 21 comments · Fixed by #6884
Assignees
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jan 20, 2020

Version information:

go-ipfs version: 0.4.22-4e981576b-dirty
Repo version: 7
System version: amd64/linux
Golang version: go1.12.8

Description:

using the rabin chunker seems to work only on small files right now:

ipfs add --nocopy -t --chunker=rabin-262144-1048576-8388608 bigfile.bin

bigfile.bin is 670 MB, the progress freezes at some MB and this is the error message:

Error: object size limit exceeded

Additionally running this command will crash ipfs instantly:

ipfs add --nocopy -t --chunker=rabin-262144-1048576-99999999999 bigfile.bin

@RubenKelevra RubenKelevra added the kind/bug A bug in existing code (including security flaws) label Jan 20, 2020
@Stebalien
Copy link
Member

Hm. This is because the maximum block size is 1MiB.

Additionally running this command will crash ipfs instantly:

Yep, it's trying to allocate a massive file.


@ribasushi could you fix this? We should probably error in github.com/ipfs/go-ipfs-chunker (maybe in FromString?).

@RubenKelevra
Copy link
Contributor Author

Thanks!

I knew I was holding it wrong ... somehow.

I can confirm that this is working fine:

ipfs add --nocopy -t --chunker=rabin-262144-524288-1048576 bigfile.bin

@dbaarda
Copy link

dbaarda commented Jan 21, 2020 via email

@Stebalien
Copy link
Member

We've just addesd support for a buzhash backed chunker (in master). It's significantly faster and no worse, from what we can tell.

@ribasushi is working on a custom content-based chunker with a bunch of different heuristics.

For now, I'd hold off on adding additional chunking algorithms directly. However, please talk with @ribasushi about this algorithm. He's writing a testing tool for comparing chunking algorithms.

@ribasushi
Copy link
Contributor

@dbaarda I'll flag you on the issue containing the test-bed, when it is ready ( couple days, deity-willing )

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jan 21, 2020

Was going to say that was a massive chunk size. Deduping content with rabin chunking will work better with smaller chunks.

Yeah, I was just "whatevery, maybe a large number will throw a different error" and it did. ;)

I think we should just do a sanity check of the input values than crashing, and the errormessage could be better than "object size limit exceeded".

@RubenKelevra
Copy link
Contributor Author

We've just addesd support for a buzhash backed chunker (in master). It's significantly faster and no worse, from what we can tell.

Cool! How 'stable' do you think is this implementation? The Wikipedia mirror project is thinking about fetching always the latest versions and running an ipfs cluster. On this file types a rolling checksum is key to get the diffs small.

@ribasushi
Copy link
Contributor

@RubenKelevra while buzhash itself is stable, there will be a large amount of "extra considerations" landing in the next ~4 weeks or so, that are likely to lead to a substantially different approach to chunking.

You happen to ask these questions right in the middle of my work on this checklist ipfs/specs#227 (comment), hence the "cagey answers".

TLDR: Stay tuned for a bit more, before making decisions

@RubenKelevra
Copy link
Contributor Author

The mirror is outdated since mid 2017. A month or two doesn't really matter much 😂

Since we're talking about multiple terabyte of data, it's fair to say that it's probably wise to wait for some upcoming major changes, to avoid converting a cluster with maybe dozens of machines.

I'm currently aware of CIDv2, buzhash, UnixFSv2 and some other experimental features which are going to be standard in the future.

The ticket where we try to plan the cluster:

ipfs/distributed-wikipedia-mirror#68

I'm pretty new to this project and like to help organize some stuff going onto clusters of users willing to help a project with some storage - since I don't know Go - that's basically all I can do...

And write some nice bullet point lists of features I like to have hint hint

ipfs/notes#397

@Stebalien
Copy link
Member

Stebalien commented Jan 21, 2020 via email

@RubenKelevra
Copy link
Contributor Author

@Stebalien If selected it will give stable outputs of the same content in the future, that we don't have to convert TB of data just because the way it's been calculated has been changed.

@ribasushi
Copy link
Contributor

@RubenKelevra in terms of "too much data" from your list the most bang-for-buck would be the chunking research. CIDv1+base32 and UnixFSv2 are important endeavors but are not directly relevant to what you are doing ( a preexisting encode of historical data ought to keep working "forever", though we are working on exact definitions of "forever" so that project like yours can make informed decisions, ping @momack2 )

In addition there is a concerted push to get the content discovery ( DHT ) work better, slated for an IPFS release in ~mid-March ( #6776 (comment), scroll down to Todo). While it will not change how you encode Wikipedia's data "at rest", it will make a massive difference in terms of usability, so any evaluation you do right now, may need to be re-done at that point.

@RubenKelevra
Copy link
Contributor Author

@ribasushi thanks for all the input!

This gives a lot of time to find some people willing to donate bandwidth and storage to build a cluster for it!

@Stebalien
Copy link
Member

Stebalien commented Jan 21, 2020 via email

@dbaarda
Copy link

dbaarda commented Jan 21, 2020 via email

@RubenKelevra
Copy link
Contributor Author

However, librsync's rollsum requirements are a little different to just chunking (it uses them for matching), so buzzhash might be fine for chunking.

I think we should import some real world data, like diffs from articles on Wikipedia or the Linux Kernel git, and apply the diffs to pinned objects.

Then we measure how much storage requirement each setup takes.

We could even build a test out of this, to see if a commit change these known numbers significantly, to find some unexpected bugs, or automatically test storage optimizations/new rolling hashes which are added to the project.

@ribasushi
Copy link
Contributor

ribasushi commented Jan 22, 2020

Then we measure how much storage requirement each setup takes.

@RubenKelevra this is literally what I am working on, but it is slower going than I'd like. With a cleared up weekend I should have a preliminary version others can hack on before Monday...

@Kubuxu
Copy link
Member

Kubuxu commented Feb 8, 2020

@dbaarda I think rollsum differs quite a bit with how we've implemented buzhash so you might want to give it a look. It uses a balanced random map for 8bit to 32bit mapping and we also use xor-shit instead of add.

@dbaarda
Copy link

dbaarda commented Feb 8, 2020

@Kubuxu in librsync there are now 2 different rollsums; the original (not very good) rsync rollsum which is a hybrid of adler32 and fletcher, and a RabinKarp rollsum (AKA polynomial rolling hash, as used in the RabinKarp algorithm doc, different to the Rabin Fingerprint). It doesn't use buzzhash (AKA cyclic polynomial) because my testing showed it had unacceptably high collisions for ASCII data.

I think the reason buzzhash has high collisions is ASCII data uses only a small subset of byte values, the shift/rotate operations cycle every 32 rotates, and xor doesn't mix adjacent bits. Together this means there is a high probability of bytes to repeat in the input stream in a way that align so the xor cancels them both out of the sum. Also, just swapping any 2 bytes that are 32 bytes apart will give the same hash. It means it's less sensitive to the order of the bytes in the stream, and with ASCII's reduced range of values, the chance of ending up with the same bytes in a different order is much higher. Using a balanced random map for 8->32bits doesn't help with collisions, since the same bytes always map to the same bit pattern you xor with (though it does help to better distribute the hashes over the full 32bit range).

The RabinKarp rollsum uses multiplies and adds instead of rotates and xor's, which both do propagate to adjacent bits, making it much more sensitive to the order of the bytes in the stream. It gets a good hash distribution without needing an 8->32bit balanced random mapping.

However, this might not matter for chunking, which probably doesn't care about collisions. What matters most is a good enough solution that is fast and efficient. Buzzhash's rotate and xor is probably a tiny bit faster than RabinKarp's multiply and add (multiplies can be expensive). However, an 8->32bit lookup table is something that ends up stealing CPU cache space from other parts of the process, and that might be worse than the multiply overheads. You'd need to test it.

For a summary of the rollsums and their names see https://en.wikipedia.org/wiki/Rolling_hash

@ribasushi
Copy link
Contributor

... is something that ends up stealing CPU cache space from other parts of the process...

A periodic reminder for folks following along: the actual rolling hash calculation never exists in a vacuum. There is the outer context of hashing/taggin/moving the resulting chunks around, and this almost always takes place synchronously with the chunking itself due to stream-processing.

In other words focusing on the clock-count of a particular rolling hash algorithm is very misleading.

I am still working on the tool I mentioned higher up, that makes it more convenient to quantify the above claim. A lot of progress has been made, but realistically at least another full week out before a v0.1.

@RubenKelevra
Copy link
Contributor Author

... is something that ends up stealing CPU cache space from other parts of the process...

A periodic reminder for folks following along: the actual rolling hash calculation never exists in a vacuum. There is the outer context of hashing/taggin/moving the resulting chunks around, and this almost always takes place synchronously with the chunking itself due to stream-processing.
In other words focusing on the clock-count of a particular rolling hash algorithm is very misleading.

This sounds like it's probably wise to run a test on a single CPU core and run synthetic tests additionally along, to measure how the impact on additionally jobs running in parallel would be.

Prime95 comes to mind a synthetic load.

I did some testing a while back and used BOINC which would run just one workunit of Seti@home as a background load with high nice setting.

The interesting value for me was the system load, one minute into the test.

The issue with using something like BOINC is, you can't start the same work units over and over again in an automated test to see if changes to the rest of the code may have an impact.

Prime95 could run the same work on each test, helping to find performance bugs.

I am still working on the tool I mentioned higher up, that makes it more convenient to quantify the above claim. A lot of progress has been made, but realistically at least another full week out before a v0.1.

Sounds like you made some progress, thanks for your work!

ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants