-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in v1.9.0 #1869
Comments
That sounds likely, yes. The terrible hack for lagrange basis caching probably needs to be redone in a disciplined way. I'll discuss with @45930. |
Here's one idea: The last time I had a performance regression related to Lagrange basis creation, the problem was that they were generated in a single thread. Multi-threading really helps with making them faster. The recent PR probably increased the number of places where LB creation gets triggered. So maybe now we calculate LB eagerly in some place that only uses a single thread when previously it used to be deferred and then calculated in multiple threads. I don't think the LB caching is a terrible hack btw! |
Yes, the ZkProgram that compiles 3x slower in my case is also calculating recursive proofs. |
According to the research made by @jmikedupont2, in 1.9.0 there is some slow process that almost do not use the CPU inserted at the beginning of compilation: |
Update: I confirmed (by looking at a flamegraph in Chrome) that the long time is spent in Now tracing this down is simple, we can put a |
It's now pretty obvious what is going on: Previously, With @mrmr1993's change, it seems that (@mrmr1993 also worth noting that this has nothing to do with the "terrible hack for lagrange basis caching", but rather with not being careful about performance on the Rust side) |
This is called from inside the |
@mitschabaude the original issue that made us implement this new behavior was #1411. Making sure that the LB is there when it's asked for, and never not there did resolve that issue and introduced this performance bug. Another idea we were discussing is computing the LB and shipping it in o1js so that it never needs to be computed anyways. This would come with a bundle size cost. I don't want to regress #1411 and keep flip-flopping between the issues, so I'm reticent to just revert the changes. Is it possible to revert the change to |
Yeah of course, we can't just revert everything. Both issues have to be fixed at the same time, and this really shouldn't be a big problem if one takes the effort to understand exactly where the original version went wrong. I didn't look into the original bug, so I can't say whether only reverting If that doesn't work, that just means more effort is needed to fix the performance regression. We can't keep the performance as it is now. The "maybe" method fundamentally is needed for caching. The file system cache is slow: you can't read from it every single time. Therefore, you need a fast cache that takes over once the file system cache read from the file system once. That fast cache is the in-memory LB. To act as a cache though, it needs to support cache misses. Cache misses (= returning |
@45930 I tried it out, it works - this PR fixes performance and also keeps the verification test working: MinaProtocol/mina#16261 Can you land it in o1js? |
@mitschabaude Thanks for the PR. Looks like I still can't execute CI on Mina :( But I will organize the right people to get eyes on it at least. |
@mitschabaude I built your fix here:
I was looking for a good way to unit test the cache, but it seems too deeply rooted in the bindings code. The performance and cache read feature seem to work simultaneously on my machine. |
I think it's fine, it was working all that time now without problems, probably not worth it to unit test now since it didn't change |
Actually @45930 I take that back, there's a way you can test the cache: The file system cache used for LBs is actually a custom object passed in and controlled by the user. Currently, compile() sets that to the "srsCache", then runs things where it expects the cache to be used, and then unsets the cache again. For testing you could create a custom 'Cache' that also records all the cache read attempts, and pass it to compile(). That way you can assert that cache reads are attempted, which would catch the regression we just had, and similar tests |
Thanks, I wrote a test and confirms it works on 1.8.0, and my branch, and fails on 1.9.0 I think it's good to have a test since we know this caused a performance regression once. Adding a timing-based test to test performance may also be nice to do, but if we can pinpoint the root cause, I'm for adding a test to prevent it in the future. |
Working on adding the archiving of the timing tests generally to the repo and am testing your patch here https://github.com/jmikedupont2/o1js/actions/runs/11436079285 . While setting up the tests, one thing I noted was one test (./dist/node/lib/proof-system/cached-verification.unit-test.js) stood out as supremely slow (3.5h) in 1.9 https://github.com/jmikedupont2/o1js/actions/runs/11429190936/job/31795307052, not sure if it is the same root cause but I will have more to say in the next days. These patches write perf logs to the git hub artifacts so we can trace individual function performance changes across commits, also split up the work into chunks that can be handled without timing out. We can report on those artifacts with other git hub actions or process them locally. |
@jmikedupont2 that test was introduced to fix a process that hangs forever, which seems to be what's happening on your branch. The test passes quickly on the main branch and in my PR branch https://github.com/o1-labs/o1js/actions/runs/11418129637/job/31771226595 I'd double check you have the right branched merged on your fork. |
v1.9.1 is on it's way out! |
I'm experiencing crazy performance regression when compiling a recursive circuit, that occurs when switching to the latest 1.9.0 release from the CD release
https://pkg.pr.new/o1-labs/o1js@0f8ff81
that came just before it.I'm seeing compilation time go from 17s to 180s. This is reproducible when switching between the two installed versions.
The obvious candidate PR that could have introduced such a regression is #1857 @mrmr1993
The text was updated successfully, but these errors were encountered: