Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in v1.9.0 #1869

Closed
mitschabaude opened this issue Oct 16, 2024 · 18 comments · Fixed by #1874
Closed

Performance regression in v1.9.0 #1869

mitschabaude opened this issue Oct 16, 2024 · 18 comments · Fixed by #1874

Comments

@mitschabaude
Copy link
Collaborator

I'm experiencing crazy performance regression when compiling a recursive circuit, that occurs when switching to the latest 1.9.0 release from the CD release https://pkg.pr.new/o1-labs/o1js@0f8ff81 that came just before it.

I'm seeing compilation time go from 17s to 180s. This is reproducible when switching between the two installed versions.

The obvious candidate PR that could have introduced such a regression is #1857 @mrmr1993

@mrmr1993
Copy link
Member

That sounds likely, yes. The terrible hack for lagrange basis caching probably needs to be redone in a disciplined way. I'll discuss with @45930.

@mitschabaude
Copy link
Collaborator Author

Here's one idea: The last time I had a performance regression related to Lagrange basis creation, the problem was that they were generated in a single thread. Multi-threading really helps with making them faster. The recent PR probably increased the number of places where LB creation gets triggered. So maybe now we calculate LB eagerly in some place that only uses a single thread when previously it used to be deferred and then calculated in multiple threads.

I don't think the LB caching is a terrible hack btw!

@dfstio
Copy link

dfstio commented Oct 17, 2024

Yes, the ZkProgram that compiles 3x slower in my case is also calculating recursive proofs.

@dfstio
Copy link

dfstio commented Oct 18, 2024

The last time I had a performance regression related to Lagrange basis creation, the problem was that they were generated in a single thread.

According to the research made by @jmikedupont2, in 1.9.0 there is some slow process that almost do not use the CPU inserted at the beginning of compilation:

#1870 (comment)

@mitschabaude
Copy link
Collaborator Author

mitschabaude commented Oct 18, 2024

Update: I confirmed (by looking at a flamegraph in Chrome) that the long time is spent in caml_fq_srs_maybe_lagrange_commitment, called via lagrangeCommitment() in o1js-bindings, and only uses a single CPU.

Now tracing this down is simple, we can put a console.trace('') inside lagrangeCommitment() and see where this is called from (recommend to run with node --enable-source-maps to get ocaml traces)

@mitschabaude
Copy link
Collaborator Author

mitschabaude commented Oct 18, 2024

It's now pretty obvious what is going on: Previously, maybe_lagrange_commitment was only there to see if the lagrange commitment exists and return it if that was the case. So it was fine to run it in a single thread, and the kimchi bindings do so.

With @mrmr1993's change, it seems that maybe_lagrange_commitment actually computes the LB if it's not there. Doing this in a single thread is not ok anymore!

(@mrmr1993 also worth noting that this has nothing to do with the "terrible hack for lagrange basis caching", but rather with not being careful about performance on the Rust side)

@mitschabaude
Copy link
Collaborator Author

mitschabaude commented Oct 18, 2024

This is called from inside the step_verifer.ml. The performance is fixed when inside lagrangeCommitment(), we directly call caml_fq_srs_lagrange_commitment instead of the "maybe" variant. but that breaks caching, so the proper fix is to reintroduce an actual, fast "maybe" variant that we can use before trying to read from cache

@45930
Copy link
Contributor

45930 commented Oct 18, 2024

@mitschabaude the original issue that made us implement this new behavior was #1411. Making sure that the LB is there when it's asked for, and never not there did resolve that issue and introduced this performance bug.

Another idea we were discussing is computing the LB and shipping it in o1js so that it never needs to be computed anyways. This would come with a bundle size cost.

I don't want to regress #1411 and keep flip-flopping between the issues, so I'm reticent to just revert the changes. Is it possible to revert the change to maybe_lagrange_commitment without regressing the cache behavior?

@mitschabaude
Copy link
Collaborator Author

mitschabaude commented Oct 18, 2024

I don't want to regress #1411 and keep flip-flopping between the issues, so I'm reticent to just revert the changes. Is it possible to revert the change to maybe_lagrange_commitment without regressing the cache behavior?

Yeah of course, we can't just revert everything. Both issues have to be fixed at the same time, and this really shouldn't be a big problem if one takes the effort to understand exactly where the original version went wrong.

I didn't look into the original bug, so I can't say whether only reverting maybe_lagrange_commitment will regress to the original bug. But it seems worth a try! Simple enough to revert only that function, rebuild and run the regression test.

If that doesn't work, that just means more effort is needed to fix the performance regression. We can't keep the performance as it is now.

The "maybe" method fundamentally is needed for caching. The file system cache is slow: you can't read from it every single time. Therefore, you need a fast cache that takes over once the file system cache read from the file system once. That fast cache is the in-memory LB. To act as a cache though, it needs to support cache misses. Cache misses (= returning undefined) used to be supported by maybeLagrangeBasis(), but the current version never misses, so the file system cache is never used.

@mitschabaude
Copy link
Collaborator Author

@45930 I tried it out, it works - this PR fixes performance and also keeps the verification test working: MinaProtocol/mina#16261

Can you land it in o1js?

@45930
Copy link
Contributor

45930 commented Oct 18, 2024

@mitschabaude Thanks for the PR. Looks like I still can't execute CI on Mina :( But I will organize the right people to get eyes on it at least.

@45930
Copy link
Contributor

45930 commented Oct 18, 2024

@mitschabaude I built your fix here:

I was looking for a good way to unit test the cache, but it seems too deeply rooted in the bindings code. The performance and cache read feature seem to work simultaneously on my machine.

@mitschabaude
Copy link
Collaborator Author

I was looking for a good way to unit test the cache

I think it's fine, it was working all that time now without problems, probably not worth it to unit test now since it didn't change

@mitschabaude
Copy link
Collaborator Author

Actually @45930 I take that back, there's a way you can test the cache:

The file system cache used for LBs is actually a custom object passed in and controlled by the user. Currently, compile() sets that to the "srsCache", then runs things where it expects the cache to be used, and then unsets the cache again.

For testing you could create a custom 'Cache' that also records all the cache read attempts, and pass it to compile(). That way you can assert that cache reads are attempted, which would catch the regression we just had, and similar tests

@45930
Copy link
Contributor

45930 commented Oct 19, 2024

@mitschabaude

Thanks, I wrote a test and confirms it works on 1.8.0, and my branch, and fails on 1.9.0

https://github.com/o1-labs/o1js/pull/1874/files#diff-e7d3bfd05832c7137d897ba35e98b9b92dbfff4a541efec13eea2701c126f8caR50

I think it's good to have a test since we know this caused a performance regression once. Adding a timing-based test to test performance may also be nice to do, but if we can pinpoint the root cause, I'm for adding a test to prevent it in the future.

@jmikedupont2
Copy link

jmikedupont2 commented Oct 21, 2024

Working on adding the archiving of the timing tests generally to the repo and am testing your patch here https://github.com/jmikedupont2/o1js/actions/runs/11436079285 .

While setting up the tests, one thing I noted was one test (./dist/node/lib/proof-system/cached-verification.unit-test.js) stood out as supremely slow (3.5h) in 1.9 https://github.com/jmikedupont2/o1js/actions/runs/11429190936/job/31795307052, not sure if it is the same root cause but I will have more to say in the next days.

These patches write perf logs to the git hub artifacts so we can trace individual function performance changes across commits, also split up the work into chunks that can be handled without timing out. We can report on those artifacts with other git hub actions or process them locally.

@45930
Copy link
Contributor

45930 commented Oct 21, 2024

@jmikedupont2 that test was introduced to fix a process that hangs forever, which seems to be what's happening on your branch.

The test passes quickly on the main branch and in my PR branch https://github.com/o1-labs/o1js/actions/runs/11418129637/job/31771226595

I'd double check you have the right branched merged on your fork.

@45930
Copy link
Contributor

45930 commented Oct 21, 2024

v1.9.1 is on it's way out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants