Performance regression in v1.9.0 #1869

mitschabaude · 2024-10-16T19:31:50Z

I'm experiencing crazy performance regression when compiling a recursive circuit, that occurs when switching to the latest 1.9.0 release from the CD release https://pkg.pr.new/o1-labs/o1js@0f8ff81 that came just before it.

I'm seeing compilation time go from 17s to 180s. This is reproducible when switching between the two installed versions.

The obvious candidate PR that could have introduced such a regression is #1857 @mrmr1993

The text was updated successfully, but these errors were encountered:

mrmr1993 · 2024-10-16T19:58:03Z

That sounds likely, yes. The terrible hack for lagrange basis caching probably needs to be redone in a disciplined way. I'll discuss with @45930.

mitschabaude · 2024-10-17T06:32:07Z

Here's one idea: The last time I had a performance regression related to Lagrange basis creation, the problem was that they were generated in a single thread. Multi-threading really helps with making them faster. The recent PR probably increased the number of places where LB creation gets triggered. So maybe now we calculate LB eagerly in some place that only uses a single thread when previously it used to be deferred and then calculated in multiple threads.

I don't think the LB caching is a terrible hack btw!

dfstio · 2024-10-17T09:24:12Z

Yes, the ZkProgram that compiles 3x slower in my case is also calculating recursive proofs.

dfstio · 2024-10-18T05:54:44Z

The last time I had a performance regression related to Lagrange basis creation, the problem was that they were generated in a single thread.

According to the research made by @jmikedupont2, in 1.9.0 there is some slow process that almost do not use the CPU inserted at the beginning of compilation:

#1870 (comment)

mitschabaude · 2024-10-18T08:09:41Z

Update: I confirmed (by looking at a flamegraph in Chrome) that the long time is spent in caml_fq_srs_maybe_lagrange_commitment, called via lagrangeCommitment() in o1js-bindings, and only uses a single CPU.

Now tracing this down is simple, we can put a console.trace('') inside lagrangeCommitment() and see where this is called from (recommend to run with node --enable-source-maps to get ocaml traces)

mitschabaude · 2024-10-18T08:15:44Z

It's now pretty obvious what is going on: Previously, maybe_lagrange_commitment was only there to see if the lagrange commitment exists and return it if that was the case. So it was fine to run it in a single thread, and the kimchi bindings do so.

With @mrmr1993's change, it seems that maybe_lagrange_commitment actually computes the LB if it's not there. Doing this in a single thread is not ok anymore!

(@mrmr1993 also worth noting that this has nothing to do with the "terrible hack for lagrange basis caching", but rather with not being careful about performance on the Rust side)

mitschabaude · 2024-10-18T08:30:33Z

This is called from inside the step_verifer.ml. The performance is fixed when inside lagrangeCommitment(), we directly call caml_fq_srs_lagrange_commitment instead of the "maybe" variant. but that breaks caching, so the proper fix is to reintroduce an actual, fast "maybe" variant that we can use before trying to read from cache

45930 · 2024-10-18T17:04:40Z

@mitschabaude the original issue that made us implement this new behavior was #1411. Making sure that the LB is there when it's asked for, and never not there did resolve that issue and introduced this performance bug.

Another idea we were discussing is computing the LB and shipping it in o1js so that it never needs to be computed anyways. This would come with a bundle size cost.

I don't want to regress #1411 and keep flip-flopping between the issues, so I'm reticent to just revert the changes. Is it possible to revert the change to maybe_lagrange_commitment without regressing the cache behavior?

mitschabaude · 2024-10-18T18:56:07Z

I don't want to regress #1411 and keep flip-flopping between the issues, so I'm reticent to just revert the changes. Is it possible to revert the change to maybe_lagrange_commitment without regressing the cache behavior?

Yeah of course, we can't just revert everything. Both issues have to be fixed at the same time, and this really shouldn't be a big problem if one takes the effort to understand exactly where the original version went wrong.

I didn't look into the original bug, so I can't say whether only reverting maybe_lagrange_commitment will regress to the original bug. But it seems worth a try! Simple enough to revert only that function, rebuild and run the regression test.

If that doesn't work, that just means more effort is needed to fix the performance regression. We can't keep the performance as it is now.

The "maybe" method fundamentally is needed for caching. The file system cache is slow: you can't read from it every single time. Therefore, you need a fast cache that takes over once the file system cache read from the file system once. That fast cache is the in-memory LB. To act as a cache though, it needs to support cache misses. Cache misses (= returning undefined) used to be supported by maybeLagrangeBasis(), but the current version never misses, so the file system cache is never used.

mitschabaude · 2024-10-18T19:34:32Z

@45930 I tried it out, it works - this PR fixes performance and also keeps the verification test working: MinaProtocol/mina#16261

Can you land it in o1js?

45930 · 2024-10-18T19:42:52Z

@mitschabaude Thanks for the PR. Looks like I still can't execute CI on Mina :( But I will organize the right people to get eyes on it at least.

45930 · 2024-10-18T22:04:31Z

@mitschabaude I built your fix here:

I was looking for a good way to unit test the cache, but it seems too deeply rooted in the bindings code. The performance and cache read feature seem to work simultaneously on my machine.

mitschabaude · 2024-10-19T07:38:18Z

I was looking for a good way to unit test the cache

I think it's fine, it was working all that time now without problems, probably not worth it to unit test now since it didn't change

mitschabaude · 2024-10-19T08:14:22Z

Actually @45930 I take that back, there's a way you can test the cache:

The file system cache used for LBs is actually a custom object passed in and controlled by the user. Currently, compile() sets that to the "srsCache", then runs things where it expects the cache to be used, and then unsets the cache again.

For testing you could create a custom 'Cache' that also records all the cache read attempts, and pass it to compile(). That way you can assert that cache reads are attempted, which would catch the regression we just had, and similar tests

45930 · 2024-10-19T13:59:56Z

@mitschabaude

Thanks, I wrote a test and confirms it works on 1.8.0, and my branch, and fails on 1.9.0

https://github.com/o1-labs/o1js/pull/1874/files#diff-e7d3bfd05832c7137d897ba35e98b9b92dbfff4a541efec13eea2701c126f8caR50

I think it's good to have a test since we know this caused a performance regression once. Adding a timing-based test to test performance may also be nice to do, but if we can pinpoint the root cause, I'm for adding a test to prevent it in the future.

jmikedupont2 · 2024-10-21T08:26:14Z

Working on adding the archiving of the timing tests generally to the repo and am testing your patch here https://github.com/jmikedupont2/o1js/actions/runs/11436079285 .

While setting up the tests, one thing I noted was one test (./dist/node/lib/proof-system/cached-verification.unit-test.js) stood out as supremely slow (3.5h) in 1.9 https://github.com/jmikedupont2/o1js/actions/runs/11429190936/job/31795307052, not sure if it is the same root cause but I will have more to say in the next days.

These patches write perf logs to the git hub artifacts so we can trace individual function performance changes across commits, also split up the work into chunks that can be handled without timing out. We can report on those artifacts with other git hub actions or process them locally.

45930 · 2024-10-21T14:08:26Z

@jmikedupont2 that test was introduced to fix a process that hangs forever, which seems to be what's happening on your branch.

The test passes quickly on the main branch and in my PR branch https://github.com/o1-labs/o1js/actions/runs/11418129637/job/31771226595

I'd double check you have the right branched merged on your fork.

45930 · 2024-10-21T15:49:47Z

v1.9.1 is on it's way out!

mitschabaude mentioned this issue Oct 17, 2024

o1js 1.9.0 sometimes compiles 3x slower and breaks verification keys #1870

Closed

mitschabaude mentioned this issue Oct 18, 2024

Fix Lagrange basis performance regression MinaProtocol/mina#16261

Merged

45930 mentioned this issue Oct 21, 2024

Build fix for lagrange basis performance #1874

Merged

45930 closed this as completed in #1874 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in v1.9.0 #1869

Performance regression in v1.9.0 #1869

mitschabaude commented Oct 16, 2024

mrmr1993 commented Oct 16, 2024

mitschabaude commented Oct 17, 2024

dfstio commented Oct 17, 2024

dfstio commented Oct 18, 2024

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024 •

edited

Loading

45930 commented Oct 18, 2024

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024

45930 commented Oct 18, 2024

45930 commented Oct 18, 2024

mitschabaude commented Oct 19, 2024

mitschabaude commented Oct 19, 2024

45930 commented Oct 19, 2024 •

edited

Loading

jmikedupont2 commented Oct 21, 2024 •

edited

Loading

45930 commented Oct 21, 2024

45930 commented Oct 21, 2024

Performance regression in v1.9.0 #1869

Performance regression in v1.9.0 #1869

Comments

mitschabaude commented Oct 16, 2024

mrmr1993 commented Oct 16, 2024

mitschabaude commented Oct 17, 2024

dfstio commented Oct 17, 2024

dfstio commented Oct 18, 2024

mitschabaude commented Oct 18, 2024 • edited Loading

mitschabaude commented Oct 18, 2024 • edited Loading

mitschabaude commented Oct 18, 2024 • edited Loading

45930 commented Oct 18, 2024

mitschabaude commented Oct 18, 2024 • edited Loading

mitschabaude commented Oct 18, 2024

45930 commented Oct 18, 2024

45930 commented Oct 18, 2024

mitschabaude commented Oct 19, 2024

mitschabaude commented Oct 19, 2024

45930 commented Oct 19, 2024 • edited Loading

jmikedupont2 commented Oct 21, 2024 • edited Loading

45930 commented Oct 21, 2024

45930 commented Oct 21, 2024

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024 •

edited

Loading

mitschabaude commented Oct 18, 2024 •

edited

Loading

45930 commented Oct 19, 2024 •

edited

Loading

jmikedupont2 commented Oct 21, 2024 •

edited

Loading