Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

[ci] Add weights job #13316

Closed
wants to merge 3 commits into from
Closed

[ci] Add weights job #13316

wants to merge 3 commits into from

Conversation

alvicsam
Copy link
Contributor

@alvicsam alvicsam commented Feb 6, 2023

PR adds job that runs benchmarks and creates artifacts with git diff. The job runs on new GCP runners, the goal is to deprecate bm* machines. Weights generation is currently in progress

https://github.com/paritytech/ci_cd/issues/697
https://github.com/paritytech/ci_cd/issues/733

@mordamax @athei

@alvicsam alvicsam requested a review from a team as a code owner February 6, 2023 10:50
@github-actions github-actions bot added the A0-please_review Pull request needs code review. label Feb 6, 2023
@mordamax
Copy link
Contributor

mordamax commented Feb 6, 2023

@alvicsam worth to re-generate weights in same PR?

@alvicsam
Copy link
Contributor Author

alvicsam commented Feb 6, 2023

Sure, I ran the new job so now I'm waiting for the new weights

@alvicsam alvicsam added B0-silent Changes should not be mentioned in any release notes A3-in_progress and removed A0-please_review Pull request needs code review. labels Feb 6, 2023
@athei
Copy link
Member

athei commented Feb 6, 2023

How would that pipeline be triggered? Is it manually?

@mordamax
Copy link
Contributor

mordamax commented Feb 6, 2023

How would that pipeline be triggered? Is it manually?

very soon, I hope, you'll be able to run them through command bot too. bot bench $ all something like this
But today/now yes - through gitlab manually

@athei
Copy link
Member

athei commented Feb 6, 2023

That is fine! Just wanted to make sure it is not run on every commit or something 😄

@athei
Copy link
Member

athei commented Feb 7, 2023

I pushed the weight results. Is it expected that the weights get worse by that much?

This also doesn't look promising:
Screenshot 2023-02-06 at 21 58 49

@alvicsam
Copy link
Contributor Author

alvicsam commented Feb 7, 2023

How would that pipeline be triggered? Is it manually?

Yes, the job is manual and available in every PR and commit to be run.

I pushed the weight results. Is it expected that the weights get worse by that much?

It can be, afaiu consistency matters. We can run benchmarks one more time and compare results of the runs on new runners. However, results on polkadot and cumulus were consistent.

@mordamax can we run only some of the benchmarks with the bot in substrate? Where bot will run the benchmark (I mean on which runner)?

This also doesn't look promising

@oleg-plakida can tell more about this benchmark. AFAIR the results weren't consistent and different glibc version could change the output

@athei
Copy link
Member

athei commented Feb 7, 2023

It can be, afaiu consistency matters.

So how do we continue here? This doesn't sound very confident.

We can run benchmarks one more time and compare results of the runs on new runners.

What good can come from this?

If it is the same numbers: Why does a supposedly faster machine perform 20% worse in many benchmarks?

If it is completely different numbers: The numbers are not consistent.

However, results on polkadot and cumulus were consistent.

That dosn't help us here in substrate. I suggest reverting the bot to the old runners until this is figured out. This is blocking all of my PRs.

@alvicsam alvicsam added A0-please_review Pull request needs code review. A0-please_review and removed A3-in_progress A0-please_review Pull request needs code review. labels Feb 7, 2023
@alvicsam
Copy link
Contributor Author

alvicsam commented Feb 7, 2023

So how do we continue here? This doesn't sound very confident.

Results are consistent in polkadot and cumulus. I see no reason why they won't be consistent here. If you have doubts I suggested options for confirming the stability of the results.

What good can come from this?

The machines are generated dynamically so several ppl can run benchmarks in the same time in different PRs. Also it's possible to parallelise benchmarks in the future.

If it is the same numbers: Why does a supposedly faster machine perform 20% worse in many benchmarks?

Because benchmarks run only on one core. And core frequency on new runners is lower than on the old ones.

@athei
Copy link
Member

athei commented Feb 7, 2023

If you have doubts I suggested options for confirming the stability of the results.

Okay let's re-run then. I tried to trigger. I hope I did it the right way:
https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2358232

What good can come from this?

The machines are generated dynamically so several ppl can run benchmarks in the same time in different PRs. Also it's possible to parallelise benchmarks in the future.

Okay this might be a misunderstanding. I didn't mean from the whole endeavour. I understand that bare metal machines are annoying for you to manage. I meant from re-running the benchmarks.

@mateo-moon
Copy link
Contributor

mateo-moon commented Feb 7, 2023

I pushed the weight results. Is it expected that the weights get worse by that much?

This also doesn't look promising: Screenshot 2023-02-06 at 21 58 49

I've ran tests with benchmark machine on different setups and machines. The results i got show that this particular benchnmarking approach isn't correct and tell nothing about real runtime performance of pallets. And there are 2 different evidence of this:

  1. I've got better results in benchmark machine but worse result for benchmark pallet for "AMD Milan" and vise versa for "Intel Ice Lake". So we can switch the machine CPU to "AMD Milan" and you will have results like in the screenshot, but performance of real test will be worse. In my opinion this is enough to say that benchmark machine results literally says nothing about final performance of the runtime node. But there is also second point:
  2. The performance of this result tigthly coupled with libc library version of the particular environment where node will be ran. I've got different results with different libc libraries. That means that results will be different from host to host unless there are completely identical. But this doesn't matter, because if result of the measure is dependent on some external parameter and you can't garant consistency of this parameter the result is also can't be trusted.

Screen Shot 2022-12-08 at 16 57 37

@athei
Copy link
Member

athei commented Feb 7, 2023

@oleg-plakida So I guess we should just remove this then?

@mateo-moon
Copy link
Contributor

@oleg-plakida So I guess we should just remove this then?

The command or our concern?

@athei
Copy link
Member

athei commented Feb 7, 2023

This micro benchmark which is not reflective of the actual performance we are interested in.

@mateo-moon
Copy link
Contributor

This micro benchmark which is not reflective of the actual performance we are interested in.

I would say that it would be nice to have benchmark like this, and i suppose that was the idea at the beginning, but we shouldn't trust it a lot right now at least until we bring it to the consistency. And i assume it's hard challenge. But your question is realy interesting. Does anyone really use this benchmark for node testing!?

@athei
Copy link
Member

athei commented Feb 7, 2023

Once the benchmarks are finished I will make a new PR into this one with them. This way we can use the weight UI to make a comparison. Please don't commit them to this PR.

@mateo-moon
Copy link
Contributor

But i suppose consistency the only matters for us. The setup which is used for measuring and not the performance of the setup. As long as we compare result produced in the the reference set up we can measure code performance.

@athei
Copy link
Member

athei commented Feb 8, 2023

Committed the new weights here: #13336

pallet-contracts just went up by 50% for some benchmarks. This never happened on the old machines.

@ggwpez
Copy link
Member

ggwpez commented Feb 8, 2023

The benchmark machine is not updated yet for ths new specs, so it will report as failing: #13317

The single-threaded CPU speed is expected to be slower than old ref hardware because it is using cloud VM which have server CPUs.
With faster disk speed we should get a bit cheaper read&writes to make up for it.

Our current goal is indeed to have consistent results, so lets re-run them a few times.

@alvicsam alvicsam mentioned this pull request Feb 8, 2023
@athei
Copy link
Member

athei commented Feb 8, 2023

@ggwpez Have you checked the link above? We ran gitlab-update_substrate_weights twice. We should get consistent results, right?

@ggwpez
Copy link
Member

ggwpez commented Feb 8, 2023

@ggwpez Have you checked the link above? We ran gitlab-update_substrate_weights twice. We should get consistent results, right?

Ah, thought you compared against master 🤦‍♂️
Going to double-check the consistency on old bm2.

@ggwpez ggwpez mentioned this pull request Feb 8, 2023
@ggwpez
Copy link
Member

ggwpez commented Feb 8, 2023

I re-run the worst offender instr_i64add a dozen times on bm2 and it always fluctuated between 0.9 - 1.1 µs base and consistent 756 ns component.
The change was probably from some rust version update or other change, so the new weights for that one are good.
We could also re-run all on bm2 bm* with the env overwrite.

@athei
Copy link
Member

athei commented Feb 8, 2023

I re-run the worst offender instr_i64add a dozen times on bm2 and it always fluctuated between 0.9 - 1.1 µs base and consistent 756 ns component.
The change was probably from some rust version update or other change, so the new weights for that one are good.
We could also re-run all on bm2 bm* with the env overwrite.

Sorry I am confused now. There was no version update between the two runs we did in this PR. You think we should merge this PR as-is?

@ggwpez
Copy link
Member

ggwpez commented Feb 8, 2023

Sorry I am confused now. There was no version update between the two runs we did in this PR. You think we should merge this PR as-is?

I tested it on master, so there was an update since then which now pollutes these results. But that wont explain inconsistencies here, yea.
Can you use #13336 (comment) in the meantime?
Then we can still re-run it a few times just to be sure.

@athei
Copy link
Member

athei commented Feb 8, 2023

I tested it on master, so there was an update since then which now pollutes these results.

Update to rustc? Wasmi is sometimes really senstive to those. It relies on some things being a tail call and rust gives no guarentee for that :(. @Robbepop Is that still the case. I remember faintly that you found a workaround for this.

Can you use #13336 (comment) in the meantime?

Yeah. @alvicsam is trying to get it running here: #13268

@Robbepop
Copy link
Contributor

Robbepop commented Feb 8, 2023

Update to rustc? Wasmi is sometimes really senstive to those. It relies on some things being a tail call and rust gives no guarentee for that :(. @Robbepop Is that still the case. I remember faintly that you found a workaround for this.

Yes, wasmi is still super sensitive to changes such as used rustc and LLVM versions. Unfortunately it is not possible to fix this reliably without Rust support for guaranteed tail calls which are not a thing in Rust, yet. Until then the best we can hope for is that future updates won't degrade the performance without a way to fix it. Fingers crossed.

The good thing is that the Rust devs take performance regressions very seriously. The bad thing is that so far I always had to dig out myself when they popped up. Rustc issues are still open and nobody working on them although having high priority.

@alvicsam
Copy link
Contributor Author

bot help

@command-bot
Copy link

command-bot bot commented Feb 10, 2023

Here's a link to docs

@the-right-joyce the-right-joyce added A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. and removed A0-please_review labels Feb 13, 2023
@alvicsam alvicsam closed this Feb 16, 2023
@athei athei deleted the as-weights-gcp branch March 10, 2023 11:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants