-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Use median instead of average to stabilize First Block metric #54157
Conversation
I think it'd be could to figure the reason for the outliers. What's different than puppeteer values? Other than that, I'm fine with moving to "median" but not sure about doing it only for a single metric while leaving the rest using "average". |
Flaky tests detected in 857438c. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/6075034739
|
Yep, I was planning to continue investigating. The difference is that now we load each page in a new, isolated browser context. With Puppeteer, we were using a single/default context, so there was likely some sort of caching going on.
To be clear, it's not the only metric that we're currently using median for - we also do that for TTFB and LCP. I'm not sure why those metrics were kept being calculated via medians (maybe just an oversight), but from my current observations, I'd be keen to move all the loading-related metrics to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more reason why distributions are going to be more helpful for us. The numbers on my latest PR for many metrics were wildly different than each other, when I know because the change didn't impact any JS, that they should have all been identical. I'm not looking at small differences either, some around 30%+.
I still think min
is more appropriate for some tests as we are theoretically looking at what impact the code might have on runtime performance, not on environmental factors, and we know that the minimum is the fastest it can get based off of the given code.
This should all get better anyway if we do significantly more test rounds as the numbers should converge. What about setting up a daily task to run the tests something like 30 times (or as many as we can fit inside Github's test runner limitation of six hours)? That has come up before as a way to better measure.
It would lose the close associativity to PRs, but honestly, our numbers have consistently struggled to be reliable on a small scale so it's arguable this connection even exists today. If we ran daily runs with better statistical grouping we might have a better chance at identifying when precisely performance impacts improve or degrade.
I'm personally still concerned that there's a change that we don't understand here and that the median might be hiding real issues. If we know for sure that it's cache, maybe we can ignore the first run or things like that but at least we know what we're doing. The details are super important for performance tests. |
I will keep investigating this and try to figure out what causes those spikes in CI (still can't repro locally). We're already filtering out the first sample from metrics in which we identified it as an outlier, but in the FBL metric, those spikes are occurring at (seemingly) random samples, so we can't really do that. That said, the fact that I cannot repro locally makes me think this isn't a real issue, only something specific to the CI environment. |
If it were specific to the CI environment, why it wasn't happening when we were using puppeteer? |
As I mentioned above, Puppeteer's stability for Loading metrics was likely related to reusing the browser context for each sample, which we're not doing currently with Playwright. |
@youknowriad, looks like I was wrong - there was a real issue 😄 #54188 |
What?
The First Block Loaded has been unstable since the switch to Playwright. From https://www.codevitals.run/project/gutenberg:
After looking into the raw values of the metrics, I found some spikes distorting the otherwise stable values. To address that, let's start by calculating the median instead of the average to filter out those outliers.