Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(lantern): update accuracy data #4180

Merged
merged 2 commits into from
Jan 5, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 16 additions & 17 deletions docs/lantern.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,44 +6,43 @@ Project Lantern is an ongoing effort to reduce the run time of Lighthouse and im

## Accuracy

All of the following accuracy stats are reported on a set of 1500 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Trace and load data were collected for *a single run* in one environment and compared to the trace and load data of *a single run* in a second environment. Some natural variation is expected and is captured by the reference stats in the table below. The most errant 10% of observations were excluded from all comparisons as outliers. For more on the methodology and reasoning, see the [Lantern design doc](https://docs.google.com/a/chromium.org/document/d/1pHEjtQjeycMoFOtheLfFjqzggY8VvNaIRfjC7IgNLq0/edit?usp=sharing).
All of the following accuracy stats are reported on a set of 300 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Median was collected for *9 runs* in one environment and compared to the median of *9 runs* in a second environment.

Stats were collected using the [trace-evaluation](https://github.com/patrickhulce/lighthouse-trace-evaluations) scripts. Table cells contain [Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) and [MAPE](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) for the respective metric.

### Accuracy Stats
### Lantern Accuracy Stats
| Comparison | FCP | FMP | TTI |
| -- | -- | -- | -- |
| Lantern predicting Default LH | .850 : 19.6% | .866 : 21.0% | .907 : 26.9% |
| Lantern predicting LH on WPT | .764 : 34.4% | .795 : 32.5% | .879 : 33.1% |
| Lantern w/adjusted settings<sup>1</sup> predicting LH on WPT | .769 : 32.9% | .808 : 31.1% | .879 : 32.6% |
| Lantern predicting Default LH | .811 : 23.1% | .811 : 23.6% | .869 : 42.5% |
| Lantern predicting LH on WPT | .785 : 28.3% | .761 : 33.7% | .854 : 45.4% |

### Reference Stats
| Comparison | FCP | FMP | TTI |
| -- | -- | -- | -- |
| Unthrottled LH correlation with Unthrottled LH<sup>2</sup> | .881 : 30.8% | .860 : 30.0% | .845 : 36.5% |
| WPT correlation with WPT | .805 : 28.7% | .823 : 30.57% | .795 : 43.7% |
| Default LH correlation with LH on WPT<sup>2</sup> | .808 : 30.0% | .818 : 31.3% | .819 : 39.5% |
| Unthrottled LH correlation with LH on WPT | .643 : 36.3% | .625 : 40.1% | .731 : 58.4% |

<sup>1</sup> 320 ms RTT, 1.3 mbps, 5x CPU

<sup>2</sup> Two trace sets were captured several weeks apart, so some site changes may have occurred that skew these stats
| Unthrottled LH predicting Default LH | .738 : 27.1% | .694 : 33.8% | .743 : 62.0% |
| Unthrottled LH predicting WPT | .691 : 33.8% | .635 : 33.7% | .712 : 66.4% |
| Default LH predicting WPT | .855 : 22.3% | .813 : 27.0% | .889 : 32.3% |

## Conclusions

### Lantern Accuracy Conclusions
We conclude that Lantern is ~6-13% more inaccurate than DevTools throttling. When evaluating rank performance, Lantern achieves correlations within ~.04-.07 of DevTools throttling.

* For the single view use case, we conclude that Lantern is roughly as accurate at predicting the rank of a website the next time you visit it as the metrics themselves. That is to say, the average error we observe between a Lantern performance score and a LH on DevTools performance score is within the expectation for standard deviation, which is the highest goal we set out to achieve. As a sanity check, we also see that using the unthrottled metrics to predict throttled performance has a significantly lower correlation than Lantern does.
* For the repeat view use case, we require more data to reach a conclusion, but the high correlation of the single view use case suggests the accuracy meets our correlation requirements even if some sites may diverge.
* For the single view use case, our original conclusion that Lantern's inaccuracy is roughly equal to the inaccuracy introduced by expected variance seems to hold. The standard deviation of single observations from DevTools throttling is ~9-13%, and given Lantern's much lower variance, single observations from Lantern are not significantly more inaccurate on average than single observations from DevTools throttling.
* For the repeat view use case, we can conclude that Lantern is systematically off by ~6-13% more than DevTools throttling.

### Metric Variability Conclusions
The reference stats demonstrate that there is high degree of variability with the user-centric metrics and strengthens the position that every load is just an observation of a point drawn from a distribution and to understand the entire experience, multiple draws must be taken, i.e. multiple runs are needed to have sufficiently small error bounds on the median load experience.

## Future Work
Conducting this same analysis with a 3/5/9/21 run dataset blocks much of the future work here. Future investments in Lantern accuracy would be ill-spent without this larger dataset to validate their efficacy.
The current size of confidence intervals for DevTools throttled performance scores are as follows.

* 95% confidence interval for **1-run** of site at median: 50 **+/- 15** = 65-35
* 95% confidence interval for **3-runs** of site at median: 50 **+/- 11** = 61-39
* 95% confidence interval for **5-runs** of site at median: 50 **+/- 8** = 58-42

## Links

* [Lighthouse Variability and Accuracy Analysis](https://docs.google.com/document/d/1BqtL-nG53rxWOI5RO0pItSRPowZVnYJ_gBEQCJ5EeUE/edit?usp=sharing)
* [Lantern Deck](https://docs.google.com/presentation/d/1EsuNICCm6uhrR2PLNaI5hNkJ-q-8Mv592kwHmnf4c6U/edit?usp=sharing)
* [Lantern Design Doc](https://docs.google.com/a/chromium.org/document/d/1pHEjtQjeycMoFOtheLfFjqzggY8VvNaIRfjC7IgNLq0/edit?usp=sharing)
* [WPT Trace Data Set Half 1](https://drive.google.com/open?id=1Y_duiiJVljzIEaYWEmiTqKQFUBFWbKVZ) (access on request)
Expand Down