GoogleChrome · paulirish · Jan 5, 2018 · Jan 5, 2018 · Jan 5, 2018
diff --git a/docs/lantern.md b/docs/lantern.md
@@ -6,44 +6,43 @@ Project Lantern is an ongoing effort to reduce the run time of Lighthouse and im
 
 ## Accuracy
 
-All of the following accuracy stats are reported on a set of 1500 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Trace and load data were collected for *a single run* in one environment and compared to the trace and load data of *a single run* in a second environment. Some natural variation is expected and is captured by the reference stats in the table below. The most errant 10% of observations were excluded from all comparisons as outliers. For more on the methodology and reasoning, see the [Lantern design doc](https://docs.google.com/a/chromium.org/document/d/1pHEjtQjeycMoFOtheLfFjqzggY8VvNaIRfjC7IgNLq0/edit?usp=sharing).
+All of the following accuracy stats are reported on a set of 300 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Median was collected for *9 runs* in one environment and compared to the median of *9 runs* in a second environment.
 
 Stats were collected using the [trace-evaluation](https://github.com/patrickhulce/lighthouse-trace-evaluations) scripts. Table cells contain [Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) and [MAPE](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) for the respective metric.
 
-### Accuracy Stats
+### Lantern Accuracy Stats
 | Comparison | FCP | FMP | TTI |
 | -- | -- | -- | -- |
-| Lantern predicting Default LH | .850 : 19.6% | .866 : 21.0% | .907 : 26.9% |
-| Lantern predicting LH on WPT | .764 : 34.4% | .795 : 32.5% | .879 : 33.1% |
-| Lantern w/adjusted settings<sup>1</sup> predicting LH on WPT | .769 : 32.9% | .808 : 31.1% | .879 : 32.6% |
+| Lantern predicting Default LH | .811 : 23.1% | .811 : 23.6% | .869 : 42.5% |
+| Lantern predicting LH on WPT | .785 : 28.3% | .761 : 33.7% | .854 : 45.4% |
 
 ### Reference Stats
 | Comparison | FCP | FMP | TTI |
 | -- | -- | -- | -- |
-| Unthrottled LH correlation with Unthrottled LH<sup>2</sup> | .881 : 30.8% | .860 : 30.0% | .845 : 36.5% |
-| WPT correlation with WPT | .805 : 28.7% | .823 : 30.57% | .795 : 43.7% |
-| Default LH correlation with LH on WPT<sup>2</sup> | .808 : 30.0% | .818 : 31.3% | .819 : 39.5% |
-| Unthrottled LH correlation with LH on WPT | .643 : 36.3% | .625 : 40.1% | .731 : 58.4% |
-
-<sup>1</sup> 320 ms RTT, 1.3 mbps, 5x CPU
-
-<sup>2</sup> Two trace sets were captured several weeks apart, so some site changes may have occurred that skew these stats
+| Unthrottled LH predicting Default LH | .738 : 27.1% | .694 : 33.8% | .743 : 62.0% |
+| Unthrottled LH predicting WPT | .691 : 33.8% | .635 : 33.7% | .712 : 66.4% |
+| Default LH predicting WPT | .855 : 22.3% | .813 : 27.0% | .889 : 32.3% |
 
 ## Conclusions
 
 ### Lantern Accuracy Conclusions
+We conclude that Lantern is ~6-13% more inaccurate than DevTools throttling. When evaluating rank performance, Lantern achieves correlations within ~.04-.07 of DevTools throttling.
 
-* For the single view use case, we conclude that Lantern is roughly as accurate at predicting the rank of a website the next time you visit it as the metrics themselves. That is to say, the average error we observe between a Lantern performance score and a LH on DevTools performance score is within the expectation for standard deviation, which is the highest goal we set out to achieve. As a sanity check, we also see that using the unthrottled metrics to predict throttled performance has a significantly lower correlation than Lantern does.
-* For the repeat view use case, we require more data to reach a conclusion, but the high correlation of the single view use case suggests the accuracy meets our correlation requirements even if some sites may diverge.
+* For the single view use case, our original conclusion that Lantern's inaccuracy is roughly equal to the inaccuracy introduced by expected variance seems to hold. The standard deviation of single observations from DevTools throttling is ~9-13%, and given Lantern's much lower variance, single observations from Lantern are not significantly more inaccurate on average than single observations from DevTools throttling.
+* For the repeat view use case, we can conclude that Lantern is systematically off by ~6-13% more than DevTools throttling. 
 
 ### Metric Variability Conclusions
 The reference stats demonstrate that there is high degree of variability with the user-centric metrics and strengthens the position that every load is just an observation of a point drawn from a distribution and to understand the entire experience, multiple draws must be taken, i.e. multiple runs are needed to have sufficiently small error bounds on the median load experience.
 
-## Future Work
-Conducting this same analysis with a 3/5/9/21 run dataset blocks much of the future work here. Future investments in Lantern accuracy would be ill-spent without this larger dataset to validate their efficacy.
+The current size of confidence intervals for DevTools throttled performance scores are as follows.
+
+* 95% confidence interval for **1-run** of site at median: 50 **+/- 15** = 65-35
+* 95% confidence interval for **3-runs** of site at median: 50 **+/- 11** = 61-39
+* 95% confidence interval for **5-runs** of site at median: 50 **+/- 8** = 58-42
 
 ## Links
 
+* [Lighthouse Variability and Accuracy Analysis](https://docs.google.com/document/d/1BqtL-nG53rxWOI5RO0pItSRPowZVnYJ_gBEQCJ5EeUE/edit?usp=sharing)
 * [Lantern Deck](https://docs.google.com/presentation/d/1EsuNICCm6uhrR2PLNaI5hNkJ-q-8Mv592kwHmnf4c6U/edit?usp=sharing)
 * [Lantern Design Doc](https://docs.google.com/a/chromium.org/document/d/1pHEjtQjeycMoFOtheLfFjqzggY8VvNaIRfjC7IgNLq0/edit?usp=sharing)
 * [WPT Trace Data Set Half 1](https://drive.google.com/open?id=1Y_duiiJVljzIEaYWEmiTqKQFUBFWbKVZ) (access on request)