Internal Dogfooding: Experimentation Feedback #7766

neilkakkar · 2021-12-17T12:19:11Z

Was playing around with Experiments, and had a couple of new ideas to throw. (things really change when you become a user rather than an implementer haha). Would've mentioned in #7462 but that issue is getting too big, and these ideas can stand alone.

Threshold of Caring

I think I might be approaching the "threshold of caring" the wrong way. Been biased by all the existing tools out there. (Thanks @clarkus for asking why sliders over input boxes - it helped me clarify what's the core issue).

Hypothesis: Most users care about how long the experiment should run, and not about the % change they expect to see. Apriori, less users have an idea in mind of how much of a change they want to see. Put in other words, they're running the experiment to find the new conversion rate, & not going in expecting, say, a 5% change.

Thus, I wanted the slider: The precise value doesn't matter, what mattered was the resulting number of days to run the experiment. And quickly changing the slider allowed me to quickly find a reasonable number of days, see the % change it implied, and confirm if its good enough.

How about we flip this? (I think this will make explaining the numbers easier).

You tell us how long you want to run the experiment, and we'll tell you what's the minimum % change this experiment can detect. If you want to see a bigger change in conversion, run it for longer. & Vice versa.

Unsure if this the way to go, but will bring it up in user interviews to see how they feel about it.

Experiment Results Page

It would be cool to see progress of an experiment: You setup parameters earlier, we recommend running time / sample size. But, things change. Maybe your website suddenly got popular. Instead of sticking to the old recommendations, as more users enter the experiment, we should update our recommendations: "We recommend continuing your experiment for X more days". Or something like: "256 / 1023 people enrolled in experiment. Waiting for 800 more."

This ties into what @marcushyett-ph wanted on the list view. At minimum, we ought to show this on the experiment detail view.

All of this is possible given the data we have.

cc: @paolodamico @liyiy @clarkus for thoughts :)

marcushyett-ph · 2021-12-17T13:29:21Z

You tell us how long you want to run the experiment, and we'll tell you what's the minimum % change this experiment can detect. If you want to see a bigger change in conversion, run it for longer. & Vice versa.

I think this is a pretty neat idea - the time to run the experiment is so much more tangible.

marcushyett-ph · 2021-12-17T13:30:59Z

It would be cool to see progress of an experiment

This sounds cool - but we should be super conscious that this is likely to drive peeking, which is a bit of a risk https://gopractice.io/blog/peeking-problem/

neilkakkar · 2021-12-17T13:40:11Z

this is likely to drive peeking

Interesting! I would've thought the reverse: If you don't know how many people have come already (i.e. you don't know how far into the experiment you are), you might be more tempted to look at current results.

vs. us telling you how much longer to wait. Anywho, something to keep in mind, thanks!

(Similar reasoning to how Uber > Taxi on hire: Uber tells you how long you need to wait, reducing uncertainty which people hate, vs. calling a cab that may or may not come)

Aside: The peeking problem isn't that big of a problem as long as you use bayesian statistics: The results are always valid, given the information you have so far.

But looks like I might be oversimplifying: http://varianceexplained.org/r/bayesian-ab-testing/ - need to wrap my head a bit more around this. I've so far glossed over the expected loss calculations 👀 .

neilkakkar · 2022-01-06T13:35:32Z

After a chat with Chris, we're going forward with idea #1 !

With one minor caveat: Instead of them choosing the running time, they choose the sample size. This is because running time is a guess based on existing trends: if you get featured on HackerNews, that's going to send more people to you which can reduce the running time.

We'll have an expected running time on the results page, but during creation, they select a sample size, which in turns determines the sensitivity of the experiment. (we're calling it sensitivity now, since that makes more sense than threshold of caring: it's the minimum % change we can detect confidently)

re: idea #2 - I'm yet to dive deeper into peeking problems with bayesian testing.

One big thing to keep in mind: peeking is a problem only when people act on the information they've peeked at. We can, and should leverage our designs such that if we think users are making a mistake ending an experiment early, we can tell them!

neilkakkar · 2022-01-07T14:00:07Z

Did some more research into the peeking problem, and have some fun results to share.

Taking action on incomplete experiments (a.k.a. peeking problem) is indeed a problem, but only in specific cases.
These specific cases are: when the sample size is small, and the test & control conversion rate are close together. Basically, we can retrospectively calculate if it's okay to end the experiment right now or not.

The big thing is that this is calculable. (We're just doing the sample size determination retrospectively. This isn't a problem because there's no notion of apriori decisions with Bayesian testing. It's the same math formula, and whether you calculate at the beginning of an experiment to get an estimate, or in the middle to judge progress, it's okay: because it doesn't depend on any parameters that change during the experiment. ). What's usually a problem is when you're calculating p values / expected loss in the middle. These depend on the current state of the experiment.

So, we can always show experiment progress, and whenever users want to end the experiment early, we can warn them: either we're confident that test is good to go, or we're not sure & you should wait for more time.

Would appreciate pushback on this, if any.

marcushyett-ph · 2022-01-07T14:34:35Z

Not related to peeking: but I have to admit I really enjoy reading your updates @neilkakkar every time I see a notification on this thread (and many others) I know I'm guaranteed to learn something new and interesting.

paolodamico · 2022-01-11T21:58:14Z

Basend on your latest update @neilkakkar, I guess we'll have a clear indicator in the UI to tell users if it's safe to end the experiment or not, right? I think with clear UI around this we can make sure users don't take premature actions on inconclusive data.

neilkakkar · 2022-01-12T11:09:53Z

Correct, makes sense.

neilkakkar added discussion feature/experimentation Feature Tag: Experimentation labels Dec 17, 2021

neilkakkar mentioned this issue Jan 6, 2022

Get rid of user facing Minimum Detectable Threshold for Experiments #7926

Merged

neilkakkar closed this as completed Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Dogfooding: Experimentation Feedback #7766

Internal Dogfooding: Experimentation Feedback #7766

neilkakkar commented Dec 17, 2021

marcushyett-ph commented Dec 17, 2021

marcushyett-ph commented Dec 17, 2021

neilkakkar commented Dec 17, 2021

neilkakkar commented Jan 6, 2022

neilkakkar commented Jan 7, 2022

marcushyett-ph commented Jan 7, 2022

paolodamico commented Jan 11, 2022

neilkakkar commented Jan 12, 2022

Internal Dogfooding: Experimentation Feedback #7766

Internal Dogfooding: Experimentation Feedback #7766

Comments

neilkakkar commented Dec 17, 2021

Threshold of Caring

Experiment Results Page

marcushyett-ph commented Dec 17, 2021

marcushyett-ph commented Dec 17, 2021

neilkakkar commented Dec 17, 2021

neilkakkar commented Jan 6, 2022

neilkakkar commented Jan 7, 2022

marcushyett-ph commented Jan 7, 2022

paolodamico commented Jan 11, 2022

neilkakkar commented Jan 12, 2022