Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Dogfooding: Experimentation Feedback #7766

Closed
neilkakkar opened this issue Dec 17, 2021 · 8 comments
Closed

Internal Dogfooding: Experimentation Feedback #7766

neilkakkar opened this issue Dec 17, 2021 · 8 comments
Labels
feature/experimentation Feature Tag: Experimentation

Comments

@neilkakkar
Copy link
Contributor

Was playing around with Experiments, and had a couple of new ideas to throw. (things really change when you become a user rather than an implementer haha). Would've mentioned in #7462 but that issue is getting too big, and these ideas can stand alone.

Threshold of Caring

I think I might be approaching the "threshold of caring" the wrong way. Been biased by all the existing tools out there. (Thanks @clarkus for asking why sliders over input boxes - it helped me clarify what's the core issue).

Hypothesis: Most users care about how long the experiment should run, and not about the % change they expect to see. Apriori, less users have an idea in mind of how much of a change they want to see. Put in other words, they're running the experiment to find the new conversion rate, & not going in expecting, say, a 5% change.

Thus, I wanted the slider: The precise value doesn't matter, what mattered was the resulting number of days to run the experiment. And quickly changing the slider allowed me to quickly find a reasonable number of days, see the % change it implied, and confirm if its good enough.

How about we flip this? (I think this will make explaining the numbers easier).

You tell us how long you want to run the experiment, and we'll tell you what's the minimum % change this experiment can detect. If you want to see a bigger change in conversion, run it for longer. & Vice versa.


Unsure if this the way to go, but will bring it up in user interviews to see how they feel about it.

Experiment Results Page

It would be cool to see progress of an experiment: You setup parameters earlier, we recommend running time / sample size. But, things change. Maybe your website suddenly got popular. Instead of sticking to the old recommendations, as more users enter the experiment, we should update our recommendations: "We recommend continuing your experiment for X more days". Or something like: "256 / 1023 people enrolled in experiment. Waiting for 800 more."

This ties into what @marcushyett-ph wanted on the list view. At minimum, we ought to show this on the experiment detail view.

All of this is possible given the data we have.


cc: @paolodamico @liyiy @clarkus for thoughts :)

@neilkakkar neilkakkar added discussion feature/experimentation Feature Tag: Experimentation labels Dec 17, 2021
@marcushyett-ph
Copy link
Contributor

You tell us how long you want to run the experiment, and we'll tell you what's the minimum % change this experiment can detect. If you want to see a bigger change in conversion, run it for longer. & Vice versa.

I think this is a pretty neat idea - the time to run the experiment is so much more tangible.

@marcushyett-ph
Copy link
Contributor

It would be cool to see progress of an experiment

This sounds cool - but we should be super conscious that this is likely to drive peeking, which is a bit of a risk https://gopractice.io/blog/peeking-problem/

@neilkakkar
Copy link
Contributor Author

this is likely to drive peeking

Interesting! I would've thought the reverse: If you don't know how many people have come already (i.e. you don't know how far into the experiment you are), you might be more tempted to look at current results.

vs. us telling you how much longer to wait. Anywho, something to keep in mind, thanks!

(Similar reasoning to how Uber > Taxi on hire: Uber tells you how long you need to wait, reducing uncertainty which people hate, vs. calling a cab that may or may not come)


Aside: The peeking problem isn't that big of a problem as long as you use bayesian statistics: The results are always valid, given the information you have so far.

But looks like I might be oversimplifying: http://varianceexplained.org/r/bayesian-ab-testing/ - need to wrap my head a bit more around this. I've so far glossed over the expected loss calculations 👀 .

@neilkakkar
Copy link
Contributor Author

After a chat with Chris, we're going forward with idea #1 !

With one minor caveat: Instead of them choosing the running time, they choose the sample size. This is because running time is a guess based on existing trends: if you get featured on HackerNews, that's going to send more people to you which can reduce the running time.

We'll have an expected running time on the results page, but during creation, they select a sample size, which in turns determines the sensitivity of the experiment. (we're calling it sensitivity now, since that makes more sense than threshold of caring: it's the minimum % change we can detect confidently)


re: idea #2 - I'm yet to dive deeper into peeking problems with bayesian testing.

One big thing to keep in mind: peeking is a problem only when people act on the information they've peeked at. We can, and should leverage our designs such that if we think users are making a mistake ending an experiment early, we can tell them!

@neilkakkar
Copy link
Contributor Author

Did some more research into the peeking problem, and have some fun results to share.

  1. Taking action on incomplete experiments (a.k.a. peeking problem) is indeed a problem, but only in specific cases.
  2. These specific cases are: when the sample size is small, and the test & control conversion rate are close together. Basically, we can retrospectively calculate if it's okay to end the experiment right now or not.

The big thing is that this is calculable. (We're just doing the sample size determination retrospectively. This isn't a problem because there's no notion of apriori decisions with Bayesian testing. It's the same math formula, and whether you calculate at the beginning of an experiment to get an estimate, or in the middle to judge progress, it's okay: because it doesn't depend on any parameters that change during the experiment. ). What's usually a problem is when you're calculating p values / expected loss in the middle. These depend on the current state of the experiment.

So, we can always show experiment progress, and whenever users want to end the experiment early, we can warn them: either we're confident that test is good to go, or we're not sure & you should wait for more time.


Would appreciate pushback on this, if any.

@marcushyett-ph
Copy link
Contributor

Not related to peeking: but I have to admit I really enjoy reading your updates @neilkakkar every time I see a notification on this thread (and many others) I know I'm guaranteed to learn something new and interesting.

@paolodamico
Copy link
Contributor

Basend on your latest update @neilkakkar, I guess we'll have a clear indicator in the UI to tell users if it's safe to end the experiment or not, right? I think with clear UI around this we can make sure users don't take premature actions on inconclusive data.

@neilkakkar
Copy link
Contributor Author

Correct, makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/experimentation Feature Tag: Experimentation
Projects
None yet
Development

No branches or pull requests

3 participants