-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal Dogfooding: Experimentation Feedback #7766
Comments
I think this is a pretty neat idea - the time to run the experiment is so much more tangible. |
This sounds cool - but we should be super conscious that this is likely to drive peeking, which is a bit of a risk https://gopractice.io/blog/peeking-problem/ |
Interesting! I would've thought the reverse: If you don't know how many people have come already (i.e. you don't know how far into the experiment you are), you might be more tempted to look at current results. vs. us telling you how much longer to wait. Anywho, something to keep in mind, thanks! (Similar reasoning to how Uber > Taxi on hire: Uber tells you how long you need to wait, reducing uncertainty which people hate, vs. calling a cab that may or may not come) Aside: The peeking problem isn't that big of a problem as long as you use bayesian statistics: The results are always valid, given the information you have so far. But looks like I might be oversimplifying: http://varianceexplained.org/r/bayesian-ab-testing/ - need to wrap my head a bit more around this. I've so far glossed over the expected loss calculations 👀 . |
After a chat with Chris, we're going forward with idea #1 ! With one minor caveat: Instead of them choosing the running time, they choose the sample size. This is because running time is a guess based on existing trends: if you get featured on HackerNews, that's going to send more people to you which can reduce the running time. We'll have an expected running time on the results page, but during creation, they select a sample size, which in turns determines the sensitivity of the experiment. (we're calling it sensitivity now, since that makes more sense than threshold of caring: it's the minimum % change we can detect confidently) re: idea #2 - I'm yet to dive deeper into peeking problems with bayesian testing. One big thing to keep in mind: peeking is a problem only when people act on the information they've peeked at. We can, and should leverage our designs such that if we think users are making a mistake ending an experiment early, we can tell them! |
Did some more research into the peeking problem, and have some fun results to share.
The big thing is that this is calculable. (We're just doing the sample size determination retrospectively. This isn't a problem because there's no notion of apriori decisions with Bayesian testing. It's the same math formula, and whether you calculate at the beginning of an experiment to get an estimate, or in the middle to judge progress, it's okay: because it doesn't depend on any parameters that change during the experiment. ). What's usually a problem is when you're calculating p values / expected loss in the middle. These depend on the current state of the experiment. So, we can always show experiment progress, and whenever users want to end the experiment early, we can warn them: either we're confident that test is good to go, or we're not sure & you should wait for more time. Would appreciate pushback on this, if any. |
Not related to peeking: but I have to admit I really enjoy reading your updates @neilkakkar every time I see a notification on this thread (and many others) I know I'm guaranteed to learn something new and interesting. |
Basend on your latest update @neilkakkar, I guess we'll have a clear indicator in the UI to tell users if it's safe to end the experiment or not, right? I think with clear UI around this we can make sure users don't take premature actions on inconclusive data. |
Correct, makes sense. |
Was playing around with Experiments, and had a couple of new ideas to throw. (things really change when you become a user rather than an implementer haha). Would've mentioned in #7462 but that issue is getting too big, and these ideas can stand alone.
Threshold of Caring
I think I might be approaching the "threshold of caring" the wrong way. Been biased by all the existing tools out there. (Thanks @clarkus for asking why sliders over input boxes - it helped me clarify what's the core issue).
Hypothesis: Most users care about how long the experiment should run, and not about the % change they expect to see. Apriori, less users have an idea in mind of how much of a change they want to see. Put in other words, they're running the experiment to find the new conversion rate, & not going in expecting, say, a 5% change.
Thus, I wanted the slider: The precise value doesn't matter, what mattered was the resulting number of days to run the experiment. And quickly changing the slider allowed me to quickly find a reasonable number of days, see the % change it implied, and confirm if its good enough.
How about we flip this? (I think this will make explaining the numbers easier).
You tell us how long you want to run the experiment, and we'll tell you what's the minimum % change this experiment can detect. If you want to see a bigger change in conversion, run it for longer. & Vice versa.
Unsure if this the way to go, but will bring it up in user interviews to see how they feel about it.
Experiment Results Page
It would be cool to see progress of an experiment: You setup parameters earlier, we recommend running time / sample size. But, things change. Maybe your website suddenly got popular. Instead of sticking to the old recommendations, as more users enter the experiment, we should update our recommendations: "We recommend continuing your experiment for X more days". Or something like: "256 / 1023 people enrolled in experiment. Waiting for 800 more."
This ties into what @marcushyett-ph wanted on the list view. At minimum, we ought to show this on the experiment detail view.
All of this is possible given the data we have.
cc: @paolodamico @liyiy @clarkus for thoughts :)
The text was updated successfully, but these errors were encountered: