A/B testing is sometimes referred to as the “gold standard” for making data-informed decisions because it allows us to infer causal relationships (this lift is because of the thing we did!) with more rigor, due to having a randomized control group to compare to.
Unfortunately, the rise in popularity of A/B testing has led to a lot of corner cutting. There are lot of folks out there who just aren't doing it with the amount of specialization and rigor it requires. Specifically when it comes to their data and analytics practices. This is arguably the place where you should be the MOST rigorous when it comes to your A/B testing and personalization programs, otherwise you risk untrustworthy results and serious harm to your KPIs.
This post breaks down a few of the concepts and unpacks surefoot's analytics methodology and framework.
There is always going to be uncertainty; it’s a fact of life (and testing). However, there are certain levers we can pull that can reduce (or increase) the amount of uncertainty we bake into our tests. There are trade offs though, so we need to be thoughtful about how much uncertainty we are comfortable with.
For example, you may choose to:
Confidence and power are correlated with different types of error. Here are some questions to ask when deciding on an appropriate level of confidence and power for a test.
Increasing the confidence level or power reduces the likelihood of observing type 1 or type 2 error, respectively. The tradeoff is that higher confidence/power will require a larger sample size and longer test duration.
Confidence and Power are directly related to statistical significance.
If v0 and v1 were the same, how surprised would you be to see this result? If a result is statistically significant, it means it is surprising to see a result/change of that size, given what would be expected if there was no difference between variations. These expectations depend on the sample size, standard deviation, and baseline conversion rate. What we consider statistically significant depends on the confidence threshold and power we choose (it’s not a given!)
Stopping conditions tell us how long we need to run a test in order to observe a statistically significant change (of a certain size). An important note is that we have to decide HOW BIG of a change we want to detect and HOW CONFIDENT we want to be BEFORE we launch the experiment.
Before we decide to launch a test, we:
We want to run tests long enough to detect practically significant effects, but we don’t want to run tests so long that we’re wasting your time or ours.
Peeking refers to looking at results before a test has met the predetermined stopping conditions and making decisions based on those results. We know it’s difficult, but we highly discourage peeking! We monitor test results regularly to ensure quality. If things are really bad (or good), we’ll occasionally call a test early using a sequential testing approach. But it’s usually best to wait until a test has met its stopping conditions to truly make a data-informed decision. We are very thoughtful about how we set stopping conditions, and following the process gives us the greatest chance of observing meaningful and reliable results.
If you don’t expect to see a lift and just want to make sure a change isn’t hurting anything, we can use different statistical methods and run a shorter test.
We generally don’t recommend using your testing tool for results analysis. The reason is that the data is often less granular and harder to dissect than the data that is available in Google Analytics 4 or other analytics platforms. We check them and think they are a great secondary data source, but we generally use Google Analytics 4 as our source of truth. That said, we are happy to discuss how we can best align with your business needs.