Our A/B testing analytics framework

Why A/B test?

A/B testing is sometimes referred to as the “gold standard” for making data-informed decisions because it allows us to infer causal relationships (this lift is because of the thing we did!) with more rigor, due to having a randomized control group to compare to.

Unfortunately, the rise in popularity of A/B testing has led to a lot of corner cutting. There are lot of folks out there who just aren't doing it with the amount of specialization and rigor it requires. Specifically when it comes to their data and analytics practices. This is arguably the place where you should be the MOST rigorous when it comes to your A/B testing and personalization programs, otherwise you risk untrustworthy results and serious harm to your KPIs.

This post breaks down a few of the concepts and unpacks surefoot's analytics methodology and framework.

Uncertainty

There is always going to be uncertainty; it’s a fact of life (and testing). However, there are certain levers we can pull that can reduce (or increase) the amount of uncertainty we bake into our tests. There are trade offs though, so we need to be thoughtful about how much uncertainty we are comfortable with.

For example, you may choose to:

  • Reduce the significance threshold from the (arbitrary) industry standard of 95% to 90% or 85%
  • The tradeoff here is that you increase the risk of thinking the observed difference is real when it isn’t, but it may also allow you to accelerate testing by decreasing runtime/sample needed.

Confidence, power, & statistical significance

Confidence and power are correlated with different types of error. Here are some questions to ask when deciding on an appropriate level of confidence and power for a test.

  • Confidence: How important is it that you do not erroneously report a difference when, in reality, the variations perform the same? (type 1 error)
  • Power: How important is it that you do not erroneously report NO difference when, in reality, there is a difference between the variations? (type 2 error)

Increasing the confidence level or power reduces the likelihood of observing type 1 or type 2 error, respectively. The tradeoff is that higher confidence/power will require a larger sample size and longer test duration.

Confidence and Power are directly related to statistical significance.

If v0 and v1 were the same, how surprised would you be to see this result? If a result is statistically significant, it means it is surprising to see a result/change of that size, given what would be expected if there was no difference between variations. These expectations depend on the sample size, standard deviation, and baseline conversion rate. What we consider statistically significant depends on the confidence threshold and power we choose (it’s not a given!)

Stopping Conditions

Stopping conditions tell us how long we need to run a test in order to observe a statistically significant change (of a certain size). An important note is that we have to decide HOW BIG of a change we want to detect and HOW CONFIDENT we want to be BEFORE we launch the experiment.

Before we decide to launch a test, we:

  1. Determine our primary metric. This will not always be “conversion rate”. Instead, it’s the thing closest to the change we’re making. In other words, if we’re running a test on a PDP, “adds to cart” is most likely going to be the primary metric since it’s the closest indicator of success to the change we’re making. The primary metric tells us whether or not our test was successful and is most closely tied to decision-making.
  2. Look at historical traffic to the page or website we plan to test, and the baseline conversion rate for that metric. More traffic and a higher conversion rate is often correlated with a shorter test duration.
  3. Determine what size change (in that metric) would be meaningful to the business. What size change (% increase or decrease) would be meaningful enough to make a decision based on? Ex. if we saw a 5% increase in cart adds, it would be worth implementing a new PDP design.
  4. Calculate stopping conditions. Calculate how long it would take for us to observe that size change based on the decisions we made earlier about how much uncertainty we’re willing to accept (What is our confidence threshold? How big of a change do we need to observe in order to make a decision? How long are we willing to run this test?)

We want to run tests long enough to detect practically significant effects, but we don’t want to run tests so long that we’re wasting your time or ours.

Peeking early

Peeking refers to looking at results before a test has met the predetermined stopping conditions and making decisions based on those results. We know it’s difficult, but we highly discourage peeking! We monitor test results regularly to ensure quality. If things are really bad (or good), we’ll occasionally call a test early using a sequential testing approach. But it’s usually best to wait until a test has met its stopping conditions to truly make a data-informed decision. We are very thoughtful about how we set stopping conditions, and following the process gives us the greatest chance of observing meaningful and reliable results.

If you don’t expect to see a lift and just want to make sure a change isn’t hurting anything, we can use different statistical methods and run a shorter test.

Using testing tools for results analysis

We generally don’t recommend using your testing tool for results analysis. The reason is that the data is often less granular and harder to dissect than the data that is available in Google Analytics 4 or other analytics platforms. We check them and think they are a great secondary data source, but we generally use Google Analytics 4 as our source of truth. That said, we are happy to discuss how we can best align with your business needs.

In order to provide you with the best user experience, this website has been set to allow cookies. View our Terms of Service and Privacy Policy for more information.