amcneil36

A/B Testing

Motivation for A/B Testing

Suppose you have an idea for a feature for your app. You think it’s a good idea. But you don’t know with 100% confidence that users will like it. Maybe you are 80% confident. It’s worth a try, but how do we know if users will like it? Suppose you do the code change and release it to all of the users and a few days later, the number of daily active users (DAU) on your app increases. Was the increase due to your feature or was it due to some other reason? There could have been, for example, meaning other features released on the same day. Or maybe it became the weekend and your app just has more usage on the weekend. This is where A/B testing comes into play.

What is A/B testing and how does it work?

In A/B testing, you do a code change where you introduce a new feature but the availability of this feature is gated behind a flag. The feature could be gated behind a user_id flag or a device_id flag, for example. You then have two randomly selected segments of users of equal size where the only difference between them is that one segment gets the feature (the test group) and the other segment does not get the feature (the control group).

You could then say that any difference in metrics between the two segments is solely due to your feature because your feature is the only difference between the two segments. It’s that simple. Okay, not quite. But, sort of. Let me explain with a dummy example using coins. Suppose there are two quarters, each flipped 5 times. Quarter A gets heads 3 times and Quarter B gets heads 4 times. Does this mean Quarter B is more likely to result in heads than Quarter A? Not necessarily. It could have just been randomness going that way.

So, suppose you have 20,345,653 DAU in the control group (doesn’t get your feature) and 20,345,875 DAU in the test group (does get your feature). This doesn’t necessarily mean that this gain is due to your feature. It could have been due to randomness just like the coin flip example. For this reason, A/B tests generate confidence intervals. You then look at the confidence intervals to see whether a metric is statistically significant or neutral. If a metric is statistically significant, then it is likely due to your code change. If a metric is neutral, then your experiment wasn’t able to measure a statistically significant change in the two groups at the allocation size of the experiment. You can calculate the statistical power of the metric to determine whether increasing allocation would likely be able to give you a stat-sig reading on a metric. But increasing allocation has tradeoffs. While it does make it easier to measure a stat-sig difference between control and test, it also gives the feature to more users, which is something you may want to do in the early stages. You usually want the feature tested with a smaller set of users and validate that everything looks good before giving it to more users. This way, if something goes unexpectedly wrong with the feature, you catch it early with only a small number of users impacted.

A/B testing will typically collect a variety of metrics and not just DAU. It could, for example, check things like app crashes, ad revenue, power usage, impressions of certain types of media, uploads of certain types of media, etc. So when you have a feature that you are interested to code up and release, you can test it under an A/B test and these metrics can give you a good idea of how users like the feature and whether anything is going unexpectedly wrong.

Does this mean that we are testing in production? Yeah. We are testing in production. You may have heard that testing in production is bad. Testing in production is not inherently bad. It’s only bad when it’s the first time that the code is tested as well as when it is tested with all users immediately as opposed to a small subset of users. Testing in production can give you certain information that is just hard to get when testing internally. How do you know when testing the code with your team that it is going to increase the number of DAU?

Let’s go over the phases in which code gets tested with A/B testing. It’s a little different for each company but some themes are pretty common. You should test thoroughly before making it to production. And you shouldn’t test with all users in production. Only a small segment of them. You want to typically test locally before committing the code that is gated behind a flag. You can then have a dogfooding session with teammates if you like. You can then rollout the code to employee only. If everything looks good after that, you can rollout the code to a small amount of users in prod. If everything looks good there, then you can roll it out to more users and eventually all users. Usually, companies or teams will have a specific process for A/B testing that they have outlined for what the requirements are before you can roll out the feature to more people.

Potential pitfalls of A/B testing

Pitfall #1: a stat-sig measurement could be a false positive

This is usually due to randomness (like explained in the coin flip metric) or due to pre-experimentation bias. Pre-experimentation bias could be something like the test group having a stat-sig amount more DAU than the control group the day before the experiment started. For the pre-experimentation bias, there are ways to query the data that subtract off the pre-experimentation bias. One way to see how often stat-sig measurements are false positives in your framework is to run an A/A test. Have two groups where they both have the same treatment and see how many metrics are stat-sig.

Pitfall #2: the novelty effect

Often times when you introduce a new feature that users can see, they click on it a lot out of curiosity and then click on it less often over time after they are aware of what it does. This often leads to metric readings showing a stat-sig increase in DAU over the first few days and then it becomes neutral over time. Or it maintains a stat-sig increase in DAU but it becomes a smaller stat-sig win over time. Or even worse, it becomes a stat-sig decline in DAU eventually. So, the metrics you see in the first few days or weeks for a feature are not necessarily representative of how the metrics will look long term. So, typically, you would want to test the feature for some time in order to account for the novelty effect before releasing it to all users. One potential pitfall of this is feeling like you have to wait a really long time before releasing to all users due to concern of the novelty effect and now it takes longer to ship features. One compromise for this is to, for example, launch the feature to 97% of users and leave 3% of the users excluded from getting the feature. You can then have this unlucky group of users be without the feature for a few months and compare the metrics between them and users who get the feature. If the group without the feature eventually has better metrics than the group with the feature, you can unship the feature from everyone. Otherwise, you eventually give these remaining users the feature. This way, you got to test longer without having to have 97% of users excluded from the feature for months before shipping.

Pitfall #3: dilution

Ideally, it would be the case that everyone in the test group actually saw the feature and that everyone in the control group would have seen the feature but didn’t see it solely due to being in the control group. It sometimes happens, whether intentional or unintentional, that some users in the test group didn’t actually see the feature. This is known as dilution since the test group is diluted with some users that are in the same situation as the control group: they didn’t see the feature. This makes it more likely that a metric is NSS. So you would typically have code that gates on your feature to determine whether the user gets the feature or not. When the gate is run against a user, the user is “exposed” in the experiment, meaning that their user will be included in the metrics. So the users compared between the control and test group would both be users that were “exposed” in your experiment. So, for example, if your feature is one that runs when a user uploads a story, you would expose when the user uploads a story and then all users in your metric readout would be ones that uploaded a story.

Pitfall #4: experiments interacting with each other

Suppose you are running experiment A where you are changing the font size and your colleague is running experiment B where they are changing the font color. There are four possibilities for a user: they are in both experiments, only in experiment A, only in experiment B, or are in neither experiment. This makes it harder to analyze the data in each experiment since they have users that might be affected by other ongoing experiments, changing the metrics. Maybe even worse, you are doing an experiment changing the background color to red and someone else is doing an experiment changing the background color to blue and the code is set up such that your teammate’s code overwrites your code so users in both experiments get blue. To get around this, there is the concept of a universe. Let’s say you have Universe A and it has 100% allocation available. You can start an experiment in Universe A at 2% allocation. Of the 100% available universe space, 2% of users are randomly selected to be eligible for your experiment. Your teammate can then start their experiment in Universe A at 2% allocation. Your teammate’s experiment will have 2% of the userbase selected from the pool of 98% of users that are not allocated to any experiment. So your two experiments are guaranteed to not have any users in the same experiment.

It may be that two features independently perform well and are thus launched but then after launching, they perform poorly together. Having two features in the same experiment in the same universe can help you test the interaction together. It might look like it’s good to have all of your experiments in the same universe so that you get to choose which features interact with which features. The downside is, for example, if all experiments run at 2% allocation, then you can only have at most 50 experiments running. Any future experiment that you want to run would have to wait until there is universe space. To get around this, you can create more universes. But keep in mind that users in an experiment of one universe can also be in an experiment of another universe. So the general rule of thumb is: 1) The more you are concerned about experiments interacting with each other, the more likely it is that you want them in the same universe. 2) The more experiments that you need to run, the more likely it is that you want to have more universes.

So take both of those pieces of information and make a judgement call.