A/b Testing Done Right... Or Wrong?

We all want to be there, some of us already doing it and few even do it great, A/B testing!

When saying "A/B" test among data practitioners you usually get's one of the next reactions:

Un-explained joy and enthusiasm
Skepticism
A "get out of here" look and feel

In this post I want to shed light about best practice and common mistakes I have seen in a verity of companies (some are heavily using hypothesis testing)

Starts with why you need it!

Things we would like to test with a/b:

New features
UI changes
Different look of a page with same functionality
Ranking changes

When and why not to use:

Change aversion & New experience
No benchmark/ baseline for comparison
How much time we need till users adapt the new experience
Need lot’s of returning users to determine a change
Testing can’t tell if we missing something

When not to use:

The test will measure the change effect but will not tell if there is another huge effect we are missing and it’s not on the test scope
Redo your site
Changing user’s interaction way

Some tips before/after planning/running a test

Sometimes the A/B test is mandatory - but you would need to have more tools/input to make a decision
For sequence of tests- we would prefer using the same metric
Plot the data to identify the variance
You want your metric to be sensitive enough to capture the changes and robustness one with smooth distribution (use histogram to find the range of values and frequencies) that impact the majority of the data
Do A/A tests

Not everything is testable - so what else?

-> Other ways of having information for a decision, with an hypothesis :

Web-site logs / events
DWH
Behavioral analysis
User experience research
Focus groups
Surveys
Human evaluation

Till now - I have talked about when and why even do a test...

I'll assume you are all familiar with the methodology and the statistical measures (if not, I'll cover that in another post for "A/B 101").

Now, let's talk about the common mistakes I have seen and how to easily fix them!

The common mistakes and how to avoid them!

#1 - The engineering factor

When your test is going live - it usually goes through the engineering team.

While the test might be straight forward, the BE won't - and as a Data scientist/analyst you need to make sure that your test is well defined, the workflow aligned with the spec and the desired result actually happens.

Test validation after going live → AKA Engineering factor:

Feature does not function as expected! need to validate that the test and control are behaving as the design
- Example - if the test population is for new users only - we can check if on the test time frame all the new users enrolled to one of the test groups ( in case of 50-50 test for new users)

Considerate that the observed increments is the increment brought by the feature

For example - if there is an active campaign talking about the feature , social conversation about it etc.. that might increase the feature awareness
Feature was enabled to a partial population instead of the desired population

Unreliable a/b test systems - finding all kind of cases that users did not got any enrollment to one of the groups due to engineering definitions

Widely common example:

Flickers, which refers to users that have switched between control and treatment groups.

Existence of such users might contaminate the experiment results, so we would exclude these users (flickers) in our analyses.

#2 - Cherry peaking and the human factor

When product managers running a test, they usually have a winning version they would like to see ;-)

In order to run a test with a pure goal of having the right decision - we need to avoid the human factor and reduce the chances of choosing the wrong test version.

Here are some examples of how not to take decision on:

Looking at the whole population while only part of it is impacted
Running test without any business intuition
Choosing specific slices to reach statistical significant
Looking at several metrics to reach statistical significant
Stopping the test when reaching statistical significance
Keeping the test until reaching statistical significance
Ignoring a/b test results when they going against “intuition”

#3 - No to poor statistical understanding

Well, at the end, A/B test is an statistical method o- and as such one, you need to know your way in...

Here are some examples of what not to do:

Running tests without a statistical methodology
- Not doing anything or doing it without compassion is not a test
Working with the wrong metric
Running tests without analyzing the behavior first

Bring on the solutions!

As a data scientist/analyst - you have number of options to apply on your data in order to turn it into an accurate one.

Here are few ones you can use before/during the test.

Pre stages to A/B test

Outlier detection

Removes irregularities in data and improves the robustness of analytic results. We should use a clustering-based algorithm to perform outlier detection and removal

Variance reduction

Helps increase the statistical power of hypothesis testing, which is especially helpful when the experiment has a small user base or when we need to end the experiment prematurely without sacrificing scientific rigor. The CUPED Method leverages extra information we have and reduces the variance in decision metrics.

Pre-experiment bias

Is a big challenge because of the diversity of users.

Sometimes, constructing a robust counterfactual via mere randomization just doesn’t cut it.

Difference in differences (diff-in-diff) is a well-accepted method in quantitative research and we use it to correct pre-experiment bias between groups so as to produce reliable treatment effects estimation.

Calculate it right!

Make sure you are calculating the numbers/metrics right.

Here are various procedures for p-value calculation, including:

Welch’s t-test, the default test used for continuous metrics, e.g., conversions / completed actions.

The Mann-Whitney U test, a nonparametric rank sum test used to detect severe skewness in the data.

It requires weaker assumptions than the t-test and performs better with skewed data.

The Chi-squared test, used for proportion metrics, e.g., retention rate.

The Delta method (Deng et al. 2011) and bootstrap methods, used for standard error estimation whenever suitable to generate robust results for experiments with ratio metrics or with small sample sizes, e.g., the ratio of plans cancelled by users.

Well, I hope that by now you did found about something that needs a bit of an update, if not ,you're doing great!

Keep following for future posts about A/B testing.