Data to Action: AB test - Pricing Test

image.png

Load Data

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("ticks")
plt.style.use('fivethirtyeight')

#sns.set_style("white")
#plt.style.use('ggplot')
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


plt.rcParams["figure.figsize"] = (12,8)
plt.rcParams['axes.labelsize'] = 16
plt.rcParams['axes.titlesize'] = 20

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
In [3]:
df_test = pd.read_csv("data/Pricing_Test/test_results.csv")
print(df_test.shape)
df_test.head()
(316800, 8)
Out[3]:
user_id timestamp source device operative_system test price converted
0 604839 2015-05-08 03:38:34 ads_facebook mobile iOS 0 39 0
1 624057 2015-05-10 21:08:46 seo-google mobile android 0 39 0
2 317970 2015-04-04 15:01:23 ads-bing mobile android 0 39 0
3 685636 2015-05-07 07:26:01 direct_traffic mobile iOS 1 59 0
4 820854 2015-05-24 11:04:40 ads_facebook web mac 0 39 0
In [6]:
df_user = pd.read_csv("data/Pricing_Test/user_table.csv")
print(df_user.shape)
df_user.head()
(275616, 5)
Out[6]:
user_id city country lat long
0 510335 Peabody USA 42.53 -70.97
1 89568 Reno USA 39.54 -119.82
2 434134 Rialto USA 34.11 -117.39
3 289769 Carson City USA 39.15 -119.74
4 939586 Chicago USA 41.84 -87.68

Data Cleaning

In [7]:
# missing value
df_test.isnull().sum()
Out[7]:
user_id             0
timestamp           0
source              0
device              0
operative_system    0
test                0
price               0
converted           0
dtype: int64
In [8]:
df_user.isnull().sum()
Out[8]:
user_id    0
city       0
country    0
lat        0
long       0
dtype: int64
In [10]:
# merge two dataframes on user_id
df = pd.merge(df_test, df_user, on='user_id', how= 'left')
df.head()
Out[10]:
user_id timestamp source device operative_system test price converted city country lat long
0 604839 2015-05-08 03:38:34 ads_facebook mobile iOS 0 39 0 Buffalo USA 42.89 -78.86
1 624057 2015-05-10 21:08:46 seo-google mobile android 0 39 0 Lakeville USA 44.68 -93.24
2 317970 2015-04-04 15:01:23 ads-bing mobile android 0 39 0 Parma USA 41.38 -81.73
3 685636 2015-05-07 07:26:01 direct_traffic mobile iOS 1 59 0 Fayetteville USA 35.07 -78.90
4 820854 2015-05-24 11:04:40 ads_facebook web mac 0 39 0 Fishers USA 39.95 -86.02
In [11]:
# check missing value after merge
df.isnull().sum()
Out[11]:
user_id                 0
timestamp               0
source                  0
device                  0
operative_system        0
test                    0
price                   0
converted               0
city                41184
country             41184
lat                 41184
long                41184
dtype: int64
In [12]:
# those missing values are exactly the number of rows 
In [16]:
# check if people in test always get the 59$ and people in control 39$
df.groupby(['test','price'])[['price']].count()
Out[16]:
price
test price
0 39 202517
59 210
1 39 155
59 113918

Issue:

  • A small number of users appear to be getting the wrong price based on their test/control assignment.

Fix:

  • Remove the users with wrong price and move one with our analysis.
In [20]:
df=df[((df['test']==0) & (df['price']==39)) | ((df['test']==1) & (df['price']==59))]
df.head()
Out[20]:
user_id timestamp source device operative_system test price converted city country lat long
0 604839 2015-05-08 03:38:34 ads_facebook mobile iOS 0 39 0 Buffalo USA 42.89 -78.86
1 624057 2015-05-10 21:08:46 seo-google mobile android 0 39 0 Lakeville USA 44.68 -93.24
2 317970 2015-04-04 15:01:23 ads-bing mobile android 0 39 0 Parma USA 41.38 -81.73
3 685636 2015-05-07 07:26:01 direct_traffic mobile iOS 1 59 0 Fayetteville USA 35.07 -78.90
4 820854 2015-05-24 11:04:40 ads_facebook web mac 0 39 0 Fishers USA 39.95 -86.02
In [21]:
# check count again
df.groupby(['test','price'])[['price']].count()
Out[21]:
price
test price
0 39 202517
1 59 113918
In [22]:
# create a new feature: revenue
df['revenue'] = df['converted'] * df['price']
df.head()
Out[22]:
user_id timestamp source device operative_system test price converted city country lat long revenue
0 604839 2015-05-08 03:38:34 ads_facebook mobile iOS 0 39 0 Buffalo USA 42.89 -78.86 0
1 624057 2015-05-10 21:08:46 seo-google mobile android 0 39 0 Lakeville USA 44.68 -93.24 0
2 317970 2015-04-04 15:01:23 ads-bing mobile android 0 39 0 Parma USA 41.38 -81.73 0
3 685636 2015-05-07 07:26:01 direct_traffic mobile iOS 1 59 0 Fayetteville USA 35.07 -78.90 0
4 820854 2015-05-24 11:04:40 ads_facebook web mac 0 39 0 Fishers USA 39.95 -86.02 0
In [23]:
df['revenue'].describe()
Out[23]:
count    316435.000000
mean          0.827083
std           6.179009
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          59.000000
Name: revenue, dtype: float64

Q: Should the company sell its software for 39 or 59?

In [25]:
# compare conversion rate and revenue
df.groupby('test')['converted'].mean()
Out[25]:
test
0    0.019900
1    0.015555
Name: converted, dtype: float64
In [26]:
df.groupby('test')['revenue'].mean()
Out[26]:
test
0    0.776083
1    0.917748
Name: revenue, dtype: float64

Although conversion rate seems to go down by around 25%, revenue per use is going up.

In [29]:
# Perform a t-test

from scipy import stats
t_test = stats.ttest_ind(df[df['test'] == 1]['revenue'],
                         df[df['test'] == 0]['revenue'],
                         equal_var=False)

print('t statistic is: ',t_test.statistic)
print('p-value is: ',t_test.pvalue)
t statistic is:  5.715224666463108
p-value is:  1.0972577312420781e-08

Revenue per user is up and highly statistically significant.

Let’s now check how different segments are reacting to the test. This can give us extremely useful information about user price sensitivity by segment, which is really just a fancy way to say: how much different users are valuing our product?

In [39]:
# check revenue by source
sns.barplot(x='source', y='revenue', hue='test', data=df);
plt.xticks(rotation=45);
plt.title('Revenue by Source');

Insights:

  • Most segments agree with the overall finding of an increase in revenue as we increase price.
  • Unfortunately, the three segments with the higher avg revenue with the new price happen to be the most expensive ones, i.e. friend referral, and google/FB ads.
In [40]:
# check revenue by device
sns.barplot(x='device', y='revenue', hue='test', data=df);
plt.xticks(rotation=45) ;
plt.title('Revenue by Device');

Insight:

  • It appears that mobile users are less price sensitive.
In [42]:
# check revenue by operating system
sns.barplot(x='operative_system', y='revenue', hue='test', data=df);
plt.xticks(rotation=45);

Insights:

  • Mac/iOS react better to the price increase. Most likely, they are acting as a proxy for different user characteristics.
    • -> It would probably be a good idea to figure out how to get more of those users and target them more via, for instance, ads or ad-hoc marketing campaigns.
  • Linux has 0% conversion rate for test! That’s clearly impossible and most likely a bug.
    • -> Try to remove that segment from the dataset and redo the statistical test
In [43]:
# check the ratio of Linux users
df[df['operative_system']=='linux'].shape[0] / df.shape[0]
Out[43]:
0.01305165357814401

There are only 1% users choosing Linux platform. Let's remove them from our dataset.

In [44]:
df = df.query('operative_system!=\'linux\'')
df.groupby('test')[['revenue']].mean()
Out[44]:
revenue
test
0 0.778002
1 0.933531
In [45]:
# perform t-test

t_test = stats.ttest_ind(df[df['test'] == 1]['revenue'],
                         df[df['test'] == 0]['revenue'],
                         equal_var=False)

print('t statistic is: ',t_test.statistic)
print('p-value is: ',t_test.pvalue)
t statistic is:  6.184240028400434
p-value is:  6.253458285742887e-10

The result improves slightly.

In [46]:
# check top 10 cities with most users
top_cities=df['city'].value_counts().reset_index().sort_values('city', ascending=False).head(10)
top_cities
Out[46]:
index city
0 New York 25401
1 Chicago 7071
2 Houston 6608
3 San Antonio 4554
4 Los Angeles 4089
5 San Jose 3312
6 Indianapolis 3177
7 Jacksonville 2865
8 Philadelphia 2488
9 Las Vegas 2375
In [47]:
#plot avg revenue for top 10 cities
sns.barplot(x='city', y='revenue', hue='test', data=df[df['city'].isin(top_cities['index'])]);
plt.xticks(rotation=45);
  • Results appear pretty different by city, but it is hard to draw any conclusions.
  • The city variable has so many levels that, by chance only, we are expected to find a lot of different results.
  • However, combining this with domain/product knowledge could lead to create some hypotheses that could be then A/B tested in a more statistically sound way.

Conclusions:

  • If the goal of the test was trying to increase revenue, then the test has won and we should go ahead with the change.
  • The negative side is that we are losing roughly 25% of customers.
  • The positive side is that larger revenue per user typically means that the company can bid higher for ads or, more generally, invest more money in most marketing channels. So, in the long run, this should give the opportunity of increasing the number of customers and, therefore, total revenue.

Other insights:

  • Friend-referral is by far the best channel. Any UI/UX change that emphasizes it even more is likely to lead to revenue gains
  • After referrals, FB/Google ads have higher revenue per user than any other channel. This is kind of expected. However, SEO is not really that much behind. And SEO is free. Focusing on SEO and increasing its relative share of traffic would lead to much larger profits
  • Mobile has larger revenue per user than web. This is an excellent news. Mobile share of traffic has been growing constantly over the last few years. Just by taking advantage of this trend, the company will make more money. If they could increase mobile traffic even more, they would make even more money. Bidding higher for mobile ads is probably a good idea
  • iOS is doing much better than Android. Most likely, this is just because iOS users are less price sensitive than Android users. However, it would still be interesting investigating whether the actual product (i.e. UI/UX) is different between the two platforms
  • Revenue per user varies greatly across segments. This could lead to potentially understanding why and coming up with different products, some cheaper and some more expensive

[Bonus] After how many days you would have stopped the test?

Let’s say significance level and power are the standard 0.05 and 0.8. Let’s also assume we are interested in detecting a difference in conversion rate of about ~33%, i.e. ~1-39/59. That is, if the new conversion rate is about 66% of the old one we are good.

In [53]:
import statsmodels.stats.api as sms

conversion_rate_control = df_test.query('test==0').converted.mean()
p1_and_p2 = sms.proportion_effectsize(conversion_rate_control, conversion_rate_control*39/59)
sample_size = round(sms.NormalIndPower().solve_power(p1_and_p2, power=0.8, alpha=0.05))
print("The required sample size per group is ~", sample_size)
The required sample size per group is ~ 5548
In [55]:
# check how many users we have per week to figure out how long it will take to get enough users
#create a column that's week of the year
df_test['week_year'] = pd.to_datetime(df_test.timestamp.str.split(' ').str[0]).dt.strftime('%W')
#take avg count of users per week. We remove first and last week, just in case they don't include all 7 days
users_week = df_test['week_year'].value_counts().sort_index()
users_week = users_week.drop(users_week.index[[0,(users_week.shape[0]-1)]]).mean()
  
print("Avg number of users per week is:", round(users_week))
Avg number of users per week is: 24356

If we ran the test on 50% of the users, we would be done in less than a week.

However, it is a good practice to run the test for at least two weeks regardless of that, so that we can get weekly patterns.

\In terms of percentage split, we can look at what’s the minimum percentage of users that would need to be in test to get the required sample size in two weeks.

In [58]:
print("The percentage of traffic that needs to be in test is", round(sample_size/(users_week*2)*100),'%')
The percentage of traffic that needs to be in test is 11 %
In [ ]: