Data to Action: How to Improve Conversion Rate

image.png

Load Data

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

sns.set_style("ticks")
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


plt.rcParams["figure.figsize"] = (8,6)
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 18
In [2]:
# read from Google drive
df=pd.read_csv("https://drive.google.com/uc?export=download&id=1LU5be_H1TD2Pp1OmI202to3YyKo9AzFY")
df.head()
Out[2]:
country age new_user source total_pages_visited converted
0 UK 25 1 Ads 1 0
1 US 23 1 Seo 5 0
2 US 28 1 Seo 4 0
3 China 39 1 Seo 5 0
4 US 30 1 Seo 6 0
In [3]:
df.shape
Out[3]:
(316200, 6)

Feature

  • country : user country based on the IP address

  • age : user age. Self-reported at sign-up step

  • new_user : whether the user created the account during this session or had already an account and simply came back to the site

  • source : marketing channel source

    • Ads: came to the site by clicking on an advertisement
    • Seo: came to the site by clicking on search results
    • Direct: came to the site by directly typing the URL on the browser
  • total_pages_visited: number of total pages visited during the session. This can be seen as a proxy for time spent on site and engagement

  • converted: this is our label. 1 means they converted within the session, 0 means they left without buying anything.

The company goal is to increase conversion rate: # conversions / total sessions

Descriptive Stats

Goal: Identifying the wrong data and dealing with it is a crucial step

In [4]:
# numerical features
df.describe()
Out[4]:
age new_user total_pages_visited converted
count 316200.000000 316200.000000 316200.000000 316200.000000
mean 30.569858 0.685465 4.872966 0.032258
std 8.271802 0.464331 3.341104 0.176685
min 17.000000 0.000000 1.000000 0.000000
25% 24.000000 0.000000 2.000000 0.000000
50% 30.000000 1.000000 4.000000 0.000000
75% 36.000000 1.000000 7.000000 0.000000
max 123.000000 1.000000 29.000000 1.000000
In [5]:
# categorical features
df.country.value_counts()
Out[5]:
US         178092
China       76602
UK          48450
Germany     13056
Name: country, dtype: int64
In [6]:
df.source.value_counts()
Out[6]:
Seo       155040
Ads        88740
Direct     72420
Name: source, dtype: int64

Quick observations:

  • the site is probably a US site, although it does have a large Chinese user base as well
  • user base is pretty young
  • conversion rate at around 3% is industry standard. It makes sense

Anomaly data:

  • everything seems to make sense here except for max age 123 yrs!

Remove Outliers

In [7]:
# look into extreme high age
sorted(df.age.unique(), reverse = True)
Out[7]:
[123,
 111,
 79,
 77,
 73,
 72,
 70,
 69,
 68,
 67,
 66,
 65,
 64,
 63,
 62,
 61,
 60,
 59,
 58,
 57,
 56,
 55,
 54,
 53,
 52,
 51,
 50,
 49,
 48,
 47,
 46,
 45,
 44,
 43,
 42,
 41,
 40,
 39,
 38,
 37,
 36,
 35,
 34,
 33,
 32,
 31,
 30,
 29,
 28,
 27,
 26,
 25,
 24,
 23,
 22,
 21,
 20,
 19,
 18,
 17]
In [8]:
df[df['age'] > 100]
Out[8]:
country age new_user source total_pages_visited converted
90928 Germany 123 0 Seo 15 1
295581 UK 111 0 Ads 10 1
In [9]:
df = df[df['age'] < 110]

Explotary Data Analysis

In [10]:
df[['country', 'converted']].groupby('country').mean()
Out[10]:
converted
country
China 0.001332
Germany 0.062428
UK 0.052612
US 0.037801
In [15]:
# conversion rate by country
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#sns.set_style("ticks")


# 设置画布大小
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'country', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Country", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'country', y = 'converted', data = df);
ax2.set(title="Mean Converstion by Country");

Observation:

  • Chinese convert at a much lower rate than other countries.
In [16]:
# plot mean conversion wrt age
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.distplot(x = df[df['converted']==0]['age'],kde=True,label = 'converted 0');
sns.distplot(x = df[df['converted']==1]['age'],kde=True,label = 'converted 1');
ax1.set(title="Distribution of Conversion by User Age",xlim=(10,80));
plt.legend()

ax2 = fig.add_subplot(122) 
grouped_data = df[['age','converted']].groupby('age').mean().reset_index()
sns.lineplot(x = 'age', y='converted', data = grouped_data,markers=True)
ax2.set(title="Mean Converstion by User Age");

Observation:

  • Users have high conversion rate in the age 20-30
  • Older users have lower conversion rate
In [17]:
# plot mean conversion wrt user type
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'new_user', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by user type", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'new_user', y = 'converted', data = df);
ax2.set(title="Mean Converstion by user type");

Observation:

  • New users tend to have low conversion rate
In [18]:
# plot mean conversion wrt source
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'source', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Source", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'source', y = 'converted', data = df);
ax2.set(title="Mean Converstion by Source");
In [19]:
conversion_by_source = df[['source','converted']].set_index('source').groupby(['converted','source']).size()
conversion_by_source=conversion_by_source.unstack(level = 0)
conversion_by_source
Out[19]:
converted 0 1
source
Ads 85680 3059
Direct 70380 2040
Seo 149940 5099
In [20]:
conversion_by_source.plot.bar(stacked = True);

Observation:

  • Ads has a higher conversion rate on average
  • More users are coming from SEO
In [21]:
# plot conversion rate wrt total_page_visited
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'total_pages_visited', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Total_Pages_Visited", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.lineplot(x="total_pages_visited", y="converted",estimator="mean", data=df);
ax2.set(title="Mean Converstion by Total_Pages_Visited");

Observation:

  • Definitely spending more time on the site implies higher probability of conversion!

Feature Engineering

In [22]:
df.head()
Out[22]:
country age new_user source total_pages_visited converted
0 UK 25 1 Ads 1 0
1 US 23 1 Seo 5 0
2 US 28 1 Seo 4 0
3 China 39 1 Seo 5 0
4 US 30 1 Seo 6 0
In [23]:
# check data type
df.dtypes
Out[23]:
country                object
age                     int64
new_user                int64
source                 object
total_pages_visited     int64
converted               int64
dtype: object
In [24]:
# check missing value
df.isnull().sum()
Out[24]:
country                0
age                    0
new_user               0
source                 0
total_pages_visited    0
converted              0
dtype: int64
In [25]:
# one-hot encoding for categorical variables
df_cleaned = pd.get_dummies(df, drop_first=True)
df_cleaned.head()
Out[25]:
age new_user total_pages_visited converted country_Germany country_UK country_US source_Direct source_Seo
0 25 1 1 0 0 1 0 0 0
1 23 1 5 0 0 0 1 0 1
2 28 1 4 0 0 0 1 0 1
3 39 1 5 0 0 0 0 0 1
4 30 1 6 0 0 0 1 0 1

Random Forest Model

In [26]:
# specify features and target
target = df_cleaned['converted']
features =df_cleaned.drop('converted', axis = 1)
In [27]:
# check taget distribution
df_cleaned['converted'].value_counts()
Out[27]:
0    306000
1     10198
Name: converted, dtype: int64
In [28]:
# split traninig set and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, 
                                                    target, 
                                                    test_size=0.2, 
                                                    random_state=1)
In [29]:
# train a random forest model
from sklearn.ensemble import RandomForestClassifier

rf= RandomForestClassifier(random_state=0, 
                            oob_score=True, 
                            n_jobs=-1)

# Train model 
model = rf.fit(x_train, y_train)
In [30]:
# predict
train_preds = rf.predict_proba(x_train)[:,1]
test_preds = rf.predict_proba(x_test)[:,1]
In [31]:
# use AUC score as the major metric to evaluate the model
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_train, train_preds)
auc_train = metrics.auc(fpr, tpr)
print("Training Set AUC:",auc_train)
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds)
auc_test = metrics.auc(fpr, tpr)
print("Test Set AUC:",auc_test)
Training Set AUC: 0.9936961573011132
Test Set AUC: 0.9528837330582031
In [32]:
# create ROC curve for the random forest model
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

false_positive_rate, true_positive_rate, threshold = roc_curve(y_train, train_preds)
false_positive_rate_test, true_positive_rate_test, threshold_test = roc_curve(y_test, test_preds)

# plot ROC currve
plt.title("ROC Curve for Random Forest Model")
plt.plot(false_positive_rate, true_positive_rate,label='Train ROC Curve (area = %0.3f)' % roc_auc_score(y_train, train_preds))
plt.plot(false_positive_rate_test, true_positive_rate_test,label='Test ROC Curve (area = %0.3f)' % roc_auc_score(y_test, test_preds))
plt.plot([0,1], ls="--")
plt.plot([0,0], [1,0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.legend()
plt.show()
In [33]:
feat_importances = pd.Series(rf.feature_importances_, index=x_train.columns)

plt.figure(figsize = (12, 6))
feat_importances.sort_values().plot(kind='barh')
# Create plot title 
plt.title("Feature Importance")
plt.show()

Observation:

  • Total pages visited is the most important one. Unfortunately, it is probably the least “actionable”. People visit many pages because they already want to buy. Also, in order to buy, you have to click on multiple pages.
  • Training AUC score is a bit higher, which means that present model is overfitting.

Retrain the model:

  • Rebuild the RF without total_pages_visited.
  • Tune the hyperparameters
  • Since classes are heavily unbalanced, we can reweigh the classes
In [34]:
x_train_new = x_train.drop('total_pages_visited', axis = 1)
x_test_new =  x_test.drop('total_pages_visited', axis = 1)
In [35]:
# train a random forest model
from sklearn.ensemble import RandomForestClassifier
np.random.seed(4684)
rf2= RandomForestClassifier(max_features=3,
                            n_estimators=100, 
                            class_weight={0:1, 1:10},
                            max_depth =20,
                            oob_score=True,
                            n_jobs=-1)

# Train model 
model2 = rf2.fit(x_train_new, y_train)
In [36]:
# predict
train_preds2 = model2.predict_proba(x_train_new)[:,1]
test_preds2 = model2.predict_proba(x_test_new)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y_train, train_preds2)
auc_train = metrics.auc(fpr, tpr)
print("Training Set AUC:",auc_train)
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds2)
auc_test = metrics.auc(fpr, tpr)
print("Test Set AUC:",auc_test)
Training Set AUC: 0.8266106294251536
Test Set AUC: 0.8143804338291096

Observation:

  • Now the model fit very well, neither overfitting nor underfitting.

Performance Metric

In [37]:
# create ROC curve for the random forest model

false_positive_rate, true_positive_rate, threshold = roc_curve(y_train, train_preds2)
false_positive_rate_test, true_positive_rate_test, threshold_test = roc_curve(y_test, test_preds2)

# plot ROC currve
plt.title("ROC Curve for Random Forest Model")
plt.plot(false_positive_rate, true_positive_rate,label='Train ROC Curve (area = %0.3f)' % roc_auc_score(y_train, train_preds2))
plt.plot(false_positive_rate_test, true_positive_rate_test,label='Test ROC Curve (area = %0.3f)' % roc_auc_score(y_test, test_preds2))
plt.plot([0,1], ls="--")
plt.plot([0,0], [1,0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.legend()
plt.show()
In [38]:
## Accuracy
test_preds_outcome =np.where(test_preds2>0.5, 1, 0)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_preds_outcome, y_test)
accuracy
Out[38]:
0.892314990512334
In [39]:
# confusion matrix
from sklearn.metrics import confusion_matrix 

#use pandas 'crosstab' function to produce a more readable confusion matrix
cm = pd.crosstab(y_test, test_preds_outcome,
            rownames=['Actual'], colnames=['Predicted'], margins=True)
cm
Out[39]:
Predicted 0 1 All
Actual
0 55544 5666 61210
1 1144 886 2030
All 56688 6552 63240
In [40]:
## Choose a threshold
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds2)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
optimal_threshold
Out[40]:
0.2438375413016584

Model Explanation

In [41]:
# feature importance
feat_importances = pd.Series(rf2.feature_importances_, index=x_train_new.columns)
feat_importances.sort_values().plot(kind='barh')
plt.show()
  • New user is the most important feature
  • Continuous variables tend to always show up at the top
  • Source-related variables don't seem to matter at all
In [42]:
# partial dependence plot
from pdpbox import pdp, info_plots
  
#country
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature=['country_Germany', 'country_UK', 'country_US'], 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='Country')
plt.show()
In [43]:
from pdpbox import pdp, info_plots
  
#source
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature=['source_Direct', 'source_Seo'], 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='source')
plt.show()
In [44]:
from pdpbox import pdp, info_plots
  
#source
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature='new_user', 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='New User')
plt.show()
In [45]:
#age
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature='age', 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.feature_grids)
pdp_dataset.plot(title='Age')
plt.show()

Insights:

  • Old users are much better than new users
  • Germany, UK, and US are similar, with Germany being the best. Most importantly, they all have very very high values. We could read this as relative to the reference level, which is China. So this means that not being from China and being from any of those 3 countries significantly increases the probability of conversion. That is, China is very bad for conversion
  • The site works very well for young people and gets worse for >30 yr old
  • Source is less relevant
In [47]:
# tree segmentation
# Let’s now build a simple decision tree and check the 2 or 3 most important segments:
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image
from IPython.display import display
  
tree = DecisionTreeClassifier(max_depth=2,class_weight={0:1, 1:10}, min_impurity_decrease = 0.001)
tree.fit(x_train_new, y_train)
  
#visualize it tree plot
dot_data = export_graphviz(tree, 
                           out_file=None, 
                           feature_names=x_train_new.columns,
                           proportion=True,
                           rotate=True,
                           filled=True)


graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))

A simple small tree confirms exactly the random forest findings.

Conclusion and Next Step

What feature matters and how to improve:

  1. The site is working very well for young users. Definitely let’s tell marketing to advertise and use channels which are more likely to reach young people.

  2. The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again, marketing should get more Germans. Big opportunity.

  3. Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try.

  4. Maybe go through the UI and figure out why older users perform so poorly? From ~30 y/o conversion clearly starts dropping. A good actionable metric here is conversion rate for people >=30 yr old. Building a team whose goal is to increase that number would be interesting.

  5. Something is wrong with the Chinese version of the site. It is either poorly translated, doesn’t fit the local culture, or maybe some payment issue. Given how many users are based in China, fixing this should be a top priority. Huge opportunity.

In [ ]: