Data to Action: How to Improve Conversion Rate¶

Load Data¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

sns.set_style("ticks")
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


plt.rcParams["figure.figsize"] = (8,6)
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 18

# read from Google drive
df=pd.read_csv("https://drive.google.com/uc?export=download&id=1LU5be_H1TD2Pp1OmI202to3YyKo9AzFY")
df.head()

df.shape

(316200, 6)

Feature

country : user country based on the IP address
age : user age. Self-reported at sign-up step
new_user : whether the user created the account during this session or had already an account and simply came back to the site
source : marketing channel source
- Ads: came to the site by clicking on an advertisement
- Seo: came to the site by clicking on search results
- Direct: came to the site by directly typing the URL on the browser

total_pages_visited: number of total pages visited during the session. This can be seen as a proxy for time spent on site and engagement
converted: this is our label. 1 means they converted within the session, 0 means they left without buying anything.

The company goal is to increase conversion rate: # conversions / total sessions

Descriptive Stats¶

Goal: Identifying the wrong data and dealing with it is a crucial step

# numerical features
df.describe()

# categorical features
df.country.value_counts()

US         178092
China       76602
UK          48450
Germany     13056
Name: country, dtype: int64

df.source.value_counts()

Seo       155040
Ads        88740
Direct     72420
Name: source, dtype: int64

Quick observations:

the site is probably a US site, although it does have a large Chinese user base as well
user base is pretty young
conversion rate at around 3% is industry standard. It makes sense

Anomaly data:

everything seems to make sense here except for max age 123 yrs!

Remove Outliers¶

# look into extreme high age
sorted(df.age.unique(), reverse = True)

[123,
 111,
 79,
 77,
 73,
 72,
 70,
 69,
 68,
 67,
 66,
 65,
 64,
 63,
 62,
 61,
 60,
 59,
 58,
 57,
 56,
 55,
 54,
 53,
 52,
 51,
 50,
 49,
 48,
 47,
 46,
 45,
 44,
 43,
 42,
 41,
 40,
 39,
 38,
 37,
 36,
 35,
 34,
 33,
 32,
 31,
 30,
 29,
 28,
 27,
 26,
 25,
 24,
 23,
 22,
 21,
 20,
 19,
 18,
 17]

df[df['age'] > 100]

df = df[df['age'] < 110]

Explotary Data Analysis¶

df[['country', 'converted']].groupby('country').mean()

# conversion rate by country
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#sns.set_style("ticks")


# 设置画布大小
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'country', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Country", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'country', y = 'converted', data = df);
ax2.set(title="Mean Converstion by Country");

Observation:

Chinese convert at a much lower rate than other countries.

# plot mean conversion wrt age
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.distplot(x = df[df['converted']==0]['age'],kde=True,label = 'converted 0');
sns.distplot(x = df[df['converted']==1]['age'],kde=True,label = 'converted 1');
ax1.set(title="Distribution of Conversion by User Age",xlim=(10,80));
plt.legend()

ax2 = fig.add_subplot(122) 
grouped_data = df[['age','converted']].groupby('age').mean().reset_index()
sns.lineplot(x = 'age', y='converted', data = grouped_data,markers=True)
ax2.set(title="Mean Converstion by User Age");

Observation:

Users have high conversion rate in the age 20-30
Older users have lower conversion rate

# plot mean conversion wrt user type
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'new_user', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by user type", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'new_user', y = 'converted', data = df);
ax2.set(title="Mean Converstion by user type");

Observation:

New users tend to have low conversion rate

# plot mean conversion wrt source
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'source', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Source", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.barplot(x = 'source', y = 'converted', data = df);
ax2.set(title="Mean Converstion by Source");

conversion_by_source = df[['source','converted']].set_index('source').groupby(['converted','source']).size()
conversion_by_source=conversion_by_source.unstack(level = 0)
conversion_by_source

conversion_by_source.plot.bar(stacked = True);

Observation:

Ads has a higher conversion rate on average
More users are coming from SEO

# plot conversion rate wrt total_page_visited
fig = plt.figure(figsize = (12, 6))

ax1= fig.add_subplot(121) 
sns.countplot(x = 'total_pages_visited', hue = 'converted', data = df);
ax1.set(title="Distribution of Conversion by Total_Pages_Visited", ylabel = 'log count');
ax1.set_yscale('log');

ax2 = fig.add_subplot(122) 
sns.lineplot(x="total_pages_visited", y="converted",estimator="mean", data=df);
ax2.set(title="Mean Converstion by Total_Pages_Visited");

Observation:

Definitely spending more time on the site implies higher probability of conversion!

Feature Engineering¶

df.head()

# check data type
df.dtypes

country                object
age                     int64
new_user                int64
source                 object
total_pages_visited     int64
converted               int64
dtype: object

# check missing value
df.isnull().sum()

country                0
age                    0
new_user               0
source                 0
total_pages_visited    0
converted              0
dtype: int64

# one-hot encoding for categorical variables
df_cleaned = pd.get_dummies(df, drop_first=True)
df_cleaned.head()

Random Forest Model¶

# specify features and target
target = df_cleaned['converted']
features =df_cleaned.drop('converted', axis = 1)

# check taget distribution
df_cleaned['converted'].value_counts()

0    306000
1     10198
Name: converted, dtype: int64

# split traninig set and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, 
                                                    target, 
                                                    test_size=0.2, 
                                                    random_state=1)

# train a random forest model
from sklearn.ensemble import RandomForestClassifier

rf= RandomForestClassifier(random_state=0, 
                            oob_score=True, 
                            n_jobs=-1)

# Train model 
model = rf.fit(x_train, y_train)

# predict
train_preds = rf.predict_proba(x_train)[:,1]
test_preds = rf.predict_proba(x_test)[:,1]

# use AUC score as the major metric to evaluate the model
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_train, train_preds)
auc_train = metrics.auc(fpr, tpr)
print("Training Set AUC:",auc_train)
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds)
auc_test = metrics.auc(fpr, tpr)
print("Test Set AUC:",auc_test)

Training Set AUC: 0.9936961573011132
Test Set AUC: 0.9528837330582031

# create ROC curve for the random forest model
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

false_positive_rate, true_positive_rate, threshold = roc_curve(y_train, train_preds)
false_positive_rate_test, true_positive_rate_test, threshold_test = roc_curve(y_test, test_preds)

# plot ROC currve
plt.title("ROC Curve for Random Forest Model")
plt.plot(false_positive_rate, true_positive_rate,label='Train ROC Curve (area = %0.3f)' % roc_auc_score(y_train, train_preds))
plt.plot(false_positive_rate_test, true_positive_rate_test,label='Test ROC Curve (area = %0.3f)' % roc_auc_score(y_test, test_preds))
plt.plot([0,1], ls="--")
plt.plot([0,0], [1,0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.legend()
plt.show()

feat_importances = pd.Series(rf.feature_importances_, index=x_train.columns)

plt.figure(figsize = (12, 6))
feat_importances.sort_values().plot(kind='barh')
# Create plot title 
plt.title("Feature Importance")
plt.show()

Observation:

Total pages visited is the most important one. Unfortunately, it is probably the least “actionable”. People visit many pages because they already want to buy. Also, in order to buy, you have to click on multiple pages.
Training AUC score is a bit higher, which means that present model is overfitting.

Retrain the model:

Rebuild the RF without total_pages_visited.
Tune the hyperparameters
Since classes are heavily unbalanced, we can reweigh the classes

x_train_new = x_train.drop('total_pages_visited', axis = 1)
x_test_new =  x_test.drop('total_pages_visited', axis = 1)

# train a random forest model
from sklearn.ensemble import RandomForestClassifier
np.random.seed(4684)
rf2= RandomForestClassifier(max_features=3,
                            n_estimators=100, 
                            class_weight={0:1, 1:10},
                            max_depth =20,
                            oob_score=True,
                            n_jobs=-1)

# Train model 
model2 = rf2.fit(x_train_new, y_train)

# predict
train_preds2 = model2.predict_proba(x_train_new)[:,1]
test_preds2 = model2.predict_proba(x_test_new)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y_train, train_preds2)
auc_train = metrics.auc(fpr, tpr)
print("Training Set AUC:",auc_train)
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds2)
auc_test = metrics.auc(fpr, tpr)
print("Test Set AUC:",auc_test)

Training Set AUC: 0.8266106294251536
Test Set AUC: 0.8143804338291096

Observation:

Now the model fit very well, neither overfitting nor underfitting.

Performance Metric¶

# create ROC curve for the random forest model

false_positive_rate, true_positive_rate, threshold = roc_curve(y_train, train_preds2)
false_positive_rate_test, true_positive_rate_test, threshold_test = roc_curve(y_test, test_preds2)

# plot ROC currve
plt.title("ROC Curve for Random Forest Model")
plt.plot(false_positive_rate, true_positive_rate,label='Train ROC Curve (area = %0.3f)' % roc_auc_score(y_train, train_preds2))
plt.plot(false_positive_rate_test, true_positive_rate_test,label='Test ROC Curve (area = %0.3f)' % roc_auc_score(y_test, test_preds2))
plt.plot([0,1], ls="--")
plt.plot([0,0], [1,0], c=".7"), plt.plot([1, 1], c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.legend()
plt.show()

## Accuracy
test_preds_outcome =np.where(test_preds2>0.5, 1, 0)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_preds_outcome, y_test)
accuracy

0.892314990512334

# confusion matrix
from sklearn.metrics import confusion_matrix 

#use pandas 'crosstab' function to produce a more readable confusion matrix
cm = pd.crosstab(y_test, test_preds_outcome,
            rownames=['Actual'], colnames=['Predicted'], margins=True)
cm

## Choose a threshold
fpr, tpr, thresholds = metrics.roc_curve(y_test, test_preds2)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
optimal_threshold

0.2438375413016584

Model Explanation¶

# feature importance
feat_importances = pd.Series(rf2.feature_importances_, index=x_train_new.columns)
feat_importances.sort_values().plot(kind='barh')
plt.show()

New user is the most important feature
Continuous variables tend to always show up at the top
Source-related variables don't seem to matter at all

# partial dependence plot
from pdpbox import pdp, info_plots
  
#country
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature=['country_Germany', 'country_UK', 'country_US'], 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='Country')
plt.show()

from pdpbox import pdp, info_plots
  
#source
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature=['source_Direct', 'source_Seo'], 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='source')
plt.show()

from pdpbox import pdp, info_plots
  
#source
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature='new_user', 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='New User')
plt.show()

#age
pdp_iso = pdp.pdp_isolate(model=rf2, 
                          dataset=x_train_new,      
                          model_features=list(x_train_new), 
                          feature='age', 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.feature_grids)
pdp_dataset.plot(title='Age')
plt.show()

Insights:

Old users are much better than new users
Germany, UK, and US are similar, with Germany being the best. Most importantly, they all have very very high values. We could read this as relative to the reference level, which is China. So this means that not being from China and being from any of those 3 countries significantly increases the probability of conversion. That is, China is very bad for conversion
The site works very well for young people and gets worse for >30 yr old
Source is less relevant

# tree segmentation
# Let’s now build a simple decision tree and check the 2 or 3 most important segments:
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image
from IPython.display import display
  
tree = DecisionTreeClassifier(max_depth=2,class_weight={0:1, 1:10}, min_impurity_decrease = 0.001)
tree.fit(x_train_new, y_train)
  
#visualize it tree plot
dot_data = export_graphviz(tree, 
                           out_file=None, 
                           feature_names=x_train_new.columns,
                           proportion=True,
                           rotate=True,
                           filled=True)


graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))

A simple small tree confirms exactly the random forest findings.

Conclusion and Next Step¶

What feature matters and how to improve:

The site is working very well for young users. Definitely let’s tell marketing to advertise and use channels which are more likely to reach young people.
The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again, marketing should get more Germans. Big opportunity.
Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try.
Maybe go through the UI and figure out why older users perform so poorly? From ~30 y/o conversion clearly starts dropping. A good actionable metric here is conversion rate for people >=30 yr old. Building a team whose goal is to increase that number would be interesting.
Something is wrong with the Chinese version of the site. It is either poorly translated, doesn’t fit the local culture, or maybe some payment issue. Given how many users are based in China, fixing this should be a top priority. Huge opportunity.

	age	new_user	total_pages_visited	converted
count	316200.000000	316200.000000	316200.000000	316200.000000
mean	30.569858	0.685465	4.872966	0.032258
std	8.271802	0.464331	3.341104	0.176685
min	17.000000	0.000000	1.000000	0.000000
25%	24.000000	0.000000	2.000000	0.000000
50%	30.000000	1.000000	4.000000	0.000000
75%	36.000000	1.000000	7.000000	0.000000
max	123.000000	1.000000	29.000000	1.000000

converted	0	1
source
Ads	85680	3059
Direct	70380	2040
Seo	149940	5099

Predicted	0	1	All
Actual
0	55544	5666	61210
1	1144	886	2030
All	56688	6552	63240

	converted
country
China	0.001332
Germany	0.062428
UK	0.052612
US	0.037801

	age	new_user	total_pages_visited	country_UK	country_US	source_Seo
0	25	1	1	1	0	0
1	23	1	5	0	1	1
2	28	1	4	0	1	1
3	39	1	5	0	0	1
4	30	1	6	0	1	1