Predicting Bank Customer Churn using AWS SageMaker and XGBoost in Local RStudio
- 🥅 Project Goal
- 🗂 Obtain Data
- 🛁 Clean Data
- 🔭 Explore Data
- 🧮 AWS Setup
- 🏗 Train XGBoost
- 🔧 Tune XGBoost
- 🏆 Select Best Model
- 🎯 Predict Holdout
- 📬 Submission
library(tidymodels)
library(tidyverse)
library(corrmorant) # correlation matrix
library(patchwork) # multiple plots to 1 plot
library(reticulate) # calling the SageMaker Python SDK from R
library(pROC) # ROC curves
library(viridis) # color palletes
library(caret) # confusion matrix
🥅 Goal of this Project
Predict whether a bank customer will churn using AWS SageMaker and RStudio
🗂 Obtain Data
The data comes from the Kaggle competition SLICED.
churn_data <- read_csv("~/Documents/R/data_warz/content/post/2021-08-01-sagemaker-r-xgb-churn/train.csv")
holdout <- read_csv("~/Documents/R/data_warz/content/post/2021-08-01-sagemaker-r-xgb-churn/test.csv")
holdout
does not contain attrition_flag
. Use holdout
to run
inference after you’ve selected the best tuned model and built an
endpoint to host the model. To submit predicted customer churn to the
SLICED competition, we’ll need a .csv file that contains only 2 columns:
“id” (those found in holdout
) and inferred attrition_flag
. That will
be our final step.
🛁 Clean Data
glimpse(churn_data
## Rows: 7,088
## Columns: 15
## $ id <dbl> 8805, 4231, 5263, 2072, 7412, 6142, 558, 1046, 576, 2942, 8801,…
## $ attrition_flag <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_age <dbl> 27, 42, 47, 44, 54, 51, 37, 38, 45, 63, 44, 47, 43, 42, 44, 52,…
## $ gender <chr> "F", "F", "F", "M", "M", "M", "F", "F", "F", "M", "F", "M", "M"…
## $ education_level <chr> "Post-Graduate", "College", "Unknown", "Uneducated", "Graduate"…
## $ income_category <chr> "Less than $40K", "Less than $40K", "Less than $40K", "$80K - $…
## $ total_relationship_count <dbl> 3, 6, 3, 1, 3, 3, 2, 2, 3, 6, 1, 3, 2, 4, 5, 3, 4, 1, 2, 4, 2, …
## $ months_inactive_12_mon <dbl> 2, 4, 3, 3, 3, 2, 1, 3, 4, 2, 1, 3, 3, 3, 3, 2, 3, 2, 4, 3, 1, …
## $ credit_limit <dbl> 1438.3, 3050.0, 1561.0, 25428.0, 2947.0, 34516.0, 3680.0, 9830.…
## $ total_revolving_bal <dbl> 990, 1824, 0, 1528, 2216, 1763, 683, 2055, 1519, 0, 1995, 1340,…
## $ total_amt_chng_q4_q1 <dbl> 0.715, 0.771, 0.502, 0.725, 0.760, 1.266, 0.850, 0.977, 0.887, …
## $ total_trans_amt <dbl> 3855, 1973, 1947, 13360, 1744, 1550, 4835, 1042, 3434, 2399, 75…
## $ total_trans_ct <dbl> 73, 50, 28, 97, 53, 41, 77, 23, 49, 43, 87, 39, 65, 43, 60, 79,…
## $ total_ct_chng_q4_q1 <dbl> 1.147, 1.381, 0.556, 0.796, 0.606, 1.050, 0.750, 0.917, 0.690, …
## $ avg_utilization_ratio <dbl> 0.688, 0.598, 0.000, 0.060, 0.752, 0.051, 0.186, 0.209, 0.291, …
Let’s look at the data dictionary to check we understand what each variable represents.
[Data Dictionary]
Outcome Variable
- attrition_flag: whether the customer is churned (0 = no; 1 = yes)
ID
- id: unique identifier for the customer
Categorical Features
- gender
- education_level
- income_category: income range of the customer
Numeric Features
-
customer_age
-
total_relationship_count: number of relationships
-
months_inactive_12_mon: number of months the customer is inactive in past 12 months
-
credit_limit
-
total_revolving_bal: customer’s total revolving balance
-
total_amt_chng_q4_q1: amount the balance changed from Q4 to Q1
-
total_trans_amt: value of all the customer’s transactions in the period
-
total_trans_ct: count of all the customer’s transactions
-
total_ct_chng_q4_q1: difference in number of the customer’s transactions from Q4 to Q1
-
avg_utilization_ratio: customer’s average utilization ratio during the period
I like using the {skimr}
package to quickly check for missing values
and get variable information.
skimr::skim_without_charts(churn_data)
Name | churn_data |
Number of rows | 7088 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 12 |
________________________ | |
Group variables | None |
Data summary
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
gender | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
education_level | 0 | 1 | 7 | 13 | 0 | 7 | 0 |
income_category | 0 | 1 | 7 | 14 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 5055.37 | 2932.28 | 1.0 | 2514.75 | 5039.50 | 7613.25 | 10126.00 |
attrition_flag | 0 | 1 | 0.16 | 0.37 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 |
customer_age | 0 | 1 | 46.31 | 7.99 | 26.0 | 41.00 | 46.00 | 52.00 | 73.00 |
total_relationship_count | 0 | 1 | 3.82 | 1.55 | 1.0 | 3.00 | 4.00 | 5.00 | 6.00 |
months_inactive_12_mon | 0 | 1 | 2.33 | 1.01 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 |
credit_limit | 0 | 1 | 8703.87 | 9152.31 | 1438.3 | 2560.75 | 4588.50 | 11154.25 | 34516.00 |
total_revolving_bal | 0 | 1 | 1169.67 | 815.47 | 0.0 | 461.75 | 1285.00 | 1789.25 | 2517.00 |
total_amt_chng_q4_q1 | 0 | 1 | 0.76 | 0.22 | 0.0 | 0.63 | 0.74 | 0.86 | 3.36 |
total_trans_amt | 0 | 1 | 4360.55 | 3339.10 | 563.0 | 2147.75 | 3896.00 | 4731.75 | 18484.00 |
total_trans_ct | 0 | 1 | 64.65 | 23.34 | 10.0 | 45.00 | 67.00 | 80.00 | 139.00 |
total_ct_chng_q4_q1 | 0 | 1 | 0.71 | 0.24 | 0.0 | 0.58 | 0.70 | 0.82 | 3.50 |
avg_utilization_ratio | 0 | 1 | 0.28 | 0.28 | 0.0 | 0.03 | 0.18 | 0.50 | 1.00 |
Not much cleaning to do. No missing data. For EDA I want the
attrition_flag
variable to be obvious whether the customer has churned
or not. I’ll factor the categorical variables and experiment with log
transforming some numeric variables.
🔭 Explore Data
churn_data2 <- churn_data %>%
mutate(churned = factor(ifelse(attrition_flag == 1, "yes", "no"),
levels = c("no", "yes"))) %>%
mutate(education_level = fct_relevel(education_level, c("Unknown",
"Uneducated",
"High School",
"College",
"Graduate",
"Doctorate",
"Post-Graduate"))) %>%
mutate(income_category = fct_relevel(income_category, c("Unknown",
"Less than $40K",
"$40K - $60K",
"$60K - $80K",
"$80K - $120K",
"$120K +")))
churn_data3 <- churn_data2 %>%
mutate(across(.cols = c(credit_limit, total_trans_amt),
.fns = log1p)) %>%
mutate(across(.cols = c(credit_limit, total_trans_amt),
.fns = timetk::standardize_vec)) %>%
select(-c(id, attrition_flag))
Let’s look at a correlation matrix of all the numeric variables. We can
group by our new churned
variable to see features that correlate
differently based on the churn outcome.
corfun <- function(x, y) {
round(cor(x, y, method = "pearson", use = "pairwise.complete.obs"), 2)
}
ggcorrm(churn_data3, aes(col = churned, fill = churned), bg_dia = "grey30") +
lotri(geom_point(alpha = 0.1)) +
lotri(geom_smooth(se=F, method = "loess")) +
utri_funtext(fun = corfun, size = 4) +
dia_names(y_pos = 0.15, size = 2) +
dia_density(lower = 0.3, fill = "grey60", color = 1) +
theme_dark() +
scale_color_viridis_d() +
scale_fill_viridis_d() +
labs(title = "Correlation Plot")
It looks to me that total_ct_chng_q4_q1
, total_trans_amt
,
total_trans_ct
, credit_limit
, and avg_utilization_ratio
might be
important to churn. I want to see how the numeric feature density plots
look and see if we can see any distributions that are clearly separated
by churn.
dense <- churn_data3 %>%
select(-c(gender, education_level, income_category)) %>%
pivot_longer(cols = c(customer_age:avg_utilization_ratio),
names_to = "feature",
values_to = "value"
ggplot(dense, aes(value, fill = churned)) +
geom_density(alpha = .5) +
facet_wrap(~ feature, scales = "free") +
labs(title = "Numeric features impacting churn?")
We see some clues in the distributions of total_trans_amt
,
total_trans_ct
, total_relationship_count
, and total_revolving_bal
.
months_inactive_12_mon
might be important, but mostly when the
variable = 1 month.
Let’s try and find some combinations of these features that lead to churn more often. Last data viz for the numeric features.
churn_data %>%
ggplot(aes(total_trans_amt, total_trans_ct, z = attrition_flag)) +
stat_summary_hex(alpha = 0.8, bins = 50) +
scale_fill_viridis_c() +
labs(fill = "% Churned") +
theme_dark()
churn_data %>%
ggplot(aes(total_ct_chng_q4_q1, total_trans_ct, z = attrition_flag)) +
stat_summary_hex(alpha = 0.8, bins = 50) +
scale_fill_viridis_c() +
labs(fill = "% Churned") +
theme_dark()
churn_data %>%
ggplot(aes(credit_limit, avg_utilization_ratio, z = attrition_flag)) +
stat_summary_hex(alpha = 0.8, bins = 50) +
scale_fill_viridis_c() +
labs(fill = "% Churned") +
theme_dark()
churn_data %>%
ggplot(aes(total_trans_amt, total_amt_chng_q4_q1, z = attrition_flag)) +
stat_summary_hex(alpha = 0.8, bins = 60) +
scale_fill_viridis_c() +
labs(fill = "% Churned") +
theme_dark()
It’s clear that churn will be predicted if a customer data contains certain combination of feature values (yellow).
Now let’s get a count for each categorical variable and a percentage churn for each category to see how balanced the dataset is.
cat_feat <- churn_data2 %>%
select(c(gender, education_level, income_category, churned))
n_gen <- cat_feat %>%
group_by(gender) %>%
summarise(count = n()) %>%
ggplot(aes(gender, count)) +
geom_col() +
coord_flip() +
geom_text(aes(label = count),
color = "white",
size = 3,
fontface = 'bold',
hjust = 1.2,
vjust = 0.4) +
theme_minimal() +
theme(legend.position = 'none',
axis.title.x = element_blank())
n_edu <- cat_feat %>%
group_by(education_level) %>%
summarise(count = n()) %>%
ggplot(aes(education_level, count)) +
geom_col() +
coord_flip() +
geom_text(aes(label = count),
color = "white",
size = 3,
fontface = 'bold',
hjust = 1.2,
vjust = 0.4) +
theme_minimal() +
theme(legend.position = 'none')
n_inc <- cat_feat %>%
group_by(income_category) %>%
summarise(count = n()) %>%
ggplot(aes(income_category, count)) +
geom_col() +
coord_flip() +
geom_text(aes(label = count),
color = "white",
size = 3,
fontface = 'bold',
hjust = 1.2,
vjust = 0.4) +
theme_minimal() +
theme(legend.position = 'none')
n_chrn <- cat_feat %>%
group_by(churned) %>%
summarise(count = n()) %>%
ggplot(aes(churned, count)) +
geom_col() +
coord_flip() +
geom_text(aes(label = count),
color = "white",
size = 3,
fontface = 'bold',
hjust = 1.2,
vjust = 0.4) +
theme_minimal() +
theme(legend.position = 'none',
axis.title.x = element_blank())
n_chrn + n_gen + n_edu + n_inc +
plot_layout(ncol = 2, heights = c(0.7,2)) +
plot_annotation(title = "Counts of Each Categorical Variable in Training Data")
Our churn variable is unbalanced. Would be better if we had more data points for those that have churned. We could upsample positive churn outcome or downsample the negative churn outcome, but I’ll let it slide this time.
How balanced are the categorical predictors?
gen <- ggplot(cat_feat %>%
group_by(gender, churned) %>%
summarise(n = n()) %>%
mutate(percent_churned = prop.table(n)*100) %>%
filter(churned == 'yes'),
aes(gender, percent_churned)) +
geom_col() +
coord_flip() +
geom_text(aes(label = round(percent_churned, digits = 1)),
color = "white",
size = 4,
fontface = 'bold',
hjust = 1.3,
vjust = 0.4) +
theme_minimal() +
theme(legend.position = 'none')
# `summarise()` has grouped output by 'gender'. Override using the `.groups` argument.
ed <- ggplot(cat_feat %>%
group_by(education_level, churned) %>%
summarise(n = n()) %>%
mutate(percent_churned = prop.table(n)*100) %>%
filter(churned == 'yes'),
aes(education_level, percent_churned)) +
geom_col() +
coord_flip() +
geom_text(aes(label = round(percent_churned, digits = 1)),
color = "white",
size = 4,
fontface = 'bold',
hjust = 1.3,
vjust = 0.4) +
theme_minimal() +
theme(axis.title.x = element_blank(),
legend.position = 'none')
# `summarise()` has grouped output by 'education_level'. Override using the `.groups` argument.
inc <- ggplot(cat_feat %>%
group_by(income_category, churned) %>%
summarise(n = n()) %>%
mutate(percent_churned = prop.table(n)*100) %>%
filter(churned == 'yes'),
aes(income_category, percent_churned)) +
geom_col() +
coord_flip() +
geom_text(aes(label = round(percent_churned, digits = 1)),
color = "white",
size = 4,
fontface = 'bold',
hjust = 1.3,
vjust = 0.4) +
theme_minimal() +
theme(axis.title.x = element_blank(),
legend.position = 'none')
# `summarise()` has grouped output by 'income_category'. Override using the `.groups` argument.
ed + inc + gen +
plot_layout(ncol = 2, heights = c(2,0.7)) +
plot_annotation(title = "Percentage of Each Categorical Variable that Churned")
cat_feat %>%
pivot_longer(gender:income_category) %>%
ggplot(aes(y = value, fill = churned)) +
geom_bar(position = "fill") +
facet_wrap(vars(name), scales = "free", ncol = 1) +
theme_minimal() +
labs(x = NULL, y = NULL)
At least the data has consistantly low churn (12%-21%) across all categorical values. I’d guess the numeric features are more important based on what we’ve seen.
🧮 AWS Setup
Now it’s time to model. You need to set up your own AWS account and get RStudio connected. Alex Lemm has a great repo that guides you through configuring RStudio to work with AWS tools github.com/alex23lemm. You’ll find a more detailed explaination of my code chunks within his projects if you want a deeper understanding.
We need to set up our AWS SageMaker session using a conda or miniconda
environment and the {reticulate}
package.
This code works on the AWS ‘Free Tier’ btw.
use_condaenv("sagemaker-r", required = TRUE)
sagemaker <- import("sagemaker")
session <- sagemaker$Session()
Select the S3 bucket you want to upload your data and models to. Then
create a data
and models
folder. The data
folder will store
training, validation, test, and holdout data XGBoost will use for this
project. The models
folder will store every training job we run in
this project, with an output
subfolder containing the model artifacts.
Any model used to make predictions will also have a predictions
subfolder.
# bucket <- session$default_bucket()
bucket <- "twarczak-sagemaker3"
project <- "churn" # keep short. 32 character max job_name
data_path <- paste0("s3://", bucket, "/", project, "/", "data")
models_path <- paste0("s3://", bucket, "/", project, "/", "models")
Data splitting
Split the churn_data
into a train/validation/test. We will use the
SageMaker built-in XGBoost algorithm and Bayesian Optimation for
hyperparameter tuning. Info about this version of XGBoost,
hyperparameters, metrics, etc. can be found here
AWS_XGBoost.
I am also using the original data, not the dataframes I manipulated for our EDA.
# Make 70/15/15 Train/Validate/Test split
set.seed(333)
splits <- initial_split(churn_data, prop = 0.70, strata = attrition_flag)
train <- training(splits)
other <- testing(splits)
splits2 <- initial_split(other, prop = 0.50, strata = attrition_flag)
validation <- training(splits2)
test <- testing(splits2)
print(splits)
print(splits2)
Preprocessing the data
Let’s make a simple recipe for our data. XGBoost doesn’t need numeric
variable transformations, but we need to one-hot encode the categorical
features. SageMaker XGBoost doesn’t want id variables and the target
(attrition_flag
) must be the first column. We’ll also remove the table
header later.
churn_rec <- recipe(attrition_flag ~ ., data = train) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_rm(id) %>%
prep()
churn_training <- churn_rec %>%
juice() %>%
select(attrition_flag, everything()) # just to put attrition_flag as 1st column
churn_validation <- bake(churn_rec, new_data = validation) %>%
select(attrition_flag, everything()) # just to put attrition_flag as 1st column
churn_test <- bake(churn_rec, new_data = test) %>%
select(attrition_flag, everything()) # just to put attrition_flag as 1st column
id_holdout <- holdout %>% select(id)
churn_holdout <- bake(churn_rec, new_data = holdout)
Great. Now all our data is transformed and ready for SageMaker XGBoost. First though, let’s run the training data through a basic XGBoost algo in R, just to look at feature importance.
bst <- xgboost::xgboost(data = data.matrix(subset(churn_training, select = -attrition_flag)),
label = churn_training$attrition_flag,
max_depth = 2, eta = 0.5, nthread = 2, nrounds = 100,
objective = "binary:logistic", verbose = FALSE)
## [22:07:55] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
imp_bst <- xgb.importance(model = bst)
xgb.ggplot.importance(imp_bst)
We should have looked harder at total_revolving_balance
. It might be
more important than we thought, but otherwise, this mostly agrees with
our EDA assessment. We could go back and remove the low importance
features, like income_category
, education_level
, and gender
, but
not this time.
Save the pre-processed data. The test set must lack the target variable.
Save churn_holdout_id
to join to the predicted values at the very end
for our SLICED submission.
dir.create("../2021-08-01-sagemaker-r-xgb-churn/data") # local
write_csv(churn_training,
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_training.csv", col_names = FALSE)
write_csv(churn_validation,
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_validation.csv", col_names = FALSE)
write_csv(churn_test %>% select(-attrition_flag),
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_test.csv", col_names = FALSE)
write_csv(churn_test,
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_test_2.csv", col_names = TRUE)
# Need this later to use with results
write_csv(churn_holdout,
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_holdout.csv", col_names = FALSE)
write_csv(id_holdout,
"../2021-08-01-sagemaker-r-xgb-churn/data/churn_holdout_id.csv",col_names = FALSE)
# Need this later to use with holdout results
upload()
from S3Uploader
moves the preprocessed data to the S3
bucket.
s3_uploader <- sagemaker$s3$S3Uploader()
s3_train <- s3_uploader$upload(local_path = "../2021-08-01-sagemaker-r-xgb-churn/data/churn_training.csv",
desired_s3_uri = data_path)
s3_validation <- s3_uploader$upload(local_path = "../2021-08-01-sagemaker-r-xgb-churn/data/churn_validation.csv", desired_s3_uri = data_path)
s3_test <- s3_uploader$upload(local_path = "../2021-08-01-sagemaker-r-xgb-churn/data/churn_test.csv",
desired_s3_uri = data_path)
🏗 Train XGBoost model
Step 1 - Create an Estimator object
Need
-
image_name
- Location of built-in XGBoost docker container image in AWS Elastic Container Registry (ECR) -
role
- AWS Identify and Access Management (IAM) -
instance_count
- Number of EC2 instances -
instance_type
- Choice of instance -
volume_size
- Size in GB of Amazon Elastic Block Store (EBS) -
output_path
- Path to S3 bucket for storing training results -
sagemaker_session
- Session object that manages interactions with SageMaker APIs
Setting a max_run
, use_spot_instances
, and max_wait
is optional,
but will save you 70-90% on your compute purchase (in my experience).
Note: With {reticulate}
, integers in R need to be converted to “long”
objects in Python, hence the “L” after integers in the script.
region <- session$boto_region_name
container <- sagemaker$image_uris$retrieve(framework = "xgboost",
region = region,
version = "1.3-1" )
role_arn <- Sys.getenv("SAGEMAKER_ROLE_ARN")
xgb_estimator <- sagemaker$estimator$Estimator(image_uri = container,
role = role_arn,
instance_count = 1L,
instance_type = "ml.m5.4xlarge",
volume_size = 10L,
max_run = 300L,
output_path = models_path,
sagemaker_session = session,
use_spot_instances = TRUE,
max_wait = 300L )
Step 2 - Set static hyperparameters
No tuning yet. Full list of SageMaker built-in XGBoost hyperparameters here: XGBoost.
Use AUC as metric
xgb_estimator$set_hyperparameters(objective = "binary:logistic",
eval_metric = "auc",
max_depth = 5L,
eta = 0.1,
num_round = 100L,
colsample_bytree = 0.4,
alpha = 10L,
min_child_weight = 1.1,
subsample = 0.7)
Step 3 - Define S3 location of data sets and training job name
# Create training job name based project organization principles
algo <- "xgb" # keep short. 32 character max job_name
timestamp <- format(Sys.time(), "%Y-%m-%d-%H-%M-%S") # timestamp
job_name_x <- paste(project, algo, timestamp, sep = "-")
s3_train_input <- sagemaker$inputs$TrainingInput(s3_data = s3_train,
content_type = 'csv')
s3_valid_input <- sagemaker$inputs$TrainingInput(s3_data = s3_validation,
content_type = 'csv')
input_data <- list('train' = s3_train_input,
'validation' = s3_valid_input)
Step 4 - Start training job
xgb_estimator$fit(inputs = input_data,
job_name = job_name_x,
wait = FALSE) # If TRUE, call will wait until job completes
session$describe_training_job(job_name_x)[["TrainingJobStatus"]]
## [1] "Completed"
Step 5 - Evaluate training results
training_job_stats <- session$describe_training_job(job_name = job_name_x)
final_metrics_1 <- map_df(training_job_stats$FinalMetricDataList,
~tibble(metric_name = .x[["MetricName"]],
value = .x[["Value"]]))
final_metrics_1
## # A tibble: 2 × 2
## metric_name value
## <chr> <dbl>
## 1 validation:auc 0.989
## 2 train:auc 0.993
Step 6 - Fit model on test data
Use Batch Transform, a SageMaker feature for generating batch inferences.
Create Transformer object
-
strategy
- ‘MultiRecord’ or ‘SingleRecord’ -
assemble_with
- ‘Line’ places each prediction on its own line -
output_path
- S3 location for storing prediction results
predictions_path_1 <- paste0(models_path, "/", job_name_x, "/predictions")
xgb_batch_predictor_1 <- xgb_estimator$transformer(instance_count = 1L,
instance_type = "ml.m5.4xlarge",
strategy = "MultiRecord",
assemble_with = "Line",
output_path = predictions_path_1 )
Fit model to test data with transform()
split_type
- Must use “Line” with csv files
xgb_batch_predictor_1$transform(data = s3_test,
content_type = 'text/csv',
split_type = "Line",
job_name = job_name_x,
wait = FALSE) # If TRUE, call waits until job completes
session$describe_transform_job(job_name_x)[["TransformJobStatus"]]
# [1] "Completed"
Step 7 - Download & evaluate predictions
s3_downloader <- sagemaker$s3$S3Downloader()
s3_test_predictions_path_1 <- s3_downloader$list(predictions_path_1)
dir.create("./predictions")
s3_downloader$download(s3_test_predictions_path_1, "./predictions")
test_predictions_1 <- read_csv("./predictions/churn_test.csv.out",
col_names = FALSE) %>%
pull(X1)
test_results_1 <- tibble(truth = churn_test$attrition_flag,
predictions = test_predictions_1 )
head(test_results_1)
## # A tibble: 6 × 2
## truth predictions
## <dbl> <dbl>
## 1 0 0.00919
## 2 0 0.171
## 3 0 0.00493
## 4 0 0.0146
## 5 1 0.846
## 6 0 0.0360
roc_obj_1 <- pROC::roc(test_results_1$truth,
test_results_1$predictions,
plot = TRUE,
grid = TRUE,
print.auc = TRUE,
legacy.axes = TRUE,
main = "ROC curve for XGBoost classification",
show.thres = TRUE,
col = "red2" )
AUC: 0.989
conf_matrix_1 <- caret::confusionMatrix(factor(ifelse(test_results_1$predictions >= 0.5, 1, 0),
levels = c("0", "1"),
labels = c("retained", "churned")),
factor(test_results_1$truth,
levels = c(0, 1),
labels = c("retained", "churned")),
positive = "churned")
conf_matrix_1
## Confusion Matrix and Statistics
##
## Reference
## Prediction retained churned
## retained 888 29
## churned 6 141
##
## Accuracy : 0.9671
## 95% CI : (0.9545, 0.977)
## No Information Rate : 0.8402
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8704
##
## Mcnemar's Test P-Value : 0.0002003
##
## Sensitivity : 0.8294
## Specificity : 0.9933
## Pos Pred Value : 0.9592
## Neg Pred Value : 0.9684
## Prevalence : 0.1598
## Detection Rate : 0.1325
## Detection Prevalence : 0.1382
## Balanced Accuracy : 0.9114
##
## 'Positive' Class : churned
-
Accuracy: 0.9671
-
Sensitivity: 0.8294
-
Specificity: 0.9933
-
Kappa: 0.8704
-
False Negatives: 29
-
False Positives: 6
🔧 Tune XGBoost
Step 8 - Set hyperparameters
Static
xgb_estimator$set_hyperparameters(objective = "binary:logistic",
min_child_weight = 1 )
Tunable
hyperp_ranges <- list(num_round = sagemaker$tuner$IntegerParameter(50L, 500L),
max_depth = sagemaker$tuner$IntegerParameter(2L, 15L),
eta =sagemaker$tuner$ContinuousParameter(0.01,0.4,"Logarithmic"),
colsample_bytree = sagemaker$tuner$ContinuousParameter(0.3,0.8),
alpha = sagemaker$tuner$IntegerParameter(1L, 15L))
Create HyperparameterTuner
object.
SageMaker supports two tuning methods: Bayesian optimization (default) and random search. There is currently no grid search option. Let’s stick with Bayesian.
-
max_jobs
- Maximum number of training jobs executed -
max_parallel_job
: Number of training jobs executed in parallel
tuner <- sagemaker$tuner$HyperparameterTuner(estimator = xgb_estimator,
objective_metric_name = "validation:auc",
objective_type = "Maximize",
hyperparameter_ranges = hyperp_ranges,
strategy = "Bayesian",
max_jobs = 100L,
max_parallel_jobs = 3L )
Define S3 location of data sets and tuning job name.
algo <- "xgb"
timestamp <- format(Sys.time(), "%Y-%m-%d-%H-%M-%S")
job_name_y <- paste(project, algo, timestamp, sep = "-")
s3_train_input <- sagemaker$inputs$TrainingInput(s3_data = s3_train,
content_type = 'csv')
s3_valid_input <- sagemaker$inputs$TrainingInput(s3_data = s3_validation,
content_type = 'csv')
input_data <- list('train' = s3_train_input,
'validation' = s3_valid_input)
Step 9 - Start tuning job
tuner$fit(inputs = input_data,
job_name = job_name_y,
wait = FALSE ) # If TRUE, call will wait until job completes
session$describe_tuning_job(job_name_y)[["HyperParameterTuningJobStatus"]]
# [1] "Completed"
Step 10 - Evaluate tuning job results
tuning_job_results <- sagemaker$HyperparameterTuningJobAnalytics(job_name_y)
tuning_results_df <- tuning_job_results$dataframe()
head(tuning_results_df, n = 3)
# alpha colsample_bytree eta max_depth num_round TrainingJobName
# 1 2 0.6871357 0.13871194 15 241 churn-xgb-2021-08-06-03-56-45-100-03dd1b49
# 2 2 0.5947285 0.09493995 14 500 churn-xgb-2021-08-06-03-56-45-099-29087499
# 3 1 0.5546992 0.04806042 15 498 churn-xgb-2021-08-06-03-56-45-098-f9263ea8
#
# TrainingJobStatus FinalObjectiveValue TrainingStartTime TrainingEndTime TrainElapsedTimeSec
# 1 Completed 0.99394 2021-08-06 05:52:41 2021-08-06 05:53:39 58
# 2 Completed 0.99328 2021-08-06 05:49:44 2021-08-06 05:50:54 70
# 3 Completed 0.99445 2021-08-06 05:49:25 2021-08-06 05:50:55 90
Plot time series chart to show how AUC developed over 100 training jobs, tuned by the Bayesian optimizer.
ggplot(tuning_results_df, aes(TrainingEndTime, FinalObjectiveValue)) +
geom_point() +
xlab("Time") + ylab(tuning_job_results$description()$TrainingJobDefinition$StaticHyperParameters$`_tuning_objective_metric`) +
ggtitle("Hyperparameter tuning AUC",
"Progression over the period of all 100 training jobs") +
theme_minimal()
The next plots show how the Bayesian optimizer focused on the region of the search space that produced the best models.
ggplot(tuning_results_df, aes(num_round, eta)) +
geom_point(aes(color = FinalObjectiveValue)) +
scale_color_viridis("AUC", option = "H") +
ggtitle("Hyperparameter tuning AUC",
"Using a Bayesian strategy") +
theme_dark()
ggplot(tuning_results_df, aes(colsample_bytree, alpha)) +
geom_point(aes(color = FinalObjectiveValue)) +
scale_color_viridis("AUC", option = "H") +
ggtitle("Hyperparameter tuning AUC",
"Using a Bayesian strategy") +
theme_dark()
ggplot(tuning_results_df, aes(num_round, max_depth)) +
geom_point(aes(color = FinalObjectiveValue)) +
scale_color_viridis("AUC", option = "H") +
ggtitle("Hyperparameter tuning AUC",
"Using a Bayesian strategy") +
theme_dark()
🏆 Select Best Models
best_tuned_model <- tuning_results_df %>%
#filter(FinalObjectiveValue == max(FinalObjectiveValue)) %>%
arrange(desc(FinalObjectiveValue)) %>%
head(n=1) %>% pull(TrainingJobName)
#best_tuned_model
training_job_stats <- session$describe_training_job(job_name = best_tuned_model)
final_metrics_2 <- map_df(training_job_stats$FinalMetricDataList,
~tibble(metric_name = .x[["MetricName"]],
value = .x[["Value"]]))
final_metrics_2
## # A tibble: 3 × 2
## metric_name value
## <chr> <dbl>
## 1 validation:auc 0.995
## 2 train:auc 1.00
## 3 ObjectiveMetric 0.995
Step 11 - Create transformer object
predictions_path_2 <- paste0(models_path, "/", best_tuned_model, "/predictions")
session$create_model_from_job(best_tuned_model)
xgb_batch_predictor_2 <- sagemaker$transformer$Transformer(model_name = best_tuned_model,
instance_count = 1L,
instance_type = "ml.m5.4xlarge",
strategy = "MultiRecord",
assemble_with = "Line",
output_path = predictions_path_2)
best_tuned_model
## [1] "churn-xgb-2021-08-06-03-56-45-081-e16d7d4b"
Step 12 - Start batch prediction job
xgb_batch_predictor_2$transform(data = s3_test,
content_type = 'text/csv',
split_type = "Line",
job_name = best_tuned_model,
wait = FALSE) # If TRUE, call will wait until job completes
session$describe_transform_job(best_tuned_model)[["TransformJobStatus"]]
## [1] "Completed"
Step 13 - Download test set predictions
s3_downloader <- sagemaker$s3$S3Downloader()
s3_test_predictions_path_2 <- s3_downloader$list(predictions_path_2)
dir.create("./predictions")
s3_downloader$download(s3_test_predictions_path_2, "./predictions")
test_predictions_2 <- read_csv("./predictions/churn_test.csv.out",
col_names = FALSE) %>%
pull(X1)
churn_test <- read_csv("./data/churn_test_2.csv")
test_results_2 <- tibble(truth = churn_test$attrition_flag,
predictions = test_predictions_2)
head(test_results_2)
## # A tibble: 6 × 2
## truth predictions
## <dbl> <dbl>
## 1 0 0.000535
## 2 0 0.140
## 3 0 0.000582
## 4 0 0.000341
## 5 1 0.974
## 6 0 0.00112
Step 14 - Evaluate test set predictions
roc_obj_2 <- roc(test_results_2$truth,
test_results_2$predictions,
plot = TRUE,
grid = TRUE,
print.auc = TRUE,
legacy.axes = TRUE,
main = "ROC curve for XGBoost classification using the best model",
show.thres = TRUE,
col = "red2" )
Confusion matrix
conf_matrix_2 <- confusionMatrix(factor(ifelse(test_results_2$predictions >= 0.5, 1, 0),
levels = c("0", "1"),
labels = c("retained", "churned")),
factor(test_results_2$truth,
levels = c(0, 1),
labels = c("retained", "churned")),
positive = "churned")
conf_matrix_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction retained churned
## retained 890 22
## churned 4 148
##
## Accuracy : 0.9756
## 95% CI : (0.9644, 0.984)
## No Information Rate : 0.8402
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9049
##
## Mcnemar's Test P-Value : 0.0008561
##
## Sensitivity : 0.8706
## Specificity : 0.9955
## Pos Pred Value : 0.9737
## Neg Pred Value : 0.9759
## Prevalence : 0.1598
## Detection Rate : 0.1391
## Detection Prevalence : 0.1429
## Balanced Accuracy : 0.9331
##
## 'Positive' Class : churned
-
Accuracy: 0.9756
-
Sensitivity: 0.8706
-
Specificity: 0.9955
-
Kappa: 0.9049
-
False Negatives: 22
-
False Positives: 4
Save this model as our churn_model
. boto_client$list_models()
returns the model with the best ‘Objective metric value’ as the first
object.
boto_client <- session$boto_session$client("sagemaker")
churn_model <- boto_client$list_models()[["Models"]] %>%
map_chr("ModelName") %>%
.[[1]]
churn_model
## [1] "churn-xgb-2021-08-06-03-56-45-081-e16d7d4b"
Step 15 - Create endpoint configuration
We can deploy the model so we (or anyone we give permission to) can use
it to make more bank customer churn predictions. We still need to run
inference on our churn_holdout
data and submit those predictions to
the SLICED competition for scoring.
config_name <- paste0(churn_model, "-config")
session$create_endpoint_config(name = config_name,
model_name = churn_model,
initial_instance_count = 1L,
instance_type = "ml.m5.4xlarge")
Step 16 - Create endpoint
endpoint_name <- "churn-endpoint"
session$create_endpoint(endpoint_name = endpoint_name,
config_name = config_name,
wait = FALSE)
Check via the API when the endpoint is successfully deployed and available. Might take 5-10 minutes.
boto_client$describe_endpoint(EndpointName = endpoint_name)[["EndpointStatus"]]
## [1] "InService"
Step 17 - Make real-time predictions against the endpoint (not batch)
- Predictor object - Makes prediction requests to the inference endpoint
- Serializer object - Encodes data for inference endpoint
- Deserializer object - Decodes data from inference endpoint
csv_serializer <- sagemaker$serializers$CSVSerializer()
csv_deserializer <- sagemaker$deserializers$CSVDeserializer()
churn_predictor <- sagemaker$predictor$Predictor(endpoint_name = endpoint_name,
sagemaker_session = session,
serializer = csv_serializer,
deserializer = csv_deserializer)
Let’s just check the first 5 test_set entries to make sure it worked. We already ran a batch transform job, so the results should be the same. This is just to test the endpoint.
test_set <- read_csv("./data/churn_test.csv", col_names = FALSE, n_max = 5) %>%
as.matrix()
real_time_predictions_1 <- churn_predictor$predict(data = test_set) %>%
.[[1]] %>%
as.numeric()
batch_predictions <- read_csv("./predictions/churn_test.csv.out", col_names = FALSE, n_max = 5) %>%
.[[1]]
data.frame(real_time_predictions_1, batch_predictions)
## real_time_predictions_1 batch_predictions
## 1 0.0005347708 0.0005347708
## 2 0.1397440583 0.1397440583
## 3 0.0005816696 0.0005816696
## 4 0.0003410776 0.0003410776
## 5 0.9742197990 0.9742197990
Perfect!
1 model, 2 prediction pipelines, same results.
🎯 Predict w/ Holdout
Use the endpoint to predict churn from our holdout data.
holdout_set <- read_csv("./data/churn_holdout.csv", col_names = FALSE) %>%
as.matrix()
real_time_predictions_2 <- churn_predictor$predict(data = holdout_set) %>%
.[[1]] %>%
as.numeric()
📬 Submission
Save and submit to SLICED s01e07 on Kaggle.
churn_holdout_id <- read_csv("data/churn_holdout_id.csv", col_names = FALSE)
submission <- data.frame(churn_holdout_id, real_time_predictions_2) %>%
rename(id = X1, attrition_flag = real_time_predictions_2)
write.csv(submission, "sliced_s01_e07_submission.csv", row.names = FALSE)
The evaluation algorithm for this competition was LogLoss, so the
lowest score wins. My submission scored 0.07622, which would have
been 8th place (out of 36 competitors and 130 submissions). The winning score was 0.06800.
Competitors only had 2 hours to submit, so this is not a fair
comparison, but it does show how a well-tuned XGBoost model can make
great predictions without much feature engineering. To improve the
model, we could log transform certain numeric variables, up/down-sample
with {themis}
to balance the data, remove low-performing features,
and/or tune additional hyperparameters. There should be room to improve
model performance.
(Delete the endpoint so you’re not billed more $$$)
session$delete_endpoint(endpoint_name)