Forecasting Daily Sales w/ {modeltime}
Time-Series Modeling in the {tidymodels} Universe
- 🥅 Project Goal
- 🗂 Obtain Data
- 🛁 Clean Data
- 🔭 Explore Data
- 🧮 Prep Data for Forecasting
- 🏗 Build Workflows
- 🔧 Tune Models
- 🏆 Forecast w/ Best Models
- 🏁 Ensemble & Save Work
- 🤔 Decisions
🥅 Goal of this Project
For this blog post, I use the {modeltime}
package to forecast 3 months
of daily sales in Q1 for a young ‘Superstore’ company selling furniture,
technology, and office supplies. The forecast can then be used to make
decisions about supply-chain orders, warehouse inventory, and if/when
new employees are needed to meet predicted sales demand after the
holiday season.
🗂 Obtain Data
The Superstore Sales Dataset on Kaggle is relatively unexplored. There’s a couple python exploratory data analysis projects posted but nothing using R and only one 7-day forecast using a simple SARIMAX model.
library(tidyverse)
library(tidymodels)
library(modeltime)
library(modeltime.ensemble)
library(modeltime.resample)
library(timetk)
library(lubridate)
library(scales)
df <- read_csv("train.csv")
🛁 Clean Data
glimpse(df)
# Rows: 9,800
# Columns: 18
# $ `Row ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, …
# $ `Order ID` <chr> "CA-2017-152156", "CA-2017-152156", "CA-2017-138688", "US-2016-108966", "US…
# $ `Order Date` <chr> "08/11/2017", "08/11/2017", "12/06/2017", "11/10/2016", "11/10/2016", "09/0…
# $ `Ship Date` <chr> "11/11/2017", "11/11/2017", "16/06/2017", "18/10/2016", "18/10/2016", "14/0…
# $ `Ship Mode` <chr> "Second Class", "Second Class", "Second Class", "Standard Class", "Standard…
# $ `Customer ID` <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-20335", "BH-11710", "BH…
# $ `Customer Name` <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O'Donnell", "Sean O'…
# $ Segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consumer", "Consumer", "C…
# $ Country <chr> "United States", "United States", "United States", "United States", "United…
# $ City <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdale", "Fort Lauderdal…
# $ State <chr> "Kentucky", "Kentucky", "California", "Florida", "Florida", "California", "…
# $ `Postal Code` <dbl> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, 90032, 90032, 90032…
# $ Region <chr> "South", "South", "West", "South", "South", "West", "West", "West", "West",…
# $ `Product ID` <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240", "FUR-TA-10000577",…
# $ Category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture", "Office Supplies"…
# $ `Sub-Category` <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "Furnishings", "Art",…
# $ `Product Name` <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric Upholstered Stackin…
# $ Sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, 7.2800, 907.1520, …
After a little digging a few things become obvious. There are two date
variables, but lets only use Order Date
going forward. It seems
Order Date
format is day/month/year. They only ship within the United
States. We don’t need individual order_id
numbers. Let’s clean up
column names, format the date, factor the categorical variables, and
remove what we definitely don’t need.
data_clean_tbl <- janitor::clean_names(df) %>%
as_tibble() %>%
mutate(order_date = lubridate::dmy(order_date),
across(customer_id:product_name, .fns = as.factor)) %>%
select(-c(ship_date, country, order_id, ship_mode))
glimpse(data_clean_tbl)
# Rows: 9,800
# Columns: 14
# $ row_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22…
# $ order_date <date> 2017-11-08, 2017-11-08, 2017-06-12, 2016-10-11, 2016-10-11, 2015-06-09, 2015…
# $ customer_id <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, BH-11710, BH-11710, BH-1171…
# $ customer_name <fct> Claire Gute, Claire Gute, Darrin Van Huff, Sean O'Donnell, Sean O'Donnell, Br…
# $ segment <fct> Consumer, Consumer, Corporate, Consumer, Consumer, Consumer, Consumer, Consum…
# $ city <fct> Henderson, Henderson, Los Angeles, Fort Lauderdale, Fort Lauderdale, Los Ange…
# $ state <fct> Kentucky, Kentucky, California, Florida, Florida, California, California, Cal…
# $ postal_code <fct> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, 90032, 90032, 90032, …
# $ region <fct> South, South, West, South, South, West, West, West, West, West, West, West, S…
# $ product_id <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240, FUR-TA-10000577, OFF-ST-10…
# $ category <fct> Furniture, Furniture, Office Supplies, Furniture, Office Supplies, Furniture,…
# $ sub_category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnishings, Art, Phones, Binders…
# $ product_name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric Upholstered Stacking …
# $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600, 7.2800, 907.1520, 18…
What period of time does this data cover?
t <- summarise_by_time(.data = data_clean_tbl,
.date_var = order_date)
(diff_t <- difftime(last(t$order_date),
first(t$order_date),
units = 'weeks'))
# Time difference of 208.1429 weeks
lubridate::dweeks(208.1429)
# [1] "125884825.92s (~3.99 years)"
This is a daily dataset spanning just under 4 years from 2015-01-03 to 2018-12-30. There are gaps though!
🔭 Exploratory Data Analysis
Let’s group by region
/state
to see where the ‘Superstore’ does
business, looking at total orders
and total sales
.
q <- data_clean_tbl %>%
group_by(region,state) %>%
summarise(orders = n(), sales = sum(sales)) %>%
ungroup() %>%
mutate(state = tidytext::reorder_within(state, orders, region)) %>%
arrange(desc(sales))
ggplot(q, aes(reorder_within(state, sales, region, sep = ""), sales, fill = region)) +
geom_col(show.legend = FALSE, alpha = 0.9) +
facet_grid(region ~., scales = 'free', space = 'free') +
coord_flip() +
tidytext::scale_x_reordered() + # gets rid of the ___region
scale_y_continuous(limits = c(0,500000), expand = c(0,0), labels=dollar_format(prefix="$")) +
theme_dark_grey() +
scale_color_todd_dark_subtle() +
scale_fill_todd_dark() +
theme(panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank()) +
geom_hline(yintercept = c(100000,200000,300000,400000),
color = "#FFCF6A",
alpha = 0.1,
size = 1) +
labs(x = element_blank(), title = 'Sales by State/Region') +
geom_text(aes(label = dollar(round((sales)))),
color="white", size=3.5, fontface = 'bold',
hjust = -0.1, vjust = 0.4)
Looks like the do business throughout the US, with the most business coming from California. They probably do most of their business from online shopping, as these total sale numbers aren’t high enough for multiple warehouses spread across the country.
Lets also look at number of orders
and sales
grouped by
customer_name
, category
, and segment
. Any outliers?
# Want to identify any outlier customers, and see if any category/segment of their business dominates
x <- data_clean_tbl %>%
group_by(customer_name, category, segment) %>%
summarise(sales = sum(sales),
orders = n()) %>%
mutate(log_sales = log(sales))
# outlier_seth_vernon
x %>% filter(segment == 'Consumer',
category == "Furniture") %>%
arrange(desc(orders)) %>% head()
## A tibble: 6 × 6
## Groups: customer_name, category [6]
# customer_name category segment sales orders log_sales
# <fct> <fct> <fct> <dbl> <int> <dbl>
#1 Seth Vernon Furniture Consumer 8332. 15 9.03
#2 Caroline Jumper Furniture Consumer 6267. 9 8.74
#3 Joel Eaton Furniture Consumer 4423. 9 8.39
#4 Lena Creighton Furniture Consumer 2641. 9 7.88
#5 Anna Andreadi Furniture Consumer 3751. 8 8.23
#6 Anne McFarland Furniture Consumer 3557. 8 8.18
# outlier_2
x %>% filter(segment == 'Corporate' | segment == "Home Office",
category == "Technology") %>%
arrange(desc(log_sales)) %>% head()
## A tibble: 6 × 6
## Groups: customer_name, category [6]
# customer_name category segment sales orders log_sales
# <fct> <fct> <fct> <dbl> <int> <dbl>
#1 Sean Miller Technology Home Office 23482. 3 10.1
#2 Tamara Chand Technology Corporate 17998. 2 9.80
#3 Tom Ashbrook Technology Home Office 13710. 4 9.53
#4 Bill Shonely Technology Corporate 9107. 2 9.12
#5 Todd Sumrall Technology Corporate 8804. 4 9.08
#6 Grant Thornton Technology Corporate 8000. 1 8.99
# Make specific tbl for outlier Seth Vernon so label doesn't appear in other facets
seth_vernon <- data.frame(category = factor("Furniture",
levels = c("Furniture","Office Supplies","Technology")),
segment = factor("Consumer",
levels = c("Consumer","Corporate","Home Office")))
ggplot(x, aes(log(sales), orders, color = category)) +
geom_jitter(aes(shape = segment),
alpha = 0.7,
width = 0,
height = 0.3,
show.legend = FALSE,
size = 2) +
theme_dark_grey() +
scale_color_manual(values = c("#fbb4ae", "#b3cde3", "#fed9a6")) +
facet_grid(segment~category, scales = 'free_y') +
labs(title = "Orders & Sales in log($) per Customer by Category/Segment",
x = 'log($)') +
geom_curve(aes(x = 7, xend = 8.9, y = 18, yend = 16),
data = seth_vernon,
curvature = -0.5,
size = 1,
arrow = arrow(length = unit(2, "mm")),
color = 'white',
alpha = 0.7) +
geom_text(aes(x = 6.5, y = 15),
data = seth_vernon ,
label = "Seth Vernon\n15 Orders\nTotal = $8,332",
size = 3.5,
show.legend = F,
color = 'white',
alpha = 0.7)
Their sales are spread pretty evenly between these categories. More orders are for office supplies, but their largest sales are for technology; not surprising. They have a some loyal customers too, like Seth Vernon.
🧮 Prep Data for Forecasting
Now lets focus on sales within this time-series. If we want to forecast
sales, we can’t use any other variable but order_date
. After
2018-12-30 we don’t have any information about the number of orders or
where those orders are coming from. We could build models using , for
example, orders
as an independent variable to predict sales, but only
until 2018-12-30. After that, the models will break down since we don’t
have values for orders
to predict sales
with.
This is when we really start using {timetk}
and {modeltime}
.
Hopefully you can appreciate how well they work alongside {tidyverse}
and {tidymodels}
packages. No more learning unique processes for every
type of model you want to build.
sales_daily_tbl <- data_clean_tbl %>%
summarise_by_time(.date_var = order_date,
.by = 'day',
sales = sum(sales))
Let’s look at the daily total sales for this company over the time-series.
sales_daily_tbl %>%
plot_time_series(.date_var = order_date,
.value = sales,
.interactive = FALSE,
.line_color = "#cccccc",
.smooth_color = "#db0000",
.title = "Daily Total Sales",
.y_lab = "$") +
theme_dark_grey()
There are no days with 0 orders/sales. But in our EDA we noticed gaps in the data. This data spans just under 4 years, but we only have 1,230 rows of daily data. If we have a row for every day, we should have about 365 * 4 or 1,460 rows. We need to pad these missing days so our time-series is complete. We can fill sales with 0 for these missing days.
Pad Missing Dates
sales_daily_pad_tbl <- sales_daily_tbl %>%
pad_by_time(order_date,
.by = 'day',
.pad_value = 0)
glimpse(sales_daily_pad_tbl)
# Rows: 1,458
# Columns: 2
# $ order_date <date> 2015-01-03, 2015-01-04, 2015-01-05, 2015-01-06, 2015-01-07, 2015-01-08, 2015-01…
# $ sales <dbl> 16.448, 288.060, 19.536, 4407.100, 87.158, 0.000, 40.544, 54.830, 9.940, 0.000, …
1,458 rows, great! Now we have a full 4-year daily dataset with no missing dates.
Our sales variable ranges from 0 to ~ 30,000 on any given day, so we’ll need to transform this continuous variable. And since 0’s can mess up our machine learning models, we’ll use a log10 + 1 transformation, followed by standardization that centers our data around 0. We could do this at each recipe stage of our workflows, but we can do it here once and be done with it. Lets also change sales to sales_trans.
Transform sales
sales_daily_pad_trans_tbl <- sales_daily_pad_tbl %>%
mutate(sales = log_interval_vec(sales,
limit_lower = 0,
offset = 1)) %>%
mutate(sales = standardize_vec(sales)) %>%
rename(sales_trans = sales)
We need to keep track of the Standardization Parameters for when we un-transform the data at the end.
limit_lower <- 0
limit_upper <- 30918.3876
offset <- 1
std_mean_sales <- -4.61074359571939
std_sd_sales <- 2.86059320652223
Let’s look at our new padded & transformed time-series.
sales_daily_pad_trans_tbl %>%
plot_time_series(order_date,
sales_trans,
.interactive = FALSE,
.line_color = "#cccccc",
.smooth_color = "#db0000",
.title = "Daily Total Sales, Padded & Transformed",
.y_lab = "Sales") +
theme_dark_grey()
We might want to add lags, rolling lags, and/or fourier-series features
that could help our forecast. This is way to add features to regress on
for the future dates we want to forecast. Currently future forecast
dates have only order_date
to predict with.
Let’s look at ACF/PACF plots to see if any periods look usable.
ACF/PACF
lag_labels <- data.frame(name = c("ACF","ACF","ACF","ACF","ACF","PACF","PACF","PACF","PACF"),
label = c('7','14','21','28','357','7','14','21','28'))
sales_daily_pad_trans_tbl %>%
plot_acf_diagnostics(.date_var = order_date, .value = sales_trans, .lags = 400,
.show_white_noise_bars = TRUE, .interactive = FALSE, .line_color = "#cccccc",
.line_size = 0.5, .white_noise_line_color = "#db0000", .point_color = "#fed9a6",
.point_size = 0.8) +
theme_dark_grey() +
geom_text(aes(label = label),
data = lag_labels,
x = c(7, 14, 21.7, 29.5, 358, 7, 15, 22, 30),
y = c(.53, .52, .5, .5, .38,.5, .35, .27, .25),
size = 4,
color = "white",
angle = 55)
Clearly there is a strong weekly correlation of sales vs sales 7d, 14d, 21d, … later. The Partial Auto Correlation Features (PACF) de-weights the correlation values depending on the values that come before it. Our PACF shows lags 7, 14, 21, and 28 as potentially important lags. All other lags can be considered white noise or close to white noise.
I experimented with multiple lag and Fourier-Series combos. A lag the length of my forecast (84 days) with rolling 30d, 60d, and 90d averages helped. Additional lags and Fourier-Series did not improve model accuracy. It’s possible I didn’t use them correctly, or experiment enough. I’ll include multiple Fourier-Series with 3 orders each to show the process.
Add Forecast/Lag/Rolling Averages
forecast_3_month <- 84
lag_period <- 84 # 84 smallest lag period to get our forecast of 84 into the future
rolling_periods <- c(30, 60, 90) # 1 month, 2 month, and 3 month moving averages as features to catch the trend
Full Data
full_data_prepared_tbl <- sales_daily_pad_trans_tbl %>%
future_frame(.date_var = order_date,
.length_out = forecast_3_month,
.bind_data = TRUE) %>%
tk_augment_fourier(.date_var = order_date,
.periods = c(7,14,21,28,357),
.K = 3) %>%
tk_augment_lags(.value = sales_trans,
.lags = lag_period ) %>%
tk_augment_slidify(.value = sales_trans_lag84,
.f = ~ mean(.x, na.rm = TRUE),
.period = rolling_periods,
.partial = TRUE,
.align = 'center') %>%
rowid_to_column(var = 'rowid') %>%
rename_with(.cols = contains('lag'),
.fn = ~ str_c('lag_', .)) %>%
rename_with(.cols = matches('(_sin)|(_cos)'),
.fn = ~ str_c('fourier_', .))
glimpse(full_data_prepared_tbl)
# Rows: 1,542
# Columns: 37
# $ rowid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
# $ order_date <date> 2015-01-03, 2015-01-04, 2015-01-05, 2015-01-06, 2015-01-07, …
# $ sales_trans <dbl> -1.002789645, -0.018295287, -0.945789456, 0.984639996, -0.435…
# $ fourier_order_date_sin7_K1 <dbl> 9.749279e-01, 4.338837e-01, -4.338837e-01, -9.749279e-01, -7.…
# $ fourier_order_date_cos7_K1 <dbl> -0.2225209, -0.9009689, -0.9009689, -0.2225209, 0.6234898, 1.…
# $ fourier_order_date_sin7_K2 <dbl> -4.338837e-01, -7.818315e-01, 7.818315e-01, 4.338837e-01, -9.…
# $ fourier_order_date_cos7_K2 <dbl> -0.9009689, 0.6234898, 0.6234898, -0.9009689, -0.2225209, 1.0…
# $ fourier_order_date_sin7_K3 <dbl> -7.818315e-01, 9.749279e-01, -9.749279e-01, 7.818315e-01, -4.…
# $ fourier_order_date_cos7_K3 <dbl> 0.6234898, -0.2225209, -0.2225209, 0.6234898, -0.9009689, 1.0…
# $ fourier_order_date_sin14_K1 <dbl> 7.818315e-01, 9.749279e-01, 9.749279e-01, 7.818315e-01, 4.338…
# $ fourier_order_date_cos14_K1 <dbl> 0.6234898, 0.2225209, -0.2225209, -0.6234898, -0.9009689, -1.…
# $ fourier_order_date_sin14_K2 <dbl> 9.749279e-01, 4.338837e-01, -4.338837e-01, -9.749279e-01, -7.…
# $ fourier_order_date_cos14_K2 <dbl> -0.2225209, -0.9009689, -0.9009689, -0.2225209, 0.6234898, 1.…
# $ fourier_order_date_sin14_K3 <dbl> 4.338837e-01, -7.818315e-01, -7.818315e-01, 4.338837e-01, 9.7…
# $ fourier_order_date_cos14_K3 <dbl> -0.9009689, -0.6234898, 0.6234898, 0.9009689, -0.2225209, -1.…
# $ fourier_order_date_sin21_K1 <dbl> -9.972038e-01, -9.308737e-01, -7.818315e-01, -5.633201e-01, -…
# $ fourier_order_date_cos21_K1 <dbl> 0.07473009, 0.36534102, 0.62348980, 0.82623877, 0.95557281, 1…
# $ fourier_order_date_sin21_K2 <dbl> -1.490423e-01, -6.801727e-01, -9.749279e-01, -9.308737e-01, -…
# $ fourier_order_date_cos21_K2 <dbl> -0.98883083, -0.73305187, -0.22252093, 0.36534102, 0.82623877…
# $ fourier_order_date_sin21_K3 <dbl> 9.749279e-01, 4.338837e-01, -4.338837e-01, -9.749279e-01, -7.…
# $ fourier_order_date_cos21_K3 <dbl> -0.2225209, -0.9009689, -0.9009689, -0.2225209, 0.6234898, 1.…
# $ fourier_order_date_sin28_K1 <dbl> 4.338837e-01, 6.234898e-01, 7.818315e-01, 9.009689e-01, 9.749…
# $ fourier_order_date_cos28_K1 <dbl> 9.009689e-01, 7.818315e-01, 6.234898e-01, 4.338837e-01, 2.225…
# $ fourier_order_date_sin28_K2 <dbl> 7.818315e-01, 9.749279e-01, 9.749279e-01, 7.818315e-01, 4.338…
# $ fourier_order_date_cos28_K2 <dbl> 0.6234898, 0.2225209, -0.2225209, -0.6234898, -0.9009689, -1.…
# $ fourier_order_date_sin28_K3 <dbl> 9.749279e-01, 9.009689e-01, 4.338837e-01, -2.225209e-01, -7.8…
# $ fourier_order_date_cos28_K3 <dbl> 2.225209e-01, -4.338837e-01, -9.009689e-01, -9.749279e-01, -6…
# $ fourier_order_date_sin357_K1 <dbl> 0.2778924, 0.2947552, 0.3115267, 0.3282017, 0.3447751, 0.3612…
# $ fourier_order_date_cos357_K1 <dbl> 0.9606122, 0.9555728, 0.9502374, 0.9446077, 0.9386853, 0.9324…
# $ fourier_order_date_sin357_K2 <dbl> 0.5338936, 0.5633201, 0.5920486, 0.6200437, 0.6472706, 0.6736…
# $ fourier_order_date_cos357_K2 <dbl> 0.84555168, 0.82623877, 0.80590223, 0.78456725, 0.76226027, 0…
# $ fourier_order_date_sin357_K3 <dbl> 0.7478370, 0.7818315, 0.8136468, 0.8431944, 0.8703918, 0.8951…
# $ fourier_order_date_cos357_K3 <dbl> 0.66388234, 0.62348980, 0.58135949, 0.53760882, 0.49235973, 0…
# $ lag_sales_trans_lag84 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# $ lag_sales_trans_lag84_roll_30 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, N…
# $ lag_sales_trans_lag84_roll_60 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, N…
# $ lag_sales_trans_lag84_roll_90 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, N…
The time-series now includes lag, rolling_lag and fourier-series data in future forecast dates. If these features help model accuracy on the training/testing sets, we can assume they’ll help the models forecast future sales. They may or may not help though. Gotta experiment. Let’s look at the lag and rolling lags data.
full_data_prepared_tbl %>%
pivot_longer(c(sales_trans,
lag_sales_trans_lag84,
lag_sales_trans_lag84_roll_30,
lag_sales_trans_lag84_roll_60,
lag_sales_trans_lag84_roll_90)) %>%
plot_time_series(order_date, value, name, .interactive = FALSE,
.smooth = FALSE, .title = "30d, 60d, 90d Rolling Lags") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc","#db0000", "#0099ff", "#00ff99", "#ffffcc"))
Notice each of the first 84 days lacks at least some of the lag/rolling lag features. We can’t use these days in our testing/training splits. That’s the downside. We probably could pad those dates somehow, but I won’t try here.
The full_data_prepared_tbl
has 1542 rows (1458 original days + 84
future days). We need to separate rows for future days so we can split
the original data into training/testing sets. Filter using days lacking
sales_trans
values.
Forecast Data
We can use {skimr}
to show us a summary of our data.
forecast_tbl <- full_data_prepared_tbl %>%
filter(is.na(sales_trans))
skimr::skim_without_charts(forecast_tbl)
Name | forecast_tbl |
Number of rows | 84 |
Number of columns | 37 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 36 |
________________________ | |
Group variables | None |
Data summary
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
order_date | 0 | 1 | 2018-12-31 | 2019-03-24 | 2019-02-10 | 84 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
rowid | 0 | 1 | 1500.50 | 24.39 | 1459.00 | 1479.75 | 1500.50 | 1521.25 | 1542.00 |
sales_trans | 84 | 0 | NaN | NA | NA | NA | NA | NA | NA |
fourier_order_date_sin7_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K1 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin7_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin7_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin14_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
fourier_order_date_sin14_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin14_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
fourier_order_date_sin21_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
fourier_order_date_cos21_K1 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
fourier_order_date_sin21_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
fourier_order_date_cos21_K2 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
fourier_order_date_sin21_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos21_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
fourier_order_date_cos28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
fourier_order_date_sin28_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos28_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
fourier_order_date_sin28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
fourier_order_date_cos28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
fourier_order_date_sin357_K1 | 0 | 1 | 0.91 | 0.08 | 0.72 | 0.85 | 0.93 | 0.98 | 1.00 |
fourier_order_date_cos357_K1 | 0 | 1 | 0.03 | 0.41 | -0.64 | -0.33 | 0.03 | 0.39 | 0.69 |
fourier_order_date_sin357_K2 | 0 | 1 | 0.04 | 0.69 | -0.99 | -0.62 | 0.06 | 0.71 | 1.00 |
fourier_order_date_cos357_K2 | 0 | 1 | -0.67 | 0.28 | -1.00 | -0.93 | -0.74 | -0.45 | -0.05 |
fourier_order_date_sin357_K3 | 0 | 1 | -0.36 | 0.52 | -1.00 | -0.85 | -0.45 | 0.09 | 0.65 |
fourier_order_date_cos357_K3 | 0 | 1 | -0.03 | 0.78 | -1.00 | -0.85 | -0.09 | 0.84 | 1.00 |
lag_sales_trans_lag84 | 0 | 1 | 0.53 | 0.76 | -2.00 | 0.29 | 0.71 | 1.00 | 1.60 |
lag_sales_trans_lag84_roll_30 | 0 | 1 | 0.54 | 0.16 | 0.22 | 0.46 | 0.57 | 0.64 | 0.81 |
lag_sales_trans_lag84_roll_60 | 0 | 1 | 0.56 | 0.07 | 0.43 | 0.51 | 0.56 | 0.61 | 0.66 |
lag_sales_trans_lag84_roll_90 | 0 | 1 | 0.55 | 0.06 | 0.45 | 0.51 | 0.53 | 0.61 | 0.66 |
The forecast_tbl
has 37 columns and 84 rows for future dates starting
2018-12-31 and ending 2019-03-24. Rows are only missing sales_trans
data. Good to go.
Training/Testing Data
The rest of the data we’ll call the data_prepared_tbl
. We need to drop
those rows that lack any lag/rolling lag data.
data_prepared_tbl <- full_data_prepared_tbl %>%
filter(!is.na(sales_trans)) %>%
drop_na()
skimr::skim_without_charts(data_prepared_tbl)
Name | data_prepared_tbl |
Number of rows | 1374 |
Number of columns | 37 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 36 |
________________________ | |
Group variables | None |
Data summary
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
order_date | 0 | 1 | 2015-03-28 | 2018-12-30 | 2017-02-11 | 1374 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
rowid | 0 | 1 | 771.50 | 396.78 | 85.00 | 428.25 | 771.50 | 1114.75 | 1458.00 |
sales_trans | 0 | 1 | 0.04 | 0.98 | -2.00 | -0.25 | 0.37 | 0.70 | 1.75 |
fourier_order_date_sin7_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K1 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin7_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin7_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos7_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin14_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.22 | 0.62 | 1.00 |
fourier_order_date_sin14_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin14_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos14_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin21_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
fourier_order_date_cos21_K1 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
fourier_order_date_sin21_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
fourier_order_date_cos21_K2 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
fourier_order_date_sin21_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos21_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
fourier_order_date_sin28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
fourier_order_date_cos28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.78 | 1.00 |
fourier_order_date_sin28_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
fourier_order_date_cos28_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.22 | 0.62 | 1.00 |
fourier_order_date_sin28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.78 | 1.00 |
fourier_order_date_cos28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
fourier_order_date_sin357_K1 | 0 | 1 | -0.04 | 0.70 | -1.00 | -0.73 | -0.06 | 0.64 | 1.00 |
fourier_order_date_cos357_K1 | 0 | 1 | -0.01 | 0.72 | -1.00 | -0.73 | -0.05 | 0.73 | 1.00 |
fourier_order_date_sin357_K2 | 0 | 1 | -0.02 | 0.71 | -1.00 | -0.73 | -0.04 | 0.69 | 1.00 |
fourier_order_date_cos357_K2 | 0 | 1 | 0.03 | 0.70 | -1.00 | -0.67 | 0.06 | 0.73 | 1.00 |
fourier_order_date_sin357_K3 | 0 | 1 | 0.02 | 0.71 | -1.00 | -0.69 | 0.03 | 0.73 | 1.00 |
fourier_order_date_cos357_K3 | 0 | 1 | 0.02 | 0.71 | -1.00 | -0.68 | 0.04 | 0.74 | 1.00 |
lag_sales_trans_lag84 | 0 | 1 | -0.03 | 1.00 | -2.00 | -0.39 | 0.33 | 0.67 | 2.42 |
lag_sales_trans_lag84_roll_30 | 0 | 1 | -0.03 | 0.33 | -1.02 | -0.23 | 0.00 | 0.21 | 0.64 |
lag_sales_trans_lag84_roll_60 | 0 | 1 | -0.03 | 0.29 | -0.83 | -0.21 | 0.01 | 0.18 | 0.53 |
lag_sales_trans_lag84_roll_90 | 0 | 1 | -0.03 | 0.26 | -0.83 | -0.18 | 0.00 | 0.16 | 0.46 |
37 columns, 1374 rows, no missing data. Dates now range from 2015-03-28 to 2018-12-30. Good to go.
I decided to split training/testing so we’re testing on the last 84
days, the same amount of days in our forecast_tbl
. Could have split
differently. Cross-Validation for sequential models need to be ordered,
so making the testing set too big limits the benefit of time-series
cross-validation. That’ll make sense soon.
splits <- data_prepared_tbl %>%
time_series_split(date_var = order_date,
assess = 84,
cumulative = TRUE)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(.date_var = order_date,
.value = sales_trans,
.smooth = FALSE,
.interactive = FALSE,
.title = "Initial Cross-Validation Plan",
.y_lab = "Sales") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
{modeltime}
was built to complement {tidymodels}
, so the recipes,
workflows, hyperparameter tuning, etc. all work the same.
I’m going to make two recipes for modeling workflows. One recipe will include the Fourier-Series, and one will lack Fourier-Series. I could have done this differently, but it works.
The step_timeseries_signature()
function adds many time-series
features based on our date variable, order_date
. We just have to
remove features we can’t use with our daily dataset, like features
involving minutes, hours, etc. I also added steps to normalize numeric
variables and dummy the nominal variables.
recipe_spec <- recipe(sales_trans ~ ., data = training(splits)) %>%
step_timeseries_signature(order_date) %>%
update_role(rowid, new_role = 'indicator') %>%
step_rm(matches("(.iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
step_normalize(matches('(index.num)|(year)|(yday)|(qday)|(mday)|(date_day)')) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
step_holiday(order_date, holidays = timeDate::listHolidays("US"))
recipe_spec
uses all the Fourier-Series found in our
data_prepared_tbl
.
recipe_spec_no_f <- recipe(sales_trans ~ order_date
+ lag_sales_trans_lag84
+ lag_sales_trans_lag84_roll_30
+ lag_sales_trans_lag84_roll_60
+ lag_sales_trans_lag84_roll_90,
data = training(splits)) %>%
step_timeseries_signature(order_date) %>%
step_rm(matches("(.iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
step_normalize(matches('(index.num)|(year)|(yday)|(qday)|(mday)|(date_day)')) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
step_holiday(order_date, holidays = timeDate::listHolidays("US"))
recipe_spec_no_f
does NOT use the Fourier-Series found in our
data_prepared_tbl
.
recipe_spec_no_f %>% prep() %>% juice() %>% glimpse()
# Rows: 1,290
# Columns: 58
# $ order_date <date> 2015-03-28, 2015-03-29, 2015-03-30, 2015-03-31, 2015-04…
# $ lag_sales_trans_lag84 <dbl> -1.002789645, -0.018295287, -0.945789456, 0.984639996, -…
# $ lag_sales_trans_lag84_roll_30 <dbl> -0.6456698, -0.6031017, -0.5258254, -0.6035450, -0.67349…
# $ lag_sales_trans_lag84_roll_60 <dbl> -0.6811990, -0.6723552, -0.6606234, -0.7000903, -0.67922…
# $ lag_sales_trans_lag84_roll_90 <dbl> -0.7196151, -0.7196938, -0.7464189, -0.7394241, -0.74869…
# $ sales_trans <dbl> 0.56999672, 0.38250608, 0.48107019, 0.67053534, -0.32496…
# $ order_date_index.num <dbl> -1.730038, -1.727353, -1.724669, -1.721985, -1.719300, -…
# $ order_date_year <dbl> -1.419664, -1.419664, -1.419664, -1.419664, -1.419664, -…
# $ order_date_half <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
# $ order_date_quarter <int> 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
# $ order_date_month <int> 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
# $ order_date_day <dbl> 1.38891404, 1.50196926, 1.61502448, 1.72807970, -1.66357…
# $ order_date_wday <int> 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4,…
# $ order_date_mday <dbl> 1.38891404, 1.50196926, 1.61502448, 1.72807970, -1.66357…
# $ order_date_qday <dbl> 1.5404251, 1.5780733, 1.6157216, 1.6533698, -1.6973219, …
# $ order_date_yday <dbl> -0.9656001, -0.9555644, -0.9455288, -0.9354932, -0.92545…
# $ order_date_mweek <int> 4, 4, 5, 5, 5, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,…
# $ order_date_week <int> 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, …
# $ order_date_week2 <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
# $ order_date_week3 <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_week4 <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,…
# $ order_date_mday7 <dbl> 1.6685116, 1.6685116, 1.6685116, 1.6685116, -1.4074155, …
# $ order_date_month.lbl_01 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_02 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_03 <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_04 <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
# $ order_date_month.lbl_05 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_06 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_07 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_08 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_09 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_month.lbl_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_wday.lbl_1 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
# $ order_date_wday.lbl_2 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
# $ order_date_wday.lbl_3 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
# $ order_date_wday.lbl_4 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
# $ order_date_wday.lbl_5 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
# $ order_date_wday.lbl_6 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
# $ order_date_wday.lbl_7 <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
# $ order_date_USChristmasDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USColumbusDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USCPulaskisBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USDecorationMemorialDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USElectionDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USGoodFriday <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USInaugurationDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USIndependenceDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USLaborDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USLincolnsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USMemorialDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USMLKingsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USNewYearsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USPresidentsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USThanksgivingDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USVeteransDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# $ order_date_USWashingtonsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Even without Fourier-Series, look at all the features that may help us
model future sales. We started with just order_date
.
🏗 Build Workflows
We don’t know which algorithms will work best with our data to predict sales, so lets train and test a bunch of models. Building workflows makes it so easy to move from one model to the next. Notice how many different models I tested and the code chunks are nearly identical. Experimenting becomes easy and limited only by your CPUs.
Let’s build:
- 3 sequential models:
- ARIMA, ARIMA_boost, NNETAR
- 8 non-sequential models:
- Prophet, XGBoost, Prophet_boost, SVM, RF, NNET, MARS, GLMNET
- 2 Recipes per model:
- 1 w/ Fourier-Seires, 1 w/out Fourier-Series
Certain models use a date object like order_date
as a predictor. Some
models CANNOT and we’ll predict on all variables in our recipes EXCEPT
order_date
.
Models that use a date object predictor: ARIMA, ARIMA_boost, Neral Network Autoregression (NNETAR), Prophet, and Prophet_boost
Models that CANNOT use a date object predictor: XGBoost, Support-Vector Machine (SVM), Random Forest (RF), Neural Network (NNET), Multiple Adaptive Regression Splines (MARS), and Elastic-Net Regularized Generalized Linear Models (GLMNET)
Ok, lets build!
ARIMA
Only model w/out a workflow
set.seed(321)
model_fit_arima <- arima_reg() %>%
set_engine('auto_arima') %>%
fit(sales_trans ~ order_date, data = training(splits))
ARIMA BOOST
Keep order_date
feature
wflw_fit_arimaboost <- workflow() %>%
add_model(spec = arima_boost() %>% set_engine('auto_arima_xgboost')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_arimaboost_no_f <- workflow() %>%
add_model(spec = arima_boost() %>% set_engine('auto_arima_xgboost')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
PROPHET
Keep order_date
feature
wflw_fit_prophet <- workflow() %>%
add_model(spec = prophet_reg() %>% set_engine('prophet')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_prophet_no_f <- workflow() %>%
add_model(spec = prophet_reg() %>% set_engine('prophet')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
XGBOOST
Set order_date
to ‘indicator’
wflw_fit_xgboost <- workflow() %>%
add_model(spec = boost_tree(mode = 'regression') %>% set_engine('xgboost')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_xgboost_no_f <- workflow() %>%
add_model(spec = boost_tree(mode = 'regression') %>% set_engine('xgboost')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
PROPHET BOOST
Keep order_date
feature
wflw_fit_prophet_xgboost <- workflow() %>%
add_model(spec = prophet_boost(seasonality_daily = FALSE,
seasonality_weekly = FALSE,
seasonality_yearly = FALSE) %>% set_engine('prophet_xgboost')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_prophet_xgboost_no_f <- workflow() %>%
add_model(spec = prophet_boost(seasonality_daily = FALSE,
seasonality_weekly = FALSE,
seasonality_yearly = FALSE) %>% set_engine('prophet_xgboost')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
I turned Prophet seasonalities off so prophet is only used for trend. XGBoost will model seasonality with the Prophet model’s residuals using the calandar features from the recipe.
SVM
Set order_date
to ‘indicator’
set.seed(321)
wflw_fit_svm <- workflow() %>%
add_model(spec = svm_rbf(mode = 'regression') %>% set_engine('kernlab')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_svm_no_f <- workflow() %>%
add_model(spec = svm_rbf(mode = 'regression') %>% set_engine('kernlab')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
RANDOM FOREST
Set order_date
to ‘indicator’
set.seed(321)
wflw_fit_rf <- workflow() %>%
add_model(spec = rand_forest(mode = 'regression') %>% set_engine('ranger')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_rf_no_f <- workflow() %>%
add_model(spec = rand_forest(mode = 'regression') %>% set_engine('ranger')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
NNET
Set order_date
to ‘indicator’
set.seed(321)
wflw_fit_nnet <- workflow() %>%
add_model(spec = mlp(mode = 'regression') %>% set_engine('nnet')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_nnet_no_f <- workflow() %>%
add_model(spec = mlp(mode = 'regression') %>% set_engine('nnet')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
NNETAR
Keep order_date
feature
set.seed(321)
wflw_fit_nnetar <- workflow() %>%
add_model(nnetar_reg() %>% set_engine('nnetar')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_nnetar_no_f <- workflow() %>%
add_model(nnetar_reg() %>% set_engine('nnetar')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
MARS
Set order_date
to ‘indicator’
wflw_fit_mars <- workflow() %>%
add_model(spec = mars(mode = 'regression') %>% set_engine('earth')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_mars_no_f <- workflow() %>%
add_model(spec = mars(mode = 'regression') %>% set_engine('earth')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
GLMNET
Set order_date
to ‘indicator’
wflw_fit_glmnet <- workflow() %>%
add_model(spec = linear_reg(mode = 'regression',
penalty = 0.1,
mixture = 0.5) %>% set_engine('glmnet')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_glmnet_no_f <- workflow() %>%
add_model(spec = linear_reg(mode = 'regression',
penalty = 0.1,
mixture = 0.5) %>% set_engine('glmnet')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
That’s a lot of code, but also lots of copy/pasting followed by small changes. It’s really not that bad. Let’s collect each model/workflow and measure performance on the testing data.
Collect and Rename
submodels_tbl <- modeltime_table(model_fit_arima,
wflw_fit_arimaboost,
wflw_fit_arimaboost_no_f,
wflw_fit_prophet,
wflw_fit_prophet_no_f,
wflw_fit_prophet_xgboost,
wflw_fit_prophet_xgboost_no_f,
wflw_fit_svm,
wflw_fit_svm_no_f,
wflw_fit_rf,
wflw_fit_rf_no_f,
wflw_fit_nnet,
wflw_fit_nnet_no_f,
wflw_fit_nnetar,
wflw_fit_nnetar_no_f,
wflw_fit_mars,
wflw_fit_mars_no_f,
wflw_fit_glmnet,
wflw_fit_glmnet_no_f) %>%
update_model_description(1, "ARIMA(0,1,1)(0,0,2)[7]") %>%
update_model_description(2, "ARIMA_boost") %>%
update_model_description(3, "ARIMA_boost-no_fourier") %>%
update_model_description(4, "PROPHET") %>%
update_model_description(5, "PROPHET-no_fourier") %>%
update_model_description(6, "PROPHET_boost") %>%
update_model_description(7, "PROPHET_boost-no_fourier") %>%
update_model_description(8, "SVM") %>%
update_model_description(9, "SVM-no_fourier") %>%
update_model_description(10, "RandomForest") %>%
update_model_description(11, "RandomForest-no_fourier") %>%
update_model_description(12, "NNET") %>%
update_model_description(13, "NNET-no_fourier") %>%
update_model_description(14, "NNETAR") %>%
update_model_description(15, "NNETAR-no_fourier") %>%
update_model_description(16, "MARS") %>%
update_model_description(17, "MARS-no_fourier") %>%
update_model_description(18, "GLMNET") %>%
update_model_description(19, "GLMNET-no_fourier")
Calibrate on Testing Data
calibration_tbl <- submodels_tbl %>% modeltime_calibrate(testing(splits))
Measure Accuracy
calibration_tbl %>% modeltime_accuracy() %>% arrange(rmse) %>% table_modeltime_accuracy(.interactive = FALSE)
Accuracy Table | ||||||||
---|---|---|---|---|---|---|---|---|
.model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
9 | SVM-no_fourier | Test | 0.35 | 314.23 | 0.48 | 58.48 | 0.53 | 0.52 |
14 | NNETAR | Test | 0.38 | 310.80 | 0.52 | 61.80 | 0.54 | 0.50 |
8 | SVM | Test | 0.39 | 355.90 | 0.53 | 63.28 | 0.56 | 0.45 |
15 | NNETAR-no_fourier | Test | 0.41 | 259.44 | 0.56 | 67.29 | 0.57 | 0.46 |
7 | PROPHET_boost-no_fourier | Test | 0.38 | 351.48 | 0.52 | 58.60 | 0.57 | 0.44 |
11 | RandomForest-no_fourier | Test | 0.45 | 253.48 | 0.61 | 71.94 | 0.59 | 0.43 |
10 | RandomForest | Test | 0.48 | 258.79 | 0.65 | 78.93 | 0.62 | 0.39 |
6 | PROPHET_boost | Test | 0.42 | 309.38 | 0.57 | 65.50 | 0.63 | 0.31 |
17 | MARS-no_fourier | Test | 0.44 | 428.06 | 0.60 | 68.69 | 0.63 | 0.32 |
12 | NNET | Test | 0.48 | 260.89 | 0.65 | 75.25 | 0.64 | 0.27 |
5 | PROPHET-no_fourier | Test | 0.46 | 425.59 | 0.63 | 75.72 | 0.65 | 0.29 |
16 | MARS | Test | 0.47 | 243.98 | 0.64 | 81.43 | 0.65 | 0.29 |
19 | GLMNET-no_fourier | Test | 0.49 | 290.47 | 0.67 | 85.03 | 0.66 | 0.26 |
18 | GLMNET | Test | 0.50 | 288.36 | 0.68 | 89.65 | 0.66 | 0.26 |
3 | ARIMA_boost-no_fourier | Test | 0.50 | 291.62 | 0.68 | 79.44 | 0.67 | 0.21 |
4 | PROPHET | Test | 0.49 | 484.70 | 0.67 | 80.61 | 0.68 | 0.25 |
2 | ARIMA_boost | Test | 0.50 | 241.05 | 0.69 | 79.06 | 0.70 | 0.18 |
13 | NNET-no_fourier | Test | 0.54 | 216.05 | 0.74 | 84.72 | 0.70 | 0.24 |
1 | ARIMA(0,1,1)(0,0,2)[7] | Test | 0.61 | 162.54 | 0.83 | 102.57 | 0.76 | 0.13 |
Looking for models with low ‘rmse’ and high ‘rsq’. Those perform the best fit on the testing data. I want to see how only the 6 best models perform.
Collect 6 Best
submodels_tbl_2 <- modeltime_table(wflw_fit_svm_no_f,
wflw_fit_nnetar_no_f,
wflw_fit_svm,
wflw_fit_prophet_xgboost_no_f,
wflw_fit_nnetar,
wflw_fit_rf_no_f) %>%
update_model_description(1, "SVM_no_f") %>%
update_model_description(2, "NNETAR_no_f") %>%
update_model_description(3, "SVM") %>%
update_model_description(4, "PROPHET_boost_no_f") %>%
update_model_description(5, "NNETAR") %>%
update_model_description(6, "RandomForest_no_f")
Visualize Best 6 Models Fit on Testing Data
calibration_tbl <- submodels_tbl_2 %>% modeltime_calibrate(testing(splits))
calibration_tbl %>%
modeltime_forecast(new_data = testing(splits),
actual_data = filter_by_time(data_prepared_tbl, .start_date = "2018-09"),
keep_data = TRUE) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Models fit on testing(splits)",
.y_lab = "sales_trans",
.line_size = 1) +
theme_dark_grey() +
scale_color_todd_dark_bright()
Not bad. But we can tune these models for a better fit before forecasting.
🔧 Hyperparameter Tuning
There are two types of cross-validation we’ll need for resampling. K-Fold CV for models that can train on random data and Time-Series CV for models that need to train on sequential or ordered data. You’ll see.
K-Fold CV
set.seed(321)
resamples_kfold <- training(splits) %>% vfold_cv(v = 6)
resamples_kfold %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(order_date,
sales_trans,
.facet_ncol = 2,
.interactive = FALSE,
.title = "K-fold CV Plan",
.y_lab = "sales_trans") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
Time-Series CV
set.seed(321)
resamples_tscv <- time_series_cv(data = training(splits) %>% drop_na(),
cumulative = TRUE,
assess = "12 weeks",
skip = '6 weeks',
slice_limit = 6)
resamples_tscv %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(order_date,
sales_trans,
.facet_ncol = 2,
.interactive = FALSE,
.title = "Time-Series CV Plan",
.y_lab = "sales_trans") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
I did 2-3 rounds of tuning for all 6 models, I’ll show the first and last round of tuning for just 2 of the 6 models. We’ll still compare performance of all 6 tuned models.
Each of my tuning sets consist of:
- Tunable Model
- Workflow
- specifies model and recipe
- Parameter Grid
- specifies parameters to tune and range of values to test
- Round of Tuning
- uses tunable model, workflow, parameter grid, and appropriate CV plan to train/test
- Results Table
- shows which combinations work best
- Results Plot
- chose values that associate with low rmse and high rsq
After each set I narrow the tuning grid, depending on what the plot shows me. Then I do it again until additional tunes make little impact on the rmse/rsq. Hint - pay attention to the y-axis of the plots. Models tuning with many tuning parameters could take ~8 minutes to run on my 2019 16 GB, 1.6 GHz MacBook Air. I could use an upgrade.
SVM - no Fourier-Series
Tunable Model
model_spec_svm_tune <- svm_rbf(mode = 'regression',
cost = tune(),
rbf_sigma = tune(),
margin = 0.17 ) %>%
set_engine('kernlab')
Workflow
Change order_date
feature to ‘indicator’
wflw_spec_svm_tune <- workflow() %>%
add_model(model_spec_svm_tune) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date,
new_role = 'indicator'))
Tuning Grid 1
grid_spec_svm_1 <- grid_latin_hypercube(parameters(model_spec_svm_tune),
size = 20)
Tuning Round 1
set.seed(321)
tune_results_svm <- wflw_spec_svm_tune %>%
tune_grid(resamples = resamples_kfold,
grid = grid_spec_svm_1,
control = control_grid(verbose = TRUE))
Results Table 1
tune_results_svm %>% show_best('rmse', n = 10)
# # A tibble: 10 × 8
# cost rbf_sigma .metric .estimator mean n std_err .config
# <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 3.09 0.00225 rmse standard 0.840 6 0.0312 Preprocessor1_Model12
# 2 0.0964 0.0173 rmse standard 0.894 6 0.0321 Preprocessor1_Model08
# 3 0.126 0.0396 rmse standard 0.897 6 0.0309 Preprocessor1_Model03
# 4 11.3 0.112 rmse standard 0.898 6 0.0225 Preprocessor1_Model04
# 5 0.265 0.000559 rmse standard 0.958 6 0.0331 Preprocessor1_Model02
# 6 0.595 0.000175 rmse standard 0.970 6 0.0331 Preprocessor1_Model16
# 7 0.00594 0.00372 rmse standard 1.02 6 0.0325 Preprocessor1_Model01
# 8 1.17 0.00000741 rmse standard 1.03 6 0.0327 Preprocessor1_Model06
# 9 0.0588 0.0000220 rmse standard 1.04 6 0.0327 Preprocessor1_Model11
# 10 0.0280 0.0000449 rmse standard 1.04 6 0.0327 Preprocessor1_Model18
Results Plot 1
tune_results_svm %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
The range of the rmse and rsq y-axis is wide-enough to use a round of tuning. Update the grid and do round 2.
Tuning Grid 2
grid_spec_svm_2 <- grid_latin_hypercube(cost(range = c(3.2,3.5),
trans = scales::log2_trans()),
rbf_sigma(range = c(-2.45,-2.35),
trans = scales::log10_trans()),
size = 20)
Tuning Round 2
set.seed(321)
tune_results_svm_2 <- wflw_spec_svm_tune %>%
tune_grid(resamples = resamples_kfold,
grid = grid_spec_svm_2,
control = control_grid(verbose = TRUE))
Results Table 2
tune_results_svm_2 %>% show_best('rmse', n = 10)
# # A tibble: 10 × 8
# cost rbf_sigma .metric .estimator mean n std_err .config
# <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 10.0 0.00361 rmse standard 0.813 6 0.0270 Preprocessor1_Model06
# 2 10.4 0.00364 rmse standard 0.813 6 0.0268 Preprocessor1_Model11
# 3 11.2 0.00357 rmse standard 0.813 6 0.0266 Preprocessor1_Model08
# 4 9.51 0.00380 rmse standard 0.813 6 0.0269 Preprocessor1_Model17
# 5 9.38 0.00383 rmse standard 0.813 6 0.0269 Preprocessor1_Model18
# 6 10.3 0.00371 rmse standard 0.813 6 0.0267 Preprocessor1_Model12
# 7 9.81 0.00388 rmse standard 0.814 6 0.0267 Preprocessor1_Model07
# 8 9.39 0.00395 rmse standard 0.814 6 0.0268 Preprocessor1_Model02
# 9 11.1 0.00373 rmse standard 0.814 6 0.0263 Preprocessor1_Model03
# 10 9.67 0.00400 rmse standard 0.814 6 0.0265 Preprocessor1_Model19
Results Plot 2
tune_results_svm_2 %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Good enough. Look at the y-axis. Another round of tuning won’t make a big difference.
Fit Tuned Workflow on Original Traning Data
wflw_fit_svm_no_f_tuned <- wflw_spec_svm_tune %>%
finalize_workflow(select_best(tune_results_svm_2, 'rmse')) %>%
fit(training(splits))
NNETAR - w/ Fourier Series
Tunable Model
model_spec_nnetar_tune <- nnetar_reg(mode = 'regression',
seasonal_period = 7,
non_seasonal_ar = tune(),
seasonal_ar = tune(),
num_networks = 10,
hidden_units = tune(),
penalty = tune(),
epochs = 50) %>%
set_engine('nnetar')
Workflow
Keep order_date
feature
wflw_spec_nnetar_tune_f <- workflow() %>%
add_model(model_spec_nnetar_tune) %>%
add_recipe(recipe_spec)
Tuning Grid 1
grid_spec_nnetar_f_1 <- grid_latin_hypercube(parameters(model_spec_nnetar_tune),
size = 20)
Tuning Round 1
set.seed(321)
tune_results_nnetar_f <- wflw_spec_nnetar_tune_f %>%
tune_grid(resamples = resamples_tscv,
grid = grid_spec_nnetar_f_1,
control = control_grid(verbose = TRUE))
Results Table 1
tune_results_nnetar_f %>% show_best('rmse', n = 10)
## A tibble: 10 × 10
# non_seasonal_ar seasonal_ar hidden_units penalty .metric .estimator mean n std_err .config
# <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 2 1 10 4.48e-1 rmse standard 0.767 6 0.0467 Preproce…
# 2 2 2 8 1.84e-9 rmse standard 0.775 6 0.0479 Preproce…
# 3 4 0 7 1.67e-3 rmse standard 0.780 6 0.0530 Preproce…
# 4 2 0 8 6.48e-9 rmse standard 0.786 6 0.0511 Preproce…
# 5 4 1 4 8.39e-8 rmse standard 0.787 6 0.0488 Preproce…
# 6 3 1 9 1.92e-4 rmse standard 0.788 6 0.0455 Preproce…
# 7 1 2 5 3.70e-5 rmse standard 0.790 6 0.0484 Preproce…
# 8 3 1 2 2.96e-5 rmse standard 0.790 6 0.0483 Preproce…
# 9 1 2 2 4.68e-7 rmse standard 0.791 6 0.0595 Preproce…
#10 3 1 6 8.49e-4 rmse standard 0.791 6 0.0568 Preproce…
Results Plot 1
tune_results_nnetar_f %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Tuning Grid 2
set.seed(321)
grid_spec_nnetar_f_2 <- grid_latin_hypercube(non_seasonal_ar(range = c(1,1)),
seasonal_ar(range = c(0,2)),
hidden_units(range = c(5,5)),
penalty(range = c(-6,-3),
trans = scales::log10_trans()),
size = 10)
Tuning Round 2
set.seed(321)
tune_results_nnetar_f_2 <- wflw_spec_nnetar_tune_f %>%
tune_grid(resamples = resamples_tscv,
grid = grid_spec_nnetar_f_2,
control = control_grid(verbose = TRUE))
Results Table 2
tune_results_nnetar_f_2 %>% show_best('rmse', n = 10)
## A tibble: 10 × 10
# non_seasonal_ar seasonal_ar hidden_units penalty .metric .estimator mean n std_err .config
# <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 1 0 5 0.0000254 rmse standard 0.778 6 0.0548 Prepro…
# 2 1 2 5 0.000645 rmse standard 0.781 6 0.0503 Prepro…
# 3 1 1 5 0.00000495 rmse standard 0.784 6 0.0506 Prepro…
# 4 1 0 5 0.000358 rmse standard 0.785 6 0.0472 Prepro…
# 5 1 2 5 0.00000230 rmse standard 0.785 6 0.0503 Prepro…
# 6 1 1 5 0.0000129 rmse standard 0.788 6 0.0490 Prepro…
# 7 1 1 5 0.0000943 rmse standard 0.789 6 0.0502 Prepro…
# 8 1 1 5 0.000208 rmse standard 0.793 6 0.0501 Prepro…
# 9 1 1 5 0.0000350 rmse standard 0.795 6 0.0555 Prepro…
#10 1 2 5 0.00000105 rmse standard 0.797 6 0.0505 Prepro…
Results Plot 2
tune_results_nnetar_f_2 %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Fit Tuned Workflow on Original Traning Data
wflw_fit_nnetar_f_tuned <- wflw_spec_nnetar_tune_f %>%
finalize_workflow(select_best(tune_results_nnetar_f_2, 'rmse')) %>%
fit(training(splits))
🏆 Forecast w/ Best Models
After lots of tuning, we have 6 models that have been tuned and fit back onto the training data. Now let’s evaluate their performance on the testing data.
Collect Tuned Models
submodels_tuned_tbl <- modeltime_table(wflw_fit_prophet_boost_tuned,
wflw_fit_svm_f_tuned,
wflw_fit_svm_no_f_tuned,
wflw_fit_rf_tuned,
wflw_fit_nnetar_no_f_tuned,
wflw_fit_nnetar_f_tuned) %>%
update_model_description(1, "PROPHET_XGBOOST-tuned") %>%
update_model_description(2, "SVM-fourier-tuned") %>%
update_model_description(3, "SVM-no_fourier-tuned") %>%
update_model_description(4, "Random Forest-tuned") %>%
update_model_description(5, "NNETAR-no_fourier-tuned") %>%
update_model_description(6, "NNETAR-fourier-tuned")
Calibrate on Test Data
calibration_tuned_tbl <- submodels_tuned_tbl %>% modeltime_calibrate(testing(splits))
Measure Accuracy
calibration_tuned_tbl %>% modeltime_accuracy() %>% arrange(rmse)
## A tibble: 6 × 9
# .model_id .model_desc .type mae mape mase smape rmse rsq
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 3 SVM-no_fourier-tuned Test 0.371 360. 0.506 57.8 0.544 0.482
#2 2 SVM-fourier-tuned Test 0.395 379. 0.539 63.4 0.572 0.426
#3 6 NNETAR-fourier-tuned Test 0.427 266. 0.582 69.7 0.576 0.431
#4 4 Random Forest-tuned Test 0.419 308. 0.572 64.5 0.585 0.403
#5 1 PROPHET_XGBOOST-tuned Test 0.424 374. 0.578 62.9 0.598 0.377
#6 5 NNETAR-no_fourier-tuned Test 0.450 273. 0.613 73.3 0.602 0.391
Visualize Fit on Test Data
calibration_tuned_tbl %>%
modeltime_forecast(new_data = testing(splits),
actual_data = filter_by_time(data_prepared_tbl, .start_date = "2018-09"),
keep_data = TRUE) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models fit on testing(splits)",
.y_lab = "sales_trans",
.line_size = 1) +
theme_dark_grey() +
scale_color_todd_dark_bright()
The hyperparameter tuning helped the performance of some models, at least regarding fit on the testing data. I could do some more work tuning, but lets move forward and forecast future sales with these tuned models. We need to refit the models on the entire dataset now, not just the training or testing data. Once refit, we can forecast daily sales for the next 84 days.
Refit on All Data
refit_tbl <- calibration_tuned_tbl %>%
modeltime_refit(data = data_prepared_tbl)
After refitting, we can un-transform the data using those transformation parameters we saved way up top and visualize our models on a true y-axis of sales.
Visualizing Sales Forecast with 6 Tuned-Models Zoomed In
refit_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl,
keep_data = FALSE) %>%
mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value:.conf_hi, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
filter_by_time(.start_date = '2018-10' ) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models 3-Month Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.8) +
theme_dark_grey() +
scale_color_todd_dark_bright()
Visualizing Sales Forecast with 6 Tuned-Models Zoomed Out
refit_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl,
keep_data = FALSE) %>%
mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value:.conf_hi, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
#filter_by_time(.start_date = '2018-10' ) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models 3-Month Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.5) +
theme_dark_grey() +
scale_color_todd_dark_bright()
🏁 Ensemble & Save Work
There’s plenty more we could do to improve these models, but the last
thing I’ll show here is a quick and easy weighted model. Instead of
using all 6 tuned models individually, I wanted to give the best
performing tuned models a scalable weight and make 1 weighted ensemble
model. My best fitting model was SVM-no_fourier-tuned
, the worst
fitting was PROPHET_XGBOOST-tuned
. My weights ranged from 10 for the
best to 5 for the worst.
Weighted Ensemble Model
calibration_tuned_tbl %>% modeltime_accuracy() %>% arrange(rmse)
## A tibble: 6 × 9
# .model_id .model_desc .type mae mape mase smape rmse rsq
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 3 SVM-no_fourier-tuned Test 0.371 360. 0.506 57.8 0.544 0.482
#2 2 SVM-fourier-tuned Test 0.395 379. 0.539 63.4 0.572 0.426
#3 6 NNETAR-fourier-tuned Test 0.427 266. 0.582 69.7 0.576 0.431
#4 4 Random Forest-tuned Test 0.419 308. 0.572 64.5 0.585 0.403
#5 1 PROPHET_XGBOOST-tuned Test 0.424 374. 0.578 62.9 0.598 0.377
#6 5 NNETAR-no_fourier-tuned Test 0.450 273. 0.613 73.3 0.602 0.391
weight_ensemble_tbl <- calibration_tuned_tbl %>%
ensemble_weighted(loadings = c(5,7,10,6,8,9)) %>%
modeltime_table()
Calibrate Accuracy of Weighted Ensemble Model
weight_ensemble_tbl %>% modeltime_accuracy(testing(splits))
## A tibble: 1 × 9
# .model_id .model_desc .type mae mape mase smape rmse rsq
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 ENSEMBLE (WEIGHTED): 6 MODELS Test 0.387 318. 0.529 60.7 0.553 0.462
Refit Ensemble on All Data
refit_weight_tbl <- weight_ensemble_tbl %>%
modeltime_refit(data = data_prepared_tbl)
Visualizing Sales Forecast with Ensemble Model
refit_weight_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl) %>%
mutate(across(.value, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
#filter_by_time(.start_date = '2018') %>%
plot_modeltime_forecast(.interactive = FALSE,
.title = "Weighted Ensemble Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.6) +
theme_dark_grey() +
scale_color_todd_dark_bright()
If you need the actual forecast values, be sure to un-transform and save any individual model or ensemble data. These files have a lot of information, so they can get quite large.
Collect Un-Transformed Forecast Data
forecast_tbl_final <- refit_weight_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl) %>%
mutate(across(.value, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
filter(.key == 'prediction')
forecast_tbl_final
## A tibble: 84 × 5
# .model_id .model_desc .key .index .value
# <int> <chr> <fct> <date> <dbl>
# 1 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2018-12-31 1651.
# 2 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-01 68.4
# 3 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-02 200.
# 4 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-03 0.992
# 5 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-04 637.
# 6 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-05 846.
# 7 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-06 1030.
# 8 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-07 1030.
# 9 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-08 1153.
#10 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-09 183.
# # … with 74 more rows
The .value
is our daily sales prediction for the corresponding date
(.index
).
🤔 Decisions
Our 3-month forecast predicts a reduction in sales over January and February. The models then predict an increase in March sales. There’s always the possibility for unexpectedly big spikes in single day sales, as demonstrated in our model confidence intervals. However, it is more logical to expect sales to decrease until March, so the company should prepare supply-chains, warehouse inventory, and hire/train new employees accordingly.
For more nuanced decisions, we could run another version of this pipeline to forecast sales or orders for different categories/segments or for individual states/regions. Similar to our EDA towards the beginning of this project.
In a future post I’ll show how to use the {modeltime.ensemble}
package
for creating more sophisticated ensembles. Until then, I hope you’ve
gotten a good sense for what forecasting with {modeltime}
can do. If
you already model with {tidymodels}
, {modeltime}
is a package worth
digging into. Find the complete code @
github.com/TWarczak.