Instacart Market Basket Analysis

!!Work in progress!!

Introduction

This analysis is based on Instacart’s dataset of 3 million anonymized orders, which is discussed in a blog post here: https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

The goal of the Kaggle competition is to predict which items, from among those previously purchased, will be included in a customer’s next order. https://www.kaggle.com/c/instacart-market-basket-analysis

Exploratory analysis

Data

The data comprise five .csv files.

data <- list.files("../Data")
data

## [1] "aisles.csv"                "departments.csv"          
## [3] "order_products__prior.csv" "order_products__train.csv"
## [5] "orders.csv"                "products.csv"             
## [7] "sample_submission.csv"

Let’s have a look at them. The first (alphabetically) is aisles.csv.

aisles <- read.csv("../Data/aisles.csv")
dim(aisles)

## [1] 134   2

head(aisles)

##   aisle_id                      aisle
## 1        1      prepared soups salads
## 2        2          specialty cheeses
## 3        3        energy granola bars
## 4        4              instant foods
## 5        5 marinades meat preparation
## 6        6                      other

We see there are 134 “aisles”, each of which corresponds to a different collection of groceries.

The next .csv file is departments.csv.

depts <- read.csv("../Data/departments.csv")
dim(depts)

## [1] 21  2

head(depts)

##   department_id    department
## 1             1        frozen
## 2             2         other
## 3             3        bakery
## 4             4       produce
## 5             5       alcohol
## 6             6 international

So there are 21 departments.

Next we have the biggest .csv files. order_products__prior.csv contains previous order contents for all customers.

opp <- read.csv(file = "../Data/order_products__prior.csv")
dim(opp)

## [1] 32434489        4

head(opp)

##   order_id product_id add_to_cart_order reordered
## 1        2      33120                 1         1
## 2        2      28985                 2         1
## 3        2       9327                 3         0
## 4        2      45918                 4         1
## 5        2      30035                 5         0
## 6        2      17794                 6         1

So there are 32,434,489 data points in the prior order dataset. For each data point, we get the order_ID telling us which particular order the data point corresponds to; product_ID telling us which type of product that data point is, add_to_cart_order telling us at which point in carrying out the order was the item chosen, and reordered which tells us whether or not the customer had previously ordered the item.

Next let’s have a look at the `order_products__train.csv’ file.

opt <- read.csv(file = "../Data/order_products__train.csv")
dim(opt)

## [1] 1384617       4

head(opt)

##   order_id product_id add_to_cart_order reordered
## 1        1      49302                 1         1
## 2        1      11109                 2         1
## 3        1      10246                 3         0
## 4        1      49683                 4         0
## 5        1      43633                 5         1
## 6        1      13176                 6         0

We see it contains 1,384,617 observations. The columns are the same as in the order_products__prior.csv file.

The `orders.csv’ file tells us information about the orders, including which dataset (prior, training or test) it belongs to.

orders <- read.csv(file = "../Data/orders.csv")
dim(orders)

## [1] 3421083       7

head(orders)

##   order_id user_id eval_set order_number order_dow order_hour_of_day
## 1  2539329       1    prior            1         2                 8
## 2  2398795       1    prior            2         3                 7
## 3   473747       1    prior            3         3                12
## 4  2254736       1    prior            4         4                 7
## 5   431534       1    prior            5         4                15
## 6  3367565       1    prior            6         2                 7
##   days_since_prior_order
## 1                     NA
## 2                     15
## 3                     21
## 4                     29
## 5                     28
## 6                     19

We see there are 3,421,083 orders in the dataset. Each one has a user_id, an order_number, a weekday order_dow and time order_hour_of_day for the order, and days_since_prior_order, i.e. how long it has been since the user’s previous order.

The number of unique users is about 200 thousand:

length(
  unique(
    orders$user_id
    )
  )

## [1] 206209

Finally let’s have a look at the products.csv file:

products <- read.csv(file = "../Data/products.csv")
dim(products)

## [1] 49688     4

head(products)

##   product_id
## 1          1
## 2          2
## 3          3
## 4          4
## 5          5
## 6          6
##                                                        product_name
## 1                                        Chocolate Sandwich Cookies
## 2                                                  All-Seasons Salt
## 3                              Robust Golden Unsweetened Oolong Tea
## 4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce
## 5                                         Green Chile Anytime Sauce
## 6                                                      Dry Nose Oil
##   aisle_id department_id
## 1       61            19
## 2      104            13
## 3       94             7
## 4       38             1
## 5        5            13
## 6       11            11

There are 49,688 different products; for each one as well as a product_id we get a product_name, an aisle_id and a department_id.

Exploratory graphs

Number of orders

Let’s have a look at how many orders there are per user, given by table(orders$user_id). Let’s plot a histogram:

qplot(n, data = count(orders, user_id), bins = 97)

We can see that the number of orders in the dataset is between 4 and 100, as described. The count is strictly decreasing as a function of the number of orders, except at 100. This suggests that all customers with $\geq 100$ orders have been added into the 100 box.

Popular days to order

We can plot this with

qplot(as.factor(orders$order_dow))

This seems to show that days 0 and 1 are the most popular - suggesting that these two days might be weekend days.

Popular times of day to order

We can see which times of day are popular to order with the following plot:

qplot(as.factor(orders$order_hour_of_day), xlab = "Time of day")

This shows that the purchases peak once at 10a.m. and again at 3p.m.

We’ve seen that whether or not it’s a weekend (presuming this is what days “0” and “1” are) has a significant impact on the number of orders made. Does it also impact what time of days the orders are made? Apparently not:

day.type <- with(orders, as.factor(order_dow == 0 | order_dow == 1))
levels(day.type) <- c("Weekday", "Weekend")
ggplot(orders, aes(order_hour_of_day, fill = day.type)) + labs(x = "Time of day") + geom_density(alpha = 0.2, adjust = 3)

Number of days between orders

qplot(days_since_prior_order, data = orders, xlab = "Days since prior order", bins = 30, na.rm = T)

This is obviously cut off at a maximum of 30 days, with any higher value counting as 30. The histogram seems to peak at 7 days, with further local maxima at other multiples of 7 days, suggesting that people tend to do their shopping at weekly intervals.

Size of orders

count(opp, order_id) %>% select(n) %>% summary()

##        n         
##  Min.   :  1.00  
##  1st Qu.:  5.00  
##  Median :  8.00  
##  Mean   : 10.09  
##  3rd Qu.: 14.00  
##  Max.   :145.00

qplot(count(opp, order_id)$n, geom = "histogram", xlim = c(0,50), bins = 51, na.rm = T)

How consistent are users with their order sizes?

count(opp, order_id) %>% inner_join(orders) %>% select(user_id, n) %>% group_by(user_id) %>% summarize(sd = sd(n)) %>% select(sd) %>% summary()

## Joining, by = "order_id"

##        sd        
##  Min.   : 0.000  
##  1st Qu.: 2.345  
##  Median : 3.782  
##  Mean   : 4.266  
##  3rd Qu.: 5.610  
##  Max.   :44.747

So 50% of users have a standard deviation of their order size between 2.3 and 5.6, with the mean user standard deviation between 4.3.

Benchmark dataset Output

Let’s have a look at the sample submission to see what format our output should be in:

sampsub <- read.csv(file = "../Data/sample_submission.csv", stringsAsFactors = F)
dim(sampsub)

## [1] 75000     2

head(sampsub)

##   order_id    products
## 1       17 39276 29259
## 2       34 39276 29259
## 3      137 39276 29259
## 4      182 39276 29259
## 5      257 39276 29259
## 6      313 39276 29259

We see that there are 75,000 rows with two columns. The first column is the order ID; so the idea is that given the order ID we should predict which products are included (from among those that the customer has previously ordered). The ordered products should be given as a space-separated list. It’s possible that no repeat products were ordered; in this case the submitted answer for “products”" should be “None”. https://www.kaggle.com/c/instacart-market-basket-analysis#evaluation

In the sample submission, every basket is identical.

all(sampsub$products==sampsub$products[1])

## [1] TRUE

Let’s see what products are in the sample basket by subsetting the products dataframe to just the rows with these two product IDs:

sampbasket <- as.numeric(strsplit(sampsub$products[1],"\\ ")[[1]])

products %>% filter(product_id %in% sampbasket)

##   product_id product_name aisle_id department_id
## 1      29259 Baby Bananas       24             4
## 2      39276      Bananas       24             4

This explains why they chose the name “Going Bananas Benchmark”!

Partitioning off a validation set

Let’s split a dataset off of the training set to give us an estimated out-of-sample error rate of our analysis. We should not use the validation set for any training or preliminary investigation.

Let’s get the IDs of the orders that are in the training set, and sample 20% of them as a validation set.

trainorders <- orders %>% filter(eval_set=="train") %>% select(order_id)
set.seed(9383)
intrain <- as.logical(
  rbinom(
    n = nrow(trainorders), 
    size = 1, 
    prob = 0.8
    )
  )
trainingIDs <- trainorders %>% filter(intrain)
validationIDs <- trainorders %>% filter(!intrain)
nrow(trainingIDs)

## [1] 105142

nrow(validationIDs)

## [1] 26067

Now we can set up a function to evaluate how accurate our attempts are. Let’s assign the “going bananas benchmark” to our validation set.

valsample <- data.frame(order_id = validationIDs, products = sampsub[1,2])
valsample[,2] <- as.character(valsample[,2])
head(valsample)

##   order_id    products
## 1  1854765 39276 29259
## 2  1524161 39276 29259
## 3  1072954 39276 29259
## 4   669729 39276 29259
## 5  2608424 39276 29259
## 6   859654 39276 29259

To evaluate this, we need to find the “ground truth” for our validation set. For each order number, we need to get a list of products which appear in that order and which are repeat orders. We can do this with the following pipeline

##
## Subset to training orders
##
groundtruth <- filter(orders, eval_set == "train") %>% 
  ##
  ## Join with the training data frame "opt"
  ## to get the product IDs
  ##
  left_join(opt) %>% 
  ##
  ## Only consider reordered items;
  ## only use order_id and product_id columns
  ## 
  filter(reordered == 1) %>% select(order_id, product_id) %>%
  ##
  ## Create a new column with a string listing
  ## the products in each order
  ##
  group_by(order_id) %>% mutate(truth = paste0(product_id, collapse = " ")) %>% 
  ##
  ## Use only order_id and "truth" columns and delete
  ## duplicate rows.
  ##
  select(order_id, truth) %>% unique() %>% data.frame()

## Joining, by = "order_id"

## Let's see what the data frame looks like

tail(groundtruth)

##        order_id
## 122602  3383615
## 122603  2585586
## 122604   943915
## 122605  2371631
## 122606  1716008
## 122607   272231
##                                                                                  truth
## 122602                           24852 28849 18027 39947 44632 26209 35547 40055 12206
## 122603 8898 6701 38855 38341 38930 12127 49198 22242 7702 36944 10333 42352 7103 18952
## 122604               22828 13176 39190 34243 15592 47209 21137 35221 21709 29344 43504
## 122605           15693 37188 21469 41007 2482 14050 26384 3765 36929 23153 31915 13176
## 122606                                             27845 21137 28745 22035 31404 10181
## 122607                                                           6846 9405 24852 39216

$F_1$ score

The scores will be computed from the mean F1 score. This is defined as \[ F_1 = 2 \cdot \frac{p \cdot r }{p + r} \, , \]

where $p$ is the precision, i.e. the number of correct product IDs divided by the number of product IDs submitted, and $r$ is the recall, i.e. the number of correct product IDs divided by the number of product IDs that there should have been. To compute these, let us notice that both of them can be obtained in similar ways. Given a list $G$ of predicted values and a list $T$ of true values, we have $p = |G \cap T| / |G|$ and $r = |G \cap T | / |T|$. We can compute these using R’s %in% function. Using G %in% T we get a logical vector of length $|G|$ whose entries are equal to 1 for every entry in common between $G$ and $T$, and 0 otherwise, so sum(G %in% T) is equal to $|G \cap T |$ and thus mean(G %in% T) equals $p$. Similarly, mean(G %in% T)$= r$.

Let’s create some functions to compute the F1 score:

string.ratio <- function(s1, s2){
    mean( strsplit(s1, split = " ")[[1]] %in% strsplit(s2, split = " ")[[1]] )
}

f1strings <- function(s1, s2){
  p = string.ratio(s1, s2)
  r = string.ratio(s2, s1)
  f1 = ifelse(p != 0 & r != 0,
              2 * p * r / (p + r),
              0)
  f1
}

Now, with our ground truth data frame and our functions to compute F1, we can write a function that takes a data frame of order_ids and predictions and outputs a data frame with a new column for the $F_1$ score of that prediction:

pred.to.f1 <- function(df){
  groundtruth %>% inner_join(df) %>% mutate(f1 = f1strings(products, truth)) %>% select(order_id, f1) 
}

Let’s use this to compute the $F_1$ score for the validation sample:

mean(pred.to.f1(valsample)$f1)

## Joining, by = "order_id"

## [1] 0

So the benchmark guess for the validation dataset is about 0.000558. On the other hand, the benchmark guess for the test dataset, available on the leaderboard at https://www.kaggle.com/c/instacart-market-basket-analysis/leaderboard , is 0.000545. These two scores are within 3% of each other, which is reassuring because it tells us that we are computing the $F_1$ score correctly, and that there is some consistent structure in the dataset for us to find as expected (i.e. the orders are not just random).

A better benchmark

A better benchmark (than just giving bananas to everyone) would be to predict that a user buys a product if it appears in more than 50% of their other orders.

First let’s define a function which takes a user_id and outputs the desired output.

We want to use both the prior and training dataframes to compute the means.

First let’s look at a single example so we can work out how to do this. Let’s look at user 178520.

ord1 <- orders %>% filter(user_id == 178520) %>% select(order_id) %>% inner_join(opp)

## Joining, by = "order_id"

n1 <- ord1 %>% distinct(order_id) %>% nrow()
fracs1 <- ord1 %>% count(product_id) %>% mutate(frac = n/n1)

We see that very few of the products are chosen in over 50% of orders, so this seems to be quite conservative:

qplot(fracs1$frac)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Actually there are six items orders over half the time:

guess1 <- fracs1 %>% filter(frac >= 0.5)
guess1 %>% inner_join(products) %>% select(frac, product_name)

## Joining, by = "product_id"

## # A tibble: 6 × 2
##        frac                                            product_name
##       <dbl>                                                  <fctr>
## 1 0.5178571             Kellogg's Nutri-Grain Apple Cinnamon Cereal
## 2 0.5892857 Nutri-Grain Soft Baked Strawberry Cereal Breakfast Bars
## 3 0.5178571                                     2% Reduced Fat Milk
## 4 0.8750000                             Oats & Chocolate Chewy Bars
## 5 0.6607143                                 Sugar Free Energy Drink
## 6 0.8214286                                            Energy Drink

So this person mostly buys breakfast supplies and energy drinks.

Let’s put this as our guess for this one user:

ans1 <- guess1$product_id %>% paste(collapse = " ")
val1 <- orders %>% filter(eval_set == "train", user_id == 178520) %>% select(order_id) %>% mutate(products = ans1)
val1

##   order_id                            products
## 1  2331095 10054 21351 23909 27761 31102 41276

We can find the F1 score of this one example by running

f1strings((groundtruth %>% filter(order_id == 2331095) %>% select(truth))[[1,1]], ans1)

## [1] 0.4

Which is not that bad; the Kaggle leaderboard is led (at the time of writing) by a score of 0.403.

Now we need to apply the same manipulations automatically to the whole prior dataset.

practice <- data.frame(user_id = c(1,1,1,1,2,2,2,2,2), order_id = c(1,1,2,2,3,3,3,4,4), product_id = c(7,9,7,10,8,11,7,6,11))

ordercount <- practice %>% group_by(user_id) %>% summarize(norders = n())

practice %>% group_by(user_id, product_id) %>% summarize(count = n()) %>% inner_join(ordercount) %>% mutate(frac = count/norders) %>% filter(frac >= 0.5)

## Joining, by = "user_id"

## Source: local data frame [1 x 5]
## Groups: user_id [1]
## 
##   user_id product_id count norders  frac
##     <dbl>      <dbl> <int>   <int> <dbl>
## 1       1          7     2       4   0.5

## Making a data frame listing the fraction of times a product appeared
## in a user's order

## The computation takes a bit of time so let's save the output

if(!file.exists("orderfracs.rds")){
##  
## Link the prior orders to the order data to get user IDs
##
benchdat <- opp %>% inner_join(orders) %>% 
  ##
  ## Keep only the user_id, order_id and product_id columns
  ##
  select(user_id, order_id, product_id)
##
## Make a new data frame norders listing the number
## of distinct orders for each user
##
norders <- benchdat %>% group_by(user_id) %>% summarize(number_orders = n_distinct(order_id))
##
## Make a new data frame listing which fraction of a user's
## orders include a product.
## First count the occurrences of each product
##
orderfracs <- benchdat %>% group_by(user_id, product_id) %>% summarize(count = n()) %>% 
  ##
  ## Then join to the data frame with the number of orders
  ## per user and compute the fraction
  ##
  inner_join(norders) %>% mutate(frac = count/number_orders) %>% 
  ##
  ## Return only the fields we need
  ##
  select(user_id, product_id, frac) 

saveRDS(orderfracs, file = "orderfracs.rds")
} else {
orderfracs <- readRDS("orderfracs.rds")
}

bench2 <- orderfracs %>% filter(frac >= 0.5)

Putting the data in the right form:

filtered.to.form <- function(df){
  df %>% group_by(user_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(user_id, products) %>% unique()
}

bench2form <- filtered.to.form(bench2)

Now we need to use this to get predictions

form.to.train <- function(df){
  filter(orders, eval_set == "train") %>% left_join(df) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"",products))
}

guesses1 <- form.to.train(bench2form)

## Joining, by = "user_id"

Now we need to compare this to the `ground truth’.

f1scores <- pred.to.f1(guesses1)

## Joining, by = "order_id"

mean(f1scores$f1)

## [1] 0.4285714

So the score is about 27.9%, which is OK.

Let’s generate a prediction for the test dataset to submit to Kaggle.

# Let's make the data frame with the same pipeline as above

guess1test <- filter(orders, eval_set == "test") %>% left_join(bench2form) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))

## Joining, by = "user_id"

write.csv(guess1test, file = "guess1.csv", quote = F, row.names = F)

which, when uploaded to Kaggle, gets a score of about 28.7%.

Finding the best cutoff

Maybe it would be better to guess all items bought more than $q$ of the time were in a given basket, where $q$ is a fraction between 0 and 1. Let’s compute and work out which fraction gives the highest score.

if(!file.exists("scores.rds")){
  scores <- NULL
  for (i in 0:50){
    print(i)
    p <- 0.1 + i*0.01
    pguess <- filter(orderfracs, frac >= p)
    f1sc <- pguess %>% filtered.to.form() %>% form.to.train() %>% pred.to.f1()
    score <- mean(f1sc$f1)
    scores <- c(scores, score)
  }
  } else {
  scores <- readRDS("scores.rds")
  }
  pmax <- 0.1 + 0.01 * (0:50)[which(scores == max(scores))]
  plot(0.1 + 0.01 * (0:50), scores, xlab = "Cutoff fraction")
  abline(v = pmax)

Uploading the test set with this cutoff

# Let's make the data frame with the same pipeline as above

prods.26 <- orderfracs %>% filter(frac >= 0.26) %>% filtered.to.form()

test.26 <- filter(orders, eval_set == "test") %>% left_join(prods.26) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))

## Joining, by = "user_id"

write.csv(test.26, file = "cutoff26.csv", quote = F, row.names = F)

This gives a score on Kaggle of 0.3326.

A model taking into account time since previous order

Let’s make a model using logistic regression taking account of two predictors: the length of time since the previous order and the fraction of previous orders on which the item was bought.

We need to get the data into a particular form to do this; we need to make a data frame with the following columns:

order_id
previous items ordered by the user
fraction of previous orders the user bought this item
how many days was it since the previous order?
did the user buy it this time yes/no?

Let’s start by making a data frame which, for every order in the training set, lists all previous orders by the user and their probability.

word.in.sentence <- function(word, sentence){
    word %in% strsplit(sentence, split = " ")[[1]]
}

if(!file.exists("prodsinorderdays.RDS")){
trainordersdays <- orders %>% filter(eval_set == "train") %>% select(order_id, user_id, days_since_prior_order)

trainordersprods <- opt %>% filter(reordered == 1) %>% group_by(order_id) %>% mutate(prods = paste0(product_id, collapse = " "))  %>% select(order_id, prods) %>% unique()

prodsinorderdays <- trainordersdays %>% inner_join(orderfracs) %>% inner_join(trainordersprods) %>% group_by(order_id) %>% mutate(inorder = word.in.sentence(product_id, prods)) %>% data.frame()

saveRDS(prodsinorderdays, "prodsinorderdays.RDS")
} else {
  prodsinorderdays <- readRDS("prodsinorderdays.rds")
}

head(prodsinorderdays)

##   order_id user_id days_since_prior_order product_id frac
## 1  1187899       1                     14        196  1.0
## 2  1187899       1                     14      10258  0.9
## 3  1187899       1                     14      10326  0.1
## 4  1187899       1                     14      12427  1.0
## 5  1187899       1                     14      13032  0.3
## 6  1187899       1                     14      13176  0.2
##                                                       prods inorder
## 1 196 25133 38928 26405 39657 10258 13032 26088 49235 46149    TRUE
## 2 196 25133 38928 26405 39657 10258 13032 26088 49235 46149    TRUE
## 3 196 25133 38928 26405 39657 10258 13032 26088 49235 46149   FALSE
## 4 196 25133 38928 26405 39657 10258 13032 26088 49235 46149   FALSE
## 5 196 25133 38928 26405 39657 10258 13032 26088 49235 46149    TRUE
## 6 196 25133 38928 26405 39657 10258 13032 26088 49235 46149   FALSE

Now we have the data frame in the form we need, let’s train a logistic regression model with days_since_prior_order and frac as the regressors. The model object is huge (4 Gb) and we only need the coefficients so we’ll just save those and do the predictions manually instead of using R’s predict() function.

if(!file.exists("dayfraccoefs.rds")){
  if(!file.exists("dayfracfit.rds")){
    dayfracfit <- glm(inorder ~ frac + days_since_prior_order, data = prodsinorderdays, family = binomial)
    saveRDS(dayfracfit, "dayfracfit.rds")
  } else {
    dayfracfit <- readRDS("dayfracfit.rds")
  }
  dayfraccoefs <- summary(dayfracfit)$coef[,1]
  saveRDS(dayfraccoefs, "dayfraccoefs.rds")
} else {
  dayfraccoefs <- readRDS("dayfraccoefs.rds")
}
dayfraccoefs

##            (Intercept)                   frac days_since_prior_order 
##           -3.159715623            4.997165461           -0.004554965

Now, this model outputs an updated “probability” of a certain item being in an order which is an increasing function of the fraction of prior orders and a decreasing function of the time since the previous order was made — the longer a user waits between orders, the less likely it is that they will buy something! Is this too simple?

Using these estimates of the regression coefficients, let’s define a function to give the response for a given value of frac and days_since_prior_order.

day.frac.response <- function(frac, days_since_prior_order){
  linear = dayfraccoefs[1] + dayfraccoefs[2] * frac + dayfraccoefs[3] * days_since_prior_order
  1 / (1 + exp(- linear))
}

We can make a plot of the response as a function of frac, showing how it varies with the number of days—the impact of days_since_prior_order is small but not negligible.

df.days <- data.frame(days_since_prior_order = c(rep(1,101),rep(15,101),rep(30,101)), frac = 0.01 * 0:100) %>% mutate(response = day.frac.response(frac, days_since_prior_order))
##
df.days$days_since_prior_order <- as.factor(df.days$days_since_prior_order) 
##
ggplot(df.days, aes(x = frac, y = response, colour = days_since_prior_order)) + geom_line(lwd = 2, alpha = 0.8)

Now, with this model trained, let us make predictions on the test dataset. We need to get a data frame which has the test order numbers with frac and days_since_prior_order as its columns.

  predtestfracdays <- orders %>% filter(eval_set == "test") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>% mutate(response = day.frac.response(frac, days_since_prior_order))

## Joining, by = "user_id"

As we saw in the case of using just frac, it isn’t easy to predict where we should set the cutoff. But we know it should be for the value of frac equal to about 0.26; this corresponds to a value of the response variable around 0.126, depending on the length of time:

if(!file.exists("predday26.rds")){
  predday26 <- predict(dayfracfit, newdata = data.frame(days_since_prior_order = c(1,15,30), frac = 0.26), type = "response")
  saveRDS(predday26, "predday26.rds")
} else {
  predday26 <- readRDS("predday26.rds")
}
predday26

##         1         2         3 
## 0.1341204 0.1268860 0.1195076

So let’s check some values in the region of 0.1 to 0.15. We’ll use the training dataset again. We need to get a data frame with the order number, days since prior order, and frac variable.

odpf <- orders %>% filter(eval_set == "train") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>% mutate(response = day.frac.response(frac, days_since_prior_order))

## Joining, by = "user_id"

Using this, let’s try different values of the cutoff, i.e. let’s vary $p$, make the prediction that an item is in the cart if the response from the logistic regression is $\geq p$, and see which value of $p$ gives the best value of $F_1$.

if(!file.exists("dayscores.rds")){
  day.scores <- NULL
  for(i in 0:20){
    p <- 0.1 + 0.0025 * i
    print(i)
    prod.pred <- odpf %>% filter(response >= p) %>% group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()
    ##
    ##
    ##
    f1.scores <- pred.to.f1(prod.pred)
    day.scores <- c(day.scores, mean(f1.scores$f1))
  }
  saveRDS(day.scores,"dayscores.RDS")
} else {
  day.scores <- readRDS("dayscores.rds")
}

if(!file.exists("dayscoresfine.rds")){
  day.scores.fine <- NULL
  for(i in 0:19){
    p <- 0.125 + 0.00025 * i
    print(i)
    prod.pred <- odpf %>% filter(response >= p) %>% group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()
    ##
    ##
    ##
    f1.scores <- pred.to.f1(prod.pred)
    day.scores.fine <- c(day.scores.fine, mean(f1.scores$f1))
  }
  saveRDS(day.scores.fine,"dayscoresfine.rds")
} else {
  day.scores.fine <- readRDS("dayscoresfine.rds")
}

We can plot our results:

day.scores.df <- data.frame(cutoff = c(0.1 + 0.0025 * (0:20), 0.125 + 0.00025 * (0:19)), F1 = c(day.scores, day.scores.fine))
#
best.cutoff = 0.128125
#
ggplot(day.scores.df, aes(x = cutoff, y = F1)) + geom_line(lwd = 1) + geom_vline(xintercept = best.cutoff)

The optimum cutoff for this response is thus 0.128125.

Using this model and this cutoff, let’s make predictions for the test dataset and upload them to Kaggle.

##
## Make a data frame with test order_id's, frac and days_since_prior_order
##
test.odpf <- orders %>% filter(eval_set == "test") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>% 
  ##
  ## Predict the response
  ##
  mutate(response = day.frac.response(frac, days_since_prior_order))

## Joining, by = "user_id"

##
## Subset to the products with scores above the cutoff
##
frac.day.pred <- test.odpf %>% filter(response >= best.cutoff) %>% 
  ##
  ## Put output in desired form
  ##
  group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()

##
## Make sure we have all order_id's from test set, setting "None"
## if there are no products predictes
##
frac.day.pred.all <- filter(orders, eval_set == "test") %>% left_join(frac.day.pred) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))

## Joining, by = "order_id"

##
## Output CSV in the required format
##
write.csv(frac.day.pred.all, file = "fracdaypred.csv", quote = F, row.names = F)

This gives a score of 0.3321 — slightly worse than the submission which didn’t account for the number of days since the purchase occurred!