!!Work in progress!!
This analysis is based on Instacart’s dataset of 3 million anonymized orders, which is discussed in a blog post here: https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
The goal of the Kaggle competition is to predict which items, from among those previously purchased, will be included in a customer’s next order. https://www.kaggle.com/c/instacart-market-basket-analysis
The data comprise five .csv files.
data <- list.files("../Data")
data
## [1] "aisles.csv" "departments.csv"
## [3] "order_products__prior.csv" "order_products__train.csv"
## [5] "orders.csv" "products.csv"
## [7] "sample_submission.csv"
Let’s have a look at them. The first (alphabetically) is aisles.csv
.
aisles <- read.csv("../Data/aisles.csv")
dim(aisles)
## [1] 134 2
head(aisles)
## aisle_id aisle
## 1 1 prepared soups salads
## 2 2 specialty cheeses
## 3 3 energy granola bars
## 4 4 instant foods
## 5 5 marinades meat preparation
## 6 6 other
We see there are 134 “aisles”, each of which corresponds to a different collection of groceries.
The next .csv file is departments.csv
.
depts <- read.csv("../Data/departments.csv")
dim(depts)
## [1] 21 2
head(depts)
## department_id department
## 1 1 frozen
## 2 2 other
## 3 3 bakery
## 4 4 produce
## 5 5 alcohol
## 6 6 international
So there are 21 departments.
Next we have the biggest .csv files. order_products__prior.csv
contains previous order contents for all customers.
opp <- read.csv(file = "../Data/order_products__prior.csv")
dim(opp)
## [1] 32434489 4
head(opp)
## order_id product_id add_to_cart_order reordered
## 1 2 33120 1 1
## 2 2 28985 2 1
## 3 2 9327 3 0
## 4 2 45918 4 1
## 5 2 30035 5 0
## 6 2 17794 6 1
So there are 32,434,489 data points in the prior order dataset. For each data point, we get the order_ID
telling us which particular order the data point corresponds to; product_ID
telling us which type of product that data point is, add_to_cart_order
telling us at which point in carrying out the order was the item chosen, and reordered
which tells us whether or not the customer had previously ordered the item.
Next let’s have a look at the `order_products__train.csv’ file.
opt <- read.csv(file = "../Data/order_products__train.csv")
dim(opt)
## [1] 1384617 4
head(opt)
## order_id product_id add_to_cart_order reordered
## 1 1 49302 1 1
## 2 1 11109 2 1
## 3 1 10246 3 0
## 4 1 49683 4 0
## 5 1 43633 5 1
## 6 1 13176 6 0
We see it contains 1,384,617 observations. The columns are the same as in the order_products__prior.csv
file.
The `orders.csv’ file tells us information about the orders, including which dataset (prior, training or test) it belongs to.
orders <- read.csv(file = "../Data/orders.csv")
dim(orders)
## [1] 3421083 7
head(orders)
## order_id user_id eval_set order_number order_dow order_hour_of_day
## 1 2539329 1 prior 1 2 8
## 2 2398795 1 prior 2 3 7
## 3 473747 1 prior 3 3 12
## 4 2254736 1 prior 4 4 7
## 5 431534 1 prior 5 4 15
## 6 3367565 1 prior 6 2 7
## days_since_prior_order
## 1 NA
## 2 15
## 3 21
## 4 29
## 5 28
## 6 19
We see there are 3,421,083 orders in the dataset. Each one has a user_id
, an order_number
, a weekday order_dow
and time order_hour_of_day
for the order, and days_since_prior_order
, i.e. how long it has been since the user’s previous order.
The number of unique users is about 200 thousand:
length(
unique(
orders$user_id
)
)
## [1] 206209
Finally let’s have a look at the products.csv
file:
products <- read.csv(file = "../Data/products.csv")
dim(products)
## [1] 49688 4
head(products)
## product_id
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## product_name
## 1 Chocolate Sandwich Cookies
## 2 All-Seasons Salt
## 3 Robust Golden Unsweetened Oolong Tea
## 4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce
## 5 Green Chile Anytime Sauce
## 6 Dry Nose Oil
## aisle_id department_id
## 1 61 19
## 2 104 13
## 3 94 7
## 4 38 1
## 5 5 13
## 6 11 11
There are 49,688 different products; for each one as well as a product_id
we get a product_name
, an aisle_id
and a department_id
.
Let’s have a look at how many orders there are per user, given by table(orders$user_id)
. Let’s plot a histogram:
qplot(n, data = count(orders, user_id), bins = 97)
We can see that the number of orders in the dataset is between 4 and 100, as described. The count is strictly decreasing as a function of the number of orders, except at 100. This suggests that all customers with \(\geq 100\) orders have been added into the 100 box.
We can plot this with
qplot(as.factor(orders$order_dow))
This seems to show that days 0 and 1 are the most popular - suggesting that these two days might be weekend days.
We can see which times of day are popular to order with the following plot:
qplot(as.factor(orders$order_hour_of_day), xlab = "Time of day")
This shows that the purchases peak once at 10a.m. and again at 3p.m.
We’ve seen that whether or not it’s a weekend (presuming this is what days “0” and “1” are) has a significant impact on the number of orders made. Does it also impact what time of days the orders are made? Apparently not:
day.type <- with(orders, as.factor(order_dow == 0 | order_dow == 1))
levels(day.type) <- c("Weekday", "Weekend")
ggplot(orders, aes(order_hour_of_day, fill = day.type)) + labs(x = "Time of day") + geom_density(alpha = 0.2, adjust = 3)
qplot(days_since_prior_order, data = orders, xlab = "Days since prior order", bins = 30, na.rm = T)
This is obviously cut off at a maximum of 30 days, with any higher value counting as 30. The histogram seems to peak at 7 days, with further local maxima at other multiples of 7 days, suggesting that people tend to do their shopping at weekly intervals.
count(opp, order_id) %>% select(n) %>% summary()
## n
## Min. : 1.00
## 1st Qu.: 5.00
## Median : 8.00
## Mean : 10.09
## 3rd Qu.: 14.00
## Max. :145.00
qplot(count(opp, order_id)$n, geom = "histogram", xlim = c(0,50), bins = 51, na.rm = T)
count(opp, order_id) %>% inner_join(orders) %>% select(user_id, n) %>% group_by(user_id) %>% summarize(sd = sd(n)) %>% select(sd) %>% summary()
## Joining, by = "order_id"
## sd
## Min. : 0.000
## 1st Qu.: 2.345
## Median : 3.782
## Mean : 4.266
## 3rd Qu.: 5.610
## Max. :44.747
So 50% of users have a standard deviation of their order size between 2.3 and 5.6, with the mean user standard deviation between 4.3.
Let’s have a look at the sample submission to see what format our output should be in:
sampsub <- read.csv(file = "../Data/sample_submission.csv", stringsAsFactors = F)
dim(sampsub)
## [1] 75000 2
head(sampsub)
## order_id products
## 1 17 39276 29259
## 2 34 39276 29259
## 3 137 39276 29259
## 4 182 39276 29259
## 5 257 39276 29259
## 6 313 39276 29259
We see that there are 75,000 rows with two columns. The first column is the order ID; so the idea is that given the order ID we should predict which products are included (from among those that the customer has previously ordered). The ordered products should be given as a space-separated list. It’s possible that no repeat products were ordered; in this case the submitted answer for “products”" should be “None”. https://www.kaggle.com/c/instacart-market-basket-analysis#evaluation
In the sample submission, every basket is identical.
all(sampsub$products==sampsub$products[1])
## [1] TRUE
Let’s see what products are in the sample basket by subsetting the products
dataframe to just the rows with these two product IDs:
sampbasket <- as.numeric(strsplit(sampsub$products[1],"\\ ")[[1]])
products %>% filter(product_id %in% sampbasket)
## product_id product_name aisle_id department_id
## 1 29259 Baby Bananas 24 4
## 2 39276 Bananas 24 4
This explains why they chose the name “Going Bananas Benchmark”!
Let’s split a dataset off of the training set to give us an estimated out-of-sample error rate of our analysis. We should not use the validation set for any training or preliminary investigation.
Let’s get the IDs of the orders that are in the training set, and sample 20% of them as a validation set.
trainorders <- orders %>% filter(eval_set=="train") %>% select(order_id)
set.seed(9383)
intrain <- as.logical(
rbinom(
n = nrow(trainorders),
size = 1,
prob = 0.8
)
)
trainingIDs <- trainorders %>% filter(intrain)
validationIDs <- trainorders %>% filter(!intrain)
nrow(trainingIDs)
## [1] 105142
nrow(validationIDs)
## [1] 26067
Now we can set up a function to evaluate how accurate our attempts are. Let’s assign the “going bananas benchmark” to our validation set.
valsample <- data.frame(order_id = validationIDs, products = sampsub[1,2])
valsample[,2] <- as.character(valsample[,2])
head(valsample)
## order_id products
## 1 1854765 39276 29259
## 2 1524161 39276 29259
## 3 1072954 39276 29259
## 4 669729 39276 29259
## 5 2608424 39276 29259
## 6 859654 39276 29259
To evaluate this, we need to find the “ground truth” for our validation set. For each order number, we need to get a list of products which appear in that order and which are repeat orders. We can do this with the following pipeline
##
## Subset to training orders
##
groundtruth <- filter(orders, eval_set == "train") %>%
##
## Join with the training data frame "opt"
## to get the product IDs
##
left_join(opt) %>%
##
## Only consider reordered items;
## only use order_id and product_id columns
##
filter(reordered == 1) %>% select(order_id, product_id) %>%
##
## Create a new column with a string listing
## the products in each order
##
group_by(order_id) %>% mutate(truth = paste0(product_id, collapse = " ")) %>%
##
## Use only order_id and "truth" columns and delete
## duplicate rows.
##
select(order_id, truth) %>% unique() %>% data.frame()
## Joining, by = "order_id"
## Let's see what the data frame looks like
tail(groundtruth)
## order_id
## 122602 3383615
## 122603 2585586
## 122604 943915
## 122605 2371631
## 122606 1716008
## 122607 272231
## truth
## 122602 24852 28849 18027 39947 44632 26209 35547 40055 12206
## 122603 8898 6701 38855 38341 38930 12127 49198 22242 7702 36944 10333 42352 7103 18952
## 122604 22828 13176 39190 34243 15592 47209 21137 35221 21709 29344 43504
## 122605 15693 37188 21469 41007 2482 14050 26384 3765 36929 23153 31915 13176
## 122606 27845 21137 28745 22035 31404 10181
## 122607 6846 9405 24852 39216
The scores will be computed from the mean F1 score. This is defined as \[ F_1 = 2 \cdot \frac{p \cdot r }{p + r} \, , \]
where \(p\) is the precision, i.e. the number of correct product IDs divided by the number of product IDs submitted, and \(r\) is the recall, i.e. the number of correct product IDs divided by the number of product IDs that there should have been. To compute these, let us notice that both of them can be obtained in similar ways. Given a list \(G\) of predicted values and a list \(T\) of true values, we have \(p = |G \cap T| / |G|\) and \(r = |G \cap T | / |T|\). We can compute these using R’s %in%
function. Using G %in% T
we get a logical vector of length \(|G|\) whose entries are equal to 1 for every entry in common between \(G\) and \(T\), and 0 otherwise, so sum(G %in% T)
is equal to \(|G \cap T |\) and thus mean(G %in% T)
equals \(p\). Similarly, mean(G %in% T)
\(= r\).
Let’s create some functions to compute the F1 score:
string.ratio <- function(s1, s2){
mean( strsplit(s1, split = " ")[[1]] %in% strsplit(s2, split = " ")[[1]] )
}
f1strings <- function(s1, s2){
p = string.ratio(s1, s2)
r = string.ratio(s2, s1)
f1 = ifelse(p != 0 & r != 0,
2 * p * r / (p + r),
0)
f1
}
Now, with our ground truth data frame and our functions to compute F1, we can write a function that takes a data frame of order_id
s and predictions and outputs a data frame with a new column for the \(F_1\) score of that prediction:
pred.to.f1 <- function(df){
groundtruth %>% inner_join(df) %>% mutate(f1 = f1strings(products, truth)) %>% select(order_id, f1)
}
Let’s use this to compute the \(F_1\) score for the validation sample:
mean(pred.to.f1(valsample)$f1)
## Joining, by = "order_id"
## [1] 0
So the benchmark guess for the validation dataset is about 0.000558. On the other hand, the benchmark guess for the test dataset, available on the leaderboard at https://www.kaggle.com/c/instacart-market-basket-analysis/leaderboard , is 0.000545. These two scores are within 3% of each other, which is reassuring because it tells us that we are computing the \(F_1\) score correctly, and that there is some consistent structure in the dataset for us to find as expected (i.e. the orders are not just random).
A better benchmark (than just giving bananas to everyone) would be to predict that a user buys a product if it appears in more than 50% of their other orders.
First let’s define a function which takes a user_id
and outputs the desired output.
We want to use both the prior and training dataframes to compute the means.
First let’s look at a single example so we can work out how to do this. Let’s look at user 178520
.
ord1 <- orders %>% filter(user_id == 178520) %>% select(order_id) %>% inner_join(opp)
## Joining, by = "order_id"
n1 <- ord1 %>% distinct(order_id) %>% nrow()
fracs1 <- ord1 %>% count(product_id) %>% mutate(frac = n/n1)
We see that very few of the products are chosen in over 50% of orders, so this seems to be quite conservative:
qplot(fracs1$frac)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Actually there are six items orders over half the time:
guess1 <- fracs1 %>% filter(frac >= 0.5)
guess1 %>% inner_join(products) %>% select(frac, product_name)
## Joining, by = "product_id"
## # A tibble: 6 × 2
## frac product_name
## <dbl> <fctr>
## 1 0.5178571 Kellogg's Nutri-Grain Apple Cinnamon Cereal
## 2 0.5892857 Nutri-Grain Soft Baked Strawberry Cereal Breakfast Bars
## 3 0.5178571 2% Reduced Fat Milk
## 4 0.8750000 Oats & Chocolate Chewy Bars
## 5 0.6607143 Sugar Free Energy Drink
## 6 0.8214286 Energy Drink
So this person mostly buys breakfast supplies and energy drinks.
Let’s put this as our guess for this one user:
ans1 <- guess1$product_id %>% paste(collapse = " ")
val1 <- orders %>% filter(eval_set == "train", user_id == 178520) %>% select(order_id) %>% mutate(products = ans1)
val1
## order_id products
## 1 2331095 10054 21351 23909 27761 31102 41276
We can find the F1 score of this one example by running
f1strings((groundtruth %>% filter(order_id == 2331095) %>% select(truth))[[1,1]], ans1)
## [1] 0.4
Which is not that bad; the Kaggle leaderboard is led (at the time of writing) by a score of 0.403.
Now we need to apply the same manipulations automatically to the whole prior dataset.
practice <- data.frame(user_id = c(1,1,1,1,2,2,2,2,2), order_id = c(1,1,2,2,3,3,3,4,4), product_id = c(7,9,7,10,8,11,7,6,11))
ordercount <- practice %>% group_by(user_id) %>% summarize(norders = n())
practice %>% group_by(user_id, product_id) %>% summarize(count = n()) %>% inner_join(ordercount) %>% mutate(frac = count/norders) %>% filter(frac >= 0.5)
## Joining, by = "user_id"
## Source: local data frame [1 x 5]
## Groups: user_id [1]
##
## user_id product_id count norders frac
## <dbl> <dbl> <int> <int> <dbl>
## 1 1 7 2 4 0.5
## Making a data frame listing the fraction of times a product appeared
## in a user's order
## The computation takes a bit of time so let's save the output
if(!file.exists("orderfracs.rds")){
##
## Link the prior orders to the order data to get user IDs
##
benchdat <- opp %>% inner_join(orders) %>%
##
## Keep only the user_id, order_id and product_id columns
##
select(user_id, order_id, product_id)
##
## Make a new data frame norders listing the number
## of distinct orders for each user
##
norders <- benchdat %>% group_by(user_id) %>% summarize(number_orders = n_distinct(order_id))
##
## Make a new data frame listing which fraction of a user's
## orders include a product.
## First count the occurrences of each product
##
orderfracs <- benchdat %>% group_by(user_id, product_id) %>% summarize(count = n()) %>%
##
## Then join to the data frame with the number of orders
## per user and compute the fraction
##
inner_join(norders) %>% mutate(frac = count/number_orders) %>%
##
## Return only the fields we need
##
select(user_id, product_id, frac)
saveRDS(orderfracs, file = "orderfracs.rds")
} else {
orderfracs <- readRDS("orderfracs.rds")
}
bench2 <- orderfracs %>% filter(frac >= 0.5)
Putting the data in the right form:
filtered.to.form <- function(df){
df %>% group_by(user_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(user_id, products) %>% unique()
}
bench2form <- filtered.to.form(bench2)
Now we need to use this to get predictions
form.to.train <- function(df){
filter(orders, eval_set == "train") %>% left_join(df) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"",products))
}
guesses1 <- form.to.train(bench2form)
## Joining, by = "user_id"
Now we need to compare this to the `ground truth’.
f1scores <- pred.to.f1(guesses1)
## Joining, by = "order_id"
mean(f1scores$f1)
## [1] 0.4285714
So the score is about 27.9%, which is OK.
Let’s generate a prediction for the test dataset to submit to Kaggle.
# Let's make the data frame with the same pipeline as above
guess1test <- filter(orders, eval_set == "test") %>% left_join(bench2form) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))
## Joining, by = "user_id"
write.csv(guess1test, file = "guess1.csv", quote = F, row.names = F)
which, when uploaded to Kaggle, gets a score of about 28.7%.
Maybe it would be better to guess all items bought more than \(q\) of the time were in a given basket, where \(q\) is a fraction between 0 and 1. Let’s compute and work out which fraction gives the highest score.
if(!file.exists("scores.rds")){
scores <- NULL
for (i in 0:50){
print(i)
p <- 0.1 + i*0.01
pguess <- filter(orderfracs, frac >= p)
f1sc <- pguess %>% filtered.to.form() %>% form.to.train() %>% pred.to.f1()
score <- mean(f1sc$f1)
scores <- c(scores, score)
}
} else {
scores <- readRDS("scores.rds")
}
pmax <- 0.1 + 0.01 * (0:50)[which(scores == max(scores))]
plot(0.1 + 0.01 * (0:50), scores, xlab = "Cutoff fraction")
abline(v = pmax)
# Let's make the data frame with the same pipeline as above
prods.26 <- orderfracs %>% filter(frac >= 0.26) %>% filtered.to.form()
test.26 <- filter(orders, eval_set == "test") %>% left_join(prods.26) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))
## Joining, by = "user_id"
write.csv(test.26, file = "cutoff26.csv", quote = F, row.names = F)
This gives a score on Kaggle of 0.3326.
Let’s make a model using logistic regression taking account of two predictors: the length of time since the previous order and the fraction of previous orders on which the item was bought.
We need to get the data into a particular form to do this; we need to make a data frame with the following columns:
order_id
Let’s start by making a data frame which, for every order in the training set, lists all previous orders by the user and their probability.
word.in.sentence <- function(word, sentence){
word %in% strsplit(sentence, split = " ")[[1]]
}
if(!file.exists("prodsinorderdays.RDS")){
trainordersdays <- orders %>% filter(eval_set == "train") %>% select(order_id, user_id, days_since_prior_order)
trainordersprods <- opt %>% filter(reordered == 1) %>% group_by(order_id) %>% mutate(prods = paste0(product_id, collapse = " ")) %>% select(order_id, prods) %>% unique()
prodsinorderdays <- trainordersdays %>% inner_join(orderfracs) %>% inner_join(trainordersprods) %>% group_by(order_id) %>% mutate(inorder = word.in.sentence(product_id, prods)) %>% data.frame()
saveRDS(prodsinorderdays, "prodsinorderdays.RDS")
} else {
prodsinorderdays <- readRDS("prodsinorderdays.rds")
}
head(prodsinorderdays)
## order_id user_id days_since_prior_order product_id frac
## 1 1187899 1 14 196 1.0
## 2 1187899 1 14 10258 0.9
## 3 1187899 1 14 10326 0.1
## 4 1187899 1 14 12427 1.0
## 5 1187899 1 14 13032 0.3
## 6 1187899 1 14 13176 0.2
## prods inorder
## 1 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 TRUE
## 2 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 TRUE
## 3 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 FALSE
## 4 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 FALSE
## 5 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 TRUE
## 6 196 25133 38928 26405 39657 10258 13032 26088 49235 46149 FALSE
Now we have the data frame in the form we need, let’s train a logistic regression model with days_since_prior_order
and frac
as the regressors. The model object is huge (4 Gb) and we only need the coefficients so we’ll just save those and do the predictions manually instead of using R’s predict()
function.
if(!file.exists("dayfraccoefs.rds")){
if(!file.exists("dayfracfit.rds")){
dayfracfit <- glm(inorder ~ frac + days_since_prior_order, data = prodsinorderdays, family = binomial)
saveRDS(dayfracfit, "dayfracfit.rds")
} else {
dayfracfit <- readRDS("dayfracfit.rds")
}
dayfraccoefs <- summary(dayfracfit)$coef[,1]
saveRDS(dayfraccoefs, "dayfraccoefs.rds")
} else {
dayfraccoefs <- readRDS("dayfraccoefs.rds")
}
dayfraccoefs
## (Intercept) frac days_since_prior_order
## -3.159715623 4.997165461 -0.004554965
Now, this model outputs an updated “probability” of a certain item being in an order which is an increasing function of the fraction of prior orders and a decreasing function of the time since the previous order was made — the longer a user waits between orders, the less likely it is that they will buy something! Is this too simple?
Using these estimates of the regression coefficients, let’s define a function to give the response for a given value of frac
and days_since_prior_order
.
day.frac.response <- function(frac, days_since_prior_order){
linear = dayfraccoefs[1] + dayfraccoefs[2] * frac + dayfraccoefs[3] * days_since_prior_order
1 / (1 + exp(- linear))
}
We can make a plot of the response as a function of frac
, showing how it varies with the number of days—the impact of days_since_prior_order
is small but not negligible.
df.days <- data.frame(days_since_prior_order = c(rep(1,101),rep(15,101),rep(30,101)), frac = 0.01 * 0:100) %>% mutate(response = day.frac.response(frac, days_since_prior_order))
##
df.days$days_since_prior_order <- as.factor(df.days$days_since_prior_order)
##
ggplot(df.days, aes(x = frac, y = response, colour = days_since_prior_order)) + geom_line(lwd = 2, alpha = 0.8)
Now, with this model trained, let us make predictions on the test dataset. We need to get a data frame which has the test order numbers with frac
and days_since_prior_order
as its columns.
predtestfracdays <- orders %>% filter(eval_set == "test") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>% mutate(response = day.frac.response(frac, days_since_prior_order))
## Joining, by = "user_id"
As we saw in the case of using just frac
, it isn’t easy to predict where we should set the cutoff. But we know it should be for the value of frac equal to about 0.26; this corresponds to a value of the response variable around 0.126, depending on the length of time:
if(!file.exists("predday26.rds")){
predday26 <- predict(dayfracfit, newdata = data.frame(days_since_prior_order = c(1,15,30), frac = 0.26), type = "response")
saveRDS(predday26, "predday26.rds")
} else {
predday26 <- readRDS("predday26.rds")
}
predday26
## 1 2 3
## 0.1341204 0.1268860 0.1195076
So let’s check some values in the region of 0.1 to 0.15. We’ll use the training dataset again. We need to get a data frame with the order number, days since prior order, and frac
variable.
odpf <- orders %>% filter(eval_set == "train") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>% mutate(response = day.frac.response(frac, days_since_prior_order))
## Joining, by = "user_id"
Using this, let’s try different values of the cutoff, i.e. let’s vary \(p\), make the prediction that an item is in the cart if the response from the logistic regression is \(\geq p\), and see which value of \(p\) gives the best value of \(F_1\).
if(!file.exists("dayscores.rds")){
day.scores <- NULL
for(i in 0:20){
p <- 0.1 + 0.0025 * i
print(i)
prod.pred <- odpf %>% filter(response >= p) %>% group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()
##
##
##
f1.scores <- pred.to.f1(prod.pred)
day.scores <- c(day.scores, mean(f1.scores$f1))
}
saveRDS(day.scores,"dayscores.RDS")
} else {
day.scores <- readRDS("dayscores.rds")
}
if(!file.exists("dayscoresfine.rds")){
day.scores.fine <- NULL
for(i in 0:19){
p <- 0.125 + 0.00025 * i
print(i)
prod.pred <- odpf %>% filter(response >= p) %>% group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()
##
##
##
f1.scores <- pred.to.f1(prod.pred)
day.scores.fine <- c(day.scores.fine, mean(f1.scores$f1))
}
saveRDS(day.scores.fine,"dayscoresfine.rds")
} else {
day.scores.fine <- readRDS("dayscoresfine.rds")
}
We can plot our results:
day.scores.df <- data.frame(cutoff = c(0.1 + 0.0025 * (0:20), 0.125 + 0.00025 * (0:19)), F1 = c(day.scores, day.scores.fine))
#
best.cutoff = 0.128125
#
ggplot(day.scores.df, aes(x = cutoff, y = F1)) + geom_line(lwd = 1) + geom_vline(xintercept = best.cutoff)
The optimum cutoff for this response is thus 0.128125.
Using this model and this cutoff, let’s make predictions for the test dataset and upload them to Kaggle.
##
## Make a data frame with test order_id's, frac and days_since_prior_order
##
test.odpf <- orders %>% filter(eval_set == "test") %>% select(order_id, user_id, days_since_prior_order) %>% inner_join(orderfracs) %>%
##
## Predict the response
##
mutate(response = day.frac.response(frac, days_since_prior_order))
## Joining, by = "user_id"
##
## Subset to the products with scores above the cutoff
##
frac.day.pred <- test.odpf %>% filter(response >= best.cutoff) %>%
##
## Put output in desired form
##
group_by(order_id) %>% mutate(products = paste0(product_id, collapse = " ")) %>% select(order_id, products) %>% unique()
##
## Make sure we have all order_id's from test set, setting "None"
## if there are no products predictes
##
frac.day.pred.all <- filter(orders, eval_set == "test") %>% left_join(frac.day.pred) %>% select(order_id, products) %>% mutate(products = ifelse(is.na(products),"None",products))
## Joining, by = "order_id"
##
## Output CSV in the required format
##
write.csv(frac.day.pred.all, file = "fracdaypred.csv", quote = F, row.names = F)
This gives a score of 0.3321 — slightly worse than the submission which didn’t account for the number of days since the purchase occurred!