Week 5: Advertising and Promotion


The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games. At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014, chap. 2).

include_graphics(c("los_angeles-dodgers-stadium.jpg",
                 "Los-Angeles-Dodgers-Promo.jpg",
                 "adrian_bobble.jpg"))
56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)

Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)

The 2012 season data in the events table of SQLite database data/dodgers.sqlite contain for each of 81 home play the

Prerequisites

We will use R, RStudio, R Markdown for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online

All materials for the next three weeks will be available on Google drive.

March 13: Exploratory data analysis

  1. Connect to data/dodgers.sqlite. Read table events into a variable in R.

    • Read Baumer, Kaplan, and Horton (2017, chaps. 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.

    • Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL for real projects.

    • Open RStudio terminal, connect to database dodgers.sqlite with sqlite3. Explore it (there is only one table, events, at this time) with commands

      • .help
      • .databases
      • .tables
      • .schema <table_name>
      • .headers on
      • .mode column
      • SELECT ...
      • .quit
    • Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.

    • R (likewise, Python) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R or Python to fit models to data.

library(RSQLite)  ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.

## dbListTables(con)

tbl(con, "events") %>% 
  collect() %>% 
  mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
         month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>% 
  mutate_if(is.character, factor) %>% 
  mutate(temp = round((temp- 32)*5/9)) -> events
events %>% 
  count(bobblehead, fireworks)
## # A tibble: 3 × 3
##   bobblehead fireworks     n
##   <fct>      <fct>     <int>
## 1 NO         NO           56
## 2 NO         YES          14
## 3 YES        NO           11
  1. What are the number of plays on each week day and in each month of a year?

Table 1 and 2 summarize the number of games played on each weekday and month.

events %>% 
  count(day_of_week, month) %>% 
  pivot_wider(names_from = day_of_week, values_from = n) %>% 
  pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
Table 1: Number of games played in each weekday and month
month Monday Tuesday Wednesday Thursday Friday Saturday Sunday
APR 1 2 2 1 2 2 2
MAY 3 3 2 1 3 3 3
JUN 1 1 1 1 2 2 1
JUL 3 3 2 NA 1 1 2
AUG 2 2 3 1 3 2 2
SEP 1 1 1 1 2 3 3
OCT 1 1 1 NA NA NA NA
events %>% 
  ggplot(aes(day_of_week)) +
  geom_bar(aes(fill=month))
Barplot of counts of games for each weekday and month

Figure 2: Barplot of counts of games for each weekday and month

Figure 3 shows heatmap of total attendance versus weekday and month. The colors change from bright yellow to dark red as attendance increases. Default heatmap shuffles rows and columns so as to bring together weekdays and months with similar attendance. Here we see May, Aug, and Jul together within the months and Saturday, Friday, Sunday within the weekdays. Learn more about xtabs (cross-table) heatmap by typing ?xtabs and ?heatmap in the R console.

xtabs(attend ~ day_of_week + month, events) %>% 
  heatmap()
Heatmap of attendance versus weekday and month.

Figure 3: Heatmap of attendance versus weekday and month.

In Figure 4, I have added one more aes (colour) to capture day_night information. To avoid overplotting, I replaced geom_point() with geom_jitter(). These plots were also illuminating. So let us thank your friend who suggested this one, too.

sum_attend <- events %>% 
  group_by(day_of_week, month, day_night) %>% 
  summarize(mean_attend = mean(attend),
            total_attend = sum(attend), .groups = "drop")

sum_attend %>% 
  ggplot(aes(day_of_week, month)) +
  geom_jitter(aes(size = mean_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
  scale_size(labels = scales::comma) +
  labs(title = "Average attendance", size = "attendance", col = "part of day",
       x = "Weekday", y = "Month")

sum_attend %>% 
  ggplot(aes(day_of_week, month)) +
  geom_jitter(aes(size = total_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
  labs(title = "Total attendance", size = "attendance", col = "part of day",
       x = "Weekday", y = "Month") +
  scale_size(labels = scales::comma) +
  guides(col = guide_legend(order = 1), 
         shape = guide_legend(order = 2))
Average and total attendances on each weekday and month in each part of dayAverage and total attendances on each weekday and month in each part of day

Figure 4: Average and total attendances on each weekday and month in each part of day

  1. Check the orders of the levels of the day_of_week and month factors. If necessary, put them in the logical order.
levels(events$day_of_week)
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
## [7] "Sunday"
levels(events$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
  1. How many times were bobblehead promotions run on each week day?
events %>% 
  count(day_of_week, bobblehead) %>% 
  pivot_wider(names_from = bobblehead, values_from = n) %>% 
  replace_na(list(YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(Bobblehead = YES)
## # A tibble: 7 × 3
##   day_of_week Bobblehead Total
##   <fct>            <dbl> <dbl>
## 1 Monday               0    12
## 2 Tuesday              6    13
## 3 Wednesday            0    12
## 4 Thursday             2     5
## 5 Friday               0    13
## 6 Saturday             2    13
## 7 Sunday               1    13
  1. How did the attendance vary across week days? Draw boxplots. On which day of week was the attendance the highest on average?
events %>% 
  ggplot(aes(day_of_week, attend)) +
  geom_boxplot()

events %>% 
  slice_max(order_by = attend, n=5)
## # A tibble: 5 × 12
##   month   day attend day_of_week opponent  temp skies  day_night cap   shirt
##   <fct> <dbl>  <dbl> <fct>       <fct>    <dbl> <fct>  <fct>     <fct> <fct>
## 1 APR      10  56000 Tuesday     Pirates     19 Clear  Day       NO    NO   
## 2 AUG      21  56000 Tuesday     Giants      24 Clear  Night     NO    NO   
## 3 JUL       1  55359 Sunday      Mets        24 Clear  Night     NO    NO   
## 4 JUN      12  55279 Tuesday     Angels      19 Cloudy Night     NO    NO   
## 5 AUG       7  55024 Tuesday     Rockies     27 Clear  Night     NO    NO   
## # … with 2 more variables: fireworks <fct>, bobblehead <fct>
  1. Is there an association between attendance and
    • whether the game is played in day light or night?
    • whether skies are clear or cloudy?
events %>% 
  ggplot(aes(day_night, attend)) +
  geom_boxplot()

t.test(x=events$attend[events$day_night=="Day"],
       y=events$attend[events$day_night=="Night"])
## 
##  Welch Two Sample t-test
## 
## data:  events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
## t = 0.42851, df = 23.62, p-value = 0.6722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3531.652  5380.397
## sample estimates:
## mean of x mean of y 
##  41793.27  40868.89

Since p-value (0.67) is large (greater than 0.05), we cannot reject null, which means there is no statistical difference between average attendance of games played in day and night.

events %>% 
  ggplot(aes(skies, attend)) +
  geom_boxplot()

t.test(x=events$attend[events$skies=="Clear"],
       y=events$attend[events$skies=="Cloudy"])
## 
##  Welch Two Sample t-test
## 
## data:  events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
## t = 1.2868, df = 27.664, p-value = 0.2088
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1741.315  7617.103
## sample estimates:
## mean of x mean of y 
##  41729.21  38791.32

We do not see a statisticall significant difference between the average attendance of the games played under clear and cloudy skies.

  1. Is there an association between attendance and temperature?
    • If yes, is there a positive or negative association?
    • Do the associations differ on clear and cloud days or day or night times?
events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter(aes(col = day_of_week)) +
  geom_text(aes(label = str_sub(day_of_week, 1,3), col = day_of_week)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter(aes(col = month)) +
  geom_text(aes(label = str_sub(month, 1,3), col = month)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

events %>% 
  ggplot(aes(month, temp)) +
  geom_boxplot(aes(fill = day_night))

events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter(aes(col = bobblehead)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter(aes(col = opponent)) +
  geom_text(aes(label = str_sub(opponent, 1,4), col = opponent)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# +
#   geom_smooth(se=FALSE, method = "lm", formula = y ~ x, col = "red", lty = "dashed") +
#   geom_smooth(se=FALSE, method = "lm", formula = y ~ x + pmax(x-24,0), col = "red")

\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 23)^+ + \varepsilon_i \]

lmod0 <- lm(attend ~ temp , data = events) 
lmod0 %>% summary()
## 
## Call:
## lm(formula = attend ~ temp, data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16066.6  -6519.8   -995.6   6353.9  15621.4 
## 
## Coefficients:
##             Estimate Std. Error t value         Pr(>|t|)    
## (Intercept)  37105.0     4626.5   8.020 0.00000000000797 ***
## temp           172.3      198.5   0.868            0.388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8310 on 79 degrees of freedom
## Multiple R-squared:  0.009447,   Adjusted R-squared:  -0.003091 
## F-statistic: 0.7535 on 1 and 79 DF,  p-value: 0.388

Null hypothesis tested with F-stat at the very bottom of the outplot of lm(): “Null: There is no relation between attendance and temperature” We say the data are consistent with the null hypothesis if p-value is large [always between 0 and 1]. In this case, pvlue is 0.38, which is large [small means < 0.05 in which case we reject the null]. Since p-value is large in this case, We conclude that the null is consistent with the data. We arrived at wrpng conclusion because the model was not checked through. Use diagnostics to make sure the model captures the features of data

plot(lmod0)

lmod0 %>% summary()
## 
## Call:
## lm(formula = attend ~ temp, data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16066.6  -6519.8   -995.6   6353.9  15621.4 
## 
## Coefficients:
##             Estimate Std. Error t value         Pr(>|t|)    
## (Intercept)  37105.0     4626.5   8.020 0.00000000000797 ***
## temp           172.3      198.5   0.868            0.388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8310 on 79 degrees of freedom
## Multiple R-squared:  0.009447,   Adjusted R-squared:  -0.003091 
## F-statistic: 0.7535 on 1 and 79 DF,  p-value: 0.388
lmod01 <- lm(attend ~ temp + pmax(0, temp - 24), data = events)
lmod01 %>% summary()
## 
## Call:
## lm(formula = attend ~ temp + pmax(0, temp - 24), data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16488.8  -5799.8   -220.5   5079.2  16250.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         18387.9     6645.9   2.767 0.007065 ** 
## temp                 1124.3      317.0   3.547 0.000664 ***
## pmax(0, temp - 24)  -2269.4      614.8  -3.691 0.000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7717 on 78 degrees of freedom
## Multiple R-squared:  0.1567, Adjusted R-squared:  0.1351 
## F-statistic: 7.249 on 2 and 78 DF,  p-value: 0.001295

Here F-stat p-value is small (small if < 0.05). Therefore we reject the null (null is equivalent to saying no relation between attendance and predictors in the right). So find that temp is a useful predictor.

On each row of table in the output, we test “null: corresponding variable has zero effect on attendance.” The null is consistent with data is p-value on the last column is large (> 0.05). In this case both linear and broken line have nozero effects.

plot(lmod0, which = 1)

plot(lmod01, which = 1)

How does the break function help capture the increase/decrease in attendance with temperature?

identity <- function(x) x
breakfun <-  function(x) pmax(x-24,0)
curve(identity(x), xlim = c(13,35), xlab = "temp", ylab = "attend", ylim = c(-10,40))
curve(breakfun(x), xlim = c(13,35), xlab = "temp", ylab = "attend", add = TRUE)
curve(identity(x) - 2* breakfun(x), xlim = c(13,35), xlab = "temp", ylab = "attend", add = TRUE, col ="red")
abline(v=24, lty = "dashed")

x <- c(1, 10, 20, 23, 24, 25,30,35)
breakfun(x) %>% set_names(x)
##  1 10 20 23 24 25 30 35 
##  0  0  0  0  0  1  6 11
breakfun
## function(x) pmax(x-24,0)
## <bytecode: 0x55bbaff6f7e8>

\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp-23)^+ + \varepsilon_i \]

events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter() +
  geom_smooth(se = FALSE) +
  geom_smooth(se = FALSE, method = "lm", 
              formula = y ~ x + pmax(x-24,0), col = "red") 
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is statistically significant relation between attendance and temperature.

March 20: A linear regression model

Regress attendance on month, day of the week, and bobblehead promotion.

lmod1 <- lm(attend ~ month + day_of_week + bobblehead, data = events)

Assuming that model is fine…

summary(lmod1)
## 
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446 < 0.0000000000000002 ***
## monthMAY             -2385.62    2291.22  -1.041              0.30152    
## monthJUN              7163.23    2732.72   2.621              0.01083 *  
## monthJUL              2849.83    2578.60   1.105              0.27303    
## monthAUG              2377.92    2402.91   0.990              0.32593    
## monthSEP                29.03    2521.25   0.012              0.99085    
## monthOCT              -662.67    4046.45  -0.164              0.87041    
## day_of_weekTuesday    7911.49    2702.21   2.928              0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979              0.33134    
## day_of_weekThursday    775.36    3486.15   0.222              0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950              0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497              0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682              0.00920 ** 
## bobbleheadYES        10714.90    2419.52   4.429            0.0000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 0.0000002083

In the summary table, on the lines of bobbleheadYES, we test the null hypothesis: “Giving bobblehead has no effect on attendance”. Null and data are consistent if p-value on the same line is large (> 0.05). If p-value is small, then data contradicts with the null, and therefore we reject the null. Since p-value = 0.0000359 is indeed small we reject the null. So we conclude that bobblehead can affect the attendance (in the positive direction because the estimate has positive value 10714.90). We expect the attendance to increase by 10715 on average.

But first check if the model is satisfactory

plot(lmod1)

plot(lmod1, which = 5)

plot(lmod1, which = 4)

plot(lmod1, which = 3)

car::ncvTest(lmod1) # null says the variance is constant. Since p is small (< 0.05), variance cannot be constant
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.137439, Df = 1, p = 0.041945
plot(lmod1, which = 2)

shapiro.test(rstandard(lmod1)) # null says standardized residuals have normal distribution. Since p is small, the residuals do not seem to have normal distribution.
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(lmod1)
## W = 0.9619, p-value = 0.01658
car::residualPlots(lmod1)

##             Test stat Pr(>|Test stat|)
## month                                 
## day_of_week                           
## bobblehead                            
## Tukey test    -1.1229           0.2615
lmod2 <- lm(attend ~ month*day_of_week + bobblehead, events)
lmod2 %>% plot()
## Warning: not plotting observations with leverage one:
##   3, 7, 30, 31, 32, 33, 36, 37, 44, 45, 65, 69, 70, 71, 72, 79, 80, 81

shapiro.test(rstandard(lmod2)) # residuals seem to have normal distr
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(lmod2)
## W = 0.98944, p-value = 0.8675
car::ncvTest(lmod2) # variance seems to be constant
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.0002343231, Df = 1, p = 0.98779
plot(lmod1, which = 2)

shapiro.test(rstandard(lmod1))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(lmod1)
## W = 0.9619, p-value = 0.01658
car::ncvTest(lmod1) 
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.137439, Df = 1, p = 0.041945
summary(lmod2)
## 
## Call:
## lm(formula = attend ~ month * day_of_week + bobblehead, data = events)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11035  -1686      0   1692  10851 
## 
## Coefficients: (5 not defined because of singularities)
##                               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                      26376       5485   4.809 0.0000269 ***
## monthMAY                          8971       6333   1.417   0.16521    
## monthJUN                         24183       7756   3.118   0.00357 ** 
## monthJUL                          6928       6333   1.094   0.28127    
## monthAUG                          8392       6717   1.249   0.21958    
## monthSEP                          7164       7756   0.924   0.36183    
## monthOCT                          7248       7756   0.934   0.35629    
## day_of_weekTuesday               23631       6717   3.518   0.00120 ** 
## day_of_weekWednesday              1661       6717   0.247   0.80610    
## day_of_weekThursday               1952       7756   0.252   0.80273    
## day_of_weekFriday                11828       6717   1.761   0.08676 .  
## day_of_weekSaturday              17884       6953   2.572   0.01438 *  
## day_of_weekSunday                17180       6717   2.558   0.01490 *  
## bobbleheadYES                    12272       3590   3.418   0.00158 ** 
## monthMAY:day_of_weekTuesday     -23488       8420  -2.789   0.00839 ** 
## monthJUN:day_of_weekTuesday     -31183      10871  -2.869   0.00686 ** 
## monthJUL:day_of_weekTuesday     -14287       8161  -1.751   0.08853 .  
## monthAUG:day_of_weekTuesday     -15159       9386  -1.615   0.11501    
## monthSEP:day_of_weekTuesday     -16552      10261  -1.613   0.11545    
## monthOCT:day_of_weekTuesday     -14782      10261  -1.441   0.15833    
## monthMAY:day_of_weekWednesday    -7257       8378  -0.866   0.39211    
## monthJUN:day_of_weekWednesday    -8726      10261  -0.850   0.40071    
## monthJUL:day_of_weekWednesday    11798       8378   1.408   0.16764    
## monthAUG:day_of_weekWednesday     1522       8378   0.182   0.85691    
## monthSEP:day_of_weekWednesday    15359      10261   1.497   0.14314    
## monthOCT:day_of_weekWednesday    -1271      10261  -0.124   0.90211    
## monthMAY:day_of_weekThursday    -10526      10013  -1.051   0.30018    
## monthJUN:day_of_weekThursday    -15777      11542  -1.367   0.18012    
## monthJUL:day_of_weekThursday        NA         NA      NA        NA    
## monthAUG:day_of_weekThursday      5629      10871   0.518   0.60778    
## monthSEP:day_of_weekThursday      7817      10969   0.713   0.48066    
## monthOCT:day_of_weekThursday        NA         NA      NA        NA    
## monthMAY:day_of_weekFriday       -9582       8073  -1.187   0.24305    
## monthJUN:day_of_weekFriday      -17290       9500  -1.820   0.07708 .  
## monthJUL:day_of_weekFriday       -1259       9232  -0.136   0.89231    
## monthAUG:day_of_weekFriday       -6275       8378  -0.749   0.45871    
## monthSEP:day_of_weekFriday       -6718       9500  -0.707   0.48400    
## monthOCT:day_of_weekFriday          NA         NA      NA        NA    
## monthMAY:day_of_weekSaturday    -16671       8270  -2.016   0.05134 .  
## monthJUN:day_of_weekSaturday    -23729       9668  -2.454   0.01908 *  
## monthJUL:day_of_weekSaturday     -9445       9405  -1.004   0.32194    
## monthAUG:day_of_weekSaturday     -9216       8856  -1.041   0.30496    
## monthSEP:day_of_weekSaturday    -11702       9405  -1.244   0.22145    
## monthOCT:day_of_weekSaturday        NA         NA      NA        NA    
## monthMAY:day_of_weekSunday      -10382       8073  -1.286   0.20665    
## monthJUN:day_of_weekSunday      -14235      10261  -1.387   0.17387    
## monthJUL:day_of_weekSunday       -9083       8568  -1.060   0.29618    
## monthAUG:day_of_weekSunday       -9748       8672  -1.124   0.26844    
## monthSEP:day_of_weekSunday      -16397       9232  -1.776   0.08416 .  
## monthOCT:day_of_weekSunday          NA         NA      NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5485 on 36 degrees of freedom
## Multiple R-squared:  0.8034, Adjusted R-squared:  0.5631 
## F-statistic: 3.343 on 44 and 36 DF,  p-value: 0.0001641

Compare simple and complex models using anova (useful for comparison of nested models with F-test)

anova(lmod1, lmod2)
## Analysis of Variance Table
## 
## Model 1: attend ~ month + day_of_week + bobblehead
## Model 2: attend ~ month * day_of_week + bobblehead
##   Res.Df        RSS Df  Sum of Sq    F Pr(>F)
## 1     67 2509574563                          
## 2     36 1082895916 31 1426678647 1.53 0.1096

As far as we are concerned with the point predictions on the train data, these two models are not different. Null is that small model is correct. The null is consistent with data since p-value is large.

AIC(lmod1, lmod2)
##       df      AIC
## lmod1 15 1657.031
## lmod2 46 1650.953

The lower the AIC, the better is the model. AIC picks complex model. F-test suggests simple model (model1). AIC (measures generalizibility to unseen data) and diagnostic plots recommend complex model. Since we want to predict attendance for future games, we take AIC more seriously. Therefore I will follow AIC + diagnostics recommendation and work with complex model.

p-value is a measure of consistency of null hypothesis (=no relation between variable and attendance) with data.

  1. Is there any evidence for a relationship between attendance and other variables? Why or why not?
lmod2 %>% summary()
## 
## Call:
## lm(formula = attend ~ month * day_of_week + bobblehead, data = events)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11035  -1686      0   1692  10851 
## 
## Coefficients: (5 not defined because of singularities)
##                               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                      26376       5485   4.809 0.0000269 ***
## monthMAY                          8971       6333   1.417   0.16521    
## monthJUN                         24183       7756   3.118   0.00357 ** 
## monthJUL                          6928       6333   1.094   0.28127    
## monthAUG                          8392       6717   1.249   0.21958    
## monthSEP                          7164       7756   0.924   0.36183    
## monthOCT                          7248       7756   0.934   0.35629    
## day_of_weekTuesday               23631       6717   3.518   0.00120 ** 
## day_of_weekWednesday              1661       6717   0.247   0.80610    
## day_of_weekThursday               1952       7756   0.252   0.80273    
## day_of_weekFriday                11828       6717   1.761   0.08676 .  
## day_of_weekSaturday              17884       6953   2.572   0.01438 *  
## day_of_weekSunday                17180       6717   2.558   0.01490 *  
## bobbleheadYES                    12272       3590   3.418   0.00158 ** 
## monthMAY:day_of_weekTuesday     -23488       8420  -2.789   0.00839 ** 
## monthJUN:day_of_weekTuesday     -31183      10871  -2.869   0.00686 ** 
## monthJUL:day_of_weekTuesday     -14287       8161  -1.751   0.08853 .  
## monthAUG:day_of_weekTuesday     -15159       9386  -1.615   0.11501    
## monthSEP:day_of_weekTuesday     -16552      10261  -1.613   0.11545    
## monthOCT:day_of_weekTuesday     -14782      10261  -1.441   0.15833    
## monthMAY:day_of_weekWednesday    -7257       8378  -0.866   0.39211    
## monthJUN:day_of_weekWednesday    -8726      10261  -0.850   0.40071    
## monthJUL:day_of_weekWednesday    11798       8378   1.408   0.16764    
## monthAUG:day_of_weekWednesday     1522       8378   0.182   0.85691    
## monthSEP:day_of_weekWednesday    15359      10261   1.497   0.14314    
## monthOCT:day_of_weekWednesday    -1271      10261  -0.124   0.90211    
## monthMAY:day_of_weekThursday    -10526      10013  -1.051   0.30018    
## monthJUN:day_of_weekThursday    -15777      11542  -1.367   0.18012    
## monthJUL:day_of_weekThursday        NA         NA      NA        NA    
## monthAUG:day_of_weekThursday      5629      10871   0.518   0.60778    
## monthSEP:day_of_weekThursday      7817      10969   0.713   0.48066    
## monthOCT:day_of_weekThursday        NA         NA      NA        NA    
## monthMAY:day_of_weekFriday       -9582       8073  -1.187   0.24305    
## monthJUN:day_of_weekFriday      -17290       9500  -1.820   0.07708 .  
## monthJUL:day_of_weekFriday       -1259       9232  -0.136   0.89231    
## monthAUG:day_of_weekFriday       -6275       8378  -0.749   0.45871    
## monthSEP:day_of_weekFriday       -6718       9500  -0.707   0.48400    
## monthOCT:day_of_weekFriday          NA         NA      NA        NA    
## monthMAY:day_of_weekSaturday    -16671       8270  -2.016   0.05134 .  
## monthJUN:day_of_weekSaturday    -23729       9668  -2.454   0.01908 *  
## monthJUL:day_of_weekSaturday     -9445       9405  -1.004   0.32194    
## monthAUG:day_of_weekSaturday     -9216       8856  -1.041   0.30496    
## monthSEP:day_of_weekSaturday    -11702       9405  -1.244   0.22145    
## monthOCT:day_of_weekSaturday        NA         NA      NA        NA    
## monthMAY:day_of_weekSunday      -10382       8073  -1.286   0.20665    
## monthJUN:day_of_weekSunday      -14235      10261  -1.387   0.17387    
## monthJUL:day_of_weekSunday       -9083       8568  -1.060   0.29618    
## monthAUG:day_of_weekSunday       -9748       8672  -1.124   0.26844    
## monthSEP:day_of_weekSunday      -16397       9232  -1.776   0.08416 .  
## monthOCT:day_of_weekSunday          NA         NA      NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5485 on 36 degrees of freedom
## Multiple R-squared:  0.8034, Adjusted R-squared:  0.5631 
## F-statistic: 3.343 on 44 and 36 DF,  p-value: 0.0001641

Check F-statistic’s p-value. If it is less than 0.05, then there is relation between attendance and predictors.

  1. Does the bobblehead promotion have a statistically significant effect on the attendance?
anova(update(lmod2, . ~ . - bobblehead), lmod2)
## Analysis of Variance Table
## 
## Model 1: attend ~ month + day_of_week + month:day_of_week
## Model 2: attend ~ month * day_of_week + bobblehead
##   Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
## 1     37 1434296454                                
## 2     36 1082895916  1 351400539 11.682 0.001582 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova / F-test reject the small model (without bobblehead). Therefore we conclude that bobblehead is important to boost the attendance.

AIC(update(lmod2, . ~ . - bobblehead), lmod2)
##                                   df      AIC
## update(lmod2, . ~ . - bobblehead) 45 1671.717
## lmod2                             46 1650.953

AIC confirms that bobblehead is important.

Check also with cross-validation.

  1. Do month and day of week variables help to explain the number of attendants?
anova(update(lmod2, . ~ . - month - month:day_of_week), lmod2)
## Analysis of Variance Table
## 
## Model 1: attend ~ day_of_week + bobblehead
## Model 2: attend ~ month * day_of_week + bobblehead
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1     73 3129721926                               
## 2     36 1082895916 37 2046826010 1.8391 0.03515 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Month is an important variable because small model’s p-value is small – therefore we reject the small model.

update(lmod2, contrast = list(month = "contr.sum", bobblehead = "contr.sum")) %>% coef() 
##                 (Intercept)                      month1 
##                 41495.70238                 -8983.73810 
##                      month2                      month3 
##                   -12.73810                 15199.26190 
##                      month4                      month5 
##                 -2056.07143                  -591.23810 
##                      month6          day_of_weekTuesday 
##                 -1819.73810                  7137.86395 
##        day_of_weekWednesday         day_of_weekThursday 
##                  3293.04762                  9769.00000 
##           day_of_weekFriday         day_of_weekSaturday 
##                  5110.00000                  6181.66667 
##           day_of_weekSunday                 bobblehead1 
##                   782.66667                 -6135.96429 
##   month1:day_of_weekTuesday   month2:day_of_weekTuesday 
##                 16493.13605                 -6995.14966 
##   month3:day_of_weekTuesday   month4:day_of_weekTuesday 
##                -14689.79252                  2205.82653 
##   month5:day_of_weekTuesday   month6:day_of_weekTuesday 
##                  1333.70748                   -58.86395 
## month1:day_of_weekWednesday month2:day_of_weekWednesday 
##                 -1632.04762                 -8889.04762 
## month3:day_of_weekWednesday month4:day_of_weekWednesday 
##                -10358.04762                 10165.78571 
## month5:day_of_weekWednesday month6:day_of_weekWednesday 
##                  -110.54762                 13726.95238 
##  month1:day_of_weekThursday  month2:day_of_weekThursday 
##                 -7817.00000                -18343.00000 
##  month3:day_of_weekThursday  month4:day_of_weekThursday 
##                -23593.92857                          NA 
##  month5:day_of_weekThursday  month6:day_of_weekThursday 
##                 -2188.42857                          NA 
##    month1:day_of_weekFriday    month2:day_of_weekFriday 
##                  6718.00000                 -2863.66667 
##    month3:day_of_weekFriday    month4:day_of_weekFriday 
##                -10571.50000                  5459.33333 
##    month5:day_of_weekFriday    month6:day_of_weekFriday 
##                   442.83333                          NA 
##  month1:day_of_weekSaturday  month2:day_of_weekSaturday 
##                 11701.86905                 -4969.00000 
##  month3:day_of_weekSaturday  month4:day_of_weekSaturday 
##                -12027.16667                  2256.73810 
##  month5:day_of_weekSaturday  month6:day_of_weekSaturday 
##                  2485.83333                          NA 
##    month1:day_of_weekSunday    month2:day_of_weekSunday 
##                 16397.33333                  6015.33333 
##    month3:day_of_weekSunday    month4:day_of_weekSunday 
##                  2162.33333                  7314.70238 
##    month5:day_of_weekSunday    month6:day_of_weekSunday 
##                  6649.83333                          NA
effects::predictorEffect("month", lmod2) %>% plot()

  1. How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.

Expected number of additional fans drawn to a home game with a bobblehead promotion

lmod2 %>% coef %>% .["bobbleheadYES"]
## bobbleheadYES 
##      12271.93
confint(lmod2, parm = "bobbleheadYES", level = 0.95)
##                  2.5 %   97.5 %
## bobbleheadYES 4990.077 19553.78
confint(lmod2, parm = "bobbleheadYES", level = 0.80)
##                   10 %     90 %
## bobbleheadYES 7584.494 16959.36
  1. How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.
  1. Predict the number of attendees to a typical home game on a Wednesday in June if a bobblehead promotion is extended. Give a 90% prediction interval.
predict(object = lmod2, newdata = data.frame(month = "JUN", day_of_week = "Wednesday", bobblehead = "YES"), interval = "prediction", level = 0.80)
##        fit     lwr      upr
## 1 55765.93 44607.6 66924.25

Project (will be graded)

Include all variables and conduct a full regression analysis of the problem. Submit your R markdown and html files to course homepage on moodle.

events %>% names()
##  [1] "month"       "day"         "attend"      "day_of_week" "opponent"   
##  [6] "temp"        "skies"       "day_night"   "cap"         "shirt"      
## [11] "fireworks"   "bobblehead"

Bibliography

Baumer, B. S., D. T. Kaplan, and N. J. Horton. 2017. Modern Data Science with R. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com.tr/books?id=NrddDgAAQBAJ.
Miller, T. W. 2014. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. FT Press Analytics. Pearson Education. https://books.google.com.tr/books?id=PU6nBAAAQBAJ.
Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://books.google.com.tr/books?id=aZRYrgEACAAJ.