Week 5: Advertising and Promotion


The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games. At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014, chap. 2).

include_graphics(c("los_angeles-dodgers-stadium.jpg",
                 "Los-Angeles-Dodgers-Promo.jpg",
                 "adrian_bobble.jpg"))
56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)

Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)

The 2012 season data in the events table of SQLite database data/dodgers.sqlite contain for each of 81 home play the

Prerequisites

We will use R, RStudio, R Markdown for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online

All materials for the next three weeks will be available on Google drive.

February 26: Exploratory data analysis

  1. Connect to data/dodgers.sqlite. Read table events into a variable in R.

    • Read Baumer, Kaplan, and Horton (2017, chaps. 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.

    • Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL for real projects.

    • Open RStudio terminal, connect to database dodgers.sqlite with sqlite3. Explore it (there is only one table, events, at this time) with commands

      • .help
      • .databases
      • .tables
      • .schema <table_name>
      • .headers on
      • .mode column
      • SELECT ...
      • .quit
    • Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.

    • R (likewise, Python) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R or Python to fit models to data.

library(RSQLite)
con <-dbConnect(SQLite(), "../data/dodgers.sqlite")
dbListTables(con)
## [1] "events"
dbListFields(con, "events")
##  [1] "month"       "day"         "attend"      "day_of_week" "opponent"   
##  [6] "temp"        "skies"       "day_night"   "cap"         "shirt"      
## [11] "fireworks"   "bobblehead"
d0 <- dbGetQuery(con, "SELECT * from events;")

Use fetch if query may return large data

res <- dbSendQuery(con, "SELECT * from events;")
while (!dbHasCompleted(res)) {
  chunk <- dbFetch(res, 20)
  cat("The chunk has ", sprintf("%2d", nrow(chunk)), " records. The mean attendance for the current chunk is ", mean(chunk$attend), ". \n",  sep="")
}
## The chunk has 20 records. The mean attendance for the current chunk is 38772.35. 
## The chunk has 20 records. The mean attendance for the current chunk is 42935.1. 
## The chunk has 20 records. The mean attendance for the current chunk is 42905.65. 
## The chunk has 20 records. The mean attendance for the current chunk is 39898.5. 
## The chunk has  1 records. The mean attendance for the current chunk is 34014.
dbClearResult(res)
dbDisconnect(con)
# d0$month %>% unique()
# d0$day_of_week %>% unique()
d <- d0 %>% 
  mutate(month = factor(month, levels = c("APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT")),
         day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
         temp = (temp-32)*5/9) %>% # in celcius
  mutate(across(where(is.character), factor))

summary(d)
##  month         day            attend         day_of_week      opponent 
##  APR:12   Min.   : 1.00   Min.   :24312   Monday   :12   Giants   : 9  
##  MAY:18   1st Qu.: 8.00   1st Qu.:34493   Tuesday  :13   Padres   : 9  
##  JUN: 9   Median :15.00   Median :40284   Wednesday:12   Rockies  : 9  
##  JUL:12   Mean   :16.14   Mean   :41040   Thursday : 5   Snakes   : 9  
##  AUG:15   3rd Qu.:25.00   3rd Qu.:46588   Friday   :13   Cardinals: 7  
##  SEP:12   Max.   :31.00   Max.   :56000   Saturday :13   Brewers  : 4  
##  OCT: 3                                   Sunday   :13   (Other)  :34  
##       temp          skies    day_night   cap     shirt    fireworks bobblehead
##  Min.   :12.22   Clear :62   Day  :15   NO :79   NO :78   NO :67    NO :70    
##  1st Qu.:19.44   Cloudy:19   Night:66   YES: 2   YES: 3   YES:14    YES:11    
##  Median :22.78                                                                
##  Mean   :22.86                                                                
##  3rd Qu.:26.11                                                                
##  Max.   :35.00                                                                
## 
  1. What are the number of plays on each week day and in each month of a year?
xtabs(~ day_of_week + month, data = d) %>% chisq.test(simulate.p.value = TRUE, B = 10000)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  .
## X-squared = 11.678, df = NA, p-value = 1

Month and day of week of the game are unrelated.

  1. Check the orders of the levels of the day_of_week and month factors. If necessary, put them in the logical order.

Done.

  1. How many times were bobblehead promotions run on each week day?
xtabs(~ bobblehead + day_of_week, d)
##           day_of_week
## bobblehead Monday Tuesday Wednesday Thursday Friday Saturday Sunday
##        NO      12       7        12        3     13       11     12
##        YES      0       6         0        2      0        2      1
xtabs(~ bobblehead + month, d)
##           month
## bobblehead APR MAY JUN JUL AUG SEP OCT
##        NO   11  16   7   9  12  12   3
##        YES   1   2   2   3   3   0   0
  1. How did the attendance vary across week days? Draw boxplots. On which day of week was the attendance the highest on average?
d %>% 
  ggplot(aes(day_of_week, attend)) +
  geom_boxplot()

  1. Is there an association between attendance and
    • whether the game is played in day light or night?
    • whether skies are clear or cloudy?
# xtabs(attend ~ day_night, d)
d %>% 
  group_by(day_night) %>% 
  summarize(avgattend = mean(attend)) %>% 
  pull(avgattend) %>% chisq.test()
## 
##  Chi-squared test for given probabilities
## 
## data:  .
## X-squared = 10.337, df = 1, p-value = 0.001304

Since p-value is less than 0.05, the split is not uniform.

# xtabs(attend ~ day_night, d)
d %>% 
  group_by(skies) %>% 
  summarize(avgattend = mean(attend)) %>% 
  pull(avgattend) %>% chisq.test()
## 
##  Chi-squared test for given probabilities
## 
## data:  .
## X-squared = 107.19, df = 1, p-value < 0.00000000000000022

Skies is also an important predictor since test of uniform dist rejected that the attendance is the same on average on clear or cloudy sky days.

  1. Is there an association between attendance and temperature?
    • If yes, is there a positive or negative association?
    • Do the associations differ on clear and cloud days or day or night times?
d %>% 
  # select(is.numeric)
  select(attend, temp) %>% 
  cor() %>% 
  round(4)
##        attend  temp
## attend  1.000 0.099
## temp    0.099 1.000
d %>% 
  ggplot(aes(temp, attend)) +
  geom_point() +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Next time: A linear regression model

Regress attendance on month, day of the week, and bobblehead promotion.

lmod <- lm(attend ~ day_of_week + month + bobblehead, data = d)
summary(lmod)
## 
## Call:
## lm(formula = attend ~ day_of_week + month + bobblehead, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446 < 0.0000000000000002 ***
## day_of_weekTuesday    7911.49    2702.21   2.928              0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979              0.33134    
## day_of_weekThursday    775.36    3486.15   0.222              0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950              0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497              0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682              0.00920 ** 
## monthMAY             -2385.62    2291.22  -1.041              0.30152    
## monthJUN              7163.23    2732.72   2.621              0.01083 *  
## monthJUL              2849.83    2578.60   1.105              0.27303    
## monthAUG              2377.92    2402.91   0.990              0.32593    
## monthSEP                29.03    2521.25   0.012              0.99085    
## monthOCT              -662.67    4046.45  -0.164              0.87041    
## bobbleheadYES        10714.90    2419.52   4.429            0.0000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 0.0000002083
  1. Is there any evidence for a relationship between attendance and other variables? Why or why not?

Is at least of not the variable month, week day, bobblehead associated with attendance?

Construct two models.

small <- update(lmod, . ~ 1)
anova(small, lmod)
## Analysis of Variance Table
## 
## Model 1: attend ~ 1
## Model 2: attend ~ day_of_week + month + bobblehead
##   Res.Df        RSS Df  Sum of Sq      F       Pr(>F)    
## 1     80 5507932888                                      
## 2     67 2509574563 13 2998358324 6.1576 0.0000002083 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Full model is much better than the small (null) model. This provides us evindence at least one of the predictors (month, weeek day, bobblehead) seems to be related to attendance.

  1. Does the bobblehead promotion have a statistically significant effect on the attendance?
small <- update(lmod, . ~ . -bobblehead)
anova(small, lmod)
## Analysis of Variance Table
## 
## Model 1: attend ~ day_of_week + month
## Model 2: attend ~ day_of_week + month + bobblehead
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1     68 3244161740                                  
## 2     67 2509574563  1 734587177 19.612 0.0000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Small p-value (< 0.05) implies that full model is significantly better than the small model, so bobblehead is statistically significant.

  1. Do month and day of week variables help to explain the number of attendants?
small <- update(lmod, . ~ . - month)
anova(small, lmod)
## Analysis of Variance Table
## 
## Model 1: attend ~ day_of_week + bobblehead
## Model 2: attend ~ day_of_week + month + bobblehead
##   Res.Df        RSS Df Sum of Sq      F  Pr(>F)  
## 1     73 3129721926                              
## 2     67 2509574563  6 620147363 2.7594 0.01858 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(car)
Anova(lmod)
## Anova Table (Type II tests)
## 
## Response: attend
##                 Sum Sq Df F value    Pr(>F)    
## day_of_week  575839199  6  2.5623   0.02704 *  
## month        620147363  6  2.7594   0.01858 *  
## bobblehead   734587177  1 19.6118 0.0000359 ***
## Residuals   2509574563 67                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.
coef(lmod)["bobbleheadYES"]
## bobbleheadYES 
##       10714.9
confint(lmod, "bobbleheadYES")
##                  2.5 %   97.5 %
## bobbleheadYES 5885.521 15544.29
  1. How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.
summary(lmod)
## 
## Call:
## lm(formula = attend ~ day_of_week + month + bobblehead, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446 < 0.0000000000000002 ***
## day_of_weekTuesday    7911.49    2702.21   2.928              0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979              0.33134    
## day_of_weekThursday    775.36    3486.15   0.222              0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950              0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497              0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682              0.00920 ** 
## monthMAY             -2385.62    2291.22  -1.041              0.30152    
## monthJUN              7163.23    2732.72   2.621              0.01083 *  
## monthJUL              2849.83    2578.60   1.105              0.27303    
## monthAUG              2377.92    2402.91   0.990              0.32593    
## monthSEP                29.03    2521.25   0.012              0.99085    
## monthOCT              -662.67    4046.45  -0.164              0.87041    
## bobbleheadYES        10714.90    2419.52   4.429            0.0000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 0.0000002083
d %>% 
  mutate(fitted = fitted(lmod)) %>% 
  ggplot(aes(fitted, attend)) +
  geom_point() +
  geom_abline() +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. Predict the number of attendees to a typical home game on a Wednesday in June if a bobblehead promotion is extended. Give a 90% prediction interval.
predict(lmod, newdata = 
          tibble(day_of_week = "Wednesday", 
                 month = "JUN",
                 bobblehead = "YES"),
        interval = "predict", level = 0.90)
##        fit     lwr      upr
## 1 54247.32 42491.1 66003.55

Project (will be graded)

Include all variables and conduct a full regression analysis of the problem. Submit your R markdown and html files to course homepage on moodle.

Bibliography

Baumer, B. S., D. T. Kaplan, and N. J. Horton. 2017. Modern Data Science with R. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com.tr/books?id=NrddDgAAQBAJ.
Miller, T. W. 2014. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. FT Press Analytics. Pearson Education. https://books.google.com.tr/books?id=PU6nBAAAQBAJ.
Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://books.google.com.tr/books?id=aZRYrgEACAAJ.