Spring 2023
GE 461 Introduction to Data Science
The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games. At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014, chap. 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)
The 2012 season data in the events
table of SQLite database data/dodgers.sqlite
contain for each of 81 home play the
We will use R
, RStudio
, R Markdown
for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online
R
and RStudio
,R Markdown
to interact with R
and conduct various predictive analyses.All materials for the next three weeks will be available on Google drive.
Connect to data/dodgers.sqlite
. Read table events
into a variable in R
.
Read Baumer, Kaplan, and Horton (2017, chaps. 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.
Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL
for real projects.
Open RStudio
terminal, connect to database dodgers.sqlite
with sqlite3
. Explore it (there is only one table, events
, at this time) with commands
.help
.databases
.tables
.schema <table_name>
.headers on
.mode column
SELECT ...
.quit
Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.
R
(likewise, Python
) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R
or Python
to fit models to data.
library(RSQLite) ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.
## dbListTables(con)
tbl(con, "events") %>%
collect() %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>%
mutate_if(is.character, factor) %>%
mutate(temp = round((temp- 32)*5/9)) -> events
events %>%
count(bobblehead, fireworks)
## # A tibble: 3 × 3
## bobblehead fireworks n
## <fct> <fct> <int>
## 1 NO NO 56
## 2 NO YES 14
## 3 YES NO 11
Table 1 and 2 summarize the number of games played on each weekday and month.
events %>%
count(day_of_week, month) %>%
pivot_wider(names_from = day_of_week, values_from = n) %>%
pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
month | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
APR | 1 | 2 | 2 | 1 | 2 | 2 | 2 |
MAY | 3 | 3 | 2 | 1 | 3 | 3 | 3 |
JUN | 1 | 1 | 1 | 1 | 2 | 2 | 1 |
JUL | 3 | 3 | 2 | NA | 1 | 1 | 2 |
AUG | 2 | 2 | 3 | 1 | 3 | 2 | 2 |
SEP | 1 | 1 | 1 | 1 | 2 | 3 | 3 |
OCT | 1 | 1 | 1 | NA | NA | NA | NA |
events %>%
ggplot(aes(day_of_week)) +
geom_bar(aes(fill=month))
Figure 2: Barplot of counts of games for each weekday and month
Figure 3 shows heatmap of total attendance versus weekday and month. The colors change from bright yellow to dark red as attendance increases. Default heatmap shuffles rows and columns so as to bring together weekdays and months with similar attendance. Here we see May, Aug, and Jul together within the months and Saturday, Friday, Sunday within the weekdays. Learn more about xtabs (cross-table) heatmap by typing ?xtabs
and ?heatmap
in the R console.
xtabs(attend ~ day_of_week + month, events) %>%
heatmap()
Figure 3: Heatmap of attendance versus weekday and month.
In Figure 4, I have added one more aes (colour) to capture day_night information. To avoid overplotting, I replaced geom_point()
with geom_jitter()
. These plots were also illuminating. So let us thank your friend who suggested this one, too.
sum_attend <- events %>%
group_by(day_of_week, month, day_night) %>%
summarize(mean_attend = mean(attend),
total_attend = sum(attend), .groups = "drop")
sum_attend %>%
ggplot(aes(day_of_week, month)) +
geom_jitter(aes(size = mean_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
scale_size(labels = scales::comma) +
labs(title = "Average attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month")
sum_attend %>%
ggplot(aes(day_of_week, month)) +
geom_jitter(aes(size = total_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
labs(title = "Total attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month") +
scale_size(labels = scales::comma) +
guides(col = guide_legend(order = 1),
shape = guide_legend(order = 2))
Figure 4: Average and total attendances on each weekday and month in each part of day
day_of_week
and month
factors. If necessary, put them in the logical order.levels(events$day_of_week)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
levels(events$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
events %>%
count(day_of_week, bobblehead) %>%
pivot_wider(names_from = bobblehead, values_from = n) %>%
replace_na(list(YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(Bobblehead = YES)
## # A tibble: 7 × 3
## day_of_week Bobblehead Total
## <fct> <dbl> <dbl>
## 1 Monday 0 12
## 2 Tuesday 6 13
## 3 Wednesday 0 12
## 4 Thursday 2 5
## 5 Friday 0 13
## 6 Saturday 2 13
## 7 Sunday 1 13
events %>%
ggplot(aes(day_of_week, attend)) +
geom_boxplot()
events %>%
slice_max(order_by = attend, n=5)
## # A tibble: 5 × 12
## month day attend day_of_week opponent temp skies day_night cap shirt
## <fct> <dbl> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct>
## 1 APR 10 56000 Tuesday Pirates 19 Clear Day NO NO
## 2 AUG 21 56000 Tuesday Giants 24 Clear Night NO NO
## 3 JUL 1 55359 Sunday Mets 24 Clear Night NO NO
## 4 JUN 12 55279 Tuesday Angels 19 Cloudy Night NO NO
## 5 AUG 7 55024 Tuesday Rockies 27 Clear Night NO NO
## # … with 2 more variables: fireworks <fct>, bobblehead <fct>
events %>%
ggplot(aes(day_night, attend)) +
geom_boxplot()
t.test(x=events$attend[events$day_night=="Day"],
y=events$attend[events$day_night=="Night"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
## t = 0.42851, df = 23.62, p-value = 0.6722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3531.652 5380.397
## sample estimates:
## mean of x mean of y
## 41793.27 40868.89
Since p-value (0.67) is large (greater than 0.05), we cannot reject null, which means there is no statistical difference between average attendance of games played in day and night.
events %>%
ggplot(aes(skies, attend)) +
geom_boxplot()
t.test(x=events$attend[events$skies=="Clear"],
y=events$attend[events$skies=="Cloudy"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
## t = 1.2868, df = 27.664, p-value = 0.2088
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1741.315 7617.103
## sample estimates:
## mean of x mean of y
## 41729.21 38791.32
We do not see a statisticall significant difference between the average attendance of the games played under clear and cloudy skies.
events %>%
ggplot(aes(temp, attend)) +
geom_jitter(aes(col = day_of_week)) +
geom_text(aes(label = str_sub(day_of_week, 1,3), col = day_of_week)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
events %>%
ggplot(aes(temp, attend)) +
geom_jitter(aes(col = month)) +
geom_text(aes(label = str_sub(month, 1,3), col = month)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
events %>%
ggplot(aes(month, temp)) +
geom_boxplot(aes(fill = day_night))
events %>%
ggplot(aes(temp, attend)) +
geom_jitter(aes(col = bobblehead)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
events %>%
ggplot(aes(temp, attend)) +
geom_jitter(aes(col = opponent)) +
geom_text(aes(label = str_sub(opponent, 1,4), col = opponent)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# +
# geom_smooth(se=FALSE, method = "lm", formula = y ~ x, col = "red", lty = "dashed") +
# geom_smooth(se=FALSE, method = "lm", formula = y ~ x + pmax(x-24,0), col = "red")
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 23)^+ + \varepsilon_i \]
lmod0 <- lm(attend ~ temp , data = events)
lmod0 %>% summary()
##
## Call:
## lm(formula = attend ~ temp, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16066.6 -6519.8 -995.6 6353.9 15621.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37105.0 4626.5 8.020 0.00000000000797 ***
## temp 172.3 198.5 0.868 0.388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8310 on 79 degrees of freedom
## Multiple R-squared: 0.009447, Adjusted R-squared: -0.003091
## F-statistic: 0.7535 on 1 and 79 DF, p-value: 0.388
Null hypothesis tested with F-stat at the very bottom of the outplot of lm(): “Null: There is no relation between attendance and temperature” We say the data are consistent with the null hypothesis if p-value is large [always between 0 and 1]. In this case, pvlue is 0.38, which is large [small means < 0.05 in which case we reject the null]. Since p-value is large in this case, We conclude that the null is consistent with the data. We arrived at wrpng conclusion because the model was not checked through. Use diagnostics to make sure the model captures the features of data
plot(lmod0)
lmod0 %>% summary()
##
## Call:
## lm(formula = attend ~ temp, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16066.6 -6519.8 -995.6 6353.9 15621.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37105.0 4626.5 8.020 0.00000000000797 ***
## temp 172.3 198.5 0.868 0.388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8310 on 79 degrees of freedom
## Multiple R-squared: 0.009447, Adjusted R-squared: -0.003091
## F-statistic: 0.7535 on 1 and 79 DF, p-value: 0.388
lmod01 <- lm(attend ~ temp + pmax(0, temp - 24), data = events)
lmod01 %>% summary()
##
## Call:
## lm(formula = attend ~ temp + pmax(0, temp - 24), data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16488.8 -5799.8 -220.5 5079.2 16250.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18387.9 6645.9 2.767 0.007065 **
## temp 1124.3 317.0 3.547 0.000664 ***
## pmax(0, temp - 24) -2269.4 614.8 -3.691 0.000412 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7717 on 78 degrees of freedom
## Multiple R-squared: 0.1567, Adjusted R-squared: 0.1351
## F-statistic: 7.249 on 2 and 78 DF, p-value: 0.001295
Here F-stat p-value is small (small if < 0.05). Therefore we reject the null (null is equivalent to saying no relation between attendance and predictors in the right). So find that temp is a useful predictor.
On each row of table in the output, we test “null: corresponding variable has zero effect on attendance.” The null is consistent with data is p-value on the last column is large (> 0.05). In this case both linear and broken line have nozero effects.
plot(lmod0, which = 1)
plot(lmod01, which = 1)
How does the break function help capture the increase/decrease in attendance with temperature?
identity <- function(x) x
breakfun <- function(x) pmax(x-24,0)
curve(identity(x), xlim = c(13,35), xlab = "temp", ylab = "attend", ylim = c(-10,40))
curve(breakfun(x), xlim = c(13,35), xlab = "temp", ylab = "attend", add = TRUE)
curve(identity(x) - 2* breakfun(x), xlim = c(13,35), xlab = "temp", ylab = "attend", add = TRUE, col ="red")
abline(v=24, lty = "dashed")
x <- c(1, 10, 20, 23, 24, 25,30,35)
breakfun(x) %>% set_names(x)
## 1 10 20 23 24 25 30 35
## 0 0 0 0 0 1 6 11
breakfun
## function(x) pmax(x-24,0)
## <bytecode: 0x55bbaff6f7e8>
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp-23)^+ + \varepsilon_i \]
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE) +
geom_smooth(se = FALSE, method = "lm",
formula = y ~ x + pmax(x-24,0), col = "red")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is statistically significant relation between attendance and temperature.
Regress attendance on month
, day of the week
, and bobblehead
promotion.
lmod1 <- lm(attend ~ month + day_of_week + bobblehead, data = events)
Assuming that model is fine…
summary(lmod1)
##
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 0.0000000000000002 ***
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## bobbleheadYES 10714.90 2419.52 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
In the summary table, on the lines of bobbleheadYES, we test the null hypothesis: “Giving bobblehead has no effect on attendance”. Null and data are consistent if p-value on the same line is large (> 0.05). If p-value is small, then data contradicts with the null, and therefore we reject the null. Since p-value = 0.0000359 is indeed small we reject the null. So we conclude that bobblehead can affect the attendance (in the positive direction because the estimate has positive value 10714.90). We expect the attendance to increase by 10715 on average.
But first check if the model is satisfactory
plot(lmod1)
plot(lmod1, which = 5)
plot(lmod1, which = 4)
plot(lmod1, which = 3)
car::ncvTest(lmod1) # null says the variance is constant. Since p is small (< 0.05), variance cannot be constant
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.137439, Df = 1, p = 0.041945
plot(lmod1, which = 2)
shapiro.test(rstandard(lmod1)) # null says standardized residuals have normal distribution. Since p is small, the residuals do not seem to have normal distribution.
##
## Shapiro-Wilk normality test
##
## data: rstandard(lmod1)
## W = 0.9619, p-value = 0.01658
car::residualPlots(lmod1)
## Test stat Pr(>|Test stat|)
## month
## day_of_week
## bobblehead
## Tukey test -1.1229 0.2615
lmod2 <- lm(attend ~ month*day_of_week + bobblehead, events)
lmod2 %>% plot()
## Warning: not plotting observations with leverage one:
## 3, 7, 30, 31, 32, 33, 36, 37, 44, 45, 65, 69, 70, 71, 72, 79, 80, 81
shapiro.test(rstandard(lmod2)) # residuals seem to have normal distr
##
## Shapiro-Wilk normality test
##
## data: rstandard(lmod2)
## W = 0.98944, p-value = 0.8675
car::ncvTest(lmod2) # variance seems to be constant
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0002343231, Df = 1, p = 0.98779
plot(lmod1, which = 2)
shapiro.test(rstandard(lmod1))
##
## Shapiro-Wilk normality test
##
## data: rstandard(lmod1)
## W = 0.9619, p-value = 0.01658
car::ncvTest(lmod1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.137439, Df = 1, p = 0.041945
summary(lmod2)
##
## Call:
## lm(formula = attend ~ month * day_of_week + bobblehead, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11035 -1686 0 1692 10851
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26376 5485 4.809 0.0000269 ***
## monthMAY 8971 6333 1.417 0.16521
## monthJUN 24183 7756 3.118 0.00357 **
## monthJUL 6928 6333 1.094 0.28127
## monthAUG 8392 6717 1.249 0.21958
## monthSEP 7164 7756 0.924 0.36183
## monthOCT 7248 7756 0.934 0.35629
## day_of_weekTuesday 23631 6717 3.518 0.00120 **
## day_of_weekWednesday 1661 6717 0.247 0.80610
## day_of_weekThursday 1952 7756 0.252 0.80273
## day_of_weekFriday 11828 6717 1.761 0.08676 .
## day_of_weekSaturday 17884 6953 2.572 0.01438 *
## day_of_weekSunday 17180 6717 2.558 0.01490 *
## bobbleheadYES 12272 3590 3.418 0.00158 **
## monthMAY:day_of_weekTuesday -23488 8420 -2.789 0.00839 **
## monthJUN:day_of_weekTuesday -31183 10871 -2.869 0.00686 **
## monthJUL:day_of_weekTuesday -14287 8161 -1.751 0.08853 .
## monthAUG:day_of_weekTuesday -15159 9386 -1.615 0.11501
## monthSEP:day_of_weekTuesday -16552 10261 -1.613 0.11545
## monthOCT:day_of_weekTuesday -14782 10261 -1.441 0.15833
## monthMAY:day_of_weekWednesday -7257 8378 -0.866 0.39211
## monthJUN:day_of_weekWednesday -8726 10261 -0.850 0.40071
## monthJUL:day_of_weekWednesday 11798 8378 1.408 0.16764
## monthAUG:day_of_weekWednesday 1522 8378 0.182 0.85691
## monthSEP:day_of_weekWednesday 15359 10261 1.497 0.14314
## monthOCT:day_of_weekWednesday -1271 10261 -0.124 0.90211
## monthMAY:day_of_weekThursday -10526 10013 -1.051 0.30018
## monthJUN:day_of_weekThursday -15777 11542 -1.367 0.18012
## monthJUL:day_of_weekThursday NA NA NA NA
## monthAUG:day_of_weekThursday 5629 10871 0.518 0.60778
## monthSEP:day_of_weekThursday 7817 10969 0.713 0.48066
## monthOCT:day_of_weekThursday NA NA NA NA
## monthMAY:day_of_weekFriday -9582 8073 -1.187 0.24305
## monthJUN:day_of_weekFriday -17290 9500 -1.820 0.07708 .
## monthJUL:day_of_weekFriday -1259 9232 -0.136 0.89231
## monthAUG:day_of_weekFriday -6275 8378 -0.749 0.45871
## monthSEP:day_of_weekFriday -6718 9500 -0.707 0.48400
## monthOCT:day_of_weekFriday NA NA NA NA
## monthMAY:day_of_weekSaturday -16671 8270 -2.016 0.05134 .
## monthJUN:day_of_weekSaturday -23729 9668 -2.454 0.01908 *
## monthJUL:day_of_weekSaturday -9445 9405 -1.004 0.32194
## monthAUG:day_of_weekSaturday -9216 8856 -1.041 0.30496
## monthSEP:day_of_weekSaturday -11702 9405 -1.244 0.22145
## monthOCT:day_of_weekSaturday NA NA NA NA
## monthMAY:day_of_weekSunday -10382 8073 -1.286 0.20665
## monthJUN:day_of_weekSunday -14235 10261 -1.387 0.17387
## monthJUL:day_of_weekSunday -9083 8568 -1.060 0.29618
## monthAUG:day_of_weekSunday -9748 8672 -1.124 0.26844
## monthSEP:day_of_weekSunday -16397 9232 -1.776 0.08416 .
## monthOCT:day_of_weekSunday NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5485 on 36 degrees of freedom
## Multiple R-squared: 0.8034, Adjusted R-squared: 0.5631
## F-statistic: 3.343 on 44 and 36 DF, p-value: 0.0001641
Compare simple and complex models using anova (useful for comparison of nested models with F-test)
anova(lmod1, lmod2)
## Analysis of Variance Table
##
## Model 1: attend ~ month + day_of_week + bobblehead
## Model 2: attend ~ month * day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 67 2509574563
## 2 36 1082895916 31 1426678647 1.53 0.1096
As far as we are concerned with the point predictions on the train data, these two models are not different. Null is that small model is correct. The null is consistent with data since p-value is large.
AIC(lmod1, lmod2)
## df AIC
## lmod1 15 1657.031
## lmod2 46 1650.953
The lower the AIC, the better is the model. AIC picks complex model. F-test suggests simple model (model1). AIC (measures generalizibility to unseen data) and diagnostic plots recommend complex model. Since we want to predict attendance for future games, we take AIC more seriously. Therefore I will follow AIC + diagnostics recommendation and work with complex model.
p-value is a measure of consistency of null hypothesis (=no relation between variable and attendance) with data.
lmod2 %>% summary()
##
## Call:
## lm(formula = attend ~ month * day_of_week + bobblehead, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11035 -1686 0 1692 10851
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26376 5485 4.809 0.0000269 ***
## monthMAY 8971 6333 1.417 0.16521
## monthJUN 24183 7756 3.118 0.00357 **
## monthJUL 6928 6333 1.094 0.28127
## monthAUG 8392 6717 1.249 0.21958
## monthSEP 7164 7756 0.924 0.36183
## monthOCT 7248 7756 0.934 0.35629
## day_of_weekTuesday 23631 6717 3.518 0.00120 **
## day_of_weekWednesday 1661 6717 0.247 0.80610
## day_of_weekThursday 1952 7756 0.252 0.80273
## day_of_weekFriday 11828 6717 1.761 0.08676 .
## day_of_weekSaturday 17884 6953 2.572 0.01438 *
## day_of_weekSunday 17180 6717 2.558 0.01490 *
## bobbleheadYES 12272 3590 3.418 0.00158 **
## monthMAY:day_of_weekTuesday -23488 8420 -2.789 0.00839 **
## monthJUN:day_of_weekTuesday -31183 10871 -2.869 0.00686 **
## monthJUL:day_of_weekTuesday -14287 8161 -1.751 0.08853 .
## monthAUG:day_of_weekTuesday -15159 9386 -1.615 0.11501
## monthSEP:day_of_weekTuesday -16552 10261 -1.613 0.11545
## monthOCT:day_of_weekTuesday -14782 10261 -1.441 0.15833
## monthMAY:day_of_weekWednesday -7257 8378 -0.866 0.39211
## monthJUN:day_of_weekWednesday -8726 10261 -0.850 0.40071
## monthJUL:day_of_weekWednesday 11798 8378 1.408 0.16764
## monthAUG:day_of_weekWednesday 1522 8378 0.182 0.85691
## monthSEP:day_of_weekWednesday 15359 10261 1.497 0.14314
## monthOCT:day_of_weekWednesday -1271 10261 -0.124 0.90211
## monthMAY:day_of_weekThursday -10526 10013 -1.051 0.30018
## monthJUN:day_of_weekThursday -15777 11542 -1.367 0.18012
## monthJUL:day_of_weekThursday NA NA NA NA
## monthAUG:day_of_weekThursday 5629 10871 0.518 0.60778
## monthSEP:day_of_weekThursday 7817 10969 0.713 0.48066
## monthOCT:day_of_weekThursday NA NA NA NA
## monthMAY:day_of_weekFriday -9582 8073 -1.187 0.24305
## monthJUN:day_of_weekFriday -17290 9500 -1.820 0.07708 .
## monthJUL:day_of_weekFriday -1259 9232 -0.136 0.89231
## monthAUG:day_of_weekFriday -6275 8378 -0.749 0.45871
## monthSEP:day_of_weekFriday -6718 9500 -0.707 0.48400
## monthOCT:day_of_weekFriday NA NA NA NA
## monthMAY:day_of_weekSaturday -16671 8270 -2.016 0.05134 .
## monthJUN:day_of_weekSaturday -23729 9668 -2.454 0.01908 *
## monthJUL:day_of_weekSaturday -9445 9405 -1.004 0.32194
## monthAUG:day_of_weekSaturday -9216 8856 -1.041 0.30496
## monthSEP:day_of_weekSaturday -11702 9405 -1.244 0.22145
## monthOCT:day_of_weekSaturday NA NA NA NA
## monthMAY:day_of_weekSunday -10382 8073 -1.286 0.20665
## monthJUN:day_of_weekSunday -14235 10261 -1.387 0.17387
## monthJUL:day_of_weekSunday -9083 8568 -1.060 0.29618
## monthAUG:day_of_weekSunday -9748 8672 -1.124 0.26844
## monthSEP:day_of_weekSunday -16397 9232 -1.776 0.08416 .
## monthOCT:day_of_weekSunday NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5485 on 36 degrees of freedom
## Multiple R-squared: 0.8034, Adjusted R-squared: 0.5631
## F-statistic: 3.343 on 44 and 36 DF, p-value: 0.0001641
Check F-statistic’s p-value. If it is less than 0.05, then there is relation between attendance and predictors.
anova(update(lmod2, . ~ . - bobblehead), lmod2)
## Analysis of Variance Table
##
## Model 1: attend ~ month + day_of_week + month:day_of_week
## Model 2: attend ~ month * day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 37 1434296454
## 2 36 1082895916 1 351400539 11.682 0.001582 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova / F-test reject the small model (without bobblehead). Therefore we conclude that bobblehead is important to boost the attendance.
AIC(update(lmod2, . ~ . - bobblehead), lmod2)
## df AIC
## update(lmod2, . ~ . - bobblehead) 45 1671.717
## lmod2 46 1650.953
AIC confirms that bobblehead is important.
Check also with cross-validation.
anova(update(lmod2, . ~ . - month - month:day_of_week), lmod2)
## Analysis of Variance Table
##
## Model 1: attend ~ day_of_week + bobblehead
## Model 2: attend ~ month * day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 73 3129721926
## 2 36 1082895916 37 2046826010 1.8391 0.03515 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Month is an important variable because small model’s p-value is small – therefore we reject the small model.
update(lmod2, contrast = list(month = "contr.sum", bobblehead = "contr.sum")) %>% coef()
## (Intercept) month1
## 41495.70238 -8983.73810
## month2 month3
## -12.73810 15199.26190
## month4 month5
## -2056.07143 -591.23810
## month6 day_of_weekTuesday
## -1819.73810 7137.86395
## day_of_weekWednesday day_of_weekThursday
## 3293.04762 9769.00000
## day_of_weekFriday day_of_weekSaturday
## 5110.00000 6181.66667
## day_of_weekSunday bobblehead1
## 782.66667 -6135.96429
## month1:day_of_weekTuesday month2:day_of_weekTuesday
## 16493.13605 -6995.14966
## month3:day_of_weekTuesday month4:day_of_weekTuesday
## -14689.79252 2205.82653
## month5:day_of_weekTuesday month6:day_of_weekTuesday
## 1333.70748 -58.86395
## month1:day_of_weekWednesday month2:day_of_weekWednesday
## -1632.04762 -8889.04762
## month3:day_of_weekWednesday month4:day_of_weekWednesday
## -10358.04762 10165.78571
## month5:day_of_weekWednesday month6:day_of_weekWednesday
## -110.54762 13726.95238
## month1:day_of_weekThursday month2:day_of_weekThursday
## -7817.00000 -18343.00000
## month3:day_of_weekThursday month4:day_of_weekThursday
## -23593.92857 NA
## month5:day_of_weekThursday month6:day_of_weekThursday
## -2188.42857 NA
## month1:day_of_weekFriday month2:day_of_weekFriday
## 6718.00000 -2863.66667
## month3:day_of_weekFriday month4:day_of_weekFriday
## -10571.50000 5459.33333
## month5:day_of_weekFriday month6:day_of_weekFriday
## 442.83333 NA
## month1:day_of_weekSaturday month2:day_of_weekSaturday
## 11701.86905 -4969.00000
## month3:day_of_weekSaturday month4:day_of_weekSaturday
## -12027.16667 2256.73810
## month5:day_of_weekSaturday month6:day_of_weekSaturday
## 2485.83333 NA
## month1:day_of_weekSunday month2:day_of_weekSunday
## 16397.33333 6015.33333
## month3:day_of_weekSunday month4:day_of_weekSunday
## 2162.33333 7314.70238
## month5:day_of_weekSunday month6:day_of_weekSunday
## 6649.83333 NA
effects::predictorEffect("month", lmod2) %>% plot()
Expected number of additional fans drawn to a home game with a bobblehead promotion
lmod2 %>% coef %>% .["bobbleheadYES"]
## bobbleheadYES
## 12271.93
confint(lmod2, parm = "bobbleheadYES", level = 0.95)
## 2.5 % 97.5 %
## bobbleheadYES 4990.077 19553.78
confint(lmod2, parm = "bobbleheadYES", level = 0.80)
## 10 % 90 %
## bobbleheadYES 7584.494 16959.36
predict(object = lmod2, newdata = data.frame(month = "JUN", day_of_week = "Wednesday", bobblehead = "YES"), interval = "prediction", level = 0.80)
## fit lwr upr
## 1 55765.93 44607.6 66924.25
Include all variables and conduct a full regression analysis of the problem. Submit your R markdown
and html
files to course homepage on moodle.
events %>% names()
## [1] "month" "day" "attend" "day_of_week" "opponent"
## [6] "temp" "skies" "day_night" "cap" "shirt"
## [11] "fireworks" "bobblehead"