Spring 2024
GE 461 Introduction to Data Science
The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games. At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014, chap. 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)
The 2012 season data in the events
table of SQLite database data/dodgers.sqlite
contain for each of 81 home play the
We will use R
, RStudio
, R Markdown
for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online
R
and RStudio
,R Markdown
to interact with R
and conduct various predictive analyses.All materials for the next three weeks will be available on Google drive.
Connect to data/dodgers.sqlite
. Read table events
into a variable in R
.
Read Baumer, Kaplan, and Horton (2017, chaps. 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.
Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL
for real projects.
Open RStudio
terminal, connect to database dodgers.sqlite
with sqlite3
. Explore it (there is only one table, events
, at this time) with commands
.help
.databases
.tables
.schema <table_name>
.headers on
.mode column
SELECT ...
.quit
Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.
R
(likewise, Python
) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R
or Python
to fit models to data.
library(RSQLite)
con <-dbConnect(SQLite(), "../data/dodgers.sqlite")
dbListTables(con)
## [1] "events"
dbListFields(con, "events")
## [1] "month" "day" "attend" "day_of_week" "opponent"
## [6] "temp" "skies" "day_night" "cap" "shirt"
## [11] "fireworks" "bobblehead"
d0 <- dbGetQuery(con, "SELECT * from events;")
Use fetch if query may return large data
res <- dbSendQuery(con, "SELECT * from events;")
while (!dbHasCompleted(res)) {
chunk <- dbFetch(res, 20)
cat("The chunk has ", sprintf("%2d", nrow(chunk)), " records. The mean attendance for the current chunk is ", mean(chunk$attend), ". \n", sep="")
}
## The chunk has 20 records. The mean attendance for the current chunk is 38772.35.
## The chunk has 20 records. The mean attendance for the current chunk is 42935.1.
## The chunk has 20 records. The mean attendance for the current chunk is 42905.65.
## The chunk has 20 records. The mean attendance for the current chunk is 39898.5.
## The chunk has 1 records. The mean attendance for the current chunk is 34014.
dbClearResult(res)
dbDisconnect(con)
# d0$month %>% unique()
# d0$day_of_week %>% unique()
d <- d0 %>%
mutate(month = factor(month, levels = c("APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT")),
day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
temp = (temp-32)*5/9) %>% # in celcius
mutate(across(where(is.character), factor))
summary(d)
## month day attend day_of_week opponent
## APR:12 Min. : 1.00 Min. :24312 Monday :12 Giants : 9
## MAY:18 1st Qu.: 8.00 1st Qu.:34493 Tuesday :13 Padres : 9
## JUN: 9 Median :15.00 Median :40284 Wednesday:12 Rockies : 9
## JUL:12 Mean :16.14 Mean :41040 Thursday : 5 Snakes : 9
## AUG:15 3rd Qu.:25.00 3rd Qu.:46588 Friday :13 Cardinals: 7
## SEP:12 Max. :31.00 Max. :56000 Saturday :13 Brewers : 4
## OCT: 3 Sunday :13 (Other) :34
## temp skies day_night cap shirt fireworks bobblehead
## Min. :12.22 Clear :62 Day :15 NO :79 NO :78 NO :67 NO :70
## 1st Qu.:19.44 Cloudy:19 Night:66 YES: 2 YES: 3 YES:14 YES:11
## Median :22.78
## Mean :22.86
## 3rd Qu.:26.11
## Max. :35.00
##
xtabs(~ day_of_week + month, data = d) %>% chisq.test(simulate.p.value = TRUE, B = 10000)
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: .
## X-squared = 11.678, df = NA, p-value = 1
Month and day of week of the game are unrelated.
day_of_week
and month
factors. If necessary, put them in the logical order.Done.
xtabs(~ bobblehead + day_of_week, d)
## day_of_week
## bobblehead Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## NO 12 7 12 3 13 11 12
## YES 0 6 0 2 0 2 1
xtabs(~ bobblehead + month, d)
## month
## bobblehead APR MAY JUN JUL AUG SEP OCT
## NO 11 16 7 9 12 12 3
## YES 1 2 2 3 3 0 0
d %>%
ggplot(aes(day_of_week, attend)) +
geom_boxplot()
# xtabs(attend ~ day_night, d)
d %>%
group_by(day_night) %>%
summarize(avgattend = mean(attend)) %>%
pull(avgattend) %>% chisq.test()
##
## Chi-squared test for given probabilities
##
## data: .
## X-squared = 10.337, df = 1, p-value = 0.001304
Since p-value is less than 0.05, the split is not uniform.
# xtabs(attend ~ day_night, d)
d %>%
group_by(skies) %>%
summarize(avgattend = mean(attend)) %>%
pull(avgattend) %>% chisq.test()
##
## Chi-squared test for given probabilities
##
## data: .
## X-squared = 107.19, df = 1, p-value < 0.00000000000000022
Skies is also an important predictor since test of uniform dist rejected that the attendance is the same on average on clear or cloudy sky days.
d %>%
# select(is.numeric)
select(attend, temp) %>%
cor() %>%
round(4)
## attend temp
## attend 1.000 0.099
## temp 0.099 1.000
d %>%
ggplot(aes(temp, attend)) +
geom_point() +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Regress attendance on month
, day of the week
, and bobblehead
promotion.
lmod <- lm(attend ~ day_of_week + month + bobblehead, data = d)
summary(lmod)
##
## Call:
## lm(formula = attend ~ day_of_week + month + bobblehead, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 0.0000000000000002 ***
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## bobbleheadYES 10714.90 2419.52 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
Is at least of not the variable month, week day, bobblehead associated with attendance?
Construct two models.
small <- update(lmod, . ~ 1)
anova(small, lmod)
## Analysis of Variance Table
##
## Model 1: attend ~ 1
## Model 2: attend ~ day_of_week + month + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5507932888
## 2 67 2509574563 13 2998358324 6.1576 0.0000002083 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Full model is much better than the small (null) model. This provides us evindence at least one of the predictors (month, weeek day, bobblehead) seems to be related to attendance.
small <- update(lmod, . ~ . -bobblehead)
anova(small, lmod)
## Analysis of Variance Table
##
## Model 1: attend ~ day_of_week + month
## Model 2: attend ~ day_of_week + month + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 68 3244161740
## 2 67 2509574563 1 734587177 19.612 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Small p-value (< 0.05) implies that full model is significantly better than the small model, so bobblehead is statistically significant.
small <- update(lmod, . ~ . - month)
anova(small, lmod)
## Analysis of Variance Table
##
## Model 1: attend ~ day_of_week + bobblehead
## Model 2: attend ~ day_of_week + month + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 73 3129721926
## 2 67 2509574563 6 620147363 2.7594 0.01858 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(car)
Anova(lmod)
## Anova Table (Type II tests)
##
## Response: attend
## Sum Sq Df F value Pr(>F)
## day_of_week 575839199 6 2.5623 0.02704 *
## month 620147363 6 2.7594 0.01858 *
## bobblehead 734587177 1 19.6118 0.0000359 ***
## Residuals 2509574563 67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(lmod)["bobbleheadYES"]
## bobbleheadYES
## 10714.9
confint(lmod, "bobbleheadYES")
## 2.5 % 97.5 %
## bobbleheadYES 5885.521 15544.29
summary(lmod)
##
## Call:
## lm(formula = attend ~ day_of_week + month + bobblehead, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 0.0000000000000002 ***
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## bobbleheadYES 10714.90 2419.52 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
d %>%
mutate(fitted = fitted(lmod)) %>%
ggplot(aes(fitted, attend)) +
geom_point() +
geom_abline() +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
predict(lmod, newdata =
tibble(day_of_week = "Wednesday",
month = "JUN",
bobblehead = "YES"),
interval = "predict", level = 0.90)
## fit lwr upr
## 1 54247.32 42491.1 66003.55
Include all variables and conduct a full regression analysis of the problem. Submit your R markdown
and html
files to course homepage on moodle.