TTC delay data and Friday the 13th
Wednesday, June 21, 2017
The TTC releasing their Subway Delay Data was great news. I’m always happy to see more data released to the public. In this case, it also helps us investigate one of the great, enduring mysteries: Is Friday the 13th actually an unlucky day?
As always, we start by downloading and manipulating the data. I’ve added in two steps that aren’t strictly necessary. One is converting the Date, Time, and Day columns into a single Date column. The other is to drop most of the other columns of data, since we aren’t interested in them here.
url <- "http://www1.toronto.ca/City%20Of%20Toronto/Information%20&%20Technology/Open%20Data/Data%20Sets/Assets/Files/Subway%20&%20SRT%20Logs%20(Jan01_14%20to%20April30_17).xlsx"
filename <- basename(url)
download.file(url, destfile = filename, mode = "wb")
delays <- readxl::read_excel(filename, sheet = 2) %>%
dplyr::mutate(date = lubridate::ymd_hm(paste(`Date`,
`Time`,
sep = " ")),
delay = `Min Delay`) %>%
dplyr::select(date, delay)
delays
## # A tibble: 69,043 x 2
## date delay
## <dttm> <dbl>
## 1 2014-01-01 02:06:00 3
## 2 2014-01-01 02:40:00 0
## 3 2014-01-01 03:10:00 3
## 4 2014-01-01 03:20:00 5
## 5 2014-01-01 03:29:00 0
## 6 2014-01-01 07:31:00 0
## 7 2014-01-01 07:32:00 0
## 8 2014-01-01 07:34:00 0
## 9 2014-01-01 07:34:00 0
## 10 2014-01-01 07:53:00 0
## # ... with 69,033 more rows
Now we have a delays
dataframe with 69043 incidents starting from 2014-01-01 00:21:00 and ending at 2017-04-30 22:13:00. Before we get too far, we’ll take a look at the data. A heatmap of delays by day and hour should give us some perspective.
delays %>%
dplyr::mutate(day = lubridate::day(date),
hour = lubridate::hour(date)) %>%
dplyr::group_by(day, hour) %>%
dplyr::summarise(sum_delay = sum(delay)) %>%
ggplot2::ggplot(aes(x = hour, y = day, fill = sum_delay)) +
ggplot2::geom_tile(alpha = 0.8, color = "white") +
ggplot2::scale_fill_gradient2() +
ggplot2::theme(legend.position = "right") +
ggplot2::labs(x = "Hour", y = "Day of the month", fill = "Sum of delays")
Other than a reliable band of calm very early in the morning, no obvious patterns here.
We need to identify any days that are a Friday the 13th. We also might want to compare weekends, regular Fridays, other weekdays, and Friday the 13ths, so we add a type
column that provides these values. Here we use the case_when
function:
delays <- delays %>%
dplyr::mutate(type = case_when( # Partition into Friday the 13ths, Fridays, weekends, and weekdays
lubridate::wday(.$date) %in% c(1, 7) ~ "weekend",
lubridate::wday(.$date) %in% c(6) &
lubridate::day(.$date) == 13 ~ "Friday 13th",
lubridate::wday(.$date) %in% c(6) ~ "Friday",
TRUE ~ "weekday" # Everything else is a weekday
)) %>%
dplyr::mutate(type = factor(type)) %>%
dplyr::group_by(type)
delays
## # A tibble: 69,043 x 3
## # Groups: type [4]
## date delay type
## <dttm> <dbl> <fctr>
## 1 2014-01-01 02:06:00 3 weekday
## 2 2014-01-01 02:40:00 0 weekday
## 3 2014-01-01 03:10:00 3 weekday
## 4 2014-01-01 03:20:00 5 weekday
## 5 2014-01-01 03:29:00 0 weekday
## 6 2014-01-01 07:31:00 0 weekday
## 7 2014-01-01 07:32:00 0 weekday
## 8 2014-01-01 07:34:00 0 weekday
## 9 2014-01-01 07:34:00 0 weekday
## 10 2014-01-01 07:53:00 0 weekday
## # ... with 69,033 more rows
With the data organized, we can start with just a simple box plot of the minutes of delay by type
.
ggplot2::ggplot(delays, aes(type, delay)) +
ggplot2::geom_boxplot() +
ggplot2::labs(x = "Type", y = "Minutes of delay")
Not very compelling. Basically most delays are short (as in zero minutes long) with many outliers.
How about if we summed up the total minutes in delays for each of the types of days?
delays %>%
dplyr::summarise(total_delay = sum(delay))
## # A tibble: 4 x 2
## type total_delay
## <fctr> <dbl>
## 1 Friday 18036
## 2 Friday 13th 619
## 3 weekday 78865
## 4 weekend 28194
Clearly the total minutes of delays are much shorter for Friday the 13ths. But, there aren’t very many such days (only 6 in fact). So, this is a dubious analysis.
Let’s take a step back and calculate the average of the total delay across the entire day for each of the types of days. If Friday the 13ths really are unlucky, we would expect to see longer delays, at least relative to a regular Friday.
daily_delays <- delays %>% # Total delays in a day
dplyr::mutate(year = lubridate::year(date),
day = lubridate::yday(date)) %>%
dplyr::group_by(year, day, type) %>%
dplyr::summarise(total_delay = sum(delay))
mean_daily_delays <- daily_delays %>% # Average delays in each type of day
dplyr::group_by(type) %>%
dplyr::summarise(avg_delay = mean(total_delay))
mean_daily_delays
## # A tibble: 4 x 2
## type avg_delay
## <fctr> <dbl>
## 1 Friday 107.35714
## 2 Friday 13th 103.16667
## 3 weekday 113.63833
## 4 weekend 81.01724
ggplot2::ggplot(daily_delays, aes(type, total_delay)) +
ggplot2::geom_boxplot() +
ggplot2::labs(x = "Type", y = "Total minutes of delay")
On average, Friday the 13ths have shorter total delays (103 minutes) than either regular Fridays (107 minutes) or other weekdays (114 minutes). Overall, weekend days have far shorter total delays (81 minutes).
If Friday the 13ths are unlucky, they certainly aren’t causing longer TTC delays.
For the statisticians among you that still aren’t convinced, we’ll run a basic linear model to compare Friday the 13ths with regular Fridays. This should control for many unmeasured variables.
model <- lm(total_delay ~ type, data = daily_delays,
subset = type %in% c("Friday", "Friday 13th"))
summary(model)
##
## Call:
## lm(formula = total_delay ~ type, data = daily_delays, subset = type %in%
## c("Friday", "Friday 13th"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.357 -30.357 -6.857 18.643 303.643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.357 3.858 27.829 <2e-16 ***
## typeFriday 13th -4.190 20.775 -0.202 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50 on 172 degrees of freedom
## Multiple R-squared: 0.0002365, Adjusted R-squared: -0.005576
## F-statistic: 0.04069 on 1 and 172 DF, p-value: 0.8404
Definitely no statistical support for the idea that Friday the 13ths cause longer TTC delays.
How about time series tests, like anomaly detections? Seems like we’d just be getting carried away. Part of the art of statistics is knowing when to quit.
In conclusion, then, after likely far too much analysis, we find no evidence that Friday the 13ths cause an increase in the length of TTC delays. This certainly suggests that Friday the 13ths are not unlucky in any meaningful way, at least for TTC riders.
Glad we could put this superstition to rest!