My iPhone Home Screen

My goal for the home screen is to stay focused on action by making it easy to quickly capture my intentions and to minimize distractions. With previous setups I often found that I’d unlock the phone, be confronted by a screen full of apps with notification badges, and promptly forget what I had intended to do. So, I’ve reduced my home screen to just two apps.

iPhone home screen

Drafts is on the right and is likely my most frequently used app. As the tag line for the app says, this is where text starts. Rather than searching for a specific app, launching it, and then typing, Drafts always opens up to a blank text field. Then I type whatever is on my mind and send it from Drafts to the appropriate app. So, text messages, emails, todos, meeting notes, and random ideas all start in Drafts. Unfortunately my corporate iPhone blocks iCloud Drive, so I can’t use Drafts to share notes across my devices. Anything that I want to keep gets moved into Apple Notes.

Things is on the left and is currently my favoured todo app. All of my tasks, projects, and areas of focus are in there, tagged by context, and given due dates, if appropriate. If the Things app has a notification badge, then I’ve got work to do today. If you’re keen, The Sweet Setup has a great course on Things.

A few more notes on my setup:

  • If Drafts isn’t the right place to start, I just pull down from the home screen to activate search and find the right app. I’ve found that the Siri Suggestions are often what I’m looking for (based on time of day and other context).
  • Some apps are more important for their output than input. These include calendar, weather, and notes. I’ve set these up as widgets in the Today View. A quick slide to the right reveals these.
  • I interact with several other apps through notifications, particularly for communication with Messages and Mail. But, I’ve set up VIPs in Mail to reduce these notifications to just the really important people.

I’ve been using this setup for a few months now and it certainly works for me. Even if this isn’t quite right for you, I’d encourage you to take a few minutes to really think through how you interact with your phone. I see far too many people with the default settings spending too much time scrolling around on their phones looking for the right app.

Commiting to a Fitness Challenge and Then Figuring Out if it is Achievable

My favourite spin studio has put on a fitness challenge for 2019. It has many components, one of which is improving your performance by 3% over six weeks. I’ve taken on the challenge and am now worried that I don’t know how reasonable this increase actually is. So, a perfect excuse to extract my metrics and perform some excessive analysis.

We start by importing a CSV file of my stats, taken from Torq’s website. We use readr for this and set the col_types in advance to specify the type for each column, as well as skip some extra columns with col_skip. I find doing this at the same time as the import, rather than later in the process, more direct and efficient. I also like the flexibility of the col_date import, where I can convert the source data (e.g., “Mon 11/02/19 - 6:10 AM”) into a format more useful for analysis (e.g., “2019-02-11”).

One last bit of clean up is to specify the instructors that have led 5 or more sessions. Later, we’ll try to identify an instructor-effect on my performance and only 5 of the 10 instructors have sufficient data for such an analysis. The forcats::fct_other function is really handy for this: it collapses several factor levels together into one “Other” level.

Lastly, I set the challenge_start_date variable for use later on.

col_types = cols(
  Date = col_date(format = "%a %d/%m/%y - %H:%M %p"),
  Ride = col_skip(),
  Instructor = col_character(),
  Spot = col_skip(),
  `Avg RPM` = col_skip(),
  `Max RPM` = col_skip(),
  `Avg Power` = col_double(),
  `Max Power` = col_double(),
  `Total Energy` = col_double(),
  `Dist. (Miles)` = col_skip(),
  Cal. = col_skip(),
  `Class Rank` = col_skip()
)
main_instructors <- c("Tanis", "George", "Charlotte", "Justine", "Marawan") # 5 or more sessions
data <- readr::read_csv("torq_data.csv", col_types = col_types) %>% 
  tibbletime::as_tbl_time(index = Date) %>% 
  dplyr::rename(avg_power    = `Avg Power`,
                max_power    = `Max Power`,
                total_energy = `Total Energy`) %>% 
  dplyr::mutate(Instructor = 
                  forcats::fct_other(Instructor, keep = main_instructors))
challenge_start_date <- lubridate::as_date("2019-01-01")
head(data)
## # A time tibble: 6 x 5
## # Index: Date
##   Date       Instructor avg_power max_power total_energy
##                               
## 1 2019-02-11 Justine          234       449          616
## 2 2019-02-08 George           221       707          577
## 3 2019-02-04 Justine          230       720          613
## 4 2019-02-01 George           220       609          566
## 5 2019-01-21 Justine          252       494          623
## 6 2019-01-18 George           227       808          590

To start, we just plot the change in average power over time. Given relatively high variability from class to class, we add a smoothed line to show the overall trend. We also mark the point where the challenge starts with a red line.

ggplot2::ggplot(data, aes(x = Date, y = avg_power)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  geom_smooth(method = "loess") +
  ggplot2::ylab("Average power") +
  ggplot2::geom_vline(xintercept = challenge_start_date, col = "red")

Overall, looks like I made some steady progress when I started, plateaued in the Summer (when I didn’t ride as often), and then started a slow, steady increase in the Fall. Unfortunately for me, it also looks like I started to plateau in my improvement just in time for the challenge.

We’ll start a more formal analysis by just testing to see what my average improvement is over time.

time_model <- lm(avg_power ~ Date, data = data)
summary(time_model)
## 
## Call:
## lm(formula = avg_power ~ Date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.486 -11.356  -1.879  11.501  33.838 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.032e+03  3.285e+02  -6.186 4.85e-08 ***
## Date         1.264e-01  1.847e-02   6.844 3.49e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.65 on 64 degrees of freedom
## Multiple R-squared:  0.4226, Adjusted R-squared:  0.4136 
## F-statistic: 46.85 on 1 and 64 DF,  p-value: 3.485e-09

Based on this, we can see that my performance is improving over time and the R2 is decent. But, interpreting the coefficient for time isn’t entirely intuitive and isn’t really the point of the challenge. The question is: how much higher is my average power during the challenge than before? For that, we’ll set up a “dummy variable” based on the date of the class. Then we use dplyr to group by this variable and estimate the mean of the average power in both time periods.

data %<>% 
  dplyr::mutate(in_challenge = ifelse(Date > challenge_start_date, 1, 0)) 
(change_in_power <- data %>% 
  dplyr::group_by(in_challenge) %>% 
  dplyr::summarize(mean_avg_power = mean(avg_power)))
## # A tibble: 2 x 2
##   in_challenge mean_avg_power
##                    
## 1            0           213.
## 2            1           231.

So, I’ve improved from an average power of 213 to 231 for a improvement of 8%. A great relief: I’ve exceeded the target!

Of course, having gone to all of this (excessive?) trouble, now I’m interested in seeing if the instructor leading the class has a significant impact on my results. I certainly feel as if some instructors are harder than others. But, is this supported by my actual results? As always, we start with a plot to get a visual sense of the data.

ggplot2::ggplot(data, aes(x = Date, y = avg_power)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::geom_smooth(method = "lm", se = FALSE) +
  ggplot2::facet_wrap(~Instructor) +
  ggplot2::ylab("Average power")

There’s a mix of confirmation and rejection of my expectations here:

  • Charlotte got me started and there’s a clear trend of increasing power as I figure out the bikes and gain fitness
  • George’s class is always fun, but my progress isn’t as high as I’d like (indicated by the shallower slope). This is likely my fault though: George’s rides are almost always Friday morning and I’m often tired by then and don’t push myself as hard as I could
  • Justine hasn’t led many of my classes, but my two best results are with her. That said, my last two rides with her haven’t been great
  • Marawan’s classes always feel really tough. He’s great at motivation and I always feel like I’ve pushed really hard in his classes. You can sort of see this with the relatively higher position of the best fit line for his classes. But, I also had one really poor class with him. This seems to coincide with some poor classes with both Tanis and George though. So, I was likely a bit sick then. Also, for Marawan, I’m certain I’ve had more classes with him (each one is memorably challenging), as he’s substituted for other instructors several times. Looks like Torq’s tracker doesn’t adjust the name of the instructor to match this substitution though
  • Tanis’ classes are usually tough too. She’s relentless about increasing resistance and looks like I was improving well with her
  • Other isn’t really worth describing in any detail, as it is a mix of 5 other instructors each with different styles.

Having eye-balled the charts. Let’s now look at a statistical model that considers just instructors.

instructor_model <- update(time_model, . ~ Instructor)
summary(instructor_model)
## 
## Call:
## lm(formula = avg_power ~ Instructor, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.167 -10.590   0.663  13.540  32.000 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        194.000      6.123  31.683  < 2e-16 ***
## InstructorGeorge    23.840      7.141   3.339 0.001451 ** 
## InstructorJustine   33.167      9.682   3.426 0.001112 ** 
## InstructorMarawan   28.200     10.246   2.752 0.007817 ** 
## InstructorTanis     31.833      8.100   3.930 0.000222 ***
## InstructorOther     15.667      8.660   1.809 0.075433 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.37 on 60 degrees of freedom
## Multiple R-squared:  0.2542, Adjusted R-squared:  0.1921 
## F-statistic: 4.091 on 5 and 60 DF,  p-value: 0.002916

Each of the named instructors has a significant, positive effect on my average power. However, the overall R2 is much less than the model that considered just time. So, our next step is to consider both together.

time_instructor_model <- update(time_model, . ~ . + Instructor)
summary(time_instructor_model)
## 
## Call:
## lm(formula = avg_power ~ Date + Instructor, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.689 -11.132   1.039  10.800  26.267 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2332.4255   464.3103  -5.023 5.00e-06 ***
## Date                  0.1431     0.0263   5.442 1.07e-06 ***
## InstructorGeorge     -2.3415     7.5945  -0.308    0.759    
## InstructorJustine    10.7751     8.9667   1.202    0.234    
## InstructorMarawan    12.5927     8.9057   1.414    0.163    
## InstructorTanis       3.3827     8.4714   0.399    0.691    
## InstructorOther      12.3270     7.1521   1.724    0.090 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.12 on 59 degrees of freedom
## Multiple R-squared:  0.5034, Adjusted R-squared:  0.4529 
## F-statistic: 9.969 on 6 and 59 DF,  p-value: 1.396e-07
anova(time_instructor_model, time_model)
## Analysis of Variance Table
## 
## Model 1: avg_power ~ Date + Instructor
## Model 2: avg_power ~ Date
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     59 13481                           
## 2     64 15675 -5   -2193.8 1.9202 0.1045

This model has the best explanatory power so far and increases the coefficient for time slightly. This suggests that controlling for instructor improves the time signal. However, when we add in time, none of the individual instructors are significantly different from each other. My interpretation of this is that my overall improvement over time is much more important than which particular instructor is leading the class. Nonetheless, the earlier analysis of the instructors gave me some insights that I can use to maximize the contribution each of them makes to my overall progress. When comparing these models with an ANOVA though, we find that there isn’t a significant difference between them.

The last model to try is to look for a time by instructor interaction.

time_instructor_interaction_model <- update(time_model, . ~ . * Instructor)
summary(time_instructor_interaction_model)
## 
## Call:
## lm(formula = avg_power ~ Date + Instructor + Date:Instructor, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.328  -9.899   0.963  10.018  26.128 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.881e+03  1.598e+03  -3.680 0.000540 ***
## Date                    3.442e-01  9.054e-02   3.801 0.000368 ***
## InstructorGeorge        4.225e+03  1.766e+03   2.393 0.020239 *  
## InstructorJustine       3.704e+03  1.796e+03   2.062 0.044000 *  
## InstructorMarawan       4.177e+03  2.137e+03   1.955 0.055782 .  
## InstructorTanis         1.386e+03  2.652e+03   0.523 0.603313    
## InstructorOther         3.952e+03  2.321e+03   1.703 0.094362 .  
## Date:InstructorGeorge  -2.391e-01  9.986e-02  -2.394 0.020149 *  
## Date:InstructorJustine -2.092e-01  1.016e-01  -2.059 0.044291 *  
## Date:InstructorMarawan -2.357e-01  1.207e-01  -1.953 0.056070 .  
## Date:InstructorTanis   -7.970e-02  1.492e-01  -0.534 0.595316    
## Date:InstructorOther   -2.231e-01  1.314e-01  -1.698 0.095185 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.86 on 54 degrees of freedom
## Multiple R-squared:  0.5609, Adjusted R-squared:  0.4715 
## F-statistic: 6.271 on 11 and 54 DF,  p-value: 1.532e-06
anova(time_model, time_instructor_interaction_model)
## Analysis of Variance Table
## 
## Model 1: avg_power ~ Date
## Model 2: avg_power ~ Date + Instructor + Date:Instructor
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     64 15675                           
## 2     54 11920 10    3754.1 1.7006 0.1044

We can see that there are some significant interactions, meaning that the slope of improvement over time does differ by instructor. Before getting too excited though, an ANOVA shows that this model isn’t any better than the simple model of just time. There’s always a risk with trying to explain main effects (like time) with interactions. The story here is really that we need more data to tease out the impacts of the instructor by time effect.

The last point to make is that we’ve focused so far on the average power, since that’s the metric for this fitness challenge. There could be interesting interactions among average power, maximum power, RPMs, and total energy, each of which is available in these data. I’ll return to that analysis some other time.

In the interests of science, I’ll keep going to these classes, just so we can figure out what impact instructors have on performance. In the meantime, looks like I’ll succeed with at least this component of the fitness challenge and I have the stats to prove it.

Spatial analysis of votes in Toronto

This is a “behind the scenes” elaboration of the geospatial analysis in our recent post on evaluating our predictions for the 2018 mayoral election in Toronto. This was my first, serious use of the new sf package for geospatial analysis. I found the package much easier to use than some of my previous workflows for this sort of analysis, especially given its integration with the tidyverse.

We start by downloading the shapefile for voting locations from the City of Toronto’s Open Data portal and reading it with the read_sf function. Then, we pipe it to st_transform to set the appropriate projection for the data. In this case, this isn’t strictly necessary, since the shapefile is already in the right projection. But, I tend to do this for all shapefiles to avoid any oddities later.

download.file("[opendata.toronto.ca/gcc/votin...](http://opendata.toronto.ca/gcc/voting_location_2018_wgs84.zip)",
                destfile = "data-raw/voting_location_2018_wgs84.zip", quiet = TRUE)
unzip("data-raw/voting_location_2018_wgs84.zip", exdir="data-raw/voting_location_2018_wgs84")
toronto_locations <- sf::read_sf("data-raw/voting_location_2018_wgs84", 
                                 layer = "VOTING_LOCATION_2018_WGS84") %>%
  sf::st_transform(crs = "+init=epsg:4326")
toronto_locations
## Simple feature collection with 1700 features and 13 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -79.61937 ymin: 43.59062 xmax: -79.12531 ymax: 43.83052
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 1,700 x 14
##    POINT_ID FEAT_CD FEAT_C_DSC PT_SHRT_CD PT_LNG_CD POINT_NAME VOTER_CNT
##                                      
##  1    10190 P       Primary    056        10056                   37
##  2    10064 P       Primary    060        10060                  532
##  3    10999 S       Secondary  058        10058     Malibu           661
##  4    11342 P       Primary    052        10052                 1914
##  5    10640 P       Primary    047        10047     The Summit       956
##  6    10487 S       Secondary  061        04061     White Eag…        51
##  7    11004 P       Primary    063        04063     Holy Fami…      1510
##  8    11357 P       Primary    024        11024     Rosedale …      1697
##  9    12044 P       Primary    018        05018     Weston Pu…      1695
## 10    11402 S       Secondary  066        04066     Elm Grove…        93
## # ... with 1,690 more rows, and 7 more variables: OBJECTID ,
## #   ADD_FULL , X , Y , LONGITUDE , LATITUDE ,
## #   geometry 

The file has 1700 rows of data across 14 columns. The first 13 columns are data within the original shapefile. The last column is a list column that is added by sf and contains the geometry of the location. This specific design feature is what makes an sf object work really well with the rest of the tidyverse: the geographical details are just a column in the data frame. This makes the data much easier to work with than in other approaches, where the data is contained within an @data slot of an object.

Plotting the data is straightforward, since sf objects have a plot function. Here’s an example where we plot the number of voters (VOTER_CNT) at each location. If you squint just right, you can see the general outline of Toronto in these points.

What we want to do next is use the voting location data to aggregate the votes cast at each location into census tracts. This then allows us to associate census characteristics (like age and income) with the pattern of votes and develop our statistical relationships for predicting voter behaviour.

We’ll split this into several steps. The first is downloading and reading the census tract shapefile.

download.file("[www12.statcan.gc.ca/census-re...](http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/gct_000b11a_e.zip)",
                destfile = "data-raw/gct_000b11a_e.zip", quiet = TRUE)
unzip("data-raw/gct_000b11a_e.zip", exdir="data-raw/gct")
census_tracts <- sf::read_sf("data-raw/gct", layer = "gct_000b11a_e") %>%
  sf::st_transform(crs = "+init=epsg:4326")

Now that we have it, all we really want are the census tracts in Toronto (the shapefile includes census tracts across Canada). We achieve this by intersecting the Toronto voting locations with the census tracts using standard R subsetting notation.

to_census_tracts <- census_tracts[toronto_locations,]

And, we can plot it to see how well the intersection worked. This time we’ll plot the CTUID, which is the unique identifier for each census tract. This doesn’t mean anything in this context, but adds some nice colour to the plot.

plot(to_census_tracts["CTUID"])

Now you can really see the shape of Toronto, as well as the size of each census tract.

Next we need to manipulate the voting data to get votes received by major candidates in the 2018 election. We take these data from the toVotes package and arbitrarily set the threshold for major candidates to receiving at least 100,000 votes. This yields our two main candidates: John Tory and Jennifer Keesmaat.

major_candidates <- toVotes %>% 
  dplyr::filter(year == 2018) %>%
  dplyr::group_by(candidate) %>%
  dplyr::summarise(votes = sum(votes)) %>%
  dplyr::filter(votes > 100000) %>%
  dplyr::select(candidate)
major_candidates
## # A tibble: 2 x 1
##   candidate        
##               
## 1 Keesmaat Jennifer
## 2 Tory John

Given our goal of aggregating the votes received by each candidate into census tracts, we need a data frame that has each candidate in a separate column. We start by joining the major candidates table to the votes table. In this case, we also filter the votes to 2018, since John Tory has been a candidate in more than one election. Then we use the tidyr package to convert the table from long (with one candidate column) to wide (with a column for each candidate).

spread_votes <- toVotes %>%
  tibble::as.tibble() %>% 
  dplyr::filter(year == 2018) %>%
  dplyr::right_join(major_candidates) %>%
  dplyr::select(-type, -year) %>%
  tidyr::spread(candidate, votes)
spread_votes
## # A tibble: 1,800 x 4
##     ward  area `Keesmaat Jennifer` `Tory John`
##                           
##  1     1     1                  66         301
##  2     1     2                  72         376
##  3     1     3                  40         255
##  4     1     4                  24         161
##  5     1     5                  51         211
##  6     1     6                  74         372
##  7     1     7                  70         378
##  8     1     8                  10          77
##  9     1     9                  11          82
## 10     1    10                  12          76
## # ... with 1,790 more rows

Our last step before finally aggregating to census tracts is to join the spread_votes table with the toronto_locations data. This requires pulling the ward and area identifiers from the PT_LNG_CD column of the toronto_locations data frame which we do with some stringr functions. While we’re at it, we also update the candidate names to just surnames.

to_geo_votes <- toronto_locations %>%
  dplyr::mutate(ward = as.integer(stringr::str_sub(PT_LNG_CD, 1, 2)),
                area = as.integer(stringr::str_sub(PT_LNG_CD, -3, -1))) %>%
  dplyr::left_join(spread_votes) %>%
  dplyr::select(`Keesmaat Jennifer`, `Tory John`) %>%
  dplyr::rename(Keesmaat = `Keesmaat Jennifer`, Tory = `Tory John`)
to_geo_votes
## Simple feature collection with 1700 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -79.61937 ymin: 43.59062 xmax: -79.12531 ymax: 43.83052
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 1,700 x 3
##    Keesmaat  Tory             geometry
##                  
##  1       55   101 (-79.40225 43.63641)
##  2       43    53  (-79.3985 43.63736)
##  3       61    89 (-79.40028 43.63664)
##  4      296   348 (-79.39922 43.63928)
##  5      151   204 (-79.40401 43.64293)
##  6        2    11 (-79.43941 43.63712)
##  7      161   174 (-79.43511 43.63869)
##  8      221   605  (-79.38188 43.6776)
##  9       68   213 (-79.52078 43.70176)
## 10        4    14 (-79.43094 43.64027)
## # ... with 1,690 more rows

Okay, we’re finally there. We have our census tract data in to_census_tracts and our voting data in to_geo_votes. We want to aggregate the votes into each census tract by summing the votes at each voting location within each census tract. We use the aggregate function for this.

ct_votes_wide <- aggregate(x = to_geo_votes, 
                           by = to_census_tracts, 
                           FUN = sum)
ct_votes_wide
## Simple feature collection with 522 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -79.6393 ymin: 43.58107 xmax: -79.11547 ymax: 43.85539
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    Keesmaat Tory                       geometry
## 1       407  459 MULTIPOLYGON (((-79.39659 4...
## 2      1103 2400 MULTIPOLYGON (((-79.38782 4...
## 3       403 1117 MULTIPOLYGON (((-79.27941 4...
## 4        96  570 MULTIPOLYGON (((-79.28156 4...
## 5       169  682 MULTIPOLYGON (((-79.49104 4...
## 6       121  450 MULTIPOLYGON (((-79.54422 4...
## 7        42   95 MULTIPOLYGON (((-79.45379 4...
## 8       412  512 MULTIPOLYGON (((-79.33895 4...
## 9        83  132 MULTIPOLYGON (((-79.35427 4...
## 10       92  629 MULTIPOLYGON (((-79.44554 4...

As a last step, to tidy up, we now convert the wide table with a column for each candidate into a long table that has just one candidate column containing the name of the candidate.

ct_votes <- tidyr::gather(ct_votes_wide, candidate, votes, -geometry)
ct_votes
## Simple feature collection with 1044 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -79.6393 ymin: 43.58107 xmax: -79.11547 ymax: 43.85539
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    candidate votes                       geometry
## 1   Keesmaat   407 MULTIPOLYGON (((-79.39659 4...
## 2   Keesmaat  1103 MULTIPOLYGON (((-79.38782 4...
## 3   Keesmaat   403 MULTIPOLYGON (((-79.27941 4...
## 4   Keesmaat    96 MULTIPOLYGON (((-79.28156 4...
## 5   Keesmaat   169 MULTIPOLYGON (((-79.49104 4...
## 6   Keesmaat   121 MULTIPOLYGON (((-79.54422 4...
## 7   Keesmaat    42 MULTIPOLYGON (((-79.45379 4...
## 8   Keesmaat   412 MULTIPOLYGON (((-79.33895 4...
## 9   Keesmaat    83 MULTIPOLYGON (((-79.35427 4...
## 10  Keesmaat    92 MULTIPOLYGON (((-79.44554 4...

Now that we have votes aggregated by census tract, we can add in many other attributes from the census data. We won’t do that here, since this post is already pretty long. But, we’ll end with a plot to show how easily sf integrates with ggplot2. This is a nice improvement from past workflows, when several steps were required. In the actual code for the retrospective analysis, I added some other plotting techniques, like cutting the response variable (votes) into equally spaced pieces and adding some more refined labels. Here, we’ll just produce a simple plot.

ggplot2::ggplot(data = ct_votes) +
  ggplot2::geom_sf(ggplot2::aes(fill = votes)) +
  ggplot2::facet_wrap(~candidate)

Fixing a hack finds a better solution

In my Elections Ontario official results post, I had to use an ugly hack to match Electoral District names and numbers by extracting data from a drop down list on the Find My Electoral District website. Although it was mildly clever, like any hack, I shouldn’t have relied on this one for long, as proven by Elections Ontario shutting down the website.

So, a more robust solution was required, which led to using one of Election Ontario’s shapefiles. The shapefile contains the data we need, it’s just in a tricky format to deal with. But, the sf package makes this mostly straightforward.

We start by downloading and importing the Elections Ontario shape file. Then, since we’re only interested in the City of Toronto boundaries, we download the city’s shapefile too and intersect it with the provincial one to get a subset:

download.file("[www.elections.on.ca/content/d...](https://www.elections.on.ca/content/dam/NGW/sitecontent/2016/preo/shapefiles/Polling%20Division%20Shapefile%20-%202014%20General%20Election.zip)", 
              destfile = "data-raw/Polling%20Division%20Shapefile%20-%202014%20General%20Election.zip")
unzip("data-raw/Polling%20Division%20Shapefile%20-%202014%20General%20Election.zip", 
      exdir = "data-raw/Polling%20Division%20Shapefile%20-%202014%20General%20Election")

prov_geo <- sf::st_read("data-raw/Polling%20Division%20Shapefile%20-%202014%20General%20Election", 
                        layer = "PDs_Ontario") %>%
  sf::st_transform(crs = "+init=epsg:4326")

download.file("[opendata.toronto.ca/gcc/votin...](http://opendata.toronto.ca/gcc/voting_location_2014_wgs84.zip)",
              destfile = "data-raw/voting_location_2014_wgs84.zip")
unzip("data-raw/voting_location_2014_wgs84.zip", exdir="data-raw/voting_location_2014_wgs84")
toronto_wards <- sf::st_read("data-raw/voting_location_2014_wgs84", layer = "VOTING_LOCATION_2014_WGS84") %>%
  sf::st_transform(crs = "+init=epsg:4326")

to_prov_geo <- prov_geo %>%
  sf::st_intersection(toronto_wards)

Now we just need to extract a couple of columns from the data frame associated with the shapefile. Then we process the values a bit so that they match the format of other data sets. This includes converting them to UTF-8, formatting as title case, and replacing dashes with spaces:

electoral_districts <- to_prov_geo %>%
  dplyr::transmute(electoral_district = as.character(DATA_COMPI),
                   electoral_district_name = stringr::str_to_title(KPI04)) %>%
  dplyr::group_by(electoral_district, electoral_district_name) %>%
  dplyr::count() %>%
  dplyr::ungroup() %>%
  dplyr::mutate(electoral_district_name = stringr::str_replace_all(utf8::as_utf8(electoral_district_name), "\u0097", " ")) %>%
  dplyr::select(electoral_district, electoral_district_name)
electoral_districts
## Simple feature collection with 23 features and 2 fields
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: -79.61919 ymin: 43.59068 xmax: -79.12511 ymax: 43.83057
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 23 x 3
##    electoral_distri… electoral_distric…                           geometry
##                                                 
##  1 005               Beaches East York  (-79.32736 43.69452, -79.32495 43…
##  2 015               Davenport          (-79.4605 43.68283, -79.46003 43.…
##  3 016               Don Valley East    (-79.35985 43.78844, -79.3595 43.…
##  4 017               Don Valley West    (-79.40592 43.75026, -79.40524 43…
##  5 020               Eglinton Lawrence  (-79.46787 43.70595, -79.46376 43…
##  6 023               Etobicoke Centre   (-79.58697 43.6442, -79.58561 43.…
##  7 024               Etobicoke Lakesho… (-79.56213 43.61001, -79.5594 43.…
##  8 025               Etobicoke North    (-79.61919 43.72889, -79.61739 43…
##  9 068               Parkdale High Park (-79.49944 43.66285, -79.4988 43.…
## 10 072               Pickering Scarbor… (-79.18898 43.80374, -79.17927 43…
## # ... with 13 more rows

In the end, this is a much more reliable solution, though it seems a bit extreme to use GIS techniques just to get a listing of Electoral District names and numbers.

The commit with most of these changes in toVotes is here.

Elections Ontario official results

In preparing for some PsephoAnalytics work on the upcoming provincial election, I’ve been wrangling the Elections Ontario data. As provided, the data is really difficult to work with and we’ll walk through some steps to tidy these data for later analysis.

Here’s what the source data looks like:

Screenshot of raw Elections Ontario data

Screenshot of raw Elections Ontario data

A few problems with this:

  1. The data is scattered across a hundred different Excel files
  2. Candidates are in columns with their last name as the header
  3. Last names are not unique across all Electoral Districts, so can’t be used as a unique identifier
  4. Electoral District names are in a row, followed by a separate row for each poll within the district
  5. The party affiliation for each candidate isn’t included in the data

So, we have a fair bit of work to do to get to something more useful. Ideally something like:

## # A tibble: 9 x 5
##   electoral_district  poll candidate   party votes
##                <chr> <chr>     <chr>   <chr> <int>
## 1                  X     1         A Liberal    37
## 2                  X     2         B     NDP    45
## 3                  X     3         C      PC    33
## 4                  Y     1         A Liberal    71
## 5                  Y     2         B     NDP    37
## 6                  Y     3         C      PC    69
## 7                  Z     1         A Liberal    28
## 8                  Z     2         B     NDP    15
## 9                  Z     3         C      PC    34

This is much easier to work with: we have one row for the votes received by each candidate at each poll, along with the Electoral District name and their party affiliation.

Candidate parties

As a first step, we need the party affiliation for each candidate. I didn’t see this information on the Elections Ontario site. So, we’ll pull the data from Wikipedia. The data on this webpage isn’t too bad. We can just use the table xpath selector to pull out the tables and then drop the ones we aren’t interested in.

candidate_webpage <- "https://en.wikipedia.org/wiki/Ontario_general_election,_2014#Candidates_by_region"
candidate_tables <- "table" # Use an xpath selector to get the drop down list by ID

candidates <- xml2::read_html(candidate_webpage) %>%
  rvest::html_nodes(candidate_tables) %>% # Pull tables from the wikipedia entry
  .[13:25] %>% # Drop unecessary tables
  rvest::html_table(fill = TRUE)

This gives us a list of 13 data frames, one for each table on the webpage. Now we cycle through each of these and stack them into one data frame. Unfortunately, the tables aren’t consistent in the number of columns. So, the approach is a bit messy and we process each one in a loop.

# Setup empty dataframe to store results
candidate_parties <- tibble::as_tibble(
  electoral_district_name = NULL,
  party = NULL,
  candidate = NULL
)

for(i in seq_along(1:length(candidates))) { # Messy, but works this_table <- candidates[[i]] # The header spans mess up the header row, so renaming names(this_table) <- c(this_table[1,-c(3,4)], “NA”, “Incumbent”) # Get rid of the blank spacer columns this_table <- this_table[-1, ] # Drop the NA columns by keeping only odd columns this_table <- this_table[,seq(from = 1, to = dim(this_table)[2], by = 2)] this_table %<>% tidyr::gather(party, candidate, -Electoral District) %>% dplyr::rename(electoral_district_name = Electoral District) %>% dplyr::filter(party != “Incumbent”) candidate_parties <- dplyr::bind_rows(candidate_parties, this_table) } candidate_parties

## # A tibble: 649 x 3
##       electoral_district_name   party            candidate
##                         <chr>   <chr>                <chr>
##  1 Carleton—Mississippi Mills Liberal      Rosalyn Stevens
##  2            Nepean—Carleton Liberal           Jack Uppal
##  3              Ottawa Centre Liberal          Yasir Naqvi
##  4             Ottawa—Orléans Liberal Marie-France Lalonde
##  5               Ottawa South Liberal          John Fraser
##  6              Ottawa—Vanier Liberal   Madeleine Meilleur
##  7         Ottawa West—Nepean Liberal        Bob Chiarelli
##  8 Carleton—Mississippi Mills      PC        Jack MacLaren
##  9            Nepean—Carleton      PC         Lisa MacLeod
## 10              Ottawa Centre      PC           Rob Dekker
## # ... with 639 more rows

Electoral district names

One issue with pulling party affiliations from Wikipedia is that candidates are organized by Electoral District names. But the voting results are organized by Electoral District number. I couldn’t find an appropriate resource on the Elections Ontario site. Rather, here we pull the names and numbers of the Electoral Districts from the Find My Electoral District website. The xpath selector is a bit tricky for this one. The ed_xpath object below actually pulls content from the drop down list that appears when you choose an Electoral District. One nuisance with these data is that Elections Ontario uses in the Electoral District names, instead of the — used on Wikipedia. We use str_replace_all to fix this below.

ed_webpage <- “https://www3.elections.on.ca/internetapp/FYED_Error.aspx?lang=en-ca"
ed_xpath <- “//*[(@id = \“ddlElectoralDistricts\“)]” # Use an xpath selector to get the drop down list by ID

electoral_districts <- xml2::read_html(ed_webpage) %>% rvest::html_node(xpath = ed_xpath) %>% rvest::html_nodes(“option”) %>% rvest::html_text() %>% .[-1] %>% # Drop the first item on the list (“Select…”) tibble::as.tibble() %>% # Convert to a data frame and split into ID number and name tidyr::separate(value, c(“electoral_district”, “electoral_district_name”), sep = “ “, extra = “merge”) %>% # Clean up district names for later matching and presentation dplyr::mutate(electoral_district_name = stringr::str_to_title( stringr::str_replace_all(electoral_district_name, “–”, “—”))) electoral_districts

```
## # A tibble: 107 x 2
##    electoral_district              electoral_district_name
##                                                 
##  1                001                       Ajax—Pickering
##  2                002                    Algoma—Manitoulin
##  3                003 Ancaster—Dundas—Flamborough—Westdale
##  4                004                               Barrie
##  5                005                    Beaches—East York
##  6                006                 Bramalea—Gore—Malton
##  7                007                  Brampton—Springdale
##  8                008                        Brampton West
##  9                009                                Brant
## 10                010                Bruce—Grey—Owen Sound
## # ... with 97 more rows
```

Next, we can join the party affiliations to the Electoral District names to join candidates to parties and district numbers.

candidate_parties %<>%
  # These three lines are cleaning up hyphens and dashes, seems overly complicated
  dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name, “—\n”, “—”)) %>%
  dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name,
                                                                   “Chatham-Kent—Essex”,
                                                                   “Chatham—Kent—Essex”)) %>%
  dplyr::mutate(electoral_district_name = stringr::str_to_title(electoral_district_name)) %>%
  dplyr::left_join(electoral_districts) %>%
  dplyr::filter(!candidate == “”) %>%
  # Since the vote data are identified by last names, we split candidate’s names into first and last
  tidyr::separate(candidate, into = c(“first”,“candidate”), extra = “merge”, remove = TRUE) %>%
  dplyr::select(-first)
## Joining, by = “electoral_district_name”
candidate_parties

## # A tibble: 578 x 4
##       electoral_district_name   party      candidate electoral_district
##  *                      <chr>   <chr>          <chr>              <chr>
##  1 Carleton—Mississippi Mills Liberal        Stevens                013
##  2            Nepean—Carleton Liberal          Uppal                052
##  3              Ottawa Centre Liberal          Naqvi                062
##  4             Ottawa—Orléans Liberal France Lalonde                063
##  5               Ottawa South Liberal         Fraser                064
##  6              Ottawa—Vanier Liberal       Meilleur                065
##  7         Ottawa West—Nepean Liberal      Chiarelli                066
##  8 Carleton—Mississippi Mills      PC       MacLaren                013
##  9            Nepean—Carleton      PC        MacLeod                052
## 10              Ottawa Centre      PC         Dekker                062
## # ... with 568 more rows

All that work just to get the name of each candiate for each Electoral District name and number, plus their party affiliation.

Votes

Now we can finally get to the actual voting data. These are made available as a collection of Excel files in a compressed folder. To avoid downloading it more than once, we wrap the call in an if statement that first checks to see if we already have the file. We also rename the file to something more manageable.

raw_results_file <- “www.elections.on.ca/content/d…

zip_file <- “data-raw/Poll%20by%20Poll%20Results%20-%20Excel.zip” if(file.exists(zip_file)) { # Only download the data once # File exists, so nothing to do } else { download.file(raw_results_file, destfile = zip_file) unzip(zip_file, exdir=“data-raw”) # Extract the data into data-raw file.rename(“data-raw/GE Results - 2014 (unconverted)”, “data-raw/pollresults”) }

## NULL

Now we need to extract the votes out of 107 Excel files. The combination of purrr and readxl packages is great for this. In case we want to filter to just a few of the files (perhaps to target a range of Electoral Districts), we declare a file_pattern. For now, we just set it to any xls file that ends with three digits preceeded by a “_“.

As we read in the Excel files, we clean up lots of blank columns and headers. Then we convert to a long table and drop total and blank rows. Also, rather than try to align the Electoral District name rows with their polls, we use the name of the Excel file to pull out the Electoral District number. Then we join with the electoral_districts table to pull in the Electoral District names.

file_pattern <- "*_[[:digit:]]{3}.xls" # Can use this to filter down to specific files
poll_data <- list.files(path = "data-raw/pollresults", pattern = file_pattern, full.names = TRUE) %>% # Find all files that match the pattern
  purrr::set_names() %>%
  purrr::map_df(readxl::read_excel, sheet = 1, col_types = "text", .id = "file") %>%   # Import each file and merge into a dataframe
  # Specifying sheet = 1 just to be clear we're ignoring the rest of the sheets
  # Declare col_types since there are duplicate surnames and map_df can't recast column types in the rbind
  # For example, Bell is in both district 014 and 063
  dplyr::select(-starts_with("X__")) %>% # Drop all of the blank columns
  dplyr::select(1:2,4:8,15:dim(.)[2]) %>% # Reorganize a bit and drop unneeded columns
  dplyr::rename(poll_number = `POLL NO.`) %>%
  tidyr::gather(candidate, votes, -file, -poll_number) %>% # Convert to a long table
  dplyr::filter(!is.na(votes),
                poll_number != "Totals") %>%
  dplyr::mutate(electoral_district = stringr::str_extract(file, "[[:digit:]]{3}"),
                votes = as.numeric(votes)) %>%
  dplyr::select(-file) %>%
  dplyr::left_join(electoral_districts)
poll_data

## # A tibble: 143,455 x 5
##    poll_number candidate votes electoral_district electoral_district_name
##          <chr>     <chr> <dbl>              <chr>                   <chr>
##  1         001   DICKSON    73                001          Ajax—Pickering
##  2         002   DICKSON   144                001          Ajax—Pickering
##  3         003   DICKSON    68                001          Ajax—Pickering
##  4         006   DICKSON   120                001          Ajax—Pickering
##  5         007   DICKSON    74                001          Ajax—Pickering
##  6        008A   DICKSON    65                001          Ajax—Pickering
##  7        008B   DICKSON    81                001          Ajax—Pickering
##  8         009   DICKSON   112                001          Ajax—Pickering
##  9         010   DICKSON   115                001          Ajax—Pickering
## 10         011   DICKSON    74                001          Ajax—Pickering
## # ... with 143,445 more rows

The only thing left to do is to join poll_data with candidate_parties to add party affiliation to each candidate. Because the names don’t always exactly match between these two tables, we use the fuzzyjoin package to join by closest spelling.

poll_data_party_match_table <- poll_data %>%
  group_by(candidate, electoral_district_name) %>%
  summarise() %>%
  fuzzyjoin::stringdist_left_join(candidate_parties,
                                  ignore_case = TRUE) %>%
  dplyr::select(candidate = candidate.x,
                party = party,
                electoral_district = electoral_district) %>%
  dplyr::filter(!is.na(party))
poll_data %<>%
  dplyr::left_join(poll_data_party_match_table) %>%
  dplyr::group_by(electoral_district, party)
tibble::glimpse(poll_data)

## Observations: 144,323
## Variables: 6
## $ poll_number             <chr> "001", "002", "003", "006", "007", "00...
## $ candidate               <chr> "DICKSON", "DICKSON", "DICKSON", "DICK...
## $ votes                   <dbl> 73, 144, 68, 120, 74, 65, 81, 112, 115...
## $ electoral_district      <chr> "001", "001", "001", "001", "001", "00...
## $ electoral_district_name <chr> "Ajax—Pickering", "Ajax—Pickering", "A...
## $ party                   <chr> "Liberal", "Liberal", "Liberal", "Libe...

And, there we go. One table with a row for the votes received by each candidate at each poll. It would have been great if Elections Ontario released data in this format and we could have avoided all of this work.

Charity donations by province

This tweet about the charitable donations by Albertans showed up in my timeline and caused a ruckus.

Many people took issue with the fact that these values weren’t adjusted for income. Seems to me that whether this is a good idea or not depends on what kind of question you’re trying to answer. Regardless, the CANSIM table includes this value. So, it is straightforward to calculate. Plus CANSIM tables have a pretty standard structure and showing how to manipulate this one serves as a good template for others.

library(tidyverse)
# Download and extract
url <- "[www20.statcan.gc.ca/tables-ta...](http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/01110001-eng.zip)"
zip_file <- "01110001-eng.zip"
download.file(url,
              destfile = zip_file)
unzip(zip_file) 
# We only want two of the columns. Specifying them here.
keep_data <- c("Median donations (dollars)",
               "Median total income of donors (dollars)")
cansim <- read_csv("01110001-eng.csv") %>% 
  filter(DON %in% keep_data,
         is.na(`Geographical classification`)) %>% # This second filter removes anything that isn't a province or territory
  select(Ref_Date, DON, Value, GEO) %>%
  spread(DON, Value) %>% 
  rename(year = Ref_Date,
         donation = `Median donations (dollars)`,
         income = `Median total income of donors (dollars)`) %>% 
  mutate(donation_per_income = donation / income) %>% 
  filter(year == 2015) %>% 
  select(GEO, donation, donation_per_income)
cansim
## # A tibble: 16 x 3
##                                  GEO donation donation_per_income
##                                                   
##  1                           Alberta      450         0.006378455
##  2                  British Columbia      430         0.007412515
##  3                            Canada      300         0.005119454
##  4                          Manitoba      420         0.008032129
##  5                     New Brunswick      310         0.006187625
##  6         Newfoundland and Labrador      360         0.007001167
##  7 Non CMA-CA, Northwest Territories      480         0.004768528
##  8                 Non CMA-CA, Yukon      310         0.004643499
##  9             Northwest Territories      400         0.003940887
## 10                       Nova Scotia      340         0.006505932
## 11                           Nunavut      570         0.005651398
## 12                           Ontario      360         0.005856515
## 13              Prince Edward Island      400         0.008221994
## 14                            Quebec      130         0.002452830
## 15                      Saskatchewan      410         0.006910501
## 16                             Yukon      420         0.005695688

Curious that they dropped the territories from their chart, given that Nunavut has such a high donation amount.

Now we can plot the normalized data to find how the rank order changes. We’ll add the Canadian average as a blue line for comparison.

I’m not comfortable with using median donations (adjusted for income or not) to say anything in particular about the residents of a province. But, I’m always happy to look more closely at data and provide some context for public debates.

One major gap with this type of analysis is that we’re only looking at the median donations of people that donated anything at all. In other words, we aren’t considering anyone who donates nothing. We should really compare these median donations to the total population or the size of the economy. This Stats Can study is a much more thorough look at the issue.

For me the interesting result here is the dramatic difference between Quebec and the rest of the provinces. But, I don’t interpret this to mean that Quebecers are less generous than the rest of Canada. Seems more likely that there are material differences in how the Quebec economy and social safety nets are structured.

TTC delay data and Friday the 13th

The TTC releasing their Subway Delay Data was great news. I’m always happy to see more data released to the public. In this case, it also helps us investigate one of the great, enduring mysteries: Is Friday the 13th actually an unlucky day?

As always, we start by downloading and manipulating the data. I’ve added in two steps that aren’t strictly necessary. One is converting the Date, Time, and Day columns into a single Date column. The other is to drop most of the other columns of data, since we aren’t interested in them here.

url <- "http://www1.toronto.ca/City%20Of%20Toronto/Information%20&%20Technology/Open%20Data/Data%20Sets/Assets/Files/Subway%20&%20SRT%20Logs%20(Jan01_14%20to%20April30_17).xlsx"
filename <- basename(url)
download.file(url, destfile = filename, mode = "wb")
delays <- readxl::read_excel(filename, sheet = 2) %>% 
  dplyr::mutate(date = lubridate::ymd_hm(paste(`Date`, 
                                               `Time`, 
                                               sep = " ")),
                delay = `Min Delay`) %>% 
  dplyr::select(date, delay)
delays
## # A tibble: 69,043 x 2
##                   date delay
##                  
##  1 2014-01-01 02:06:00     3
##  2 2014-01-01 02:40:00     0
##  3 2014-01-01 03:10:00     3
##  4 2014-01-01 03:20:00     5
##  5 2014-01-01 03:29:00     0
##  6 2014-01-01 07:31:00     0
##  7 2014-01-01 07:32:00     0
##  8 2014-01-01 07:34:00     0
##  9 2014-01-01 07:34:00     0
## 10 2014-01-01 07:53:00     0
## # ... with 69,033 more rows

Now we have a delays dataframe with 69043 incidents starting from 2014-01-01 00:21:00 and ending at 2017-04-30 22:13:00. Before we get too far, we’ll take a look at the data. A heatmap of delays by day and hour should give us some perspective.

delays %>% 
  dplyr::mutate(day = lubridate::day(date),
         hour = lubridate::hour(date)) %>% 
  dplyr::group_by(day, hour) %>% 
  dplyr::summarise(sum_delay = sum(delay)) %>% 
  ggplot2::ggplot(aes(x = hour, y = day, fill = sum_delay)) +
    ggplot2::geom_tile(alpha = 0.8, color = "white") +
    ggplot2::scale_fill_gradient2() + 
    ggplot2::theme(legend.position = "right") +
    ggplot2::labs(x = "Hour", y = "Day of the month", fill = "Sum of delays")

Other than a reliable band of calm very early in the morning, no obvious patterns here.

We need to identify any days that are a Friday the 13th. We also might want to compare weekends, regular Fridays, other weekdays, and Friday the 13ths, so we add a type column that provides these values. Here we use the case_when function:

delays <- delays %>% 
    dplyr::mutate(type = case_when( # Partition into Friday the 13ths, Fridays, weekends, and weekdays
      lubridate::wday(.$date) %in% c(1, 7) ~ "weekend",
      lubridate::wday(.$date) %in% c(6) & 
        lubridate::day(.$date) == 13 ~ "Friday 13th",
      lubridate::wday(.$date) %in% c(6) ~ "Friday",
      TRUE ~ "weekday" # Everything else is a weekday
  )) %>% 
  dplyr::mutate(type = factor(type)) %>% 
  dplyr::group_by(type)
delays
## # A tibble: 69,043 x 3
## # Groups:   type [4]
##                   date delay    type
##                    
##  1 2014-01-01 02:06:00     3 weekday
##  2 2014-01-01 02:40:00     0 weekday
##  3 2014-01-01 03:10:00     3 weekday
##  4 2014-01-01 03:20:00     5 weekday
##  5 2014-01-01 03:29:00     0 weekday
##  6 2014-01-01 07:31:00     0 weekday
##  7 2014-01-01 07:32:00     0 weekday
##  8 2014-01-01 07:34:00     0 weekday
##  9 2014-01-01 07:34:00     0 weekday
## 10 2014-01-01 07:53:00     0 weekday
## # ... with 69,033 more rows

With the data organized, we can start with just a simple box plot of the minutes of delay by type.

ggplot2::ggplot(delays, aes(type, delay)) +
  ggplot2::geom_boxplot() + 
  ggplot2::labs(x = "Type", y = "Minutes of delay")

Not very compelling. Basically most delays are short (as in zero minutes long) with many outliers.

How about if we summed up the total minutes in delays for each of the types of days?

delays %>% 
  dplyr::summarise(total_delay = sum(delay))
## # A tibble: 4 x 2
##          type total_delay
##               
## 1      Friday       18036
## 2 Friday 13th         619
## 3     weekday       78865
## 4     weekend       28194

Clearly the total minutes of delays are much shorter for Friday the 13ths. But, there aren’t very many such days (only 6 in fact). So, this is a dubious analysis.

Let’s take a step back and calculate the average of the total delay across the entire day for each of the types of days. If Friday the 13ths really are unlucky, we would expect to see longer delays, at least relative to a regular Friday.

daily_delays <- delays %>% # Total delays in a day
  dplyr::mutate(year = lubridate::year(date),
                day = lubridate::yday(date)) %>% 
  dplyr::group_by(year, day, type) %>% 
  dplyr::summarise(total_delay = sum(delay))

mean_daily_delays <- daily_delays %>% # Average delays in each type of day
  dplyr::group_by(type) %>% 
  dplyr::summarise(avg_delay = mean(total_delay))
mean_daily_delays
## # A tibble: 4 x 2
##          type avg_delay
##             
## 1      Friday 107.35714
## 2 Friday 13th 103.16667
## 3     weekday 113.63833
## 4     weekend  81.01724
ggplot2::ggplot(daily_delays, aes(type, total_delay)) +
  ggplot2::geom_boxplot() + 
  ggplot2::labs(x = "Type", y = "Total minutes of delay")

On average, Friday the 13ths have shorter total delays (103 minutes) than either regular Fridays (107 minutes) or other weekdays (114 minutes). Overall, weekend days have far shorter total delays (81 minutes).

If Friday the 13ths are unlucky, they certainly aren’t causing longer TTC delays.

For the statisticians among you that still aren’t convinced, we’ll run a basic linear model to compare Friday the 13ths with regular Fridays. This should control for many unmeasured variables.

model <- lm(total_delay ~ type, data = daily_delays, 
            subset = type %in% c("Friday", "Friday 13th"))
summary(model)
## 
## Call:
## lm(formula = total_delay ~ type, data = daily_delays, subset = type %in% 
##     c("Friday", "Friday 13th"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.357  -30.357   -6.857   18.643  303.643 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      107.357      3.858  27.829   <2e-16 ***
## typeFriday 13th   -4.190     20.775  -0.202     0.84    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50 on 172 degrees of freedom
## Multiple R-squared:  0.0002365,  Adjusted R-squared:  -0.005576 
## F-statistic: 0.04069 on 1 and 172 DF,  p-value: 0.8404

Definitely no statistical support for the idea that Friday the 13ths cause longer TTC delays.

How about time series tests, like anomaly detections? Seems like we’d just be getting carried away. Part of the art of statistics is knowing when to quit.

In conclusion, then, after likely far too much analysis, we find no evidence that Friday the 13ths cause an increase in the length of TTC delays. This certainly suggests that Friday the 13ths are not unlucky in any meaningful way, at least for TTC riders.

Glad we could put this superstition to rest!

Canada LEED projects

The CaGBC maintains a list of all the registered LEED projects in Canada. This is a great resource, but rather awkward for analyses. I’ve copied these data into a DabbleDB application with some of the maps and tabulations that I frequently need to reference.

Here for example is a map of the density of LEED projects in each province. While here is a rather detailed view of the kinds of projects across provinces. There are several other views available. Are there any others that might be useful?

Instapaper Review

Instapaper is an integral part of my web-reading routine. Typically, I have a few minutes early in the morning and scattered throughout the day for quick scans of my favourite web sites and news feeds. I capture anything worth reading with Instapaper’s bookmarklet to create a reading queue of interesting articles. Then with a quick update to the iPhone app this queue is available whenever I find longer blocks of time for reading, particularly during the morning subway ride to work or late at night.

I also greatly appreciate Instapaper’s text view, which removes all the banners, ads, and link lists from the articles to present a nice and clean text view of the content only. I often find myself saving an article to Instapaper even when I have the time to read it, just so I can use this text-only view.

Instapaper is one of my favourite tools and the first iPhone application I purchased.

Stikkit from the command line

Note – This post has been updated from 2007-03-20 to describe new installation instructions.

Overview

I’ve integrated Stikkit into most of my workflow and am quite happy with the results. However, one missing piece is quick access to Stikkit from the command line. In particular, a quick list of my undone todos is quite useful without having to load up a web browser. To this end, I’ve written a Ruby script for interacting with Stikkit. As I mentioned, my real interest is in listing undone todos. But I decided to make the script more general, so you can ask for specific types of stikkits and restrict the stikkits with specific parameters. Also, since the stikkit api is so easy to use, I added in a method for creating new stikkits.

Usage

The general use of the script is to list stikkits of a particular type, filtered by a parameter. For example,

ruby stikkit.rb --list calendar dates=today

will show all of today’s calendar events. While,

ruby stikkit.rb -l todos done=0

lists all undone todos. The use of -l instead of --list is simply a standard convenience. Furthermore, since this last example comprises almost all of my use for this script, I added a convenience method to get all undone todos

ruby stikkit.rb -t

A good way to understand stikkit types and parameters is to keep an eye on the url while you interact with Stikkit in your browser. To create a new stikkit, use the --create flag,

ruby stikkit.rb -c 'Remember me.'

The text you pass to stikkit.rb will be processed as usual by Stikkit.

Installation

Grab the script from the Google Code project and put it somewhere convenient. Making the file executable and adding it to your path will cut down on the typing. The script reads from a .stikkit file in your path that contains your username and password. Modify this template and save it as ~/.sikkit


     ---
     username: me@domain.org 
     password: superSecret 

The script also requires the atom gem, which you can grab with

gem install atom

I’ve tried to include some flexibility in the processing of stikkits. So, if you don’t like using atom, you can switch to a different format provided by Stikkit. The text type requires no gems, but makes picking out pieces of the stikkits challenging.

Feedback

This script serves me well, but I’m interested in making it more useful. Feel free to pass along any comments or feature requests.

Yahoo Pipes and the Globe and Mail

Most of my updates arrive through feeds to NetNewsWire. Since my main source of national news and analysis is the Globe and Mail, I’m quite happy that they provide many feeds for accessing their content. The problem is that many news stories are duplicated across these feeds. Furthermore, tracking all of the feeds of interest is challenging.

The new Yahoo Pipes offer a solution to these problems. Without providing too much detail, pipes are a way to filter, connect, and generally mash-up the web with a straightforward interface. I’ve used this service to collect all of the Globe and Mail feeds of interest, filter out the duplicates, and produce a feed I can subscribe to. Nothing fancy, but quite useful. The pipe is publicly available and if you don’t agree with my choice of news feeds, you are free to clone mine and create your own. There are plenty of other pipes available, so take a look to see if anything looks useful to you. Even better, create your own.

If you really want those details, Tim O'Reilly has plenty.

Stikkit Todos in GMail

I find it useful to have a list of my unfinished tasks generally, but subtley, available. To this end, I’ve added my unfinished todos from Stikkit to my Gmail web clips. These are the small snippets of text that appear just above the message list in GMail.

All you need is the subscribe link from your todo page with the ‘not done’ button toggled. The url should look something like:

http://stikkit.com/todos.atom?api_key={}&done=0

Paste this into the 'Search by topic or URL:’ box of Web Clips tab in GMail settings.

DabbleDB

My experiences helping people manage their data has repeatedly shown that databases are poorly understood. This is well illustrated by the rampant abuses of spreadsheets for recording, manipulating, and analysing data.

Most people realise that they should be using a database, the real issue is the difficulty of creating a proper database. This is a legitimate challenge. Typically, you need to carefully consider all of the categories of data and their relationships when creating the database, which makes the upfront costs quite significant. Why not just start throwing data into a spreadsheet and worry about it later?

I think that DabbleDB can solve this problem. A great strength of Dabble –- and the source of its name — is that you can start with a simple spreadsheet of data and progressively convert it to a database as you begin to better understand the data and your requirements.

Dabble also has a host of great features for working with data. I’ll illustrate this with a database I created recently when we were looking for a new home. This is a daunting challenge. We looked at dozens of houses each with unique pros and cons in different neighbourhoods and with different price ranges. I certainly couldn’t keep track of them all.

I started with a simple list of addresses for consideration. This was easily imported into Dabble and immediately became useful. Dabble can export to Google Earth, so I could quickly have an overview of the properties and their proximity to amenities like transit stops and parks. Next, I added in a field for asking price and MLS url which were also exported to Google Earth. Including price gave a good sense of how costs varied with location, while the url meant I could quickly view the entire listing for a property.

Next, we started scheduling appointments to view properties. Adding this to Dabble immediately created a calendar view. Better yet, Dabble can export this view as an iCal file to add into a calendaring program.

Once we started viewing homes, we began to understand what we really were looking for in terms of features. So, add these to Dabble and then start grouping, searching, and sorting by these attributes.

All of this would have been incredibly challenging without Dabble. No doubt, I would have simply used a spreadsheet and missed out on the rich functionality of a database.

Dabble really is worth a look. The best way to start is to watch the seven minute demo and then review some of the great screencasts.

Stikkit-- Out with the mental clutter

I like to believe that my brain is useful for analysis, synthesis, and creativity. Clearly it is not proficient at storing details like specific dates and looming reminders. Nonetheless, a great deal of my mental energy is devoted to trying to remember such details and fearing the consequences of the inevitable “it slipped my mind”. As counselled by GTD, I need a good and trustworthy system for removing these important, but distracting, details and having them reappear when needed. I’ve finally settled in on the new product from values of n called Stikkit.

Stikkit appeals to me for two main reasons: easy data entry and smart text processing. Stikkit uses the metaphor of the yellow sticky note for capturing text. When you create a new note, you are presented with a simple text field — nothing more. However, Stikkit parses your note for some key words and extracts information to make the note more useful. For example, if you type:

Phone call with John Smith on Feb 1 at 1pm

Stikkit realises that you are describing an event scheduled for February 1st at one in the afternoon with a person (“peep” in Stikkit slang) named John Smith. A separate note will be created to track information about John Smith and will be linked to the phone call note. If you add the text “remind me” to the note, Stikkit will send you an email and SMS message prior to the event. You can also include tags to group notes together with the keywords “tag as”.

A recent update to peeps makes them even more useful. Stikkit now collects information about people as you create notes. So, for example, if I later post:

- Send documents to John Smith john@smith.net

Stikkit will recognise John Smith and update my peep for him with the email address provided. In this way, Stikkit becomes more useful as you continue to add information to notes. Also, the prefixed “-” causes Stikkit to recognise this note as a todo. I can then list all of my todos and check them off as they are completed.

This text processing greatly simplifies data entry, since I don’t need to click around to create todos are choose dates from a calendar picker. Just type in the text, hit save, and I’m done. Fortunately, Stikkit has been designed to be smart rather than clever. The distinction here is that Stikkit relies on some key words (such as at, for, to) to mark up notes consistently and reliably. Clever software is exemplified by Microsoft Word’s autocorrect or clipboard assistant. My first goal when encountering these “features” is to turn them off. I find they rarely do the right thing and end up being a hindrance. Stikkit is well worth a look. For a great overview check out the screencasts in the forum.

Mac vs. PC Remotes

An image of a remote from Apple and a PC

I grabbed this image while preparing a new Windows machine. This seems to be an interesting comparison of the difference in design approaches between Apple and PC remotes. Both provide essentially the same functions. Clearly, however, one is more complex than the other. Which would you rather use?

Plantae's continued development

Prior to general release, plantae is moving web hosts. This seems like a good time to point out that all of plantae’s code is hosted at Google Code. The project has great potential and deserves consistent attention. Unfortunately, I can’t continue to develop the code. So, if you have an interest in collaborative software, particularly in the scientific context, I encourage you to take a look.

Text processing with Unix

I recently helped someone process a text file with the help of Unix command line tools. The job would have been quite challenging otherwise, and I think this represents a useful demonstration of why I choose to use Unix.

The basic structure of the datafile was:

; A general header file ;
1
sample: 0.183 0.874 0.226 0.214 0.921 0.272 0.117
2
sample: 0.411 0.186 0.956 0.492 0.150 0.278 0.110
3
...

In this case the only important information is the second number of each line that begins with “sample:”. Of course, one option is to manually process the file, but there are thousands of lines, and that’s just silly.

We begin by extracting only the lines that begin with “sample:”. grep will do this job easily:

grep "^sample" input.txt

grep searches through the input.txt file and outputs any matching lines to standard output.

Now, we need the second number. sed can strip out the initial text of each line with a find and replace while tr compresses any strange use of whitespace:

sed 's/sample: //g' | tr -s ' '

Notice the use of the pipe (|) command here. This sends the output of one command to the input of the next. This allows commands to be strung together and is one of the truly powerful tools in Unix.

Now we have a matrix of numbers in rows and columns, which is easily processed with awk.

awk '{print $2;}'

Here we ask awk to print out the second number of each row.

So, if we string all this together with pipes, we can process this file as follows:

grep "^sample" input.txt | sed 's/sample: //g' | tr -s ' ' | awk '{print $2;}' > output.txt

Our numbers of interest are in output.txt.

Principles of Technology Adoption

Choosing appropriate software tools can be challenging. Here are the principles I employ when making the decision:

  • Simple: This seems obvious, but many companies fail here. Typically, their downfall is focussing on a perpetual increase in feature quantity. I don’t evaluate software with feature counts. Rather, I value software that performs a few key operations well. Small, focussed tools result in much greater productivity than overly-complex, all-in-one tools. 37 Signals’ Writeboard is a great example of a simple, focussed tool for collaborative writing.
  • Open formats: I will not choose software that uses proprietary or closed data formats. Closed formats cause two main difficulties:

  • I must pay the proprietor of a closed format for the privilege of accessing my data. Furthermore, switching to a new set of software may require translating my data or, even worse, losing access altogether. Open formats allow me to access my data at any time and with any appropriate tool.

  • My tools are limited to the range of innovations that the proprietor deems important. Open formats allow me to take advantage of any great new tools that come available.

  • Flexible: As my requirements ebb and flow, I shouldn’t be locked into the constraints of a particular tool. The best options are internet-based, subscription plans. If I need more storage space or more access for collaborators, I simply choose a new subscription plan. Then, if things slow down, I can move back down to a basic plan and save money. The online backup service Strongpace, for example, has a subscription plan tied to the amount of storage and number of users required.

  • Network: A good tool must be fully integrated into the network. The ability to collaborate with anyone or access my data from any computer are great boons to productivity. Many of the best tools are completely internet based; all that is required to use them is a web browser. This also means that the tool is monitored and maintained by a collection of experts and that the tool can be upgraded at any time without being locked into a version-number update. Furthermore, with data maintained on a network, many storage and backup problems are addressed. GMail, for example, stores over 2GB of email, free of charge with an innovative user interface.

Exemplars

These are some of my favourite adherents to the principles outlined above:

RSiteSearch

I’m not sure how this escaped my notice until now, but `RSiteSearch` is a very useful command in R. Passing a string to this function loads up your web browser with search results from the R documentation and mailing list. So, for example:

RSiteSearch("glm")

will show you everything you need to know about using R for generalised linear models.

R module for ConTeXt

I generally write my documents in Sweave format. This approach allows me to embed the code for analyses directly in the report derived from the analyses, so that all results and figures are generated dynamically with the text of the report. This provides both great documentation of the analyses and the convenience of a single file to keep track of and work with.

Now there is a new contender for integrating analysis code and documentation with the release of an R module for ConTeXt. I prefer the clean implementation and modern features of ConTeXt to the excellent, but aging, LaTeX macro package that Sweave relies on. So, using ConTeXt for my documents is a great improvement.

Here’s a simple example of using this new module. I create two randomly distributed, normal variables, test for a correlation between them, and plot their distribution.

\usemodule[r]

\starttext
Describe the motivation of the analyses first.

Now create some variables.

\startRhidden
x <- rnorm(1000, 0, 1)
y <- rnorm(1000, 0, 1)
\stopRhidden

Are they correlated?

\startR
model <- lm(y ~ x, data = test)
summary(model)
\stopR

Now we can include a figure.

\startR
pdf("testFigure.pdf")
plot(x, y)
dev.off()
\stopR

\startbuffer
\placefigure{Here it is}{\externalfigure[testFigure]}
\stopbuffer
\getbuffer

\stoptext

Processing this code produces a pdf file with all of the results produced from R, including the figure.

I had some minor difficulties getting this to work on my OS X machine, through no fault of the r module itself. There are two problems. The first is that, by default, write18 is not enabled, so ConTeXt can’t access R directly. Fix this by editing /usr/local/teTeX/texmf.cnf so that “shell_escape = t”. The next is that the R module calls @texmfstart@ which isn’t directly accessible from a stock installation of TeX. The steps required are described in the “Configuration of texmfstart” section of the ConTeXt wiki. I modified this slightly by placing the script in ~/bin so that I didn’t interfere with the installed teTeX tree. Now everything should work.

expand.grid

Here’s a simple trick for creating experimental designs in R: use the function expand.grid.

A simple example is:

  treatments <- LETTERS[1:4]
  levels <- 1:3
  experiment <- data.frame(expand.grid(treatment=treatments, level=levels))

which produces:

   treatment level
1          A     1
2          B     1
3          C     1
4          D     1
5          A     2
6          B     2
7          C     2
8          D     2
9          A     3
10         B     3
11         C     3
12         D     3

Now, if you want to randomize your experimental treatments, try:

  experiment[sample(dim(experiment)[1]), ]

sample randomly chooses numbers from a vector the same length as the experiment data frame without replacement. The square brackets then use this random sample to subsample from the experiment data frame.

Heart of the Matter

CBC’s Ideas has been running a series of shows on heart disease called “Heart of the Matter”. Episode 2 is particularly interesting from a statistical perspective, as the episode discusses several difficulties with the analysis of drug efficacy. Some highlights include:

Effect sizes Some of the best cited studies for the use of drugs to treat heart disease show a statistically significant effect of only a few percentage points improvement. Contrast this with a dramatic, vastly superior improvement from diet alone.

Response variables The focus of many drug studies has been on the reduction of cholesterol, rather than reductions in heart disease. Diet studies, for example, have shown dramatic improvements in reducing heart attacks while having no effect on cholesterol levels. Conversely, drug studies that show a reduction in cholesterol show no change in mortality rates.

Blocking of data Separate analyses of drug efficacy on female or elderly patients tend to show that drug therapy increases overall mortality. Lumping these data in with the traditional middle-aged male patients removes this effect and, instead, shows a significant decrease in heart disease with drug use.

The point here isn’t to make a comment on the influence of drug companies on medical research. Rather, such statistical concerns are common to all research disciplines. The primary concern of such analyses should be: what is the magnitude of the effect of a specific treatment on my variable of interest? The studies discussed in the Ideas program suggest that much effort has been devoted to detecting significant effects of drugs on surrogate response variables regardless of the size of the effect.

Plantae resurrected

Some technical issues coupled with my road-trip-without-a-laptop conspired to keep Plantae from working correctly. I’ve repaired the damage and isolated Plantae from such problems in the future. My apologies for the downtime.

Analysis of Count Data

When response variables are composed of counts, the standard statistical methods that rely on the normal distribution are no longer applicable. Count data are comprised of positive integers and, often, many zeros. For such data, we need statistics based on Poisson or binomial distributions. I’ve spent the past few weeks analysing counts from hundreds of transects and, as is typical, a particular challenge was determining the appropriate packages to use for R. Here’s what I’ve found so far.

The first step is to get an idea of the dispersion of data points:

Means <- tapply(y, list(x1, x2), mean)
Vars <- tapply(y, list(x1, x2), var)
plot(Means, Vars, xlab="Means", ylab="Variances")
abline(a=0, b=1)

For the Poisson distribution, the mean is equal to the variance. So, we expect the points to lie along the solid line added to the plot. If the points are overdispersed, a negative binomial link function may be more appropriate. The pscl library provides a function to test this:

library(pscl)
model.nb <- glm.nb(y ~ x, data=data)
odTest(model.nb)
summary(model.nb)

If the odTest function rejects the null model, then the data are overdispersed relative to a Poisson distribution. One particularly useful function is glmmPQL from the MASS library. This function allows for random intercepts and combined with the negative.binomial function of the same library, you can fit a negative binomial GLMM:

model.glmm.nb <- glmmPQL(y ~ x1 + x2,
                         random= ~ 1|transect, data=data,
                         family=negative.binomial(model.nb$theta))

In this case, I use the Θ estimated from the glm.nb function in the negative.binomial call. Also useful are the zeroinfl function of the pscl library for fitting zero-inflated Poisson or negative binomial models and the geeglm function of geepack for fitting generalized estimating equations for repeated measures. Finally, fitdistr from MASS allows for estimating the parameters of different distributions from empirical data.

Desktop Manager

I’m convinced that no computer display is large enough. What we need are strategies to better manage our computer workspace and application windows. Exposé and tabbed browsing are great features, but what I really want is the equivalent of a file folder. You put all of the relevant documents in a folder and then put it aside for when you need it. Once you’re ready, you open up the folder and are ready to go.

A feature that comes close to this is virtual desktops. I became enamoured with these while running KDE and have found them again for OS X with Desktop Manager. The idea is to create workspaces associated with specific tasks as a virtual desktop. You can then switch between these desktops as you move from one project to the next. So, for each of the projects I am currently working on, I’ve created a desktop with each application’s windows in the appropriate place. For a consulting project, I likely have Aquamacs running an R simulation with a terminal window open for updating my subversion repository. A project in the writing stage requires TeXShop and BibDesk, while a web-design project needs TextMate and Camino. Each of these workspaces is independent and I can quickly switch to them when needed. Since the applications are running with their windows in the appropriate place, I can quickly get back to work on the project at hand.

Application windows can be split across desktops and specific windows can be present across all desktops. I’ve also found it useful to have one desktop for communication (email, messaging, etc.) and another that has no windows open at all.