For awhile now I’ve been meaning to get serious about automation on iOS. The Siri Shortcuts Field Guide from @macsparky is a great, comprehensive introduction. While taking the course, I built a good dozen useful shortcuts and have many ideas for more.
Ed Harcourt’s album Beyond the End has been great headphone music for at work πΆ itunes.apple.com/ca/album/…
Fun to be back β·
#CozyLife
Another Christmas dinner
Good advice from a philosopher on how to disagree constructively:
Many of my best friends think that some of my deeply held beliefs about important issues are obviously false or even nonsense. Sometimes, they tell me so to my face. How can we still be friends? Part of the answer is that these friends and I are philosophers, and philosophers learn how to deal with positions on the edge of sanity.
The Barchef project is delicious
Out for a Christmas stroll
The Kano computer is a great product. Fun and easy to build with lots of good built in software.
Merry Christmas
A quiet, snowy run today.
Preparing the tree
Here’s some “behind the scenes” geospatial analysis I used in a recent @PsephoAnalytics post. This was a good excuse to experiment with the sf #rstats π¦ which makes this much easier than my old workflows
This is a βbehind the scenesβ elaboration of the geospatial analysis in our recent post on evaluating our predictions for the 2018 mayoral election in Toronto. This was my first, serious use of the new sf package for geospatial analysis. I found the package much easier to use than some of my previous workflows for this sort of analysis, especially given its integration with the tidyverse.
We start by downloading the shapefile for voting locations from the City of Torontoβs Open Data portal and reading it with the read_sf function. Then, we pipe it to st_transform to set the appropriate projection for the data. In this case, this isnβt strictly necessary, since the shapefile is already in the right projection. But, I tend to do this for all shapefiles to avoid any oddities later.
## Simple feature collection with 1700 features and 13 fields## geometry type: POINT## dimension: XY## bbox: xmin: -79.61937 ymin: 43.59062 xmax: -79.12531 ymax: 43.83052## epsg (SRID): 4326## proj4string: +proj=longlat +datum=WGS84 +no_defs## # A tibble: 1,700 x 14## POINT_ID FEAT_CD FEAT_C_DSC PT_SHRT_CD PT_LNG_CD POINT_NAME VOTER_CNT## <dbl> <chr> <chr> <chr> <chr> <chr> <int>## 1 10190 P Primary 056 10056 <na> 37## 2 10064 P Primary 060 10060 <na> 532## 3 10999 S Secondary 058 10058 Malibu 661## 4 11342 P Primary 052 10052 <na> 1914## 5 10640 P Primary 047 10047 The Summit 956## 6 10487 S Secondary 061 04061 White Eag⦠51## 7 11004 P Primary 063 04063 Holy Fami⦠1510## 8 11357 P Primary 024 11024 Rosedale ⦠1697## 9 12044 P Primary 018 05018 Weston Pu⦠1695## 10 11402 S Secondary 066 04066 Elm Grove⦠93## # ... with 1,690 more rows, and 7 more variables: OBJECTID <dbl>,## # ADD_FULL <chr>, X <dbl>, Y <dbl>, LONGITUDE <dbl>, LATITUDE <dbl>,## # geometry <point>
The file has 1700 rows of data across 14 columns. The first 13 columns are data within the original shapefile. The last column is a list column that is added by sf and contains the geometry of the location. This specific design feature is what makes an sf object work really well with the rest of the tidyverse: the geographical details are just a column in the data frame. This makes the data much easier to work with than in other approaches, where the data is contained within an [@data](https://micro.blog/data) slot of an object.
Plotting the data is straightforward, since sf objects have a plot function. Hereβs an example where we plot the number of voters (VOTER_CNT) at each location. If you squint just right, you can see the general outline of Toronto in these points.
What we want to do next is use the voting location data to aggregate the votes cast at each location into census tracts. This then allows us to associate census characteristics (like age and income) with the pattern of votes and develop our statistical relationships for predicting voter behaviour.
Weβll split this into several steps. The first is downloading and reading the census tract shapefile.
Now that we have it, all we really want are the census tracts in Toronto (the shapefile includes census tracts across Canada). We achieve this by intersecting the Toronto voting locations with the census tracts using standard R subsetting notation.
And, we can plot it to see how well the intersection worked. This time weβll plot the CTUID, which is the unique identifier for each census tract. This doesnβt mean anything in this context, but adds some nice colour to the plot.
plot(to_census_tracts["CTUID"])
Now you can really see the shape of Toronto, as well as the size of each census tract.
Next we need to manipulate the voting data to get votes received by major candidates in the 2018 election. We take these data from the toVotes package and arbitrarily set the threshold for major candidates to receiving at least 100,000 votes. This yields our two main candidates: John Tory and Jennifer Keesmaat.
## # A tibble: 2 x 1## candidate ## <chr> ## 1 Keesmaat Jennifer## 2 Tory John
Given our goal of aggregating the votes received by each candidate into census tracts, we need a data frame that has each candidate in a separate column. We start by joining the major candidates table to the votes table. In this case, we also filter the votes to 2018, since John Tory has been a candidate in more than one election. Then we use the tidyr package to convert the table from long (with one candidate column) to wide (with a column for each candidate).
Our last step before finally aggregating to census tracts is to join the spread_votes table with the toronto_locations data. This requires pulling the ward and area identifiers from the PT_LNG_CD column of the toronto_locations data frame which we do with some stringr functions. While weβre at it, we also update the candidate names to just surnames.
Okay, weβre finally there. We have our census tract data in to_census_tracts and our voting data in to_geo_votes. We want to aggregate the votes into each census tract by summing the votes at each voting location within each census tract. We use the aggregate function for this.
ct_votes_wide <-aggregate(x = to_geo_votes,
by = to_census_tracts,
FUN = sum)
ct_votes_wide
As a last step, to tidy up, we now convert the wide table with a column for each candidate into a long table that has just one candidate column containing the name of the candidate.
Now that we have votes aggregated by census tract, we can add in many other attributes from the census data. We wonβt do that here, since this post is already pretty long. But, weβll end with a plot to show how easily sf integrates with ggplot2. This is a nice improvement from past workflows, when several steps were required. In the actual code for the retrospective analysis, I added some other plotting techniques, like cutting the response variable (votes) into equally spaced pieces and adding some more refined labels. Here, weβll just produce a simple plot.
Our predictions for the recent election in Toronto held up well. We were within 6%, on average, with a slight bias towards overestimating Keesmaat’s support. Now we’ll add more demographic richness to our agents and reduce the geographical distribution of errors www.psephoanalytics.ca/2018/11/r…
Our predictions for the 2018 mayoral race in Toronto were generated by our new agent-based model that used demographic characteristics and results of previous elections.
Now that the final results are available, we can see how our predictions performed at the census tract level.
For this analysis, we restrict the comparison to just Tory and Keesmaat, as they were the only two major candidates and the only two for which we estimated vote share. Given this, we start by just plotting the difference between the actual votes and the predicted votes for Keesmaat. The distribution for Tory is simply the mirror image, since their combined share of votes always equals 100%.
Distribution of the difference between the predicted and actual proportion of votes for Keesmaat
The mean difference from the actual results for Keesmaat is -6%, which means that, on average, we slightly overestimated support for Keesmaat. However, as the histogram shows, there is significant variation in this difference across census tracts with the differences slightly skewed towards overestimating Keesmaatβs support.
To better understand this variation, we can look at a plot of the geographical distribution of the differences. In this figure, we show both Keesmaat and Tory. Although the plots are just inverted versions of each other (since the proportion of votes always sums to 100%), seeing them side by side helps illuminate the geographical structure of the differences.
The distribution of the difference between the predicted and actual proportion of votes by census tract
The overall distribution of differences doesnβt have a clear geographical bias. In some sense, this is good, as it shows our agent-based model isnβt systematically biased to any particular census tract. Rather, refinements to the model will improve accuracy across all census tracts.
Weβll write details about our new agent-based approach soon. In the meantime, these results show that the approach has promise, given that it used only a few demographic characteristics and no polls. Now weβre particularly motivated to gather up much more data to enrich our agentsβ behaviour and make better predictions.
Interested in analytics and transit planning? I’m looking for an Analyst to help Metrolinx generate and communicate evidence for project evaluation ca.linkedin.com/jobs/view…