Good advice from a philosopher on how to disagree constructively:
Many of my best friends think that some of my deeply held beliefs about important issues are obviously false or even nonsense. Sometimes, they tell me so to my face. How can we still be friends? Part of the answer is that these friends and I are philosophers, and philosophers learn how to deal with positions on the edge of sanity.
The Barchef project is delicious
Out for a Christmas stroll
The Kano computer is a great product. Fun and easy to build with lots of good built in software.
Merry Christmas
A quiet, snowy run today.
Preparing the tree
Here’s some “behind the scenes” geospatial analysis I used in a recent @PsephoAnalytics post. This was a good excuse to experiment with the sf #rstats 📦 which makes this much easier than my old workflows
This is a “behind the scenes” elaboration of the geospatial analysis in our recent post on evaluating our predictions for the 2018 mayoral election in Toronto. This was my first, serious use of the new sf package for geospatial analysis. I found the package much easier to use than some of my previous workflows for this sort of analysis, especially given its integration with the tidyverse.
We start by downloading the shapefile for voting locations from the City of Toronto’s Open Data portal and reading it with the read_sf function. Then, we pipe it to st_transform to set the appropriate projection for the data. In this case, this isn’t strictly necessary, since the shapefile is already in the right projection. But, I tend to do this for all shapefiles to avoid any oddities later.
## Simple feature collection with 1700 features and 13 fields## geometry type: POINT## dimension: XY## bbox: xmin: -79.61937 ymin: 43.59062 xmax: -79.12531 ymax: 43.83052## epsg (SRID): 4326## proj4string: +proj=longlat +datum=WGS84 +no_defs## # A tibble: 1,700 x 14## POINT_ID FEAT_CD FEAT_C_DSC PT_SHRT_CD PT_LNG_CD POINT_NAME VOTER_CNT## <dbl> <chr> <chr> <chr> <chr> <chr> <int>## 1 10190 P Primary 056 10056 <na> 37## 2 10064 P Primary 060 10060 <na> 532## 3 10999 S Secondary 058 10058 Malibu 661## 4 11342 P Primary 052 10052 <na> 1914## 5 10640 P Primary 047 10047 The Summit 956## 6 10487 S Secondary 061 04061 White Eag… 51## 7 11004 P Primary 063 04063 Holy Fami… 1510## 8 11357 P Primary 024 11024 Rosedale … 1697## 9 12044 P Primary 018 05018 Weston Pu… 1695## 10 11402 S Secondary 066 04066 Elm Grove… 93## # ... with 1,690 more rows, and 7 more variables: OBJECTID <dbl>,## # ADD_FULL <chr>, X <dbl>, Y <dbl>, LONGITUDE <dbl>, LATITUDE <dbl>,## # geometry <point>
The file has 1700 rows of data across 14 columns. The first 13 columns are data within the original shapefile. The last column is a list column that is added by sf and contains the geometry of the location. This specific design feature is what makes an sf object work really well with the rest of the tidyverse: the geographical details are just a column in the data frame. This makes the data much easier to work with than in other approaches, where the data is contained within an [@data](https://micro.blog/data) slot of an object.
Plotting the data is straightforward, since sf objects have a plot function. Here’s an example where we plot the number of voters (VOTER_CNT) at each location. If you squint just right, you can see the general outline of Toronto in these points.
What we want to do next is use the voting location data to aggregate the votes cast at each location into census tracts. This then allows us to associate census characteristics (like age and income) with the pattern of votes and develop our statistical relationships for predicting voter behaviour.
We’ll split this into several steps. The first is downloading and reading the census tract shapefile.
Now that we have it, all we really want are the census tracts in Toronto (the shapefile includes census tracts across Canada). We achieve this by intersecting the Toronto voting locations with the census tracts using standard R subsetting notation.
And, we can plot it to see how well the intersection worked. This time we’ll plot the CTUID, which is the unique identifier for each census tract. This doesn’t mean anything in this context, but adds some nice colour to the plot.
plot(to_census_tracts["CTUID"])
Now you can really see the shape of Toronto, as well as the size of each census tract.
Next we need to manipulate the voting data to get votes received by major candidates in the 2018 election. We take these data from the toVotes package and arbitrarily set the threshold for major candidates to receiving at least 100,000 votes. This yields our two main candidates: John Tory and Jennifer Keesmaat.
## # A tibble: 2 x 1## candidate ## <chr> ## 1 Keesmaat Jennifer## 2 Tory John
Given our goal of aggregating the votes received by each candidate into census tracts, we need a data frame that has each candidate in a separate column. We start by joining the major candidates table to the votes table. In this case, we also filter the votes to 2018, since John Tory has been a candidate in more than one election. Then we use the tidyr package to convert the table from long (with one candidate column) to wide (with a column for each candidate).
Our last step before finally aggregating to census tracts is to join the spread_votes table with the toronto_locations data. This requires pulling the ward and area identifiers from the PT_LNG_CD column of the toronto_locations data frame which we do with some stringr functions. While we’re at it, we also update the candidate names to just surnames.
Okay, we’re finally there. We have our census tract data in to_census_tracts and our voting data in to_geo_votes. We want to aggregate the votes into each census tract by summing the votes at each voting location within each census tract. We use the aggregate function for this.
ct_votes_wide <-aggregate(x = to_geo_votes,
by = to_census_tracts,
FUN = sum)
ct_votes_wide
As a last step, to tidy up, we now convert the wide table with a column for each candidate into a long table that has just one candidate column containing the name of the candidate.
Now that we have votes aggregated by census tract, we can add in many other attributes from the census data. We won’t do that here, since this post is already pretty long. But, we’ll end with a plot to show how easily sf integrates with ggplot2. This is a nice improvement from past workflows, when several steps were required. In the actual code for the retrospective analysis, I added some other plotting techniques, like cutting the response variable (votes) into equally spaced pieces and adding some more refined labels. Here, we’ll just produce a simple plot.
Our predictions for the recent election in Toronto held up well. We were within 6%, on average, with a slight bias towards overestimating Keesmaat’s support. Now we’ll add more demographic richness to our agents and reduce the geographical distribution of errors www.psephoanalytics.ca/2018/11/r…
Our predictions for the 2018 mayoral race in Toronto were generated by our new agent-based model that used demographic characteristics and results of previous elections.
Now that the final results are available, we can see how our predictions performed at the census tract level.
For this analysis, we restrict the comparison to just Tory and Keesmaat, as they were the only two major candidates and the only two for which we estimated vote share. Given this, we start by just plotting the difference between the actual votes and the predicted votes for Keesmaat. The distribution for Tory is simply the mirror image, since their combined share of votes always equals 100%.
Distribution of the difference between the predicted and actual proportion of votes for Keesmaat
The mean difference from the actual results for Keesmaat is -6%, which means that, on average, we slightly overestimated support for Keesmaat. However, as the histogram shows, there is significant variation in this difference across census tracts with the differences slightly skewed towards overestimating Keesmaat’s support.
To better understand this variation, we can look at a plot of the geographical distribution of the differences. In this figure, we show both Keesmaat and Tory. Although the plots are just inverted versions of each other (since the proportion of votes always sums to 100%), seeing them side by side helps illuminate the geographical structure of the differences.
The distribution of the difference between the predicted and actual proportion of votes by census tract
The overall distribution of differences doesn’t have a clear geographical bias. In some sense, this is good, as it shows our agent-based model isn’t systematically biased to any particular census tract. Rather, refinements to the model will improve accuracy across all census tracts.
We’ll write details about our new agent-based approach soon. In the meantime, these results show that the approach has promise, given that it used only a few demographic characteristics and no polls. Now we’re particularly motivated to gather up much more data to enrich our agents’ behaviour and make better predictions.
Interested in analytics and transit planning? I’m looking for an Analyst to help Metrolinx generate and communicate evidence for project evaluation ca.linkedin.com/jobs/view…
Thanks to generous support, the 4th Axe Pancreatic Cancer fundraiser was a great success. We raised over $32K this year and all funds support the PancOne Network. So far, we’ve raised close to $120K in honour of my Mom. Thanks to everyone that has supported this important cause!
Raising funds for pancreatic cancer research by throwing axes. Great fun!
Fascinating to think of creating cognitive scaffolding for elephants
aeon.co/essays/if…
We’ve completely retooled our approach to predicting elections to use an agent-based model. Looking forward to comparing our predictions to the actual results tonight for the Toronto election!
It’s been a while since we last posted – largely for personal reasons, but also because we wanted to take some time to completely retool our approach to modeling elections.
In the past, we’ve tried a number of statistical approaches. Because every election is quite different to its predecessors, this proved unsatisfactory – there are simply too many things that change which can’t be effectively measured in a top-down view. Top-down approaches ultimately treat people as averages. But candidates and voters do not behave like averages; they have different desires and expectations.
We know there are diverse behaviours that need to be modeled at the person-level. We also recognize that an election is a system of diverse agents, whose behaviours affect each other. For example, a candidate can gain or lose support by doing nothing, depending only on what other candidates do. Similarly, a candidate or voter will behave differently simply based on which candidates are in the race, even without changing any beliefs. In the academic world, the aggregated results of such behaviours are called “emergent properties”, and the ability to predict such outcomes is extremely difficult if looking at the system from the top down.
So we needed to move to a bottom-up approach that would allow us to model agents heterogeneously, and that led us to what is known as agent-based modeling.
Agent-based modeling and elections
Agent-based models employ individual heterogeneous “agents” that are interconnected and follow behavioural rules defined by the modeler. Due to their non-linear approach, agent-based models have been used extensively in military games, biology, transportation planning, operational research, ecology, and, more recently, in economics (where huge investments are being made).
While we’ll write more on this in the coming weeks, we define voters’ and candidates’ behaviour using parameters, and “train” them (i.e., setting those parameters) based on how they behaved in previous elections. For our first proof of concept model, we have candidate and voter agents with two-variable issues sets (call the issues “economic” and “social”) – each with a positional score of 0 to 100. Voters have political engagement scores (used to determine whether they cast a ballot), demographic characteristics based on census data, likability scores assigned to each candidate (which include anything that isn’t based on issues, from name recognition to racial or sexual bias), and a weight for how important likability is to that voter. Voters also track, via polls, the likelihood that a candidate can win. This is important for their “utility function” – that is, the calculation that defines which candidate a voter will choose, if they cast a ballot at all. For example, a candidate that a voter may really like, but who has no chance of winning, may not get the voter’s ultimate vote. Instead, the voter may vote strategically.
On the other hand, candidates simply seek votes. Each candidate looks at the polls and asks 1) am I a viable candidate?; and 2) how do I change my positions to attract more voters? (For now, we don’t provide them a way to change their likability.) Candidates that have a chance of winning move a random angle from their current position, based on how “flexible” they are on their positions. If that move works (i.e., moves them up in the polls), they move randomly in the same general direction. If the move hurt their standings in the polls, they turn around and go randomly in the opposite general direction. At some point, the election is held – that is, the ultimate poll – and we see who wins.
This approach allows us to run elections with different candidates, change a candidate’s likability, introduce shocks (e.g., candidates changing positions on an issue) and, eventually, see how different voting systems might impact who gets elected (foreshadowing future work.)
We’re not the first to apply agent-based modeling in psephology by any stretch (there are many in the academic world using it to explain observed behaviours), but we haven’t found any attempting to do so to predict actual elections.
Applying this to the Toronto 2018 Mayoral Race
First, Toronto voters have, over the last few elections, voted somewhat more right-wing than might have been predicted. Looking at the average positions across the city for the 2003, 2006, 2010, and 2014 elections looks like the following:
This doesn’t mean that Toronto voters are themselves more right-wing than might be expected, just that they voted this way. This is in fact the first interesting outcome of our new approach. We find that about 30% of Toronto voters have been based on candidate likability, and that for the more right-wing candidates, likability has been a major reason for choosing them. For example, in 2010, Rob Ford’s likability score was significantly higher that his major competitors (George Smitherman and Joe Pantalone). This isn’t to say that everyone liked Rob Ford – but those that did vote for him cared more about something other than issues, at least relative to those who voted for his opponents.
For 2018, likability is less a differentiating factor, with both major candidates (John Tory and Jennifer Keesmaat scoring about the same on this factor). Nor are the issues – Ms. Keesmaat’s positions don’t seem to be hurting her standing in the polls as she’s staked out a strong position left of centre on both issues. What appears to be the bigger factor this time around is the early probabilities assigned by voters to Ms. Keesmaat’s chance of victory, a point that seems to have been a part of the actual Tory campaign’s strategy. Having not been seen as a major threat to John Tory by much of the city, that narrative become self-reinforcing. Further, John Tory’s positions are relatively more centrist in 2018 than they were in 2014, when he had a markedly viable right-wing opponent in Doug Ford. (To prove the point of this approach’s value, we could simply introduce a right-wing candidate and see what happens…)
Thus, our predictions don’t appear to be wildly different from current polls (with Tory winning nearly 2-to-1), and map as follows:
There will be much more to say on this, and much more we can do going forward, but for a proof of concept, we think this approach has enormous promise.