This is a “behind the scenes” elaboration of the geospatial analysis in our recent post on evaluating our predictions for the 2018 mayoral election in Toronto. This was my first, serious use of the new sf package for geospatial analysis. I found the package much easier to use than some of my previous workflows for this sort of analysis, especially given its integration with the tidyverse.
We start by downloading the shapefile for voting locations from the City of Toronto’s Open Data portal and reading it with the read_sf function. Then, we pipe it to st_transform to set the appropriate projection for the data. In this case, this isn’t strictly necessary, since the shapefile is already in the right projection. But, I tend to do this for all shapefiles to avoid any oddities later.
## Simple feature collection with 1700 features and 13 fields## geometry type: POINT## dimension: XY## bbox: xmin: -79.61937 ymin: 43.59062 xmax: -79.12531 ymax: 43.83052## epsg (SRID): 4326## proj4string: +proj=longlat +datum=WGS84 +no_defs## # A tibble: 1,700 x 14## POINT_ID FEAT_CD FEAT_C_DSC PT_SHRT_CD PT_LNG_CD POINT_NAME VOTER_CNT## <dbl> <chr> <chr> <chr> <chr> <chr> <int>## 1 10190 P Primary 056 10056 <NA> 37## 2 10064 P Primary 060 10060 <NA> 532## 3 10999 S Secondary 058 10058 Malibu 661## 4 11342 P Primary 052 10052 <NA> 1914## 5 10640 P Primary 047 10047 The Summit 956## 6 10487 S Secondary 061 04061 White Eag… 51## 7 11004 P Primary 063 04063 Holy Fami… 1510## 8 11357 P Primary 024 11024 Rosedale … 1697## 9 12044 P Primary 018 05018 Weston Pu… 1695## 10 11402 S Secondary 066 04066 Elm Grove… 93## # ... with 1,690 more rows, and 7 more variables: OBJECTID <dbl>,## # ADD_FULL <chr>, X <dbl>, Y <dbl>, LONGITUDE <dbl>, LATITUDE <dbl>,## # geometry <POINT [°]>
The file has 1700 rows of data across 14 columns. The first 13 columns are data within the original shapefile. The last column is a list column that is added by sf and contains the geometry of the location. This specific design feature is what makes an sf object work really well with the rest of the tidyverse: the geographical details are just a column in the data frame. This makes the data much easier to work with than in other approaches, where the data is contained within an [@data](https://micro.blog/data) slot of an object.
Plotting the data is straightforward, since sf objects have a plot function. Here’s an example where we plot the number of voters (VOTER_CNT) at each location. If you squint just right, you can see the general outline of Toronto in these points.
What we want to do next is use the voting location data to aggregate the votes cast at each location into census tracts. This then allows us to associate census characteristics (like age and income) with the pattern of votes and develop our statistical relationships for predicting voter behaviour.
We’ll split this into several steps. The first is downloading and reading the census tract shapefile.
Now that we have it, all we really want are the census tracts in Toronto (the shapefile includes census tracts across Canada). We achieve this by intersecting the Toronto voting locations with the census tracts using standard R subsetting notation.
And, we can plot it to see how well the intersection worked. This time we’ll plot the CTUID, which is the unique identifier for each census tract. This doesn’t mean anything in this context, but adds some nice colour to the plot.
plot(to_census_tracts["CTUID"])
Now you can really see the shape of Toronto, as well as the size of each census tract.
Next we need to manipulate the voting data to get votes received by major candidates in the 2018 election. We take these data from the toVotes package and arbitrarily set the threshold for major candidates to receiving at least 100,000 votes. This yields our two main candidates: John Tory and Jennifer Keesmaat.
## # A tibble: 2 x 1## candidate ## <chr> ## 1 Keesmaat Jennifer## 2 Tory John
Given our goal of aggregating the votes received by each candidate into census tracts, we need a data frame that has each candidate in a separate column. We start by joining the major candidates table to the votes table. In this case, we also filter the votes to 2018, since John Tory has been a candidate in more than one election. Then we use the tidyr package to convert the table from long (with one candidate column) to wide (with a column for each candidate).
Our last step before finally aggregating to census tracts is to join the spread_votes table with the toronto_locations data. This requires pulling the ward and area identifiers from the PT_LNG_CD column of the toronto_locations data frame which we do with some stringr functions. While we’re at it, we also update the candidate names to just surnames.
Okay, we’re finally there. We have our census tract data in to_census_tracts and our voting data in to_geo_votes. We want to aggregate the votes into each census tract by summing the votes at each voting location within each census tract. We use the aggregate function for this.
ct_votes_wide <-aggregate(x = to_geo_votes,
by = to_census_tracts,
FUN = sum)
ct_votes_wide
As a last step, to tidy up, we now convert the wide table with a column for each candidate into a long table that has just one candidate column containing the name of the candidate.
Now that we have votes aggregated by census tract, we can add in many other attributes from the census data. We won’t do that here, since this post is already pretty long. But, we’ll end with a plot to show how easily sf integrates with ggplot2. This is a nice improvement from past workflows, when several steps were required. In the actual code for the retrospective analysis, I added some other plotting techniques, like cutting the response variable (votes) into equally spaced pieces and adding some more refined labels. Here, we’ll just produce a simple plot.
Our predictions for the 2018 mayoral race in Toronto were generated by our new agent-based model that used demographic characteristics and results of previous elections.
Now that the final results are available, we can see how our predictions performed at the census tract level.
For this analysis, we restrict the comparison to just Tory and Keesmaat, as they were the only two major candidates and the only two for which we estimated vote share. Given this, we start by just plotting the difference between the actual votes and the predicted votes for Keesmaat. The distribution for Tory is simply the mirror image, since their combined share of votes always equals 100%.
Distribution of the difference between the predicted and actual proportion of votes for Keesmaat
The mean difference from the actual results for Keesmaat is -6%, which means that, on average, we slightly overestimated support for Keesmaat. However, as the histogram shows, there is significant variation in this difference across census tracts with the differences slightly skewed towards overestimating Keesmaat’s support.
To better understand this variation, we can look at a plot of the geographical distribution of the differences. In this figure, we show both Keesmaat and Tory. Although the plots are just inverted versions of each other (since the proportion of votes always sums to 100%), seeing them side by side helps illuminate the geographical structure of the differences.
The distribution of the difference between the predicted and actual proportion of votes by census tract
The overall distribution of differences doesn’t have a clear geographical bias. In some sense, this is good, as it shows our agent-based model isn’t systematically biased to any particular census tract. Rather, refinements to the model will improve accuracy across all census tracts.
We’ll write details about our new agent-based approach soon. In the meantime, these results show that the approach has promise, given that it used only a few demographic characteristics and no polls. Now we’re particularly motivated to gather up much more data to enrich our agents’ behaviour and make better predictions.
Thanks to generous support, the 4th Axe Pancreatic Cancer fundraiser was a great success. We raised over $32K this year and all funds support the PancOne Network. So far, we’ve raised close to $120K in honour of my Mom. Thanks to everyone that has supported this important cause!
It’s been a while since we last posted – largely for personal reasons, but also because we wanted to take some time to completely retool our approach to modeling elections.
In the past, we’ve tried a number of statistical approaches. Because every election is quite different to its predecessors, this proved unsatisfactory – there are simply too many things that change which can’t be effectively measured in a top-down view. Top-down approaches ultimately treat people as averages. But candidates and voters do not behave like averages; they have different desires and expectations.
We know there are diverse behaviours that need to be modeled at the person-level. We also recognize that an election is a system of diverse agents, whose behaviours affect each other. For example, a candidate can gain or lose support by doing nothing, depending only on what other candidates do. Similarly, a candidate or voter will behave differently simply based on which candidates are in the race, even without changing any beliefs. In the academic world, the aggregated results of such behaviours are called “emergent properties”, and the ability to predict such outcomes is extremely difficult if looking at the system from the top down.
So we needed to move to a bottom-up approach that would allow us to model agents heterogeneously, and that led us to what is known as agent-based modeling.
Agent-based modeling and elections
Agent-based models employ individual heterogeneous “agents” that are interconnected and follow behavioural rules defined by the modeler. Due to their non-linear approach, agent-based models have been used extensively in military games, biology, transportation planning, operational research, ecology, and, more recently, in economics (where huge investments are being made).
While we’ll write more on this in the coming weeks, we define voters’ and candidates’ behaviour using parameters, and “train” them (i.e., setting those parameters) based on how they behaved in previous elections. For our first proof of concept model, we have candidate and voter agents with two-variable issues sets (call the issues “economic” and “social”) – each with a positional score of 0 to 100. Voters have political engagement scores (used to determine whether they cast a ballot), demographic characteristics based on census data, likability scores assigned to each candidate (which include anything that isn’t based on issues, from name recognition to racial or sexual bias), and a weight for how important likability is to that voter. Voters also track, via polls, the likelihood that a candidate can win. This is important for their “utility function” – that is, the calculation that defines which candidate a voter will choose, if they cast a ballot at all. For example, a candidate that a voter may really like, but who has no chance of winning, may not get the voter’s ultimate vote. Instead, the voter may vote strategically.
On the other hand, candidates simply seek votes. Each candidate looks at the polls and asks 1) am I a viable candidate?; and 2) how do I change my positions to attract more voters? (For now, we don’t provide them a way to change their likability.) Candidates that have a chance of winning move a random angle from their current position, based on how “flexible” they are on their positions. If that move works (i.e., moves them up in the polls), they move randomly in the same general direction. If the move hurt their standings in the polls, they turn around and go randomly in the opposite general direction. At some point, the election is held – that is, the ultimate poll – and we see who wins.
This approach allows us to run elections with different candidates, change a candidate’s likability, introduce shocks (e.g., candidates changing positions on an issue) and, eventually, see how different voting systems might impact who gets elected (foreshadowing future work.)
We’re not the first to apply agent-based modeling in psephology by any stretch (there are many in the academic world using it to explain observed behaviours), but we haven’t found any attempting to do so to predict actual elections.
Applying this to the Toronto 2018 Mayoral Race
First, Toronto voters have, over the last few elections, voted somewhat more right-wing than might have been predicted. Looking at the average positions across the city for the 2003, 2006, 2010, and 2014 elections looks like the following:
This doesn’t mean that Toronto voters are themselves more right-wing than might be expected, just that they voted this way. This is in fact the first interesting outcome of our new approach. We find that about 30% of Toronto voters have been based on candidate likability, and that for the more right-wing candidates, likability has been a major reason for choosing them. For example, in 2010, Rob Ford’s likability score was significantly higher that his major competitors (George Smitherman and Joe Pantalone). This isn’t to say that everyone liked Rob Ford – but those that did vote for him cared more about something other than issues, at least relative to those who voted for his opponents.
For 2018, likability is less a differentiating factor, with both major candidates (John Tory and Jennifer Keesmaat scoring about the same on this factor). Nor are the issues – Ms. Keesmaat’s positions don’t seem to be hurting her standing in the polls as she’s staked out a strong position left of centre on both issues. What appears to be the bigger factor this time around is the early probabilities assigned by voters to Ms. Keesmaat’s chance of victory, a point that seems to have been a part of the actual Tory campaign’s strategy. Having not been seen as a major threat to John Tory by much of the city, that narrative become self-reinforcing. Further, John Tory’s positions are relatively more centrist in 2018 than they were in 2014, when he had a markedly viable right-wing opponent in Doug Ford. (To prove the point of this approach’s value, we could simply introduce a right-wing candidate and see what happens…)
Thus, our predictions don’t appear to be wildly different from current polls (with Tory winning nearly 2-to-1), and map as follows:
There will be much more to say on this, and much more we can do going forward, but for a proof of concept, we think this approach has enormous promise.
In my Elections Ontario official results post, I had to use an ugly hack to match Electoral District names and numbers by extracting data from a drop down list on the Find My Electoral District website. Although it was mildly clever, like any hack, I shouldn’t have relied on this one for long, as proven by Elections Ontario shutting down the website.
So, a more robust solution was required, which led to using one of Election Ontario’s shapefiles. The shapefile contains the data we need, it’s just in a tricky format to deal with. But, the sf package makes this mostly straightforward.
We start by downloading and importing the Elections Ontario shape file. Then, since we’re only interested in the City of Toronto boundaries, we download the city’s shapefile too and intersect it with the provincial one to get a subset:
Now we just need to extract a couple of columns from the data frame associated with the shapefile. Then we process the values a bit so that they match the format of other data sets. This includes converting them to UTF-8, formatting as title case, and replacing dashes with spaces:
## Simple feature collection with 23 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -79.61919 ymin: 43.59068 xmax: -79.12511 ymax: 43.83057
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 23 x 3
## electoral_distri… electoral_distric… geometry
##
## 1 005 Beaches East York (-79.32736 43.69452, -79.32495 43…
## 2 015 Davenport (-79.4605 43.68283, -79.46003 43.…
## 3 016 Don Valley East (-79.35985 43.78844, -79.3595 43.…
## 4 017 Don Valley West (-79.40592 43.75026, -79.40524 43…
## 5 020 Eglinton Lawrence (-79.46787 43.70595, -79.46376 43…
## 6 023 Etobicoke Centre (-79.58697 43.6442, -79.58561 43.…
## 7 024 Etobicoke Lakesho… (-79.56213 43.61001, -79.5594 43.…
## 8 025 Etobicoke North (-79.61919 43.72889, -79.61739 43…
## 9 068 Parkdale High Park (-79.49944 43.66285, -79.4988 43.…
## 10 072 Pickering Scarbor… (-79.18898 43.80374, -79.17927 43…
## # ... with 13 more rows
In the end, this is a much more reliable solution, though it seems a bit extreme to use GIS techniques just to get a listing of Electoral District names and numbers.
The commit with most of these changes in toVotes is here.
In preparing for some PsephoAnalytics work on the upcoming provincial election, I’ve been wrangling the Elections Ontario data. As provided, the data is really difficult to work with and we’ll walk through some steps to tidy these data for later analysis.
Here’s what the source data looks like:
Screenshot of raw Elections Ontario data
A few problems with this:
The data is scattered across a hundred different Excel files
Candidates are in columns with their last name as the header
Last names are not unique across all Electoral Districts, so can’t be used as a unique identifier
Electoral District names are in a row, followed by a separate row for each poll within the district
The party affiliation for each candidate isn’t included in the data
So, we have a fair bit of work to do to get to something more useful. Ideally something like:
## # A tibble: 9 x 5
## electoral_district poll candidate party votes
## <chr> <chr> <chr> <chr> <int>
## 1 X 1 A Liberal 37
## 2 X 2 B NDP 45
## 3 X 3 C PC 33
## 4 Y 1 A Liberal 71
## 5 Y 2 B NDP 37
## 6 Y 3 C PC 69
## 7 Z 1 A Liberal 28
## 8 Z 2 B NDP 15
## 9 Z 3 C PC 34
This is much easier to work with: we have one row for the votes received by each candidate at each poll, along with the Electoral District name and their party affiliation.
Candidate parties
As a first step, we need the party affiliation for each candidate. I didn’t see this information on the Elections Ontario site. So, we’ll pull the data from Wikipedia. The data on this webpage isn’t too bad. We can just use the table xpath selector to pull out the tables and then drop the ones we aren’t interested in.
```
candidate_webpage <- "https://en.wikipedia.org/wiki/Ontario_general_election,_2014#Candidates_by_region"
candidate_tables <- "table" # Use an xpath selector to get the drop down list by ID
candidates <- xml2::read_html(candidate_webpage) %>%
rvest::html_nodes(candidate_tables) %>% # Pull tables from the wikipedia entry
.[13:25] %>% # Drop unecessary tables
rvest::html_table(fill = TRUE)
</pre>
<p>This gives us a list of 13 data frames, one for each table on the webpage. Now we cycle through each of these and stack them into one data frame. Unfortunately, the tables aren’t consistent in the number of columns. So, the approach is a bit messy and we process each one in a loop.</p>
<pre class="r"><code># Setup empty dataframe to store results
candidate_parties <- tibble::as_tibble(
electoral_district_name = NULL,
party = NULL,
candidate = NULL
)
for(i in seq_along(1:length(candidates))) { # Messy, but works
this_table <- candidates[[i]]
# The header spans mess up the header row, so renaming
names(this_table) <- c(this_table[1,-c(3,4)], "NA", "Incumbent")
# Get rid of the blank spacer columns
this_table <- this_table[-1, ]
# Drop the NA columns by keeping only odd columns
this_table <- this_table[,seq(from = 1, to = dim(this_table)[2], by = 2)]
this_table %<>%
tidyr::gather(party, candidate, -`Electoral District`) %>%
dplyr::rename(electoral_district_name = `Electoral District`) %>%
dplyr::filter(party != "Incumbent")
candidate_parties <- dplyr::bind_rows(candidate_parties, this_table)
}
candidate_parties</code></pre>
<pre>
# A tibble: 649 x 3
electoral_district_name party candidate
1 Carleton—Mississippi Mills Liberal Rosalyn Stevens
2 Nepean—Carleton Liberal Jack Uppal
3 Ottawa Centre Liberal Yasir Naqvi
4 Ottawa—Orléans Liberal Marie-France Lalonde
5 Ottawa South Liberal John Fraser
6 Ottawa—Vanier Liberal Madeleine Meilleur
7 Ottawa West—Nepean Liberal Bob Chiarelli
8 Carleton—Mississippi Mills PC Jack MacLaren
9 Nepean—Carleton PC Lisa MacLeod
10 Ottawa Centre PC Rob Dekker
# … with 639 more rows
</pre>
</div>
<div id="electoral-district-names" class="section level2">
<h2>Electoral district names</h2>
<p>One issue with pulling party affiliations from Wikipedia is that candidates are organized by Electoral District <em>names</em>. But the voting results are organized by Electoral District <em>number</em>. I couldn’t find an appropriate resource on the Elections Ontario site. Rather, here we pull the names and numbers of the Electoral Districts from the <a href="https://www3.elections.on.ca/internetapp/FYED_Error.aspx?lang=en-ca">Find My Electoral District</a> website. The xpath selector is a bit tricky for this one. The <code>ed_xpath</code> object below actually pulls content from the drop down list that appears when you choose an Electoral District. One nuisance with these data is that Elections Ontario uses <code>--</code> in the Electoral District names, instead of the — used on Wikipedia. We use <code>str_replace_all</code> to fix this below.</p>
<pre class="r"><code>ed_webpage <- "https://www3.elections.on.ca/internetapp/FYED_Error.aspx?lang=en-ca"
ed_xpath <- "//*[(@id = \"ddlElectoralDistricts\")]" # Use an xpath selector to get the drop down list by ID
electoral_districts <- xml2::read_html(ed_webpage) %>%
rvest::html_node(xpath = ed_xpath) %>%
rvest::html_nodes("option") %>%
rvest::html_text() %>%
.[-1] %>% # Drop the first item on the list ("Select...")
tibble::as.tibble() %>% # Convert to a data frame and split into ID number and name
tidyr::separate(value, c("electoral_district", "electoral_district_name"),
sep = " ",
extra = "merge") %>%
# Clean up district names for later matching and presentation
dplyr::mutate(electoral_district_name = stringr::str_to_title(
stringr::str_replace_all(electoral_district_name, "--", "—")))
electoral_districts</code></pre>
<pre>
# A tibble: 107 x 2
electoral_district electoral_district_name
1 001 Ajax—Pickering
2 002 Algoma—Manitoulin
3 003 Ancaster—Dundas—Flamborough—Westdale
4 004 Barrie
5 005 Beaches—East York
6 006 Bramalea—Gore—Malton
7 007 Brampton—Springdale
8 008 Brampton West
9 009 Brant
10 010 Bruce—Grey—Owen Sound
# … with 97 more rows
</pre>
<p>Next, we can join the party affiliations to the Electoral District names to join candidates to parties and district numbers.</p>
<pre class="r"><code>candidate_parties %<>%
# These three lines are cleaning up hyphens and dashes, seems overly complicated
dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name, "—\n", "—")) %>%
dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name,
"Chatham-Kent—Essex",
"Chatham—Kent—Essex")) %>%
dplyr::mutate(electoral_district_name = stringr::str_to_title(electoral_district_name)) %>%
dplyr::left_join(electoral_districts) %>%
dplyr::filter(!candidate == "") %>%
# Since the vote data are identified by last names, we split candidate's names into first and last
tidyr::separate(candidate, into = c("first","candidate"), extra = "merge", remove = TRUE) %>%
dplyr::select(-first)</code></pre>
<pre><code>## Joining, by = "electoral_district_name"</code></pre>
<pre class="r"><code>candidate_parties</code></pre>
<pre>
# A tibble: 578 x 4
electoral_district_name party candidate electoral_district
*
1 Carleton—Mississippi Mills Liberal Stevens 013
2 Nepean—Carleton Liberal Uppal 052
3 Ottawa Centre Liberal Naqvi 062
4 Ottawa—Orléans Liberal France Lalonde 063
5 Ottawa South Liberal Fraser 064
6 Ottawa—Vanier Liberal Meilleur 065
7 Ottawa West—Nepean Liberal Chiarelli 066
8 Carleton—Mississippi Mills PC MacLaren 013
9 Nepean—Carleton PC MacLeod 052
10 Ottawa Centre PC Dekker 062
# … with 568 more rows
</pre>
<p>All that work just to get the name of each candiate for each Electoral District name and number, plus their party affiliation.</p>
</div>
<div id="votes" class="section level2">
<h2>Votes</h2>
<p>Now we can finally get to the actual voting data. These are made available as a collection of Excel files in a compressed folder. To avoid downloading it more than once, we wrap the call in an <code>if</code> statement that first checks to see if we already have the file. We also rename the file to something more manageable.</p>
<pre class="r"><code>raw_results_file <- "[www.elections.on.ca/content/d...](http://www.elections.on.ca/content/dam/NGW/sitecontent/2017/results/Poll%20by%20Poll%20Results%20-%20Excel.zip)"
zip_file <- "data-raw/Poll%20by%20Poll%20Results%20-%20Excel.zip"
if(file.exists(zip_file)) { # Only download the data once
# File exists, so nothing to do
} else {
download.file(raw_results_file,
destfile = zip_file)
unzip(zip_file, exdir="data-raw") # Extract the data into data-raw
file.rename("data-raw/GE Results - 2014 (unconverted)", "data-raw/pollresults")
}</code></pre>
<pre><code>## NULL</code></pre>
<p>Now we need to extract the votes out of 107 Excel files. The combination of <code>purrr</code> and <code>readxl</code> packages is great for this. In case we want to filter to just a few of the files (perhaps to target a range of Electoral Districts), we declare a <code>file_pattern</code>. For now, we just set it to any xls file that ends with three digits preceeded by a “_“.</p>
<p>As we read in the Excel files, we clean up lots of blank columns and headers. Then we convert to a long table and drop total and blank rows. Also, rather than try to align the Electoral District name rows with their polls, we use the name of the Excel file to pull out the Electoral District number. Then we join with the <code>electoral_districts</code> table to pull in the Electoral District names.</p>
<pre class="r">
file_pattern <- “*_[[:digit:]]{3}.xls” # Can use this to filter down to specific files
poll_data <- list.files(path = “data-raw/pollresults”, pattern = file_pattern, full.names = TRUE) %>% # Find all files that match the pattern
purrr::set_names() %>%
purrr::map_df(readxl::read_excel, sheet = 1, col_types = “text”, .id = “file”) %>% # Import each file and merge into a dataframe
Specifying sheet = 1 just to be clear we’re ignoring the rest of the sheets
Declare col_types since there are duplicate surnames and map_df can’t recast column types in the rbind
For example, Bell is in both district 014 and 063
dplyr::select(-starts_with(“X__")) %>% # Drop all of the blank columns
dplyr::select(1:2,4:8,15:dim(.)[2]) %>% # Reorganize a bit and drop unneeded columns
dplyr::rename(poll_number = POLL NO.) %>%
tidyr::gather(candidate, votes, -file, -poll_number) %>% # Convert to a long table
dplyr::filter(!is.na(votes),
poll_number != “Totals”) %>%
dplyr::mutate(electoral_district = stringr::str_extract(file, “[[:digit:]]{3}"),
votes = as.numeric(votes)) %>%
dplyr::select(-file) %>%
dplyr::left_join(electoral_districts)
poll_data
</pre>
<p>The only thing left to do is to join <code>poll_data</code> with <code>candidate_parties</code> to add party affiliation to each candidate. Because the names don’t always exactly match between these two tables, we use the <code>fuzzyjoin</code> package to join by closest spelling.</p>
<pre class="r"><code>poll_data_party_match_table <- poll_data %>%
group_by(candidate, electoral_district_name) %>%
summarise() %>%
fuzzyjoin::stringdist_left_join(candidate_parties,
ignore_case = TRUE) %>%
dplyr::select(candidate = candidate.x,
party = party,
electoral_district = electoral_district) %>%
dplyr::filter(!is.na(party))
poll_data %<>%
dplyr::left_join(poll_data_party_match_table) %>%
dplyr::group_by(electoral_district, party)
tibble::glimpse(poll_data)</code></pre>
<pre>
</pre>
<p>And, there we go. One table with a row for the votes received by each candidate at each poll. It would have been great if Elections Ontario released data in this format and we could have avoided all of this work.</p>
</div>
Analyzing these data was a great case study for the typical data management process. The data was structured for presentation, rather than analysis. So, there were several header rows, notes at the base of the table, and the data was spread across many worksheets.
Sometime recently, the ministry released an update that provides the data in a much better format: one sheet with rows for age and columns for years. Although this is a great improvement, I’ve had to update my case study, which makes it actually less useful as a lesson in data manipulation.
Although I’ve updated the main branch of the github repository, I’ve also created a branch that sources the archive.org version of the page from October 2016. Now, depending on the audience, I can choose the case study that has the right level of complexity.
Despite briefly causing me some trouble, I think it is great that these data are closer to a good analytical format. Now, if only the ministry could go one more step towards tidy data and make my case study completely unecessary.
Sometimes it’s the small things, accumulated over many days, that make a difference. As a simple example, every day when I leave the office, I message my family to let them know I’m leaving and how I’m travelling. Relatively easy: just open the Messages app, find the most recent conversation with them, and type in my message.
Using Workflow I can get this down to just a couple of taps on my watch. By choosing the “Leaving Work” workflow, I get a choice of travelling options:
Choosing one of them creates a text with the right emoticon that is pre-addressed to my family. I hit send and off goes the message.
The workflow itself is straightforward:
Like I said, pretty simple. But saves me close to a minute each and every day.
Many people took issue with the fact that these values weren’t adjusted for income. Seems to me that whether this is a good idea or not depends on what kind of question you’re trying to answer. Regardless, the CANSIM table includes this value. So, it is straightforward to calculate. Plus CANSIM tables have a pretty standard structure and showing how to manipulate this one serves as a good template for others.
library(tidyverse)
# Download and extract
url <- "[www20.statcan.gc.ca/tables-ta...](http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/01110001-eng.zip)"
zip_file <- "01110001-eng.zip"
download.file(url,
destfile = zip_file)
unzip(zip_file)
# We only want two of the columns. Specifying them here.
keep_data <- c("Median donations (dollars)",
"Median total income of donors (dollars)")
cansim <- read_csv("01110001-eng.csv") %>%
filter(DON %in% keep_data,
is.na(`Geographical classification`)) %>% # This second filter removes anything that isn't a province or territory
select(Ref_Date, DON, Value, GEO) %>%
spread(DON, Value) %>%
rename(year = Ref_Date,
donation = `Median donations (dollars)`,
income = `Median total income of donors (dollars)`) %>%
mutate(donation_per_income = donation / income) %>%
filter(year == 2015) %>%
select(GEO, donation, donation_per_income)
cansim
## # A tibble: 16 x 3
## GEO donation donation_per_income
##
## 1 Alberta 450 0.006378455
## 2 British Columbia 430 0.007412515
## 3 Canada 300 0.005119454
## 4 Manitoba 420 0.008032129
## 5 New Brunswick 310 0.006187625
## 6 Newfoundland and Labrador 360 0.007001167
## 7 Non CMA-CA, Northwest Territories 480 0.004768528
## 8 Non CMA-CA, Yukon 310 0.004643499
## 9 Northwest Territories 400 0.003940887
## 10 Nova Scotia 340 0.006505932
## 11 Nunavut 570 0.005651398
## 12 Ontario 360 0.005856515
## 13 Prince Edward Island 400 0.008221994
## 14 Quebec 130 0.002452830
## 15 Saskatchewan 410 0.006910501
## 16 Yukon 420 0.005695688
Curious that they dropped the territories from their chart, given that Nunavut has such a high donation amount.
Now we can plot the normalized data to find how the rank order changes. We’ll add the Canadian average as a blue line for comparison.
I’m not comfortable with using median donations (adjusted for income or not) to say anything in particular about the residents of a province. But, I’m always happy to look more closely at data and provide some context for public debates.
One major gap with this type of analysis is that we’re only looking at the median donations of people that donated anything at all. In other words, we aren’t considering anyone who donates nothing. We should really compare these median donations to the total population or the size of the economy. This Stats Can study is a much more thorough look at the issue.
For me the interesting result here is the dramatic difference between Quebec and the rest of the provinces. But, I don’t interpret this to mean that Quebecers are less generous than the rest of Canada. Seems more likely that there are material differences in how the Quebec economy and social safety nets are structured.
The TTC releasing their Subway Delay Data was great news. I’m always happy to see more data released to the public. In this case, it also helps us investigate one of the great, enduring mysteries: Is Friday the 13th actually an unlucky day?
As always, we start by downloading and manipulating the data. I’ve added in two steps that aren’t strictly necessary. One is converting the Date, Time, and Day columns into a single Date column. The other is to drop most of the other columns of data, since we aren’t interested in them here.
Now we have a delays dataframe with 69043 incidents starting from 2014-01-01 00:21:00 and ending at 2017-04-30 22:13:00. Before we get too far, we’ll take a look at the data. A heatmap of delays by day and hour should give us some perspective.
delays %>%
dplyr::mutate(day = lubridate::day(date),
hour = lubridate::hour(date)) %>%
dplyr::group_by(day, hour) %>%
dplyr::summarise(sum_delay =sum(delay)) %>%
ggplot2::ggplot(aes(x = hour, y = day, fill = sum_delay)) +
ggplot2::geom_tile(alpha =0.8, color ="white") +
ggplot2::scale_fill_gradient2() +
ggplot2::theme(legend.position ="right") +
ggplot2::labs(x ="Hour", y ="Day of the month", fill ="Sum of delays")
Other than a reliable band of calm very early in the morning, no obvious patterns here.
We need to identify any days that are a Friday the 13th. We also might want to compare weekends, regular Fridays, other weekdays, and Friday the 13ths, so we add a type column that provides these values. Here we use the case_when function:
delays <- delays %>%
dplyr::mutate(type =case_when( # Partition into Friday the 13ths, Fridays, weekends, and weekdays
lubridate::wday(.$date) %in%c(1, 7) ~"weekend",
lubridate::wday(.$date) %in%c(6) &
lubridate::day(.$date) ==13~"Friday 13th",
lubridate::wday(.$date) %in%c(6) ~"Friday",
TRUE~"weekday"# Everything else is a weekday
)) %>%
dplyr::mutate(type =factor(type)) %>%
dplyr::group_by(type)
delays
## # A tibble: 4 x 2## type total_delay## <fctr> <dbl>## 1 Friday 18036## 2 Friday 13th 619## 3 weekday 78865## 4 weekend 28194
Clearly the total minutes of delays are much shorter for Friday the 13ths. But, there aren’t very many such days (only 6 in fact). So, this is a dubious analysis.
Let’s take a step back and calculate the average of the total delay across the entire day for each of the types of days. If Friday the 13ths really are unlucky, we would expect to see longer delays, at least relative to a regular Friday.
daily_delays <- delays %>%# Total delays in a day
dplyr::mutate(year = lubridate::year(date),
day = lubridate::yday(date)) %>%
dplyr::group_by(year, day, type) %>%
dplyr::summarise(total_delay =sum(delay))
mean_daily_delays <- daily_delays %>%# Average delays in each type of day
dplyr::group_by(type) %>%
dplyr::summarise(avg_delay =mean(total_delay))
mean_daily_delays
## # A tibble: 4 x 2## type avg_delay## <fctr> <dbl>## 1 Friday 107.35714## 2 Friday 13th 103.16667## 3 weekday 113.63833## 4 weekend 81.01724
ggplot2::ggplot(daily_delays, aes(type, total_delay)) +
ggplot2::geom_boxplot() +
ggplot2::labs(x ="Type", y ="Total minutes of delay")
On average, Friday the 13ths have shorter total delays (103 minutes) than either regular Fridays (107 minutes) or other weekdays (114 minutes). Overall, weekend days have far shorter total delays (81 minutes).
If Friday the 13ths are unlucky, they certainly aren’t causing longer TTC delays.
For the statisticians among you that still aren’t convinced, we’ll run a basic linear model to compare Friday the 13ths with regular Fridays. This should control for many unmeasured variables.
model <-lm(total_delay ~ type, data = daily_delays,
subset = type %in%c("Friday", "Friday 13th"))
summary(model)
## ## Call:## lm(formula = total_delay ~ type, data = daily_delays, subset = type %in% ## c("Friday", "Friday 13th"))## ## Residuals:## Min 1Q Median 3Q Max ## -103.357 -30.357 -6.857 18.643 303.643 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 107.357 3.858 27.829 <2e-16 ***## typeFriday 13th -4.190 20.775 -0.202 0.84 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 50 on 172 degrees of freedom## Multiple R-squared: 0.0002365, Adjusted R-squared: -0.005576 ## F-statistic: 0.04069 on 1 and 172 DF, p-value: 0.8404
Definitely no statistical support for the idea that Friday the 13ths cause longer TTC delays.
How about time series tests, like anomaly detections? Seems like we’d just be getting carried away. Part of the art of statistics is knowing when to quit.
In conclusion, then, after likely far too much analysis, we find no evidence that Friday the 13ths cause an increase in the length of TTC delays. This certainly suggests that Friday the 13ths are not unlucky in any meaningful way, at least for TTC riders.
Thank you to all the participants, donors, and volunteers for making the third Axe Pancreatic Cancer event such a great success! Together we’re raising awareness and funding to support Pancreatic Cancer Canada.
The day after an historic landslide electoral victory for the Liberal Party of Canada, we’ve compared our predictions (and those of other organizations who provide riding-level predictions) to the actual results in Toronto.
Before getting to the details, we thought it important to highlight that while the methodologies of the other organizations differ, they are all based on tracking sentiments as the campaign unfolds. So, most columns in the table below will differ slightly from the one in our previous post as such sentiments change day to day.
This is fundamentally different from our modelling approach, which utilizes voter and candidate characteristics, and therefore could be applied to predict the results of any campaign before it even begins. (The primary assumption here is that individual voters behave in a consistent way but vote differently from election to election as they are presented with different inputs to their decision-making calculus.) We hope the value of this is obvious.
Now, on to the results! The final predictions of all organizations and the actual results were as follows:
To start with, our predictions included many more close races than the others: while we predicted average margins of victory of about 10 points, the others were predicting averages well above that (ranging from around 25 to 30 points). The actual results fell in between at around 20 points.
Looking at specific races, we did better than the others at predicting close races in York Centre and Parkdale-High Park, where the majority predicted strong Liberal wins. Further, while everyone was wrong in Toronto-Danforth (which went Liberal by only around 1,000 votes), we predicted the smallest margin of victory for the NDP. On top of that, we were as good as the others in six ridings, meaning that we were at least as good as poll tracking in 9 out of 25 ridings (and would have been 79 days ago, before the campaign started, despite the polls changing up until the day before the election).
But that means we did worse in the others ridings, particularly Toronto Centre (where our model was way off), and a handful of races that the model said would be close but ended up being strong Liberal wins. While we need to undertake much more detailed analysis (once Elections Canada releases such details), the “surprise” in many of these cases was the extent to which voters, who might normally vote NDP, chose to vote Liberal this time around (likely a coalescence of “anti-Harper” sentiment).
Overall, we are pleased with how the model stood up, and know that we have more work to do to improve our accuracy. This will include more data and more variables that influence voters’ decisions. Thankfully, we now have a few years before the next election…
Well, it is now only days until the 42nd Canadian election, and we have come a long way since this long campaign started. Based on our analyses to date of voter and candidate characteristics, we can now provide riding-level predictions. As we keep saying, we have avoided the use of polls, so these present more of an experiment than anything else. Nonetheless, we’ve put them beside the predictions of five other organizations (as of the afternoon of 15 October 2015), specifically:
(We’ll note that the last doesn’t provide the likelihood of a win, so isn’t colour-coded below, but does provide additional information for our purposes here.)
You’ll see that we’re predicting more close races than all the others combined, and more “leaning” races. In fact, the average margin of victory from 308, Vox Pop, and Too Close to Call are 23%/26%/23% respectively, which sounds high. Nonetheless, the two truly notable differences we’re predicting are in Eglinton-Lawrence, where the consensus is that finance minister Joe Oliver will lose badly (we predict he might win) and Toronto Centre, where Bill Munro is predicted to easily beat Linda McQuaig (we predict the opposite).
Anyway, we’re excited to see how these predictions look come Monday, and we’ll come back after the election with an analysis of our performance.
We’ve started looking into what might be a natural cycle between governing parties, which may account for some of our differences to the polls that we’ve seen. The terminology often heard is “time for a change” – and this sentiment, while very difficult to include in voter characteristics, is possible to model as a high level risk to governing parties.
To start, we reran our predictions with an incumbent-year interaction, to see if the incumbency bonus changed over time. Turns out it does – incumbency effect declines over time. But it is difficult to determine, from only a few years of data, whether we’re simply seeing a reversion to the mean. So we need more data – and likely at a higher level.
Let’s start with the proportion of votes received by each of today’s three major parties (or their predecessors – whose names we’ll simply substitute with modern party names), with trend lines, in every federal election since Confederation:
This chart shows that the Liberal & Conservative trend lines are essentially the same, and that the two parties effectively cycle as the governing party over this line.
Prior to a noticeable 3rd party (i.e., the NDP starting in the 1962 election and its predecessor Co-operative Commonwealth Federation starting in the 1935 election) the Liberals and Conservatives effectively flipped back and forth in terms of governing (6 times over 68 years), averaging around 48% of the vote each. Since then, the flip has continued (10 more times over the following 80 years), and the median proportion of votes for Liberals, Conservatives, and NDP has been 41%/35%/16% respectively.
Further, since 1962, the Liberals have been very slowly losing support (about 0.25 points per election), while the other two parties have been very slowly gaining it (about 0.05 points per election), though there has been considerable variation across each election, making this slightly harder to use in predictions. (We’ll look into including this in our risk modeling).
Next, we looked at some stats about governing:
In the 148.4 years since Sir John A. Macdonald was first sworn in, there have been 27 PM-ships (though only 22 PMs), for an average length of 5.5 years (though 4.3 years for Conservatives and 6.9 years for Liberals).
Parties often string a couple PMs together - so the PM-ship has only switched parties 16 times with an average length of 8.7 years (or 7.2 Cons vs. 10.4 Libs).
Only two PMs have won four consecutive elections (Macdonald and Laurier), with four more having won three (Mackensie King, Diefenbaker, Trudeau, and Crétien) prior to Harper.
All of these stats would suggest that Harper is due for a loss: he has been the sole PM for his party for 9.7 years, which is over twice his party’s average length for a PM-ship. He’s also second all-time behind Macdonald in a consecutive Conservative PM role (having past Mulroney and Borden last year). From a risk-model perspective, Harper is likely about to become hit hard by the “time for a change” narrative.
But how much will this actually affect Conservative results? And how much will their opponents benefit? These are critical questions to our predictions.
In any election where the governing party lost (averaging once every 9 years; though 7 years for Conservatives, and 11 years for Liberals), that party saw a median drop of 6.1 points from the preceding election (average of 8.1 points). Since 1962 (first election with the NDP), that loss has been 5.5 points. But do any of those votes go to the NDP? Turns out, not really: those 5.5 points appear to (at least on average) switch back to the new governing party.
Given the risk to the current governing party, we would forecast a 5.5%-6.1% shift from the Conservatives to the Liberals, on top of all our other estimates (which would not overlap with any of this analysis), assuming that Toronto would feel the same about change as the rest of the country has historically.
That would mean our comparisons to recent Toronto-specific polls would look like this:
Remember – our analysis has avoided the use of polls, so these results (assuming the polls are right) are quite impressive.
Next up (and last before the election on Monday) will be our riding-level predictions.
Political psychologists have long held that over-simplified “rational” models of voters do not help accurately predict their actual behavior. What most behavioural researchers have found is the decision-making (e.g., voting) often boils down to emotional, unconscious factors. So, in attempting to build up our voting agents, we will need to at least:
include multiple issue perspectives, not just a simple evaluation of “left-right”;
include data for non-policy factors that could determine voting; and
not prescribe values to our agents beyond what we can empirically derive.
Given that we are unable to peek into voters’ minds (and remember: we are trying to avoid using polls[1]), we need data for (or proxies for) factors that might influence someone’s vote. So, we gathered (or created) and joined detailed data for the 2006, 2008, and 2011 Canadian federal elections (as well as the 2015 election, which will be used for predictions).
In a new paper, we discuss what influence multiple factors, such as “leader likeability”, incumbency, “star” status, demographics and policy platforms, may have on voting outcomes, and use these results to predict the upcoming federal election in Toronto ridings.
At a high-level, we find that:
Almost all variables are statistically significant.
Being either a star candidate or an incumbent can boost a candidates share of the vote by 21%, but being both an incumbent and a star candidate does not give a candidate an incremental increase. The two effects are equivalent to belonging to a party (21%).
Leader likeability is associated with a 0.3% change in the proportion of votes received by a candidate. So, a leader that essentially polls the same as their party yields their Toronto-based candidates about 14 points.
The relationships between age, gender, family income, and the proportion of votes vary widely across the parties (as expected). For example, family income tends to increase support for Conservatives (0.005/$10,000) while decreasing for the other two major parties by roughly the same magnitude.
Policy matters, but only slightly, and only economic and environmental issues overall.
With our empirical results, we can turn to predicting the 2015 federal election in Toronto ridings.
It turns out that our Toronto-wide results are fairly in line with recent Toronto-specific polling results (weighted by age and sample size) – though we’ll see how right we all are come election day – which means that there may some inherent truth in the coefficients we have found.
Given that we haven’t used polls or included localized details or party platforms, these results are surprisingly good. The seeming shift from Liberal to Conservative is something that we’ll need to look into further. It is likely highlighting an issue with our data: namely, that we only have three years of detailed federal elections data, and these elections have seen some of the best showings for the Conservatives (and their predecessors) in Ontario since the end of the second world war (the exceptions being in the late 1950s with Diefenbaker, 1979 with Joe Clark, and 1984 with Brian Mulroney), with some of the worst for the Liberals over the same time frame. That is, we are not picking up a (cyclical) reversion to the mean in our variables, but might investigate the cycle itself.
Nonetheless, given we set out to understand (both theoretically and empirically) how to predict an election while significantly limiting the use of polls, and it appears that we are at least on the right track.
[1] This is true for a number of reasons: first, we want to be able to simulate elections, and therefore would not always have access to polls; second, we are trying to do something fundamentally different by observing behaviour instead of asking people questions, which often leads to lying (e.g., social desirability biases: see the “Bradley effect”); third, while polls in aggregate are generally good at predicting outcomes, individual polls are highly volatile.
Given that we are unable to peek into voters’ minds (remember: we are trying to avoid using polls as much as possible), we need data (or proxies) for factors that might influence someone’s vote. We gathered (or created) and joined data for the 2006, 2008, and 2011 Canadian federal elections (as well as the 2015 election, which will be used for predictions) for Toronto ridings.
We’ll be explaining all this in more detail next week, but for now, here are some basics:
We’ve assigned leader “likeability” scores to the major party leaders in each election, using polls that ask questions about leadership characteristics and formulaically compare them to party-level polls around the same time. This provides a value for (or at least a proxy of) how much influence the party leader was having on their party’s showing in the polls, and should account for much of the party variation that we see from year to year. (We also use party identifiers, to identify a “base”.)
For all 366 candidates across the three elections, we identify two things: are they an incumbent, and are they a “star” candidate, by which we mean would they be generally known outside of their riding? This yields 64 candidate-year incumbents (i.e., an individual could be an incumbent in all three elections) and 29 candidate-year stars.
Regressing these data against the proportion of votes received across ridings yields some interesting results. First: party, leader likeability, star candidate, and incumbency are all statistically significant (as is the interaction of star candidate and incumbency). This isn’t a surprise, given the literature around what it is that drives voters’ decisions. (Note that we haven’t yet included demographics or party platforms.)
Breaking down the results: Being a star candidate or an incumbent (but not both) adds about 20 points right off the top, so name recognition obviously matters a lot. Likeability matter too; a leader that essentially polls the same as their party yields candidates about 14 points. (As an example of what this means, Stephane Dion lost the average Liberal candidate in Toronto about 9 points relative to Paul Martin. Alternatively, in 2011, Jack Layton added about 16 points more to NDP candidates in Toronto than Michael Ignatieff did for equivalent Liberal candidates.) Finally, party base matters too: for example, being an average Liberal candidate in Toronto adds about 17 points over the equivalent NDP candidate. (We expect some of this will be explained with demographics and party platforms.)
To be clear, these are average results, so we can’t yet use them effectively for predicting individual riding-level races (that will come later). But, if we apply them to all 2015 races in Toronto and aggregate across the city, we would predict voting proportions very similar to the results of a recent poll by Mainstreet (if undecided voters split proportionally):
Given that we haven’t used polls or included localized details or party platforms, these results are amazing, and give us a lot of confidence that we’re making fantastic progress in understanding voter behaviour (at least in Toronto).
Analyzing the upcoming federal election requires collecting and integrating new data. This is often the most challenging part of any analysis and we’ve committed significant efforts to obtaining good data for federal elections in Toronto’s electoral districts.
Clearly, the first place to start was with Elections Canada and the results of previous general elections. These are available for download as collections of Excel files, which aren’t the most convenient format. So, our toVotes package has been updated to include results from the 2006, 2008, and 2011 federal elections for electoral districts in Toronto. The toFederalVotes data frame provides the candidate’s name, party, whether they were an incumbent, and the number of votes they received by electoral district and poll number. Across the three elections, this amounts to 82,314 observations.
Connecting these voting data with other characteristics requires knowing where each electoral district and poll are in Toronto. So, we created spatial joins among datasets to integrate them (e.g., combining demographics from census data with the vote results). Shapefiles for each of the three federal elections are available for download, but the location identifiers aren’t a clean match between the Excel and shapefiles. Thanks to some help from Elections Canada, we were able to translate the location identifiers and join the voting data to the election shapefiles. This gives us close to 4,000 poll locations across 23 electoral districts in each year. We then used the census shapefiles to aggregate these voting data into 579 census tracts. These tracts are relatively stable and give us a common geographical classification for all of our data.
This work is currently in the experimental fed-geo branch of the toVotes package and will be pulled into the main branch soon. Now, with votes aggregated into census tracts, we can use the census data for Toronto in our toCensus package to explore how demographics affect voting outcomes.
Getting the data to this point was more work than we expected, but well worth the effort. We’re excited to see what we can learn from these data and look forward to sharing the results with you.
A number of people have been asking whether we are going to analyze the upcoming federal election on October 19, like we did for the Toronto mayoral race last year. The truth is, we never stopped working after the mayoral race, but are back with a vengeance for the next five weeks.
We have gathered tonnes of new data and refined our methodology. We have also established a new domain name: psephoanalytics.ca. You can still subscribe to email updates here, or follow us on twitter @psephoanalytics. Finally, if you’d like to chat directly, please email us psephoanalytics@gmail.com.
Nonetheless, stay tuned for lots of updates over the coming weeks, culminating in some predictions for Toronto ridings prior to October 19.
The first (and long) step in moving towards agent-based modeling is the creation of the agents themselves. While fictional, they must represent reality – meaning they need to behave like actual people. The main issue in voter modeling, however, is that since voting is private we do not know how individuals behave, only collections of voters – and we do not want them all to behave the exact same way. That is why one of the key elements of our work is the ability to create meaningful differences among our agents – particularly when it comes to the likes of issue positions and political engagement.
The obvious difficulty is how to do that. In our model, many of our agents’ characteristics are limited to values between 0 and 1 (e.g., political positions, weights on given issues). Many standard distributions, such as the normal, would be cut off at these extremes, creating unrealistic “spikes” of extreme behaviour. We also cannot use uniform distributions, as the likelihood of individuals in a group looking somewhat the same (i.e., more around an average) seems much more reasonable than them looking uniformly different.
Which brings us to the β distribution. In a new paper, we discuss applying this family of distributions to voter characteristics. While there is great diversity in the potential shapes of these distributions - granting us the flexibility we need - in (likely) very extreme cases, the shape will not “look like” what we would expect. Therefore, one of our goals will be to somewhat constrain our selection of fixed values for α and β, based on as much empirical data as possible, to ensure we get this balance right.
A selection α-β combinations that generate “useful” distributions:
As the next Federal General Election gets closer, we’re turning our analytical attention to how the election might play out in Toronto. The first step, of course, is to gather data on prior elections. So, we’ve updated our toVotes data package to include the results of the 2008 and 2011 federal elections for electoral districts in Toronto.
This dataset includes the votes received by each candidate in each district and poll in Toronto. We also include the party affiliation of the candidate and whether they are an incumbent. These data are currently stored in a separate dataset from the mayoral results, since the geography of the electoral districts and wards aren’t directly comparable. We’ll work on integrating these datasets more closely and adding in further election results over the coming weeks.
Hopefully the general availability of cleaned and integrated datasets, such as this one, will help generate more analytically-based discussions of the upcoming election.
Turnout is often seen as (at least an easy) metric of the health of a democracy – as voting is a primary activity in civic engagement. However, turnout rates continue to decline across many jurisdictions[i]. This is certainly true in Canada and Ontario.
From the PsephoAnalytics perspective – namely, accurately predicting the results of elections (particularly when using an agent-based model (ABM) approach) – requires understanding what it is that drives the decision to vote at all, instead of simply staying home.
If this can be done, we would not only improve our estimates in an empirical (or at least heuristic) way, but might also be able to make normative statements about elections. That is, we hope to be able to suggest ways in which turnout could be improved, and whether (or how much) that mattered.
In a new paper we start to investigate the history of turnout in Canada and Ontario, and review what the literature says about the factors associated with turnout, in an effort to help “teach” our agents when and why they “want” to vote. More work will certainly be required here, but this provides a very good start.
We value constructive feedback and continuous improvement, so we’ve taken a careful look at how our predictions held up for the recent mayoral election in Toronto.
The full analysis is here. The summary is that our estimates weren’t too bad on average: the distribution of errors is centered on zero (i.e., not biased) with a small standard error. But, on-average estimates are not sufficient for the types of prediction we would like to make. At a ward-level, we find that we generally overestimated support for Tory, especially in areas where Ford received significant votes.
We understood that our simple agent-based approach wouldn’t be enough. Now we’re particularly motivated to gather up much more data to enrich our agents' behaviour and make better predictions.
The results are in, and our predictions performed reasonably well on average (we averaged 4% off per candidate). Ward by ward predictions were a little more mixed, though, with some wards being bang on (looking at Tory’s results), and some being way off – such as northern Scarborough and Etobicoke. (For what it’s worth, the polls were a ways off in this regard too.) This mostly comes down to our agents not being different enough from one another. We knew building the agents would be the hardest part, and we now have proof!
Regardless, we still think that the agent-based modeling approach is the most appropriate for this kind of work – but we obviously need a lot more data to teach our agents what they believe. So, we’re going to spend the next few months incorporating other datasets (e.g., historical federal and provincial elections, as well as councillor-level data from the 2014 Toronto election). The other piece that we need to focus on is turnout. We knew our turnout predictions were likely the minimum for this election, but aren’t yet able to model a more predictive metric, so we’ll be conducting a study into that as well.
Finally, we’ll provide detailed analysis of our predictions once all the detailed official results become available.
Our final predictions have John Tory winning the 2014 mayoral election in Toronto with a plurality 46% of the votes, followed by Doug Ford (29%) and Olivia Chow (25%). We also predict turnout of at least 49% across the city, but there are differences in turnout among each candidate’s supporters (with Tory’s supporters being the most likely to vote by a significant margin - which is why our results are more in his favour than recent polls). We predict support for each candidate will come from different pockets of the city, as can be seen on the map below.
These predictions were generated by simulating the election ten times, each time sampling one million of our representative voters (whom we created) for their voting preferences and whether they intend to vote.
Each representative voter has demographic characteristics (e.g., age, sex, income) in accordance with local census data, and lives in a specific ‘neighbourhood’ (i.e., census tract). These attributes helped us assign them political beliefs – and therefore preferences for candidates – as well as political engagement scores that come from various studies of historical turnout (from the likes of Elections Canada). The latter allows us to estimate the likelihood of each specific agent actually casting a ballot.
We’ll shortly also release a ward-by-ward summary of our predictions.
In the end, we hope this proof-of-concept proves to be a more refined (and therefore useful in the long-term) than polling data. As the model becomes more sophisticated, we’ll be able to do scenario testing and study other aspects of campaigns.
As promised, here is a ward-by-ward breakdown of our final predictions for the 2014 mayoral election in Toronto. We have Tory garnering the most votes in 33 wards for sure, plus likely another 5 in close races. Six wards are “too close to call”, with three barely leaning to Tory (38, 39, and 40) and three barely leaning to Ford (8, 35, and 43). We’re not predicting Chow will win in any ward, but will come second in fourteen.
The first (and long) step in moving towards agent-based modeling is the creation of the agents themselves. While fictional, they must be representative of reality – meaning they need to behave like actual people might.
In developing a proof of concept of our simulation platform (which we’ll lay out in some detail soon), we’ve created 10,000 agents, drawn randomly from the 542 census tracts (CTs) that make up Toronto per the 2011 Census, proportional to the actual population by age and sex. (CTs are roughly “neighbourhoods”.) So, for example, if 0.001% of the population of Toronto are male, aged 43, living in a CT on the Danforth, then roughly 0.001% of our agents will have those same characteristics. Once the basic agents are selected, we assign (for now) the median household income from the CT to the agent.
But what do these agents believe, politically? For that we take (again, for now) a weighted compilation of relatively recent polls (10 in total, having polled close to 15,000 people, since Doug Ford entered the race), averaged by age/sex /income group/region combinations (420 in total). These give us average support for each of the three major candidates (plus “other”) by agent type, which we then randomly sample (by proportion of support) and assign a Left-Right score (0-100) as we did in our other modeling.
This is somewhat akin to polling, except we’re (randomly) assigning these agents what they believe rather than asking, such that it aggregates back to what the polls are saying, on average.
Next, we take the results of an Elections Canada study on turnout by age/sex that allows us to similarly assign “engagement” scores to the agents. That is, we assign (for now) the average turnout by age/sex group accordingly to each agent. This gives us a sense of likely turnout by CT (see map below).
There is much more to go here, but this forms the basis of our “voter” agents. Next, we’ll turn to “candidate” agents, and then on to “media” agents.
Our most recent analysis shows Tory still in the lead with 44% of the votes, followed by Doug Ford at 33% and Olivia Chow at 23%.
Our analytical approach allows us to take a closer, geographical look. Based on this, we see general support for Tory across the city, while Ford and Chow have more distinct areas of support.
This still based on our original macro-level analysis, but gives a good sense of where our agents support would be (on average) at a local level.
Given the caveats we outlined re: macro-level voting modeling, we’re moving on to a totally different approach. Using something called agent-based modeling (ABM), we’re hoping to move to a point where we can both predict elections, but also use the system to conduct studies on the effectiveness of various election models.
ABM can be defined simply as an individual-centric approach to model design, and has become widespread in multiple fields, from biology to economics. In such models, researchers define agents (e.g., voters, candidates, and media) each with various properties, and an environment in which such agents can behave and interact.
Examining systems through ABM seeks to answer four questions:
Empirical: What are the (causal) explanations for how systems evolve?
Heuristic: What are outcomes of (even simple) shocks to the system?
Method: How can we advance our understanding of the system?
Normative: How can systems be designed better?
We’ll start to provide updates on our progress on the development on our system in the coming weeks.