Elections Ontario official results
Sunday, November 19, 2017
In preparing for some PsephoAnalytics work on the upcoming provincial election, Iâve been wrangling the Elections Ontario data. As provided, the data is really difficult to work with and weâll walk through some steps to tidy these data for later analysis.
Hereâs what the source data looks like:

Screenshot of raw Elections Ontario data
A few problems with this:
- The data is scattered across a hundred different Excel files
- Candidates are in columns with their last name as the header
- Last names are not unique across all Electoral Districts, so canât be used as a unique identifier
- Electoral District names are in a row, followed by a separate row for each poll within the district
- The party affiliation for each candidate isnât included in the data
So, we have a fair bit of work to do to get to something more useful. Ideally something like:
## # A tibble: 9 x 5
## electoral_district poll candidate party votes
## <chr> <chr> <chr> <chr> <int>
## 1 X 1 A Liberal 37
## 2 X 2 B NDP 45
## 3 X 3 C PC 33
## 4 Y 1 A Liberal 71
## 5 Y 2 B NDP 37
## 6 Y 3 C PC 69
## 7 Z 1 A Liberal 28
## 8 Z 2 B NDP 15
## 9 Z 3 C PC 34
This is much easier to work with: we have one row for the votes received by each candidate at each poll, along with the Electoral District name and their party affiliation.
Candidate parties
As a first step, we need the party affiliation for each candidate. I didnât see this information on the Elections Ontario site. So, weâll pull the data from Wikipedia. The data on this webpage isnât too bad. We can just use the table
xpath selector to pull out the tables and then drop the ones we arenât interested in.
``` candidate_webpage <- "https://en.wikipedia.org/wiki/Ontario_general_election,_2014#Candidates_by_region" candidate_tables <- "table" # Use an xpath selector to get the drop down list by IDcandidates <- xml2::read_html(candidate_webpage) %>% rvest::html_nodes(candidate_tables) %>% # Pull tables from the wikipedia entry .[13:25] %>% # Drop unecessary tables rvest::html_table(fill = TRUE)
</pre> <p>This gives us a list of 13 data frames, one for each table on the webpage. Now we cycle through each of these and stack them into one data frame. Unfortunately, the tables arenât consistent in the number of columns. So, the approach is a bit messy and we process each one in a loop.</p> <pre class="r"><code># Setup empty dataframe to store results candidate_parties <- tibble::as_tibble( electoral_district_name = NULL, party = NULL, candidate = NULL ) for(i in seq_along(1:length(candidates))) { # Messy, but works this_table <- candidates[[i]] # The header spans mess up the header row, so renaming names(this_table) <- c(this_table[1,-c(3,4)], "NA", "Incumbent") # Get rid of the blank spacer columns this_table <- this_table[-1, ] # Drop the NA columns by keeping only odd columns this_table <- this_table[,seq(from = 1, to = dim(this_table)[2], by = 2)] this_table %<>% tidyr::gather(party, candidate, -`Electoral District`) %>% dplyr::rename(electoral_district_name = `Electoral District`) %>% dplyr::filter(party != "Incumbent") candidate_parties <- dplyr::bind_rows(candidate_parties, this_table) } candidate_parties</code></pre> <pre>
# A tibble: 649 x 3
electoral_district_name party candidate
1 CarletonâMississippi Mills Liberal Rosalyn Stevens
2 NepeanâCarleton Liberal Jack Uppal
3 Ottawa Centre Liberal Yasir Naqvi
4 OttawaâOrlĂ©ans Liberal Marie-France Lalonde
5 Ottawa South Liberal John Fraser
6 OttawaâVanier Liberal Madeleine Meilleur
7 Ottawa WestâNepean Liberal Bob Chiarelli
8 CarletonâMississippi Mills PC Jack MacLaren
9 NepeanâCarleton PC Lisa MacLeod
10 Ottawa Centre PC Rob Dekker
# … with 639 more rows
</pre> </div> <div id="electoral-district-names" class="section level2"> <h2>Electoral district names</h2> <p>One issue with pulling party affiliations from Wikipedia is that candidates are organized by Electoral District <em>names</em>. But the voting results are organized by Electoral District <em>number</em>. I couldnât find an appropriate resource on the Elections Ontario site. Rather, here we pull the names and numbers of the Electoral Districts from the <a href="https://www3.elections.on.ca/internetapp/FYED_Error.aspx?lang=en-ca">Find My Electoral District</a> website. The xpath selector is a bit tricky for this one. The <code>ed_xpath</code> object below actually pulls content from the drop down list that appears when you choose an Electoral District. One nuisance with these data is that Elections Ontario uses <code>--</code> in the Electoral District names, instead of the â used on Wikipedia. We use <code>str_replace_all</code> to fix this below.</p> <pre class="r"><code>ed_webpage <- "https://www3.elections.on.ca/internetapp/FYED_Error.aspx?lang=en-ca" ed_xpath <- "//*[(@id = \"ddlElectoralDistricts\")]" # Use an xpath selector to get the drop down list by ID electoral_districts <- xml2::read_html(ed_webpage) %>% rvest::html_node(xpath = ed_xpath) %>% rvest::html_nodes("option") %>% rvest::html_text() %>% .[-1] %>% # Drop the first item on the list ("Select...") tibble::as.tibble() %>% # Convert to a data frame and split into ID number and name tidyr::separate(value, c("electoral_district", "electoral_district_name"), sep = " ", extra = "merge") %>% # Clean up district names for later matching and presentation dplyr::mutate(electoral_district_name = stringr::str_to_title( stringr::str_replace_all(electoral_district_name, "--", "â"))) electoral_districts</code></pre> <pre>
# A tibble: 107 x 2
electoral_district electoral_district_name
1 001 AjaxâPickering
2 002 AlgomaâManitoulin
3 003 AncasterâDundasâFlamboroughâWestdale
4 004 Barrie
5 005 BeachesâEast York
6 006 BramaleaâGoreâMalton
7 007 BramptonâSpringdale
8 008 Brampton West
9 009 Brant
10 010 BruceâGreyâOwen Sound
# … with 97 more rows
</pre> <p>Next, we can join the party affiliations to the Electoral District names to join candidates to parties and district numbers.</p> <pre class="r"><code>candidate_parties %<>% # These three lines are cleaning up hyphens and dashes, seems overly complicated dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name, "â\n", "â")) %>% dplyr::mutate(electoral_district_name = stringr::str_replace_all(electoral_district_name, "Chatham-KentâEssex", "ChathamâKentâEssex")) %>% dplyr::mutate(electoral_district_name = stringr::str_to_title(electoral_district_name)) %>% dplyr::left_join(electoral_districts) %>% dplyr::filter(!candidate == "") %>% # Since the vote data are identified by last names, we split candidate's names into first and last tidyr::separate(candidate, into = c("first","candidate"), extra = "merge", remove = TRUE) %>% dplyr::select(-first)</code></pre> <pre><code>## Joining, by = "electoral_district_name"</code></pre> <pre class="r"><code>candidate_parties</code></pre> <pre>
# A tibble: 578 x 4
electoral_district_name party candidate electoral_district
*
1 CarletonâMississippi Mills Liberal Stevens 013
2 NepeanâCarleton Liberal Uppal 052
3 Ottawa Centre Liberal Naqvi 062
4 OttawaâOrlĂ©ans Liberal France Lalonde 063
5 Ottawa South Liberal Fraser 064
6 OttawaâVanier Liberal Meilleur 065
7 Ottawa WestâNepean Liberal Chiarelli 066
8 CarletonâMississippi Mills PC MacLaren 013
9 NepeanâCarleton PC MacLeod 052
10 Ottawa Centre PC Dekker 062
# … with 568 more rows
</pre> <p>All that work just to get the name of each candiate for each Electoral District name and number, plus their party affiliation.</p> </div> <div id="votes" class="section level2"> <h2>Votes</h2> <p>Now we can finally get to the actual voting data. These are made available as a collection of Excel files in a compressed folder. To avoid downloading it more than once, we wrap the call in an <code>if</code> statement that first checks to see if we already have the file. We also rename the file to something more manageable.</p> <pre class="r"><code>raw_results_file <- "[www.elections.on.ca/content/d...](http://www.elections.on.ca/content/dam/NGW/sitecontent/2017/results/Poll%20by%20Poll%20Results%20-%20Excel.zip)" zip_file <- "data-raw/Poll%20by%20Poll%20Results%20-%20Excel.zip" if(file.exists(zip_file)) { # Only download the data once # File exists, so nothing to do } else { download.file(raw_results_file, destfile = zip_file) unzip(zip_file, exdir="data-raw") # Extract the data into data-raw file.rename("data-raw/GE Results - 2014 (unconverted)", "data-raw/pollresults") }</code></pre> <pre><code>## NULL</code></pre> <p>Now we need to extract the votes out of 107 Excel files. The combination of <code>purrr</code> and <code>readxl</code> packages is great for this. In case we want to filter to just a few of the files (perhaps to target a range of Electoral Districts), we declare a <code>file_pattern</code>. For now, we just set it to any xls file that ends with three digits preceeded by a â_â.</p> <p>As we read in the Excel files, we clean up lots of blank columns and headers. Then we convert to a long table and drop total and blank rows. Also, rather than try to align the Electoral District name rows with their polls, we use the name of the Excel file to pull out the Electoral District number. Then we join with the <code>electoral_districts</code> table to pull in the Electoral District names.</p> <pre class="r">
file_pattern <- “*_[[:digit:]]{3}.xls” # Can use this to filter down to specific files poll_data <- list.files(path = “data-raw/pollresults”, pattern = file_pattern, full.names = TRUE) %>% # Find all files that match the pattern purrr::set_names() %>% purrr::map_df(readxl::read_excel, sheet = 1, col_types = “text”, .id = “file”) %>% # Import each file and merge into a dataframe
Specifying sheet = 1 just to be clear we’re ignoring the rest of the sheets
Declare col_types since there are duplicate surnames and map_df can’t recast column types in the rbind
For example, Bell is in both district 014 and 063
dplyr::select(-starts_with(“X__")) %>% # Drop all of the blank columns dplyr::select(1:2,4:8,15:dim(.)[2]) %>% # Reorganize a bit and drop unneeded columns dplyr::rename(poll_number =
POLL NO.
) %>% tidyr::gather(candidate, votes, -file, -poll_number) %>% # Convert to a long table dplyr::filter(!is.na(votes), poll_number != “Totals”) %>% dplyr::mutate(electoral_district = stringr::str_extract(file, “[[:digit:]]{3}"), votes = as.numeric(votes)) %>% dplyr::select(-file) %>% dplyr::left_join(electoral_districts) poll_data</pre> <pre>
# A tibble: 143,455 x 5
poll_number candidate votes electoral_district electoral_district_name
1 001 DICKSON 73 001 AjaxâPickering
2 002 DICKSON 144 001 AjaxâPickering
3 003 DICKSON 68 001 AjaxâPickering
4 006 DICKSON 120 001 AjaxâPickering
5 007 DICKSON 74 001 AjaxâPickering
6 008A DICKSON 65 001 AjaxâPickering
7 008B DICKSON 81 001 AjaxâPickering
8 009 DICKSON 112 001 AjaxâPickering
9 010 DICKSON 115 001 AjaxâPickering
10 011 DICKSON 74 001 AjaxâPickering
# … with 143,445 more rows
</pre> <p>The only thing left to do is to join <code>poll_data</code> with <code>candidate_parties</code> to add party affiliation to each candidate. Because the names donât always exactly match between these two tables, we use the <code>fuzzyjoin</code> package to join by closest spelling.</p> <pre class="r"><code>poll_data_party_match_table <- poll_data %>% group_by(candidate, electoral_district_name) %>% summarise() %>% fuzzyjoin::stringdist_left_join(candidate_parties, ignore_case = TRUE) %>% dplyr::select(candidate = candidate.x, party = party, electoral_district = electoral_district) %>% dplyr::filter(!is.na(party)) poll_data %<>% dplyr::left_join(poll_data_party_match_table) %>% dplyr::group_by(electoral_district, party) tibble::glimpse(poll_data)</code></pre> <pre>
Observations: 144,323
Variables: 6
$ poll_number
“001”, “002”, “003”, “006”, “007”, “00… $ candidate
“DICKSON”, “DICKSON”, “DICKSON”, “DICK… $ votes
73, 144, 68, 120, 74, 65, 81, 112, 115… $ electoral_district
“001”, “001”, “001”, “001”, “001”, “00… $ electoral_district_name
“AjaxâPickering”, “AjaxâPickering”, “A… $ party
“Liberal”, “Liberal”, “Liberal”, “Libe… </pre> <p>And, there we go. One table with a row for the votes received by each candidate at each poll. It would have been great if Elections Ontario released data in this format and we could have avoided all of this work.</p> </div>
Finance fixed their data and broke my case study
Sunday, November 5, 2017
The past few years, I’ve delivered an introduction to using R workshop that relied on manipulating Ministry of Finance demographic projections.
Analyzing these data was a great case study for the typical data management process. The data was structured for presentation, rather than analysis. So, there were several header rows, notes at the base of the table, and the data was spread across many worksheets.
Sometime recently, the ministry released an update that provides the data in a much better format: one sheet with rows for age and columns for years. Although this is a great improvement, Iâve had to update my case study, which makes it actually less useful as a lesson in data manipulation.
Although I’ve updated the main branch of the github repository, I’ve also created a branch that sources the archive.org version of the page from October 2016. Now, depending on the audience, I can choose the case study that has the right level of complexity.
Despite briefly causing me some trouble, I think it is great that these data are closer to a good analytical format. Now, if only the ministry could go one more step towards tidy data and make my case study completely unecessary.
A workflow for leaving the office
Sunday, October 1, 2017
Sometimes it’s the small things, accumulated over many days, that make a difference. As a simple example, every day when I leave the office, I message my family to let them know I’m leaving and how I’m travelling. Relatively easy: just open the Messages app, find the most recent conversation with them, and type in my message.
Using Workflow I can get this down to just a couple of taps on my watch. By choosing the “Leaving Work” workflow, I get a choice of travelling options:
Choosing one of them creates a text with the right emoticon that is pre-addressed to my family. I hit send and off goes the message.
The workflow itself is straightforward:
Like I said, pretty simple. But saves me close to a minute each and every day.
Charity donations by province
Thursday, August 31, 2017
This tweet about the charitable donations by Albertans showed up in my timeline and caused a ruckus.
Albertans give the most to charity in Canada, 50% more than the national average, even in tough economic times. #CdnPoli pic.twitter.com/keKPzY8brO
â Oil Sands Action (@OilsandsAction) August 31, 2017Many people took issue with the fact that these values werenât adjusted for income. Seems to me that whether this is a good idea or not depends on what kind of question youâre trying to answer. Regardless, the CANSIM table includes this value. So, it is straightforward to calculate. Plus CANSIM tables have a pretty standard structure and showing how to manipulate this one serves as a good template for others.
library(tidyverse) # Download and extract url <- "[www20.statcan.gc.ca/tables-ta...](http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/01110001-eng.zip)" zip_file <- "01110001-eng.zip" download.file(url, destfile = zip_file) unzip(zip_file) # We only want two of the columns. Specifying them here. keep_data <- c("Median donations (dollars)", "Median total income of donors (dollars)") cansim <- read_csv("01110001-eng.csv") %>% filter(DON %in% keep_data, is.na(`Geographical classification`)) %>% # This second filter removes anything that isn't a province or territory select(Ref_Date, DON, Value, GEO) %>% spread(DON, Value) %>% rename(year = Ref_Date, donation = `Median donations (dollars)`, income = `Median total income of donors (dollars)`) %>% mutate(donation_per_income = donation / income) %>% filter(year == 2015) %>% select(GEO, donation, donation_per_income) cansim
## # A tibble: 16 x 3 ## GEO donation donation_per_income ##
## 1 Alberta 450 0.006378455 ## 2 British Columbia 430 0.007412515 ## 3 Canada 300 0.005119454 ## 4 Manitoba 420 0.008032129 ## 5 New Brunswick 310 0.006187625 ## 6 Newfoundland and Labrador 360 0.007001167 ## 7 Non CMA-CA, Northwest Territories 480 0.004768528 ## 8 Non CMA-CA, Yukon 310 0.004643499 ## 9 Northwest Territories 400 0.003940887 ## 10 Nova Scotia 340 0.006505932 ## 11 Nunavut 570 0.005651398 ## 12 Ontario 360 0.005856515 ## 13 Prince Edward Island 400 0.008221994 ## 14 Quebec 130 0.002452830 ## 15 Saskatchewan 410 0.006910501 ## 16 Yukon 420 0.005695688 Curious that they dropped the territories from their chart, given that Nunavut has such a high donation amount.
Now we can plot the normalized data to find how the rank order changes. Weâll add the Canadian average as a blue line for comparison.
Iâm not comfortable with using median donations (adjusted for income or not) to say anything in particular about the residents of a province. But, Iâm always happy to look more closely at data and provide some context for public debates.
One major gap with this type of analysis is that weâre only looking at the median donations of people that donated anything at all. In other words, we arenât considering anyone who donates nothing. We should really compare these median donations to the total population or the size of the economy. This Stats Can study is a much more thorough look at the issue.
For me the interesting result here is the dramatic difference between Quebec and the rest of the provinces. But, I donât interpret this to mean that Quebecers are less generous than the rest of Canada. Seems more likely that there are material differences in how the Quebec economy and social safety nets are structured.
TTC delay data and Friday the 13th
Wednesday, June 21, 2017
The TTC releasing their Subway Delay Data was great news. Iâm always happy to see more data released to the public. In this case, it also helps us investigate one of the great, enduring mysteries: Is Friday the 13th actually an unlucky day?
As always, we start by downloading and manipulating the data. Iâve added in two steps that arenât strictly necessary. One is converting the Date, Time, and Day columns into a single Date column. The other is to drop most of the other columns of data, since we arenât interested in them here.
url <- "http://www1.toronto.ca/City%20Of%20Toronto/Information%20&%20Technology/Open%20Data/Data%20Sets/Assets/Files/Subway%20&%20SRT%20Logs%20(Jan01_14%20to%20April30_17).xlsx" filename <- basename(url) download.file(url, destfile = filename, mode = "wb") delays <- readxl::read_excel(filename, sheet = 2) %>% dplyr::mutate(date = lubridate::ymd_hm(paste(`Date`, `Time`, sep = " ")), delay = `Min Delay`) %>% dplyr::select(date, delay) delays
## # A tibble: 69,043 x 2 ## date delay ## <dttm> <dbl> ## 1 2014-01-01 02:06:00 3 ## 2 2014-01-01 02:40:00 0 ## 3 2014-01-01 03:10:00 3 ## 4 2014-01-01 03:20:00 5 ## 5 2014-01-01 03:29:00 0 ## 6 2014-01-01 07:31:00 0 ## 7 2014-01-01 07:32:00 0 ## 8 2014-01-01 07:34:00 0 ## 9 2014-01-01 07:34:00 0 ## 10 2014-01-01 07:53:00 0 ## # ... with 69,033 more rows
Now we have a
delays
dataframe with 69043 incidents starting from 2014-01-01 00:21:00 and ending at 2017-04-30 22:13:00. Before we get too far, weâll take a look at the data. A heatmap of delays by day and hour should give us some perspective.delays %>% dplyr::mutate(day = lubridate::day(date), hour = lubridate::hour(date)) %>% dplyr::group_by(day, hour) %>% dplyr::summarise(sum_delay = sum(delay)) %>% ggplot2::ggplot(aes(x = hour, y = day, fill = sum_delay)) + ggplot2::geom_tile(alpha = 0.8, color = "white") + ggplot2::scale_fill_gradient2() + ggplot2::theme(legend.position = "right") + ggplot2::labs(x = "Hour", y = "Day of the month", fill = "Sum of delays")
Other than a reliable band of calm very early in the morning, no obvious patterns here.
We need to identify any days that are a Friday the 13th. We also might want to compare weekends, regular Fridays, other weekdays, and Friday the 13ths, so we add a
type
column that provides these values. Here we use thecase_when
function:delays <- delays %>% dplyr::mutate(type = case_when( # Partition into Friday the 13ths, Fridays, weekends, and weekdays lubridate::wday(.$date) %in% c(1, 7) ~ "weekend", lubridate::wday(.$date) %in% c(6) & lubridate::day(.$date) == 13 ~ "Friday 13th", lubridate::wday(.$date) %in% c(6) ~ "Friday", TRUE ~ "weekday" # Everything else is a weekday )) %>% dplyr::mutate(type = factor(type)) %>% dplyr::group_by(type) delays
## # A tibble: 69,043 x 3 ## # Groups: type [4] ## date delay type ## <dttm> <dbl> <fctr> ## 1 2014-01-01 02:06:00 3 weekday ## 2 2014-01-01 02:40:00 0 weekday ## 3 2014-01-01 03:10:00 3 weekday ## 4 2014-01-01 03:20:00 5 weekday ## 5 2014-01-01 03:29:00 0 weekday ## 6 2014-01-01 07:31:00 0 weekday ## 7 2014-01-01 07:32:00 0 weekday ## 8 2014-01-01 07:34:00 0 weekday ## 9 2014-01-01 07:34:00 0 weekday ## 10 2014-01-01 07:53:00 0 weekday ## # ... with 69,033 more rows
With the data organized, we can start with just a simple box plot of the minutes of delay by
type
.ggplot2::ggplot(delays, aes(type, delay)) + ggplot2::geom_boxplot() + ggplot2::labs(x = "Type", y = "Minutes of delay")
Not very compelling. Basically most delays are short (as in zero minutes long) with many outliers.
How about if we summed up the total minutes in delays for each of the types of days?
delays %>% dplyr::summarise(total_delay = sum(delay))
## # A tibble: 4 x 2 ## type total_delay ## <fctr> <dbl> ## 1 Friday 18036 ## 2 Friday 13th 619 ## 3 weekday 78865 ## 4 weekend 28194
Clearly the total minutes of delays are much shorter for Friday the 13ths. But, there arenât very many such days (only 6 in fact). So, this is a dubious analysis.
Letâs take a step back and calculate the average of the total delay across the entire day for each of the types of days. If Friday the 13ths really are unlucky, we would expect to see longer delays, at least relative to a regular Friday.
daily_delays <- delays %>% # Total delays in a day dplyr::mutate(year = lubridate::year(date), day = lubridate::yday(date)) %>% dplyr::group_by(year, day, type) %>% dplyr::summarise(total_delay = sum(delay)) mean_daily_delays <- daily_delays %>% # Average delays in each type of day dplyr::group_by(type) %>% dplyr::summarise(avg_delay = mean(total_delay)) mean_daily_delays
## # A tibble: 4 x 2 ## type avg_delay ## <fctr> <dbl> ## 1 Friday 107.35714 ## 2 Friday 13th 103.16667 ## 3 weekday 113.63833 ## 4 weekend 81.01724
ggplot2::ggplot(daily_delays, aes(type, total_delay)) + ggplot2::geom_boxplot() + ggplot2::labs(x = "Type", y = "Total minutes of delay")
On average, Friday the 13ths have shorter total delays (103 minutes) than either regular Fridays (107 minutes) or other weekdays (114 minutes). Overall, weekend days have far shorter total delays (81 minutes).
If Friday the 13ths are unlucky, they certainly arenât causing longer TTC delays.
For the statisticians among you that still arenât convinced, weâll run a basic linear model to compare Friday the 13ths with regular Fridays. This should control for many unmeasured variables.
model <- lm(total_delay ~ type, data = daily_delays, subset = type %in% c("Friday", "Friday 13th")) summary(model)
## ## Call: ## lm(formula = total_delay ~ type, data = daily_delays, subset = type %in% ## c("Friday", "Friday 13th")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -103.357 -30.357 -6.857 18.643 303.643 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 107.357 3.858 27.829 <2e-16 *** ## typeFriday 13th -4.190 20.775 -0.202 0.84 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 50 on 172 degrees of freedom ## Multiple R-squared: 0.0002365, Adjusted R-squared: -0.005576 ## F-statistic: 0.04069 on 1 and 172 DF, p-value: 0.8404
Definitely no statistical support for the idea that Friday the 13ths cause longer TTC delays.
How about time series tests, like anomaly detections? Seems like weâd just be getting carried away. Part of the art of statistics is knowing when to quit.
In conclusion, then, after likely far too much analysis, we find no evidence that Friday the 13ths cause an increase in the length of TTC delays. This certainly suggests that Friday the 13ths are not unlucky in any meaningful way, at least for TTC riders.
Glad we could put this superstition to rest!
Successful AxePC 2016 event
Sunday, November 6, 2016
Thank you to all the participants, donors, and volunteers for making the third Axe Pancreatic Cancer event such a great success! Together we’re raising awareness and funding to support Pancreatic Cancer Canada.
Axe PC 2016
Monday, October 17, 2016
We’re hosting our third-annual Axe Pancreatic Cancer event. Help us kick off Pancreatic Cancer Awareness Month by drinking beer and throwing axes!
Toronto election data
Tuesday, September 16, 2014
As with any analytical project, we invested significant time in obtaining and integrating data for our neighbourhood-level modeling. The Toronto Open Data portal provides detailed election results for the 2003, 2006, and 2010 elections, which is a great resource. But, they are saved as Excel files with a separate worksheet for each ward. This is not an ideal format for working with R.
We’ve taken the Excel files for the mayoral-race results and converted them into a data package for R called toVotes. This package includes the votes received by ward and area for each mayoral candidate in each of the last three elections.
If you’re interested in analyzing Toronto’s elections, we hope you find this package useful. We’re also happy to take suggestions (or code contributions) on the GitHub page.
First attempt at predicting the 2014 Toronto mayoral race
Friday, September 12, 2014
In our first paper, we describe the results of some initial modeling - at a neighbourhood level - of which candidates voters are likely to support in the 2014 Toronto mayoral race. All of our data is based upon publicly available sources.
We use a combination of proximity voter theory and statistical techniques (linear regression and principal-component analyses) to undertake two streams of analysis:
- Determining what issues have historically driven votes and what positions neighbourhoods have taken on those issues
- Determining which neighbourhood characteristics might explain why people favour certain candidates
In both cases we use candidatesâ currently stated positions on issues and assign them scores from 0 (âextreme leftâ) to 100 (âextreme rightâ). While certainly subjective, there is at least internal consistency to such modeling.
This work demonstrates that significant insights on the upcoming mayoral election in Toronto can be obtained from an analysis of publicly available data. In particular, we find that:
- Voters will change their minds in response to issues. So, “getting out the vote” is not a sufficient strategy. Carefully chosen positions and persuasion are also important.
- Despite this, the ‘voteability’ of candidates is clearly important, which includes voter’s assessments of a candidate’s ability to lead and how well they know the candidate’s positions.
- The airport expansion and transportation have been the dominant issues across the city in the last three elections, though they may not be in 2014.
- A combination of family size, mode of commuting, and home values (at the neighbourhood level) can partially predict voting patterns.
We are now moving on to something completely different, where we use an agent-based approach to simulate entire elections. We are actively working on this now and hope to share our progress soon.
What is PsephoAnalytics?
Wednesday, September 10, 2014
Political campaigns have limited resources -âboth time and financial - that should be spent on attracting voters that are more likely to support their candidates. Identifying these voters can be critical to the success of a candidate.
Given the privacy of voting and the lack of useful surveys, there are few options for identifying individual voter preferences:
- Polling, which is large-scale, but does not identify individual voters
- Voter databases, which identify individual voters, but are typically very small scale
- In-depth analytical modeling, which is both large-scale and helps to ‘identify’ voters (at least at a neighbourhood level on average)
The goal of PsephoAnalytics* is to model voting behaviour in order to accurately explain campaigns (starting with the 2014 Toronto mayoral race). This means attempting to answer four key questions:
- What are the (causal) explanations for how election campaigns evolve â and how well can we predict their outcomes?
- What are effects of (even simple) shocks to election campaigns?
- How can we advance our understanding of election campaigns?
- How can elections be better designed?
Psephology (from the Greek psephos, for ‘pebble’, which the ancient Greeks used as ballots) deals with the analysis of elections.
Public service vs. Academics
Thursday, June 3, 2010
I recently participated in a panel discussion at the University of Toronto on the career transition from academic research to public service. I really enjoyed the discussion and there were many great questions from the audience. Hereâs just a brief summary of some of the main points I tried to make about the differences between academics and public service.
The major difference Iâve experienced involves a trade-off between control and influence.
As a grad student and post-doctoral researcher I had almost complete control over my work. I could decide what was interesting, how to pursue questions, who to talk to, and when to work on specific components of my research. I believe that I made some important contributions to my field of study. But, to be honest, this work had very little influence beyond a small group of colleagues who are also interested in the evolution of floral form.
Now I want to be clear about this: in no way should this be interpreted to mean that scientific research is not important. This is how scientific progress is made â many scientists working on particular, specific questions that are aggregated into general knowledge. This work is important and deserves support. Plus, it was incredibly interesting and rewarding.
However, the comparison of the influence of my academic research with my work on infrastructure policy is revealing. Roads, bridges, transit, hospitals, schools, courthouses, and jails all have significant impacts on the day-to-day experience of millions of people. Every day I am involved in decisions that determine where, when, and how the government will invest scarce resources into these important services.
Of course, this is where the control-influence trade-off kicks in. As an individual public servant, I have very little control over these decisions or how my work will be used. Almost everything I do involves medium-sized teams with members from many departments and ministries. This requires extensive collaboration, often under very tight time constraints with high profile outcomes.
For example, in my first week as a public servant I started a year-long process to integrate and enhance decision-making processes across 20 ministries and 2 agencies. The project team included engineers, policy analysts, accountants, lawyers, economists, and external consultants from all of the major government sectors. The (rather long) document produced by this process is now used to inform every infrastructure decision made by the province.
Governments contend with really interesting and complicated problems that no one else can or will consider. Businesses generally take on the easy and profitable issues, while NGOs are able to focus on specific aspects of issues. Consequently, working on government policy provides a seemingly endless supply of challenges and puzzles to solve, or at least mitigate. I find this very rewarding.
None of this is to suggest that either option is better than the other. Iâve been lucky to have had two very interesting careers so far, which have been at the opposite ends of this control-influence trade-off. Nonetheless, my experience suggests that an actual academic career is incredibly challenging to obtain and may require significant compromises. Public service can offer many of the same intellectual challenges with better job prospects and work-life balance. But, you need to be comfortable with the diminished control.
Thanks to my colleague Andrew Miller for creating the panel and inviting me to participate. The experience led me to think more clearly about my career choices and I think the panel was helpful to some University of Toronto grad students.
From brutal brooding to retrofit-chic
Thursday, January 21, 2010
Our offices will be moving to this new space. I’m looking forward to actually working in a green building, in addition to developing green building policies.
The Jarvis Street project will set the benchmark for how the province manages its own building retrofits. The eight-month-old Green Energy Act requires Ontario government and broader public-sector buildings to meet a minimum LEED Silver standard â Leadership in Energy and Environmental Design. Jarvis Street will also be used to promote an internal culture of conservation, and to demonstrate the provinceâs commitment to technologically advanced workspaces that are accessible, flexible and that foster staff collaboration and creativity, Ms. Robinson explains.
Emacs Installation on Windows XP
Wednesday, October 7, 2009
I spend a fair bit of time with a locked-down Windows XP machine. Fortunately, I’m able to install Emacs which provides capabilities that I find quite helpful. I’ve had to reinstall Emacs a few times now. So, for my own benefit (and perhaps your’s) here are the steps I follow:
Download EmacsW32 patched and install in my user directory under Apps
Available from http://ourcomments.org/Emacs/EmacsW32.html
Set the environment variable for HOME to my user directory
Right click on My Computer, select the Advanced tab, and then Environment Variables.
Add a new variable and set Variable name to HOME and Variable value to C:\Documents and Settings\my_user_directory
Download technomancy’s Emacs Starter Kit
Available from http://github.com/technomancy/emacs-starter-kit
Extract archive into .emacs_d in %HOME%
Copy my specific emacs settings into .emacs_d\my_user_name.el
Canada LEED projects
Sunday, June 21, 2009
The CaGBC maintains a list of all the registered LEED projects in Canada. This is a great resource, but rather awkward for analyses. Iâve copied these data into a DabbleDB application with some of the maps and tabulations that I frequently need to reference.
Here for example is a map of the density of LEED projects in each province. While here is a rather detailed view of the kinds of projects across provinces. There are several other views available. Are there any others that might be useful?
Every day is âscience day'
Wednesday, June 3, 2009
I was given an opportunity to propose a measure to clarify how and on what basis the federal government allocates funds to STI - a measure that would strengthen relations between the federal government and the STI community by eliminating misunderstandings and suspicions on this point. In short, my proposal was that Ottawa direct its Science, Technology and Innovation Council to do three things:
To provide an up-to-date description of how these allocation decisions have been made in the past;
To identify the principles and sources of advice on which such decisions should be based;
To recommend the most appropriate structure and process - one characterized by transparency and openness - for making these decisions in the future.
These are reasonable suggestions from Preston Manning: be clear about why and how the Federal government funds science and technology.
Of course I may not agree with the actual decisions made through such a process, but at least I would know why the decisions were made. The current process is far too opaque and confused for such critical investment decisions.
Math and the City
Tuesday, May 26, 2009
judson.blogs.nytimes.com/2009/05/1…
A good read on the mathematics of scaling in urban patterns. I had looked into using the Bettencourt paper (cited in this article) for making allocation decisions. The trick is moving from the general patterns observed in urban scaling to specific recommendations for where to invest in new infrastructure. This is particularly challenging in the absence of good, detailed data on the current infrastructure stock. Weâve made good progress on gathering some of this data, and it might be worth revisiting this scaling relationship.
Mama Earth Organics
Friday, May 22, 2009
Iâm certain that paying attention to where my food comes from is important. Food production influences my health, has environmental consequences, and affects both urban and rural design. Ideally, I would develop relationships with local farmers, carefully choose organic produce, and always consider broad environmental impacts. Except, I like to spend time with my young family, try to get some exercise, and have more than enough commitments through work to actually spend this much effort on food choices. So, Iâve outsourced this process to the excellent Mama Earth Organics.
Every week a basket of fresh organic and/or local fruit and vegetables arrives on our doorstep. Part of the fun of this service is that different items arrive each week, which diversifies our weekly food routine. But, we always know whatâs coming several days in advance, so we can plan our meals well ahead of time. After over a year of service, weâve only had a single complaint about quality and this was handled very quickly by Mama Earth with a full refund plus credit.
Weâve found the small basket is sufficient for two adults and a picky four-year old. Weâve also added in some fresh bread from St. Johnâs Bakery, which has been consistently delicious and lasts through most of the week.
Goodyear's Religious Beliefs vs. Evolution
Thursday, March 19, 2009
Our minister of science continues to argue that his unwillingness to endorse the theory of evolution is not relevant to science policy. As quoted by the Globe and Mail:
My view isn’t important. My personal beliefs are not important.
I find this amazing. How can the minister of science’s views on the fundamental unifying theory of biology not be important?
I don’t expect him to understand the details of evolutionary theory or to have all of his personal beliefs vetted and religious views muted. However, I do expect him – as minister – to champion and support Canadian science, especially basic research. When our minister refuses to acknowledge the fundamental discoveries of science, our reputation is diminished.
There is also a legitimate – though rather exaggerated – concern that the minister’s views on the truth can influence policy and funding decisions. The funding councils are more than sufficiently independent to prevent any undue ministerial influence here. The real problem is an apparent distrust or lack of interest in basic research from the federal government.
Death Sentences Review
Wednesday, February 25, 2009
Death Sentences by Don Watson is a wonderful book â simultaneously funny, scary, and inspiring â that describes how âclichĂ©s, weasel words, and management-speakâ are infecting public language.
The humour comes from Watsonâs acerbic commentary and fantastic scorn for phrases like:
Given the within year and budget time flexibility accorded to the science agencies in the determination of resource allocation from within their global budget, a multi-parameter approach to maintaining the agencies budgets in real terms is not appropriate.
The book is scary because it makes a strong argument for the dangers of this type of language. Citizens become confused and disinterested, customers become jaded, and people loose their love for language. Also, as a public servant I see this kind of language every day and often find myself struggling to avoid banality and cliches (not to mention bullet points). We need more forceful advocates like Don Watson to call out politicians and corporations for abusing our language. This book certainly makes me want to try harder. And whatâs more inspiring than struggling for a good cause against long odds?
The book also has a great glossary of typical weasel words with possible synonyms. So, Iâm keeping the book in my office for quick reference.
Omnivore
Monday, February 23, 2009
After seventeen years as a vegetarian, I recently switched back to an omnivore. My motivation for not eating meat was environmental, since, on average, a vegetarian diet requires much less land, water, and energy. This is still the right motivation, but over the last year or so Iâve been rethinking my decision to not eat meat.
My concern was that Iâd stopped paying attention to my food choices and a poorly considered vegetarian diet can easily yield a bad environmental outcome. In particular, modern agriculture now takes 10 calories of fossil fuel energy to produce a single calorie of food. This is clearly unsustainable. We cannot rely on non-renewable, polluting resources for our food, nor can we continue to transport food great distances â even if it is only vegetables. My unexamined commitment to a vegetarian diet was no longer consistent with environmental sustainability.
I think the solution is to eat local, organic food. This also requires eating seasonal food, but Canadian winters are horrible for local vegetables. This left me wanting to support local agriculture, but unable to restrict my diet. Returning to my original motivation to choose environmentally appropriate food convinced me it was time to return to being an omnivore. My new policy is to follow Michael Pollanâs advice: âEat food. Mostly plants. Not too much.â In addition, Iâll favour locally grown, organic food and include small amounts of meat â which I hope will predominantly come from carefully considered and sustainable sources. Iâve also deciced that when faced with a dillema of choosing either local or organic, Iâll choose local. We need to support local agriculture and Iâll trade this for organic if necessary. Of course, in the majority of cases local and organic options are available, and Iâll choose them.
This is a big change and I look forward to exploring food again.
Instapaper Review
Monday, January 5, 2009
Instapaper is an integral part of my web-reading routine. Typically, I have a few minutes early in the morning and scattered throughout the day for quick scans of my favourite web sites and news feeds. I capture anything worth reading with Instapaperâs bookmarklet to create a reading queue of interesting articles. Then with a quick update to the iPhone app this queue is available whenever I find longer blocks of time for reading, particularly during the morning subway ride to work or late at night.
I also greatly appreciate Instapaperâs text view, which removes all the banners, ads, and link lists from the articles to present a nice and clean text view of the content only. I often find myself saving an article to Instapaper even when I have the time to read it, just so I can use this text-only view.
Instapaper is one of my favourite tools and the first iPhone application I purchased.
Election 2008
Tuesday, October 14, 2008
Like most Canadians, Iâll be at the polls today for the 2008 Federal Election.
In the past several elections, Iâve cast my vote for the party with the best climate change plan. The consensus among economists is that any credible plan must set a price on carbon emissions. My personal preference is for a predictable and transparent price to influence consumer spending, so I favour a carbon tax over a cap-and-trade. Enlightening discussions of these issues are available at Worthwhile Canadian Initiative, Jeffrey Simpsonâs column at the Globe and Mail, or his book Hot Air
.
Until now this voting principle has meant a vote for the Green Party who support a tax shift from income to pollution. My expectation for this vote was not that the Green Party would gain any direct political power, rather their environmental plan would gain political profile and convince the Liberals and Conservatives to improve their plans. A carbon tax is now a central component of this yearâs Liberal Platform with the Green Shift. Both the Conservative Pary and NDP support a limited cap-and-trade system on portions of the economy, with the Conservatives supporting dubious âintensity-basedâ targets.
Although I quite like the central components of the Green Shift, Iâm not too keen on the distracting social engineering aspects of the plan. Furthermore, the Liberals have certainly failed to implement any of their previous climate change plans while in power. Nonetheless, I do think (hope?) they will follow through this time and I prefer supporting a well-conceived plan that may not be implemented than a poor plan. Despite my support for this plan, I think the Liberals have done a rather poor job of explaining the Green Shift and have conducted a disappointing campaign.
In the end, my principle will hold. Iâm voting for the Green Shift and, reluctantly, the Liberal Party of Canada.
A Map of the Limits of Statistics
Friday, October 10, 2008
In this article Nassim Nicholas Taleb applies his Black Swan
idea to the current financial crisis and describes the strengths and weaknesses of econometrics.
For us the world is vastly simpler in some sense than the academy, vastly more complicated in another. So the central lesson from decision-making (as opposed to working with data on a computer or bickering about logical constructions) is the following: it is the exposure (or payoff) that creates the complexity âand the opportunities and dangersâ not so much the knowledge ( i.e., statistical distribution, model representation, etc.). In some situations, you can be extremely wrong and be fine, in others you can be slightly wrong and explode. If you are leveraged, errors blow you up; if you are not, you can enjoy life.
Globe and Mail: Incremental man
Sunday, October 5, 2008
A detailed and fascinating portrait of Stephen Harper. As the article points out:
The core of any government reflects the personality of the prime minister, because everyone in the system responds to his or her ways of thinking, personality traits, political ambitions and policy preferences. Know the prime minister; know the government.
Harper has been an enigma and learning more about his personal policies and approach to governance is very useful while thinking about the upcoming election.
A general summary of the article comes from near the end:
And the long-distance runner â bright, intense, strategic, cautious and confident in every stride â has certainly got things done, from merging two parties, to winning a minority government, to fulfilling most of his campaign promises.
He also has pursued two broad changes in the nature of the federal government: giving the provinces more running room by keeping Ottawa out of some of their affairs and giving individuals a bit more money in the form of tax reductions, credits and child-care cheques.
And yet, despite these policies that he assumed would be popular, despite all the problems on the Liberal side, despite raising far more money, despite governing in mostly excellent economic times, despite stroking Quebec, despite gearing up for elections, his Conservatives have yet to break through decisively.
Patrick Watson
Sunday, July 27, 2008
Reading up on the upcoming Polaris Music Prize reminded me of Patrick Watson, last year’s winner of the prize. His “Close to Paradise” album is inventive with intriguing lyrics, unique sounds, and an often driving piano track. Particular stand out tracks are Luscious Life, Drifters, and The Great Escape. The album is well worth considering and I’m looking forward to listening to the short-listed artists for this year’s prize.
Stuck in the middle
Saturday, June 23, 2007
A recent press release from the federal government entitled “Making a Strong Canadian Economy Even Stronger” contains a sentence that struck me as odd.
As a result of actions taken in Budget 2007, Canada’s marginal effective tax rate (METR) on new business investment improved from third-highest in the G7 to third-lowest by 2011.Fair enough, tax rates are projected to decline. But notice how they phrase the context of this reduction. Moving from third highest to third lowest is, in a list of seven countries, a change from third to fifth. Not a dramatic change – we were near the middle and we still are.
Creationists and their old tricks
Friday, March 30, 2007
TVOâs The Agenda had an interesting show on the debate between evolutionary biology and creationism. Jerry Coyne provided a great overview of evolution and a good defence during the debate.
The debate offered a great illustration of the intellectual vacuity that characterises creationism (aka intelligent design). Paul Nelson offers up an article by Doolittle and Bapteste as proof that Darwinism is unravelling. I suspect he hopes no one will read past the abstract to discover the reasonable debate scientists are having about the universality of a single tree of life. He certainly doesnât want you to notice that the entire article is couched within evolutionary theory and not once does it claim that Darwinism has been falsified.
Hereâs the hypothesis that Doolittle and Bapteste are evaluating:
âthat there should be a universal TOL [tree of life], dichotomously branching all of the way down to a single root.â p2045
They then establish that gene transfer often occurs between lineages, particularly among prokaryotes, and consequently this universal tree of life does not exist. Certainly this complicates the construction of molecular trees and shows the importance for pluralism of mechanism in biology. But they write much more about the overall significance of this work.
âTo be sure, much of evolution has been tree-like and is captured in hierarchical classifications.â p2048
ââŠit would be perverse to claim that Darwinâs TOL hypothesis has been falsified for animals (the taxon to which he primarily addressed himself) or that it is not an appropriate model for many taxa at many levels of analysisâ p2048
And the crucial quote in this context:
âHolding onto this ladder of pattern [âŠ] should not be an essential element in our struggle against those who doubt the validity of evolutionary theory, who can take comfort from this challenge to the TOL only by a willful misunderstanding of its import.â p2048
Stikkit from the command line
Wednesday, March 21, 2007
Note â This post has been updated from 2007-03-20 to describe new installation instructions.
Overview
Iâve integrated Stikkit into most of my workflow and am quite happy with the results. However, one missing piece is quick access to Stikkit from the command line. In particular, a quick list of my undone todos is quite useful without having to load up a web browser. To this end, Iâve written a Ruby script for interacting with Stikkit. As I mentioned, my real interest is in listing undone todos. But I decided to make the script more general, so you can ask for specific types of stikkits and restrict the stikkits with specific parameters. Also, since the stikkit api is so easy to use, I added in a method for creating new stikkits.
Usage
The general use of the script is to list stikkits of a particular type, filtered by a parameter. For example,
ruby stikkit.rb --list calendar dates=today
will show all of todayâs calendar events. While,
ruby stikkit.rb -l todos done=0
lists all undone todos. The use of -l instead of
--list
is simply a standard convenience. Furthermore, since this last example comprises almost all of my use for this script, I added a convenience method to get all undone todosruby stikkit.rb -t
A good way to understand stikkit types and parameters is to keep an eye on the url while you interact with Stikkit in your browser. To create a new stikkit, use the
--create
flag,ruby stikkit.rb -c 'Remember me.'
The text you pass to stikkit.rb will be processed as usual by Stikkit.
Installation
Grab the script from the Google Code project and put it somewhere convenient. Making the file executable and adding it to your path will cut down on the typing. The script reads from a .stikkit file in your path that contains your username and password. Modify this template and save it as ~/.sikkit
--- username: me@domain.org password: superSecret
The script also requires the atom gem, which you can grab with
gem install atom
Iâve tried to include some flexibility in the processing of stikkits. So, if you donât like using atom, you can switch to a different format provided by Stikkit. The text type requires no gems, but makes picking out pieces of the stikkits challenging.
Feedback
This script serves me well, but Iâm interested in making it more useful. Feel free to pass along any comments or feature requests.
Yahoo Pipes and the Globe and Mail
Friday, March 16, 2007
Most of my updates arrive through feeds to NetNewsWire. Since my main source of national news and analysis is the Globe and Mail, Iâm quite happy that they provide many feeds for accessing their content. The problem is that many news stories are duplicated across these feeds. Furthermore, tracking all of the feeds of interest is challenging.
The new Yahoo Pipes offer a solution to these problems. Without providing too much detail, pipes are a way to filter, connect, and generally mash-up the web with a straightforward interface. Iâve used this service to collect all of the Globe and Mail feeds of interest, filter out the duplicates, and produce a feed I can subscribe to. Nothing fancy, but quite useful. The pipe is publicly available and if you donât agree with my choice of news feeds, you are free to clone mine and create your own. There are plenty of other pipes available, so take a look to see if anything looks useful to you. Even better, create your own.
If you really want those details, Tim O'Reilly has plenty.
Stikkit Todos in GMail
Wednesday, March 7, 2007
I find it useful to have a list of my unfinished tasks generally, but subtley, available. To this end, Iâve added my unfinished todos from Stikkit to my Gmail web clips. These are the small snippets of text that appear just above the message list in GMail.
All you need is the subscribe link from your todo page with the ânot doneâ button toggled. The url should look something like:
http://stikkit.com/todos.atom?api_key={}&done=0
Paste this into the 'Search by topic or URL:â box of Web Clips tab in GMail settings.