Longform

A Map of the Limits of Statistics

Friday, October 10, 2008

In this article Nassim Nicholas Taleb applies his Black Swan idea to the current financial crisis and describes the strengths and weaknesses of econometrics.

For us the world is vastly simpler in some sense than the academy, vastly more complicated in another. So the central lesson from decision-making (as opposed to working with data on a computer or bickering about logical constructions) is the following: it is the exposure (or payoff) that creates the complexity —and the opportunities and dangers— not so much the knowledge ( i.e., statistical distribution, model representation, etc.). In some situations, you can be extremely wrong and be fine, in others you can be slightly wrong and explode. If you are leveraged, errors blow you up; if you are not, you can enjoy life.

Via Arts and Letters Daily

Globe and Mail: Incremental man

Sunday, October 5, 2008

A detailed and fascinating portrait of Stephen Harper. As the article points out:

The core of any government reflects the personality of the prime minister, because everyone in the system responds to his or her ways of thinking, personality traits, political ambitions and policy preferences. Know the prime minister; know the government.

Harper has been an enigma and learning more about his personal policies and approach to governance is very useful while thinking about the upcoming election.

A general summary of the article comes from near the end:

And the long-distance runner – bright, intense, strategic, cautious and confident in every stride – has certainly got things done, from merging two parties, to winning a minority government, to fulfilling most of his campaign promises.

He also has pursued two broad changes in the nature of the federal government: giving the provinces more running room by keeping Ottawa out of some of their affairs and giving individuals a bit more money in the form of tax reductions, credits and child-care cheques.

And yet, despite these policies that he assumed would be popular, despite all the problems on the Liberal side, despite raising far more money, despite governing in mostly excellent economic times, despite stroking Quebec, despite gearing up for elections, his Conservatives have yet to break through decisively.

Patrick Watson

Sunday, July 27, 2008

Reading up on the upcoming Polaris Music Prize reminded me of Patrick Watson, last year’s winner of the prize. His “Close to Paradise” album is inventive with intriguing lyrics, unique sounds, and an often driving piano track. Particular stand out tracks are Luscious Life, Drifters, and The Great Escape. The album is well worth considering and I’m looking forward to listening to the short-listed artists for this year’s prize.

Stuck in the middle

Friday, June 22, 2007

A recent press release from the federal government entitled “Making a Strong Canadian Economy Even Stronger” contains a sentence that struck me as odd.

As a result of actions taken in Budget 2007, Canada’s marginal effective tax rate (METR) on new business investment improved from third-highest in the G7 to third-lowest by 2011.

Fair enough, tax rates are projected to decline. But notice how they phrase the context of this reduction. Moving from third highest to third lowest is, in a list of seven countries, a change from third to fifth. Not a dramatic change – we were near the middle and we still are.

Creationists and their old tricks

Friday, March 30, 2007

TVO’s The Agenda had an interesting show on the debate between evolutionary biology and creationism. Jerry Coyne provided a great overview of evolution and a good defence during the debate.

The debate offered a great illustration of the intellectual vacuity that characterises creationism (aka intelligent design). Paul Nelson offers up an article by Doolittle and Bapteste as proof that Darwinism is unravelling. I suspect he hopes no one will read past the abstract to discover the reasonable debate scientists are having about the universality of a single tree of life. He certainly doesn’t want you to notice that the entire article is couched within evolutionary theory and not once does it claim that Darwinism has been falsified.

Here’s the hypothesis that Doolittle and Bapteste are evaluating:

“that there should be a universal TOL [tree of life], dichotomously branching all of the way down to a single root.” p2045

They then establish that gene transfer often occurs between lineages, particularly among prokaryotes, and consequently this universal tree of life does not exist. Certainly this complicates the construction of molecular trees and shows the importance for pluralism of mechanism in biology. But they write much more about the overall significance of this work.

“To be sure, much of evolution has been tree-like and is captured in hierarchical classifications.” p2048

“…it would be perverse to claim that Darwin’s TOL hypothesis has been falsified for animals (the taxon to which he primarily addressed himself) or that it is not an appropriate model for many taxa at many levels of analysis” p2048

And the crucial quote in this context:

“Holding onto this ladder of pattern […] should not be an essential element in our struggle against those who doubt the validity of evolutionary theory, who can take comfort from this challenge to the TOL only by a willful misunderstanding of its import.” p2048

Stikkit from the command line

Tuesday, March 20, 2007

Note – This post has been updated from 2007-03-20 to describe new installation instructions.

Overview

I’ve integrated Stikkit into most of my workflow and am quite happy with the results. However, one missing piece is quick access to Stikkit from the command line. In particular, a quick list of my undone todos is quite useful without having to load up a web browser. To this end, I’ve written a Ruby script for interacting with Stikkit. As I mentioned, my real interest is in listing undone todos. But I decided to make the script more general, so you can ask for specific types of stikkits and restrict the stikkits with specific parameters. Also, since the stikkit api is so easy to use, I added in a method for creating new stikkits.

Usage

The general use of the script is to list stikkits of a particular type, filtered by a parameter. For example,

ruby stikkit.rb --list calendar dates=today

will show all of today’s calendar events. While,

ruby stikkit.rb -l todos done=0

lists all undone todos. The use of -l instead of --list is simply a standard convenience. Furthermore, since this last example comprises almost all of my use for this script, I added a convenience method to get all undone todos

ruby stikkit.rb -t

A good way to understand stikkit types and parameters is to keep an eye on the url while you interact with Stikkit in your browser. To create a new stikkit, use the --create flag,

ruby stikkit.rb -c 'Remember me.'

The text you pass to stikkit.rb will be processed as usual by Stikkit.

Installation

Grab the script from the Google Code project and put it somewhere convenient. Making the file executable and adding it to your path will cut down on the typing. The script reads from a .stikkit file in your path that contains your username and password. Modify this template and save it as ~/.sikkit


     ---
     username: me@domain.org 
     password: superSecret

The script also requires the atom gem, which you can grab with

gem install atom

I’ve tried to include some flexibility in the processing of stikkits. So, if you don’t like using atom, you can switch to a different format provided by Stikkit. The text type requires no gems, but makes picking out pieces of the stikkits challenging.

Feedback

This script serves me well, but I’m interested in making it more useful. Feel free to pass along any comments or feature requests.

Yahoo Pipes and the Globe and Mail

Friday, March 16, 2007

Most of my updates arrive through feeds to NetNewsWire. Since my main source of national news and analysis is the Globe and Mail, I’m quite happy that they provide many feeds for accessing their content. The problem is that many news stories are duplicated across these feeds. Furthermore, tracking all of the feeds of interest is challenging.

The new Yahoo Pipes offer a solution to these problems. Without providing too much detail, pipes are a way to filter, connect, and generally mash-up the web with a straightforward interface. I’ve used this service to collect all of the Globe and Mail feeds of interest, filter out the duplicates, and produce a feed I can subscribe to. Nothing fancy, but quite useful. The pipe is publicly available and if you don’t agree with my choice of news feeds, you are free to clone mine and create your own. There are plenty of other pipes available, so take a look to see if anything looks useful to you. Even better, create your own.

If you really want those details, Tim O'Reilly has plenty.

Stikkit Todos in GMail

Wednesday, March 7, 2007

I find it useful to have a list of my unfinished tasks generally, but subtley, available. To this end, I’ve added my unfinished todos from Stikkit to my Gmail web clips. These are the small snippets of text that appear just above the message list in GMail.

All you need is the subscribe link from your todo page with the ‘not done’ button toggled. The url should look something like:

http://stikkit.com/todos.atom?api_key={}&done=0

Paste this into the 'Search by topic or URL:’ box of Web Clips tab in GMail settings.

DabbleDB

Monday, February 19, 2007

My experiences helping people manage their data has repeatedly shown that databases are poorly understood. This is well illustrated by the rampant abuses of spreadsheets for recording, manipulating, and analysing data.

Most people realise that they should be using a database, the real issue is the difficulty of creating a proper database. This is a legitimate challenge. Typically, you need to carefully consider all of the categories of data and their relationships when creating the database, which makes the upfront costs quite significant. Why not just start throwing data into a spreadsheet and worry about it later?

I think that DabbleDB can solve this problem. A great strength of Dabble –- and the source of its name — is that you can start with a simple spreadsheet of data and progressively convert it to a database as you begin to better understand the data and your requirements.

Dabble also has a host of great features for working with data. I’ll illustrate this with a database I created recently when we were looking for a new home. This is a daunting challenge. We looked at dozens of houses each with unique pros and cons in different neighbourhoods and with different price ranges. I certainly couldn’t keep track of them all.

I started with a simple list of addresses for consideration. This was easily imported into Dabble and immediately became useful. Dabble can export to Google Earth, so I could quickly have an overview of the properties and their proximity to amenities like transit stops and parks. Next, I added in a field for asking price and MLS url which were also exported to Google Earth. Including price gave a good sense of how costs varied with location, while the url meant I could quickly view the entire listing for a property.

Next, we started scheduling appointments to view properties. Adding this to Dabble immediately created a calendar view. Better yet, Dabble can export this view as an iCal file to add into a calendaring program.

Once we started viewing homes, we began to understand what we really were looking for in terms of features. So, add these to Dabble and then start grouping, searching, and sorting by these attributes.

All of this would have been incredibly challenging without Dabble. No doubt, I would have simply used a spreadsheet and missed out on the rich functionality of a database.

Dabble really is worth a look. The best way to start is to watch the seven minute demo and then review some of the great screencasts.

Stikkit-- Out with the mental clutter

Wednesday, January 31, 2007

I like to believe that my brain is useful for analysis, synthesis, and creativity. Clearly it is not proficient at storing details like specific dates and looming reminders. Nonetheless, a great deal of my mental energy is devoted to trying to remember such details and fearing the consequences of the inevitable “it slipped my mind”. As counselled by GTD, I need a good and trustworthy system for removing these important, but distracting, details and having them reappear when needed. I’ve finally settled in on the new product from values of n called Stikkit.

Stikkit appeals to me for two main reasons: easy data entry and smart text processing. Stikkit uses the metaphor of the yellow sticky note for capturing text. When you create a new note, you are presented with a simple text field — nothing more. However, Stikkit parses your note for some key words and extracts information to make the note more useful. For example, if you type:

Phone call with John Smith on Feb 1 at 1pm

Stikkit realises that you are describing an event scheduled for February 1st at one in the afternoon with a person (“peep” in Stikkit slang) named John Smith. A separate note will be created to track information about John Smith and will be linked to the phone call note. If you add the text “remind me” to the note, Stikkit will send you an email and SMS message prior to the event. You can also include tags to group notes together with the keywords “tag as”.

A recent update to peeps makes them even more useful. Stikkit now collects information about people as you create notes. So, for example, if I later post:

- Send documents to John Smith john@smith.net

Stikkit will recognise John Smith and update my peep for him with the email address provided. In this way, Stikkit becomes more useful as you continue to add information to notes. Also, the prefixed “-” causes Stikkit to recognise this note as a todo. I can then list all of my todos and check them off as they are completed.

This text processing greatly simplifies data entry, since I don’t need to click around to create todos are choose dates from a calendar picker. Just type in the text, hit save, and I’m done. Fortunately, Stikkit has been designed to be smart rather than clever. The distinction here is that Stikkit relies on some key words (such as at, for, to) to mark up notes consistently and reliably. Clever software is exemplified by Microsoft Word’s autocorrect or clipboard assistant. My first goal when encountering these “features” is to turn them off. I find they rarely do the right thing and end up being a hindrance. Stikkit is well worth a look. For a great overview check out the screencasts in the forum.

Mac vs. PC Remotes

Sunday, January 7, 2007

An image of a remote from Apple and a PC

I grabbed this image while preparing a new Windows machine. This seems to be an interesting comparison of the difference in design approaches between Apple and PC remotes. Both provide essentially the same functions. Clearly, however, one is more complex than the other. Which would you rather use?

Plantae's continued development

Thursday, November 30, 2006

Prior to general release, plantae is moving web hosts. This seems like a good time to point out that all of plantae’s code is hosted at Google Code. The project has great potential and deserves consistent attention. Unfortunately, I can’t continue to develop the code. So, if you have an interest in collaborative software, particularly in the scientific context, I encourage you to take a look.

Text processing with Unix

Wednesday, November 29, 2006

I recently helped someone process a text file with the help of Unix command line tools. The job would have been quite challenging otherwise, and I think this represents a useful demonstration of why I choose to use Unix.

The basic structure of the datafile was:

; A general header file ;
1
sample: 0.183 0.874 0.226 0.214 0.921 0.272 0.117
2
sample: 0.411 0.186 0.956 0.492 0.150 0.278 0.110
3
...

In this case the only important information is the second number of each line that begins with “sample:”. Of course, one option is to manually process the file, but there are thousands of lines, and that’s just silly.

We begin by extracting only the lines that begin with “sample:”. grep will do this job easily:

grep "^sample" input.txt

grep searches through the input.txt file and outputs any matching lines to standard output.

Now, we need the second number. sed can strip out the initial text of each line with a find and replace while tr compresses any strange use of whitespace:

sed 's/sample: //g' | tr -s ' '

Notice the use of the pipe (|) command here. This sends the output of one command to the input of the next. This allows commands to be strung together and is one of the truly powerful tools in Unix.

Now we have a matrix of numbers in rows and columns, which is easily processed with awk.

awk '{print $2;}'

Here we ask awk to print out the second number of each row.

So, if we string all this together with pipes, we can process this file as follows:

grep "^sample" input.txt | sed 's/sample: //g' | tr -s ' ' | awk '{print $2;}' > output.txt

Our numbers of interest are in output.txt.

Images from the Hinode spacecraft

Friday, November 3, 2006

Japan’s Hinode spacecraft has started taking pictures of the Sun. The detail of the shots is amazing and gives a sense of the Sun’s structure.

First light image

Stern Review on the economics of climate change

Tuesday, October 31, 2006

The Stern Review has been in the news recently for predicting that global warming could cost up to $7 trillion if not addressed soon. Of course, this has caused quite a stir as it offsets many of the, likely unfounded, concerns that fixing climate change will cost too much. The full report is available online and should be a quite interesting, if long, read.

Climate change and public relations

Saturday, October 7, 2006

This article in the Guardian explores the use of public relations firms by big oil companies to fight against the science of climate change. Apparently, the same tactics and people even of the tobacco industry’s fight against the link between smoking and cancer are being employed by the oil industry.

Principles of Technology Adoption

Tuesday, September 19, 2006

Choosing appropriate software tools can be challenging. Here are the principles I employ when making the decision:

Simple: This seems obvious, but many companies fail here. Typically, their downfall is focussing on a perpetual increase in feature quantity. I don’t evaluate software with feature counts. Rather, I value software that performs a few key operations well. Small, focussed tools result in much greater productivity than overly-complex, all-in-one tools. 37 Signals’ Writeboard is a great example of a simple, focussed tool for collaborative writing.
Open formats: I will not choose software that uses proprietary or closed data formats. Closed formats cause two main difficulties:
I must pay the proprietor of a closed format for the privilege of accessing my data. Furthermore, switching to a new set of software may require translating my data or, even worse, losing access altogether. Open formats allow me to access my data at any time and with any appropriate tool.
My tools are limited to the range of innovations that the proprietor deems important. Open formats allow me to take advantage of any great new tools that come available.
Flexible: As my requirements ebb and flow, I shouldn’t be locked into the constraints of a particular tool. The best options are internet-based, subscription plans. If I need more storage space or more access for collaborators, I simply choose a new subscription plan. Then, if things slow down, I can move back down to a basic plan and save money. The online backup service Strongpace, for example, has a subscription plan tied to the amount of storage and number of users required.
Network: A good tool must be fully integrated into the network. The ability to collaborate with anyone or access my data from any computer are great boons to productivity. Many of the best tools are completely internet based; all that is required to use them is a web browser. This also means that the tool is monitored and maintained by a collection of experts and that the tool can be upgraded at any time without being locked into a version-number update. Furthermore, with data maintained on a network, many storage and backup problems are addressed. GMail, for example, stores over 2GB of email, free of charge with an innovative user interface.

Exemplars

These are some of my favourite adherents to the principles outlined above:

TED-- Hans Rosling

Tuesday, September 19, 2006

An excellent presentation regarding the use of country statistics. The visualizations are particularly effective.

Resumes & Spam Filters

Sunday, September 10, 2006

Since I’m looking for work, I found this post rather interesting. They’ve applied a spam filter to resumes to automatically filter through candidates. The output is only as good as the reference resumes used to construct the filter, but still an intriguing idea. My results are below. Most importantly the probability of me not being hired is 1.15e-59, which is a very, very small number. Perhaps I should add this fact to my resume?

I will now tell you what i think about this CV
The CV you entered fits better in the Hired group than the NotHired group.
CLASSIFY fails; success probability: 0.0000  pR: -58.9382
Best match to file #1 (Hired.css) prob: 1.0000  pR: 58.9382 
Total features in input file: 7478
#0 (NotHired.css): features: 61899, hits: 7125, prob: 1.15e-59, pR: -58.94
#1 (Hired.css): features: 794351, hits: 90156, prob: 1.00e+00, pR:  58.94
The CV you entered fits best into the Guru catagory.
CLASSIFY succeeds; success probability: 1.0000  pR: 8.1942
Best match to file #0 (Guru.css) prob: 1.0000  pR: 8.1942 
Total features in input file: 7478
#0 (Guru.css): features: 559355, hits: 66154, prob: 1.00e-00, pR:   8.19
#1 (Intergrator.css): features: 163555, hits: 17093, prob: 2.17e-29, pR: -28.66
#2 (Administrator.css): features: 241282, hits: 24729, prob: 8.45e-25, pR: -24.07
#3 (Developer.css): features: 485579, hits: 54104, prob: 6.39e-09, pR:  -8.19

The Canary Project-- Global Warming Documented in Photos

Wednesday, July 12, 2006

The Canary Project is an intriguing idea. They are documenting the effects of global warming through pictures. Since many people, apparently, don’t believe the abundant scientific evidence, perhaps some startling pictures will be convincing.

RSiteSearch

Tuesday, July 11, 2006

I’m not sure how this escaped my notice until now, but `RSiteSearch` is a very useful command in R. Passing a string to this function loads up your web browser with search results from the R documentation and mailing list. So, for example:

RSiteSearch("glm")

will show you everything you need to know about using R for generalised linear models.

R module for ConTeXt

Monday, July 10, 2006

I generally write my documents in Sweave format. This approach allows me to embed the code for analyses directly in the report derived from the analyses, so that all results and figures are generated dynamically with the text of the report. This provides both great documentation of the analyses and the convenience of a single file to keep track of and work with.

Now there is a new contender for integrating analysis code and documentation with the release of an R module for ConTeXt. I prefer the clean implementation and modern features of ConTeXt to the excellent, but aging, LaTeX macro package that Sweave relies on. So, using ConTeXt for my documents is a great improvement.

Here’s a simple example of using this new module. I create two randomly distributed, normal variables, test for a correlation between them, and plot their distribution.

\usemodule[r]

\starttext
Describe the motivation of the analyses first.

Now create some variables.

\startRhidden
x <- rnorm(1000, 0, 1)
y <- rnorm(1000, 0, 1)
\stopRhidden

Are they correlated?

\startR
model <- lm(y ~ x, data = test)
summary(model)
\stopR

Now we can include a figure.

\startR
pdf("testFigure.pdf")
plot(x, y)
dev.off()
\stopR

\startbuffer
\placefigure{Here it is}{\externalfigure[testFigure]}
\stopbuffer
\getbuffer

\stoptext

Processing this code produces a pdf file with all of the results produced from R, including the figure.

I had some minor difficulties getting this to work on my OS X machine, through no fault of the r module itself. There are two problems. The first is that, by default, write18 is not enabled, so ConTeXt can’t access R directly. Fix this by editing /usr/local/teTeX/texmf.cnf so that “shell_escape = t”. The next is that the R module calls @texmfstart@ which isn’t directly accessible from a stock installation of TeX. The steps required are described in the “Configuration of texmfstart” section of the ConTeXt wiki. I modified this slightly by placing the script in ~/bin so that I didn’t interfere with the installed teTeX tree. Now everything should work.

CBC Radio 3

Friday, July 7, 2006

The CBC Radio 3 podcast is an excellent source for independent, Canadian music. They have recently added a playlist feature that helps you search for your favourite artists and create your own radio station. Definitely worth checking out.

expand.grid

Thursday, July 6, 2006

Here’s a simple trick for creating experimental designs in R: use the function expand.grid.

A simple example is:

  treatments <- LETTERS[1:4]
  levels <- 1:3
  experiment <- data.frame(expand.grid(treatment=treatments, level=levels))

which produces:

   treatment level
1          A     1
2          B     1
3          C     1
4          D     1
5          A     2
6          B     2
7          C     2
8          D     2
9          A     3
10         B     3
11         C     3
12         D     3

Now, if you want to randomize your experimental treatments, try:

  experiment[sample(dim(experiment)[1]), ]

sample randomly chooses numbers from a vector the same length as the experiment data frame without replacement. The square brackets then use this random sample to subsample from the experiment data frame.

Burning your money

Wednesday, June 7, 2006

Burning our money by Marc Jaccard is a useful overview of some policy options for reducing greenhouse gas emissions. Unfortunately, this article is part of the Globe’s subscribers-only section, but his paper, Burning Our Money to Warm the Planet, is available from the CD Howe Institute.

Heart of the Matter

Wednesday, May 24, 2006

CBC’s Ideas has been running a series of shows on heart disease called “Heart of the Matter”. Episode 2 is particularly interesting from a statistical perspective, as the episode discusses several difficulties with the analysis of drug efficacy. Some highlights include:

Effect sizes Some of the best cited studies for the use of drugs to treat heart disease show a statistically significant effect of only a few percentage points improvement. Contrast this with a dramatic, vastly superior improvement from diet alone.

Response variables The focus of many drug studies has been on the reduction of cholesterol, rather than reductions in heart disease. Diet studies, for example, have shown dramatic improvements in reducing heart attacks while having no effect on cholesterol levels. Conversely, drug studies that show a reduction in cholesterol show no change in mortality rates.

Blocking of data Separate analyses of drug efficacy on female or elderly patients tend to show that drug therapy increases overall mortality. Lumping these data in with the traditional middle-aged male patients removes this effect and, instead, shows a significant decrease in heart disease with drug use.

The point here isn’t to make a comment on the influence of drug companies on medical research. Rather, such statistical concerns are common to all research disciplines. The primary concern of such analyses should be: what is the magnitude of the effect of a specific treatment on my variable of interest? The studies discussed in the Ideas program suggest that much effort has been devoted to detecting significant effects of drugs on surrogate response variables regardless of the size of the effect.

Plantae resurrected

Tuesday, May 23, 2006

Some technical issues coupled with my road-trip-without-a-laptop conspired to keep Plantae from working correctly. I’ve repaired the damage and isolated Plantae from such problems in the future. My apologies for the downtime.

Competitive Enterprise Institute

Monday, May 22, 2006

The Competitive Enterprise Institute has put out some ads that would be quite funny if they weren’t so misleading. I imagine that most viewers can see through the propaganda of the oil industry. Regardless, in the long-term, industries that invest in efficient and low-polluting technology will win and the members of CEI will be out of business.

CO2: They call it pollution. We call it life.

Google Importer

Tuesday, May 16, 2006

Google Importer is a useful Spotlight plugin that includes Google searches in Spotlight searches. This helps integrate your search into one interface, which seems like an obvious progression of Apple’s Spotlight technology.

Google calendar

Thursday, April 20, 2006

Google Calendar has been featured in the news recently, and for good reason. Many of us have wanted access to a good online calendar program. One of my favourite features of Google Calendar is its integration with Gmail. If Gmail detects an event in your email message, a link appears that sends the information to Google Calendar. This is incredibly convenient and, seems to me, is one of the great promises of computers: reducing the tedious work that occupies much of our day.