Images from the Hinode spacecraft
Japanโs Hinode spacecraft has started taking pictures of the Sun. The detail of the shots is amazing and gives a sense of the Sunโs structure.

Japanโs Hinode spacecraft has started taking pictures of the Sun. The detail of the shots is amazing and gives a sense of the Sunโs structure.

Edward Burtynsky documents our impacts on the landscape through extraordinary photographs. In this presentation at TED he describes his motivations for the work and showcases some of his best images. The descriptions of China are particularly impressive.
The Stern Review has been in the news recently for predicting that global warming could cost up to $7 trillion if not addressed soon. Of course, this has caused quite a stir as it offsets many of the, likely unfounded, concerns that fixing climate change will cost too much. The full report is available online and should be a quite interesting, if long, read.
This article in the Guardian explores the use of public relations firms by big oil companies to fight against the science of climate change. Apparently, the same tactics and people even of the tobacco industry’s fight against the link between smoking and cancer are being employed by the oil industry.
Choosing appropriate software tools can be challenging. Here are the principles I employ when making the decision:
Open formats: I will not choose software that uses proprietary or closed data formats. Closed formats cause two main difficulties:
I must pay the proprietor of a closed format for the privilege of accessing my data. Furthermore, switching to a new set of software may require translating my data or, even worse, losing access altogether. Open formats allow me to access my data at any time and with any appropriate tool.
My tools are limited to the range of innovations that the proprietor deems important. Open formats allow me to take advantage of any great new tools that come available.
Flexible: As my requirements ebb and flow, I shouldnโt be locked into the constraints of a particular tool. The best options are internet-based, subscription plans. If I need more storage space or more access for collaborators, I simply choose a new subscription plan. Then, if things slow down, I can move back down to a basic plan and save money. The online backup service Strongpace, for example, has a subscription plan tied to the amount of storage and number of users required.
Exemplars
These are some of my favourite adherents to the principles outlined above:
An excellent presentation regarding the use of country statistics. The visualizations are particularly effective.
Since Iโm looking for work, I found this post rather interesting. Theyโve applied a spam filter to resumes to automatically filter through candidates. The output is only as good as the reference resumes used to construct the filter, but still an intriguing idea. My results are below. Most importantly the probability of me not being hired is 1.15e-59, which is a very, very small number. Perhaps I should add this fact to my resume?
I will now tell you what i think about this CV
The CV you entered fits better in the Hired group than the NotHired group.
CLASSIFY fails; success probability: 0.0000 pR: -58.9382
Best match to file #1 (Hired.css) prob: 1.0000 pR: 58.9382
Total features in input file: 7478
#0 (NotHired.css): features: 61899, hits: 7125, prob: 1.15e-59, pR: -58.94
#1 (Hired.css): features: 794351, hits: 90156, prob: 1.00e+00, pR: 58.94
The CV you entered fits best into the Guru catagory.
CLASSIFY succeeds; success probability: 1.0000 pR: 8.1942
Best match to file #0 (Guru.css) prob: 1.0000 pR: 8.1942
Total features in input file: 7478
#0 (Guru.css): features: 559355, hits: 66154, prob: 1.00e-00, pR: 8.19
#1 (Intergrator.css): features: 163555, hits: 17093, prob: 2.17e-29, pR: -28.66
#2 (Administrator.css): features: 241282, hits: 24729, prob: 8.45e-25, pR: -24.07
#3 (Developer.css): features: 485579, hits: 54104, prob: 6.39e-09, pR: -8.19
The Canary Project is an intriguing idea. They are documenting the effects of global warming through pictures. Since many people, apparently, donโt believe the abundant scientific evidence, perhaps some startling pictures will be convincing.
Iโm not sure how this escaped my notice until now, but `RSiteSearch` is a very useful command in R. Passing a string to this function loads up your web browser with search results from the R documentation and mailing list. So, for example:
RSiteSearch("glm")
will show you everything you need to know about using R for generalised linear models.
I generally write my documents in Sweave format. This approach allows me to embed the code for analyses directly in the report derived from the analyses, so that all results and figures are generated dynamically with the text of the report. This provides both great documentation of the analyses and the convenience of a single file to keep track of and work with.
Now there is a new contender for integrating analysis code and documentation with the release of an R module for ConTeXt. I prefer the clean implementation and modern features of ConTeXt to the excellent, but aging, LaTeX macro package that Sweave relies on. So, using ConTeXt for my documents is a great improvement.
Hereโs a simple example of using this new module. I create two randomly distributed, normal variables, test for a correlation between them, and plot their distribution.
\usemodule[r]
\starttext
Describe the motivation of the analyses first.
Now create some variables.
\startRhidden
x <- rnorm(1000, 0, 1)
y <- rnorm(1000, 0, 1)
\stopRhidden
Are they correlated?
\startR
model <- lm(y ~ x, data = test)
summary(model)
\stopR
Now we can include a figure.
\startR
pdf("testFigure.pdf")
plot(x, y)
dev.off()
\stopR
\startbuffer
\placefigure{Here it is}{\externalfigure[testFigure]}
\stopbuffer
\getbuffer
\stoptext
Processing this code produces a pdf file with all of the results produced from R, including the figure.
I had some minor difficulties getting this to work on my OS X machine, through no fault of the r module itself. There are two problems. The first is that, by default, write18 is not enabled, so ConTeXt canโt access R directly. Fix this by editing /usr/local/teTeX/texmf.cnf so that โshell_escape = tโ. The next is that the R module calls @texmfstart@ which isnโt directly accessible from a stock installation of TeX. The steps required are described in the โConfiguration of texmfstartโ section of the ConTeXt wiki. I modified this slightly by placing the script in ~/bin so that I didnโt interfere with the installed teTeX tree. Now everything should work.
The CBC Radio 3 podcast is an excellent source for independent, Canadian music. They have recently added a playlist feature that helps you search for your favourite artists and create your own radio station. Definitely worth checking out.
Hereโs a simple trick for creating experimental designs in R: use the function expand.grid.
A simple example is:
treatments <- LETTERS[1:4]
levels <- 1:3
experiment <- data.frame(expand.grid(treatment=treatments, level=levels))
which produces:
treatment level
1 A 1
2 B 1
3 C 1
4 D 1
5 A 2
6 B 2
7 C 2
8 D 2
9 A 3
10 B 3
11 C 3
12 D 3
Now, if you want to randomize your experimental treatments, try:
experiment[sample(dim(experiment)[1]), ]
sample randomly chooses numbers from a vector the same length as the experiment data frame without replacement. The square brackets then use this random sample to subsample from the experiment data frame.
Burning our money by Marc Jaccard is a useful overview of some policy options for reducing greenhouse gas emissions. Unfortunately, this article is part of the Globe’s subscribers-only section, but his paper, Burning Our Money to Warm the Planet, is available from the CD Howe Institute.
CBCโs Ideas has been running a series of shows on heart disease called โHeart of the Matterโ. Episode 2 is particularly interesting from a statistical perspective, as the episode discusses several difficulties with the analysis of drug efficacy. Some highlights include:
Effect sizes Some of the best cited studies for the use of drugs to treat heart disease show a statistically significant effect of only a few percentage points improvement. Contrast this with a dramatic, vastly superior improvement from diet alone.
Response variables The focus of many drug studies has been on the reduction of cholesterol, rather than reductions in heart disease. Diet studies, for example, have shown dramatic improvements in reducing heart attacks while having no effect on cholesterol levels. Conversely, drug studies that show a reduction in cholesterol show no change in mortality rates.
Blocking of data Separate analyses of drug efficacy on female or elderly patients tend to show that drug therapy increases overall mortality. Lumping these data in with the traditional middle-aged male patients removes this effect and, instead, shows a significant decrease in heart disease with drug use.
The point here isnโt to make a comment on the influence of drug companies on medical research. Rather, such statistical concerns are common to all research disciplines. The primary concern of such analyses should be: what is the magnitude of the effect of a specific treatment on my variable of interest? The studies discussed in the Ideas program suggest that much effort has been devoted to detecting significant effects of drugs on surrogate response variables regardless of the size of the effect.
Some technical issues coupled with my road-trip-without-a-laptop conspired to keep Plantae from working correctly. Iโve repaired the damage and isolated Plantae from such problems in the future. My apologies for the downtime.
The Competitive Enterprise Institute has put out some ads that would be quite funny if they werenโt so misleading. I imagine that most viewers can see through the propaganda of the oil industry. Regardless, in the long-term, industries that invest in efficient and low-polluting technology will win and the members of CEI will be out of business.
CO2: They call it pollution. We call it life.
Google Importer is a useful Spotlight plugin that includes Google searches in Spotlight searches. This helps integrate your search into one interface, which seems like an obvious progression of Apple’s Spotlight technology.
Google Calendar has been featured in the news recently, and for good reason. Many of us have wanted access to a good online calendar program. One of my favourite features of Google Calendar is its integration with Gmail. If Gmail detects an event in your email message, a link appears that sends the information to Google Calendar. This is incredibly convenient and, seems to me, is one of the great promises of computers: reducing the tedious work that occupies much of our day.
This looks like an incredibly important film. I hope it breaks all of the box office records.
When response variables are composed of counts, the standard statistical methods that rely on the normal distribution are no longer applicable. Count data are comprised of positive integers and, often, many zeros. For such data, we need statistics based on Poisson or binomial distributions. Iโve spent the past few weeks analysing counts from hundreds of transects and, as is typical, a particular challenge was determining the appropriate packages to use for R. Hereโs what Iโve found so far.
The first step is to get an idea of the dispersion of data points:
Means <- tapply(y, list(x1, x2), mean)
Vars <- tapply(y, list(x1, x2), var)
plot(Means, Vars, xlab="Means", ylab="Variances")
abline(a=0, b=1)
For the Poisson distribution, the mean is equal to the variance. So, we expect the points to lie along the solid line added to the plot. If the points are overdispersed, a negative binomial link function may be more appropriate. The pscl library provides a function to test this:
library(pscl)
model.nb <- glm.nb(y ~ x, data=data)
odTest(model.nb)
summary(model.nb)
If the odTest function rejects the null model, then the data are overdispersed relative to a Poisson distribution. One particularly useful function is glmmPQL from the MASS library. This function allows for random intercepts and combined with the negative.binomial function of the same library, you can fit a negative binomial GLMM:
model.glmm.nb <- glmmPQL(y ~ x1 + x2,
random= ~ 1|transect, data=data,
family=negative.binomial(model.nb$theta))
In this case, I use the ฮ estimated from the glm.nb function in the negative.binomial call. Also useful are the zeroinfl function of the pscl library for fitting zero-inflated Poisson or negative binomial models and the geeglm function of geepack for fitting generalized estimating equations for repeated measures. Finally, fitdistr from MASS allows for estimating the parameters of different distributions from empirical data.