Longform

Text processing with Unix

Wednesday, November 29, 2006

I recently helped someone process a text file with the help of Unix command line tools. The job would have been quite challenging otherwise, and I think this represents a useful demonstration of why I choose to use Unix.

The basic structure of the datafile was:

; A general header file ;
1
sample: 0.183 0.874 0.226 0.214 0.921 0.272 0.117
2
sample: 0.411 0.186 0.956 0.492 0.150 0.278 0.110
3
...

In this case the only important information is the second number of each line that begins with “sample:”. Of course, one option is to manually process the file, but there are thousands of lines, and that’s just silly.

We begin by extracting only the lines that begin with “sample:”. grep will do this job easily:

grep "^sample" input.txt

grep searches through the input.txt file and outputs any matching lines to standard output.

Now, we need the second number. sed can strip out the initial text of each line with a find and replace while tr compresses any strange use of whitespace:

sed 's/sample: //g' | tr -s ' '

Notice the use of the pipe (|) command here. This sends the output of one command to the input of the next. This allows commands to be strung together and is one of the truly powerful tools in Unix.

Now we have a matrix of numbers in rows and columns, which is easily processed with awk.

awk '{print $2;}'

Here we ask awk to print out the second number of each row.

So, if we string all this together with pipes, we can process this file as follows:

grep "^sample" input.txt | sed 's/sample: //g' | tr -s ' ' | awk '{print $2;}' > output.txt

Our numbers of interest are in output.txt.

Images from the Hinode spacecraft

Friday, November 3, 2006

Japan’s Hinode spacecraft has started taking pictures of the Sun. The detail of the shots is amazing and gives a sense of the Sun’s structure.

First light image

Stern Review on the economics of climate change

Tuesday, October 31, 2006

The Stern Review has been in the news recently for predicting that global warming could cost up to $7 trillion if not addressed soon. Of course, this has caused quite a stir as it offsets many of the, likely unfounded, concerns that fixing climate change will cost too much. The full report is available online and should be a quite interesting, if long, read.

Climate change and public relations

Sunday, October 8, 2006

This article in the Guardian explores the use of public relations firms by big oil companies to fight against the science of climate change. Apparently, the same tactics and people even of the tobacco industry’s fight against the link between smoking and cancer are being employed by the oil industry.

Principles of Technology Adoption

Tuesday, September 19, 2006

Choosing appropriate software tools can be challenging. Here are the principles I employ when making the decision:

Simple: This seems obvious, but many companies fail here. Typically, their downfall is focussing on a perpetual increase in feature quantity. I don’t evaluate software with feature counts. Rather, I value software that performs a few key operations well. Small, focussed tools result in much greater productivity than overly-complex, all-in-one tools. 37 Signals’ Writeboard is a great example of a simple, focussed tool for collaborative writing.
Open formats: I will not choose software that uses proprietary or closed data formats. Closed formats cause two main difficulties:
I must pay the proprietor of a closed format for the privilege of accessing my data. Furthermore, switching to a new set of software may require translating my data or, even worse, losing access altogether. Open formats allow me to access my data at any time and with any appropriate tool.
My tools are limited to the range of innovations that the proprietor deems important. Open formats allow me to take advantage of any great new tools that come available.
Flexible: As my requirements ebb and flow, I shouldn’t be locked into the constraints of a particular tool. The best options are internet-based, subscription plans. If I need more storage space or more access for collaborators, I simply choose a new subscription plan. Then, if things slow down, I can move back down to a basic plan and save money. The online backup service Strongpace, for example, has a subscription plan tied to the amount of storage and number of users required.
Network: A good tool must be fully integrated into the network. The ability to collaborate with anyone or access my data from any computer are great boons to productivity. Many of the best tools are completely internet based; all that is required to use them is a web browser. This also means that the tool is monitored and maintained by a collection of experts and that the tool can be upgraded at any time without being locked into a version-number update. Furthermore, with data maintained on a network, many storage and backup problems are addressed. GMail, for example, stores over 2GB of email, free of charge with an innovative user interface.

Exemplars

These are some of my favourite adherents to the principles outlined above:

TED-- Hans Rosling

Tuesday, September 19, 2006

An excellent presentation regarding the use of country statistics. The visualizations are particularly effective.

Resumes & Spam Filters

Monday, September 11, 2006

Since I’m looking for work, I found this post rather interesting. They’ve applied a spam filter to resumes to automatically filter through candidates. The output is only as good as the reference resumes used to construct the filter, but still an intriguing idea. My results are below. Most importantly the probability of me not being hired is 1.15e-59, which is a very, very small number. Perhaps I should add this fact to my resume?

I will now tell you what i think about this CV
The CV you entered fits better in the Hired group than the NotHired group.
CLASSIFY fails; success probability: 0.0000  pR: -58.9382
Best match to file #1 (Hired.css) prob: 1.0000  pR: 58.9382 
Total features in input file: 7478
#0 (NotHired.css): features: 61899, hits: 7125, prob: 1.15e-59, pR: -58.94
#1 (Hired.css): features: 794351, hits: 90156, prob: 1.00e+00, pR:  58.94
The CV you entered fits best into the Guru catagory.
CLASSIFY succeeds; success probability: 1.0000  pR: 8.1942
Best match to file #0 (Guru.css) prob: 1.0000  pR: 8.1942 
Total features in input file: 7478
#0 (Guru.css): features: 559355, hits: 66154, prob: 1.00e-00, pR:   8.19
#1 (Intergrator.css): features: 163555, hits: 17093, prob: 2.17e-29, pR: -28.66
#2 (Administrator.css): features: 241282, hits: 24729, prob: 8.45e-25, pR: -24.07
#3 (Developer.css): features: 485579, hits: 54104, prob: 6.39e-09, pR:  -8.19

The Canary Project-- Global Warming Documented in Photos

Wednesday, July 12, 2006

The Canary Project is an intriguing idea. They are documenting the effects of global warming through pictures. Since many people, apparently, don’t believe the abundant scientific evidence, perhaps some startling pictures will be convincing.

RSiteSearch

Tuesday, July 11, 2006

I’m not sure how this escaped my notice until now, but `RSiteSearch` is a very useful command in R. Passing a string to this function loads up your web browser with search results from the R documentation and mailing list. So, for example:

RSiteSearch("glm")

will show you everything you need to know about using R for generalised linear models.

R module for ConTeXt

Tuesday, July 11, 2006

I generally write my documents in Sweave format. This approach allows me to embed the code for analyses directly in the report derived from the analyses, so that all results and figures are generated dynamically with the text of the report. This provides both great documentation of the analyses and the convenience of a single file to keep track of and work with.

Now there is a new contender for integrating analysis code and documentation with the release of an R module for ConTeXt. I prefer the clean implementation and modern features of ConTeXt to the excellent, but aging, LaTeX macro package that Sweave relies on. So, using ConTeXt for my documents is a great improvement.

Here’s a simple example of using this new module. I create two randomly distributed, normal variables, test for a correlation between them, and plot their distribution.

\usemodule[r]

\starttext
Describe the motivation of the analyses first.

Now create some variables.

\startRhidden
x <- rnorm(1000, 0, 1)
y <- rnorm(1000, 0, 1)
\stopRhidden

Are they correlated?

\startR
model <- lm(y ~ x, data = test)
summary(model)
\stopR

Now we can include a figure.

\startR
pdf("testFigure.pdf")
plot(x, y)
dev.off()
\stopR

\startbuffer
\placefigure{Here it is}{\externalfigure[testFigure]}
\stopbuffer
\getbuffer

\stoptext

Processing this code produces a pdf file with all of the results produced from R, including the figure.

I had some minor difficulties getting this to work on my OS X machine, through no fault of the r module itself. There are two problems. The first is that, by default, write18 is not enabled, so ConTeXt can’t access R directly. Fix this by editing /usr/local/teTeX/texmf.cnf so that “shell_escape = t”. The next is that the R module calls @texmfstart@ which isn’t directly accessible from a stock installation of TeX. The steps required are described in the “Configuration of texmfstart” section of the ConTeXt wiki. I modified this slightly by placing the script in ~/bin so that I didn’t interfere with the installed teTeX tree. Now everything should work.

CBC Radio 3

Friday, July 7, 2006

The CBC Radio 3 podcast is an excellent source for independent, Canadian music. They have recently added a playlist feature that helps you search for your favourite artists and create your own radio station. Definitely worth checking out.

expand.grid

Thursday, July 6, 2006

Here’s a simple trick for creating experimental designs in R: use the function expand.grid.

A simple example is:

  treatments <- LETTERS[1:4]
  levels <- 1:3
  experiment <- data.frame(expand.grid(treatment=treatments, level=levels))

which produces:

   treatment level
1          A     1
2          B     1
3          C     1
4          D     1
5          A     2
6          B     2
7          C     2
8          D     2
9          A     3
10         B     3
11         C     3
12         D     3

Now, if you want to randomize your experimental treatments, try:

  experiment[sample(dim(experiment)[1]), ]

sample randomly chooses numbers from a vector the same length as the experiment data frame without replacement. The square brackets then use this random sample to subsample from the experiment data frame.

Burning your money

Wednesday, June 7, 2006

Burning our money by Marc Jaccard is a useful overview of some policy options for reducing greenhouse gas emissions. Unfortunately, this article is part of the Globe’s subscribers-only section, but his paper, Burning Our Money to Warm the Planet, is available from the CD Howe Institute.

Heart of the Matter

Wednesday, May 24, 2006

CBC’s Ideas has been running a series of shows on heart disease called “Heart of the Matter”. Episode 2 is particularly interesting from a statistical perspective, as the episode discusses several difficulties with the analysis of drug efficacy. Some highlights include:

Effect sizes Some of the best cited studies for the use of drugs to treat heart disease show a statistically significant effect of only a few percentage points improvement. Contrast this with a dramatic, vastly superior improvement from diet alone.

Response variables The focus of many drug studies has been on the reduction of cholesterol, rather than reductions in heart disease. Diet studies, for example, have shown dramatic improvements in reducing heart attacks while having no effect on cholesterol levels. Conversely, drug studies that show a reduction in cholesterol show no change in mortality rates.

Blocking of data Separate analyses of drug efficacy on female or elderly patients tend to show that drug therapy increases overall mortality. Lumping these data in with the traditional middle-aged male patients removes this effect and, instead, shows a significant decrease in heart disease with drug use.

The point here isn’t to make a comment on the influence of drug companies on medical research. Rather, such statistical concerns are common to all research disciplines. The primary concern of such analyses should be: what is the magnitude of the effect of a specific treatment on my variable of interest? The studies discussed in the Ideas program suggest that much effort has been devoted to detecting significant effects of drugs on surrogate response variables regardless of the size of the effect.

Plantae resurrected

Tuesday, May 23, 2006

Some technical issues coupled with my road-trip-without-a-laptop conspired to keep Plantae from working correctly. I’ve repaired the damage and isolated Plantae from such problems in the future. My apologies for the downtime.

Competitive Enterprise Institute

Monday, May 22, 2006

The Competitive Enterprise Institute has put out some ads that would be quite funny if they weren’t so misleading. I imagine that most viewers can see through the propaganda of the oil industry. Regardless, in the long-term, industries that invest in efficient and low-polluting technology will win and the members of CEI will be out of business.

CO2: They call it pollution. We call it life.

Google Importer

Tuesday, May 16, 2006

Google Importer is a useful Spotlight plugin that includes Google searches in Spotlight searches. This helps integrate your search into one interface, which seems like an obvious progression of Apple’s Spotlight technology.

Google calendar

Thursday, April 20, 2006

Google Calendar has been featured in the news recently, and for good reason. Many of us have wanted access to a good online calendar program. One of my favourite features of Google Calendar is its integration with Gmail. If Gmail detects an event in your email message, a link appears that sends the information to Google Calendar. This is incredibly convenient and, seems to me, is one of the great promises of computers: reducing the tedious work that occupies much of our day.

An Inconvenient Truth

Friday, April 14, 2006

This looks like an incredibly important film. I hope it breaks all of the box office records.

Analysis of Count Data

Thursday, April 13, 2006

When response variables are composed of counts, the standard statistical methods that rely on the normal distribution are no longer applicable. Count data are comprised of positive integers and, often, many zeros. For such data, we need statistics based on Poisson or binomial distributions. I’ve spent the past few weeks analysing counts from hundreds of transects and, as is typical, a particular challenge was determining the appropriate packages to use for R. Here’s what I’ve found so far.

The first step is to get an idea of the dispersion of data points:

Means <- tapply(y, list(x1, x2), mean)
Vars <- tapply(y, list(x1, x2), var)
plot(Means, Vars, xlab="Means", ylab="Variances")
abline(a=0, b=1)

For the Poisson distribution, the mean is equal to the variance. So, we expect the points to lie along the solid line added to the plot. If the points are overdispersed, a negative binomial link function may be more appropriate. The pscl library provides a function to test this:

library(pscl)
model.nb <- glm.nb(y ~ x, data=data)
odTest(model.nb)
summary(model.nb)

If the odTest function rejects the null model, then the data are overdispersed relative to a Poisson distribution. One particularly useful function is glmmPQL from the MASS library. This function allows for random intercepts and combined with the negative.binomial function of the same library, you can fit a negative binomial GLMM:

model.glmm.nb <- glmmPQL(y ~ x1 + x2,
                         random= ~ 1|transect, data=data,
                         family=negative.binomial(model.nb$theta))

In this case, I use the Θ estimated from the glm.nb function in the negative.binomial call. Also useful are the zeroinfl function of the pscl library for fitting zero-inflated Poisson or negative binomial models and the geeglm function of geepack for fitting generalized estimating equations for repeated measures. Finally, fitdistr from MASS allows for estimating the parameters of different distributions from empirical data.

Getting Evolution Up to Speed

Monday, April 10, 2006

There’s a common notion that our technology has slowed, or even stopped, human evolution. Evidently, this is not true as researchers have found many locations of strong positive selection in the human genome.

New evidence suggests humans are evolving more rapidly – and more recently – than most people thought possible. But for some radical evolutionists, Homo sapiens isn’t morphing quickly enough.

(Via Wired News.)

SSHRC and the theory of evolution

Sunday, April 9, 2006

This is quite a surprise, McGill University’s Brian Alters had his proposal to study the effects of intelligent design on Canadian education rejected by the Canadian Social Sciences and Humanities Research Council. A stated reason for the rejection was that Alters did not provide “adequate justification for the assumption in the proposal that the theory of evolution, and not intelligent design theory, was correct.”

Granted, funding proposals can be rejected for a variety of reasons and the opinions of the reviewers do not necessarily reflect those of the funding body. However, the international media (Nature, The Guardian) are reporting on this and the suggestion is that the Canadian Government — or, at least, our funding agency for social science research — rejects evolution.

If SSHRC intends to pass judgement on scientific theories, they should review the evidence first. Biological evolution is a fact. Furthermore, the theory of evolution through natural selection has accumulated 150 years of empirical evidence and ranks as one of science’s greatest insights.

I hope that SSHRC clarifies their position on evolution soon.

(Via The Panda’s Thumb.)

Deschooling, Democratic Education, and Social Change

Friday, April 7, 2006

Matt Hern provides an interesting podcast available from Canadian Voices. He considers the 150 year history of compulsory state education and asks what benefits it has provided. The basic question is, Why do we send our kids to school? Although the answer seems obvious, he takes a different approach and argues for alternatives to public education. I’m always fascinated when someone argues against what I believe to be obvious. That’s when I learn the most about my biases.

Desktop Manager

Tuesday, April 4, 2006

I’m convinced that no computer display is large enough. What we need are strategies to better manage our computer workspace and application windows. Exposé and tabbed browsing are great features, but what I really want is the equivalent of a file folder. You put all of the relevant documents in a folder and then put it aside for when you need it. Once you’re ready, you open up the folder and are ready to go.

A feature that comes close to this is virtual desktops. I became enamoured with these while running KDE and have found them again for OS X with Desktop Manager. The idea is to create workspaces associated with specific tasks as a virtual desktop. You can then switch between these desktops as you move from one project to the next. So, for each of the projects I am currently working on, I’ve created a desktop with each application’s windows in the appropriate place. For a consulting project, I likely have Aquamacs running an R simulation with a terminal window open for updating my subversion repository. A project in the writing stage requires TeXShop and BibDesk, while a web-design project needs TextMate and Camino. Each of these workspaces is independent and I can quickly switch to them when needed. Since the applications are running with their windows in the appropriate place, I can quickly get back to work on the project at hand.

Application windows can be split across desktops and specific windows can be present across all desktops. I’ve also found it useful to have one desktop for communication (email, messaging, etc.) and another that has no windows open at all.

Managing project files

Monday, March 27, 2006

As I accumulate projects (both new and completed), the maintenance and storage of the project files becomes increasingly important. There are two important goals for a file structure: find things quickly and don’t lose anything. My current strategy is as follows:

Every project has a consistent folder structure:

|-- analysis
|-- data
|-- db
|-- doc
|-- fig
`-- utils

analysis holds the R source files of the analysis. These, typically, are experiments and snippets of code. The main analyses are in the doc directory.

data contains data files. Generally, these are csv files from clients.

db is for sql dumps of databases and sqlite files. I prefer working with databases over flat text files or Excel spreadsheets. These files are kept in the data folder and converted to sql databases for analyses.

doc holds the analysis and writeup as a Sweave file. This combines R and LaTeX to create a complete document from one source file.

fig is for diagrams and plots. Many of these are generated when processing the Sweave file, but some are constructed from other sources.

utils holds scripts and binaries that are required to run the analysis.

This entire directory structure is maintained with Subversion, so I have a record of changes and can access the project files from any networked computer.

Finally, once a project is complete, I archive the project and construct a sha checksum of the zip file.

openssl dgst -sha1 -out checksums.txt archive.zip

This checksum allows me to verify that the archive remains stable over time. Coupled with a good backup routine, this should keep the project files safe.

This may seem elaborate, but data and their analyses are too important to be left scattered around a laptop’s hard drive.

One other approach I’ve considered is using the R package structure to maintain projects. This is a useful guide, but the process seems too involved for my purposes.

Sun grid

Friday, March 24, 2006

Sun’s new Grid Compute Utility could be a great resource. As I described in an earlier post, running simulations can be a challenge with limited computer resources. Rather than waiting hours for my computers to work through thousands of iterations, I could pay Sun $1, which would likely be sufficient for the scale of my work. This would be well worth the investment! I spend a single dollar and quickly get the results I need for my research or clients.

Apparently, the American government has classified the Sun Grid as a weapon, so we can’t access it here in Canada, yet. I’m sure this will change shortly.

Automator, Transmit, and Backup

Sunday, March 19, 2006

The Strongspace weblog has a useful post about using Transmit and Automator to make backups. One challenge with this approach is backing up files scattered throughout your home folder. The solution is the “Follow symbolic links” option when mirroring. I created a backup folder and populated it with aliases to the files I’m interested in backing up. Mirroring this folder to Strongspace copies the files to the server. The other option is to use the “Skip files with specified names” feature, but this rapidly filled up for me.

Remote data analysis

Wednesday, March 15, 2006

My six-year old laptop is incredibly slow, particularly when analysing data. Unfortunately, analysing data is my job, so this represents a problem. We have a new and fast desktop at home, but I can’t monopolise its use and it would negate the benefits of mobility.

Fortunately, with the help of Emacs and ESS there is a solution. I write my R code on the laptop and evaluate the code on the desktop, which sends the responses and plots back to the laptop. I can do this anywhere I have network access for the laptop and the results are quite quick.

There are a few tricks to the setup, especially if you want the plots sent back to the laptop’s screen, so I’ll document the necessary steps here.

First, you need to enable X11 forwarding on both the desktop and laptop computers (see the Apple Technote). Then start up X11 on the laptop.

Now, on the laptop, start up Emacs, open a file with R code and ssh to the desktop:

opt-x ssh -Y desktopip

Now, run R in the ssh buffer and link it to the R-code buffer

R opt-x ess-remote

That should do it. Now I have the speed of the desktop and the benefits of a laptop.

Rails, sqlite3, and id=0

Friday, February 24, 2006

I’ve spent the last few days struggling with a problem with Plantae’s rails code. I was certain that code like this should work:

class ContentController < ApplicationController
  def new
    @plant = Plant.new(params[:plant])
    if request.post? and @plant.save
      redirect_to :action => 'show', :id => @plant.id
    else
       ...
    end
  end

  def show
    @plant = Plant.find(params[:id])
  end
  ...
end

These statements should create a new plant and then redirect to the show method which sends the newly created plant to the view.

The problem was that rails insisted on asking sqlite3 for the plant with id=0, which cannot exist.

After a post to the rails mailing list and some thorough diagnostics I discovered this and realized I needed swig.

So, if anyone else has this problem:

sudo port install swig
sudo gem install sqlite3-ruby

is the solution.

The Globe and Mail-- Constitutional reform

Monday, January 16, 2006

How about a constitutional right to a healthy environment?

The Constitution of Canada guarantees its people important rights, such as freedom of religion, freedom of expression, fair trials, free elections, and language rights. But the effective exercise of these rights is impossible without safe water to drink, wholesome food to eat, and clean air to breathe. The right to a clean and healthy environment – arguably the most fundamental right of all – is conspicuously absent from our Charter of Rights and Freedoms, despite being a feature of the constitutions of many other nations.