Analysis of Count Data
Thursday, April 13, 2006
When response variables are composed of counts, the standard statistical methods that rely on the normal distribution are no longer applicable. Count data are comprised of positive integers and, often, many zeros. For such data, we need statistics based on Poisson or binomial distributions. I’ve spent the past few weeks analysing counts from hundreds of transects and, as is typical, a particular challenge was determining the appropriate packages to use for R. Here’s what I’ve found so far.
The first step is to get an idea of the dispersion of data points:
Means <- tapply(y, list(x1, x2), mean)
Vars <- tapply(y, list(x1, x2), var)
plot(Means, Vars, xlab="Means", ylab="Variances")
abline(a=0, b=1)
For the Poisson distribution, the mean is equal to the variance. So, we expect the points to lie along the solid line added to the plot. If the points are overdispersed, a negative binomial link function may be more appropriate. The pscl
library provides a function to test this:
library(pscl)
model.nb <- glm.nb(y ~ x, data=data)
odTest(model.nb)
summary(model.nb)
If the odTest
function rejects the null model, then the data are overdispersed relative to a Poisson distribution. One particularly useful function is glmmPQL
from the MASS
library. This function allows for random intercepts and combined with the negative.binomial function of the same library, you can fit a negative binomial GLMM:
model.glmm.nb <- glmmPQL(y ~ x1 + x2,
random= ~ 1|transect, data=data,
family=negative.binomial(model.nb$theta))
In this case, I use the Θ estimated from the glm.nb
function in the negative.binomial
call. Also useful are the zeroinfl
function of the pscl
library for fitting zero-inflated Poisson or negative binomial models and the geeglm
function of geepack
for fitting generalized estimating equations for repeated measures. Finally, fitdistr
from MASS
allows for estimating the parameters of different distributions from empirical data.