11 March 2024

Basics

Basics

  • in R: glm(), model specification as before: glm(y~f1+x1+f2+x2, data=..., family=..., ...)
  • definition: family/link function

Family

  • family: what kind of data do I have?
    • from first principles: family specifies the relationship between the mean and variance
    • family=binomial: proportions, out of a total number of counts; includes binary (Bernoulli) (“logistic regression”)
    • family=poisson: Poisson (independent counts, no set maximum, or far from the maximum)
    • other (Normal ("gaussian"), Gamma)
  • default family for glm is Gaussian

Most GLMs are logistic regression

Link functions

Machinery

  • “linear predictor” \(\eta = \beta_0 + \beta_1 x_1 + \ldots\) describes patterns on the link scale
  • Fit doesn’t transform the responses: instead applies inverse link function to the linear predictor
    • instead of \(\log(y) \sim x\), we analyze \(y \sim \mathrm{Poisson}(\exp(x))\)
  • this is good, because the observed value of \(y\) might be zero
    • e.g. count (Poisson) phenotype vs. temperature (centered at 20 C)
    • with \(\beta=\{1,1\}\), \(T=15\), \(\textrm{counts} \sim \textrm{Poisson}(\lambda=\exp(-4)=0.018)\)

Machinery (2)

Model setup is the same as linear models

  • categorical vs. continuous predictors
  • contrasts
  • interactions
  • multivariable regression vs ANOVA vs ANCOVA vs …

but the linear relationship is set up on the link scale

log/exponential function (log/exp): count data

logit/logistic function (qlogis/plogis)

diagnostics

  • harder than linear models: plot is still somewhat useful
  • binary data especially hard
  • goodness of fit tests, \(R^2\) etc. hard (can always compute cor(observed,predict(model, type="response")))
  • residuals are Pearson residuals by default [(obs-exp)/sqrt(V(exp))]
  • predicted values on the effect scale by default: use type="response" to back-transform
  • performance::check_model(), DHARMa package are OK (simulateResiduals(model,plot=TRUE))

overdispersion

  • too much variance: (residual deviance)/(residual df) should be \(\approx 1\). (Ratio >1.2 worrisome; ratio>3, v. worrisome (check your model & data!)
  • quasi-likelihood models (e.g. family=quasipoisson); fit, then adjust CIs/p-values
  • alternatives:
    • Poisson \(\to\) negative binomial (MASS::glm.nb)
    • binomial \(\to\) beta-binomial (glmmTMB package)
  • overdispersion not relevant for
    • binary responses
    • families with estimated scale parameters (Gaussian, Gamma, NB, …)
    • models that already account for it (NB, quasi-likelihood)

parameter interpretation

  • as with linear models (change in response per change in input)
  • but on link scale
  • log link: proportional for small \(\beta\), changes
    • e.g. \(\beta=0.01 \to\) “\(\approx\) 1% change per unit change in input”
    • \(\beta=3 \to\) “\((e^3) \approx\) 20-fold change per change in input”

logit link

interpreting logit scale

  • effect on the prob scale depends on baseline probability
    • low baseline prob: like log link
    • high baseline prob: prop. change in (1-prob)
    • medium prob: absolute change \(\approx \beta/4\)
    • e.g. going from log-odds of 0 to 1; estimated \(\Delta\) prob \(\approx\) 0.25
      • plogis(0)= 0.5
      • plogis(0+1)= 0.73
  • also see UCLA FAQ on odds ratios; read Gelman and Hill’s Applied Regression Modeling book (p. 81; Google books link)

inference

  • Wald \(Z\) tests (i.e., results of summary()), confidence intervals
    • approximate, can be way off if parameters have extreme values (complete separation)
    • asymptotic (finite-size correction/“degrees of freedom” are hard, usually ignored)
  • likelihood ratio tests (equivalent to \(F\) tests); drop1(model,test="Chisq"), anova(model1,model2)), profile confidence intervals (MASS::confint.glm)
  • AIC etc.

Model procedures

  • formula like lm
  • specify family (variance-mean), link (nonlinearity)
  • always do Poisson, binomial regression on counts, never proportions (although can specify response as a proportion if you also give \(N\) as the weights argument)
    • Use offsets to address unequal sampling
  • always check for overdispersion if necessary (Poisson/binomial \(N>1\))
  • if you want to quote values on the original scale, confidence intervals need to be back-transformed; never back-transform standard errors alone

binomial models

  • for Poisson, Bernoulli (0/1) responses we only need one piece of information
  • how do we specify denominator (\(N\) in \(k/N\))?
  • traditional R: response is two-column matrix cbind(successes,failures) [not cbind(successes,total)]
  • also allowed: response is proportion (\(k/N\)), also specify weights=N
    (if equal for all cases and specified on the fly need to replicate:
    glm(p~...,data,weights=rep(N,nrow(data))))

offsets

  • constant terms added to a model
  • what if we want to model densities rather than counts?
  • log-link (Poisson/NB) models: \(\mu_0 = \exp(\beta_0 + \beta_1 x_1 + ...)\)
  • if we know the area then we want \(\mu = A \cdot \mu_0\)
  • equivalent to adding \(\log(A)\) to the linear predictor (\(\exp(\log(\mu_0) + \log(A)) = \mu_0 \cdot A\))
  • use ... + offset(log(A)) in R formula
  • for survival/event modeling over different periods of time, a similar offset trick works with link="cloglog" (see here)

add-on packages

  • ggplot2
    • geom_smooth(method="glm", method.args=list(family=...))
  • dotwhisker, emmeans, effects, sjPlot
    • need to interpret parameters appropriately
    • means may be computed on link or response scale

Advanced topics

  • complete separation
  • ordinal data
  • zero-inflation
  • non-standard link functions
  • visualization (hard because of overlaps: try stat_sum, position="jitter", geom_dotplot, (beeswarm plot)
  • see also: GLM extensions talk (source)

Common(est?) glm() problems

  • neglecting overdispersion
  • binomial/Poisson models with non-integer data
  • equating negative binomial with binomial rather than Poisson
  • failing to specify family (\(\to\) linear model); using glm() for linear models (unnecessary)
  • predictions on effect scale
  • using \((k,N)\) rather than \((k,N-k)\) in binomial models
  • worrying about overdispersion unnecessarily (binary/Gamma)
  • back-transforming SEs rather than CIs
  • Poisson for underdispersed responses
  • ignoring random effects

Example

AIDS in Australia (Dobson and Barnett 2008)

data here

aids <- read.csv("../data/aids.csv")
aids <- transform(aids, date=year+(quarter-1)/4)
gg0 <- ggplot(aids,aes(date,cases))+geom_point()

Easy GLMs with ggplot

gg1 <- gg0 + geom_smooth(method="glm",colour="red",
                         formula=y~x,
                         method.args=list(family="quasipoisson"))

results

Equivalent code

g1 <- glm(cases~date, data = aids, family=quasipoisson(link="log"))
summary(g1)

Diagnostics (plot(g1))

autocorrelation function

acf(residuals(g1)) ## check autocorrelation

DHARMa

library(DHARMa)
g0 <- update(g1, family=poisson)
plot(simulateResiduals(g0))

ggplot: try quadratic model

print(gg2 <- gg1+geom_smooth(method="glm",formula=y~poly(x,2),
                             method.args=list(family="quasipoisson")))

(see here, here for information on poly())

improved model

g2 <- update(g1,.~poly(date,2))

new diagnostics

autocorrelation function

acf(residuals(g2)) ## check autocorrelation

inference

summary(g2)
## 
## Call:
## glm(formula = cases ~ poly(date, 2), family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.86859    0.05004  77.311  < 2e-16 ***
## poly(date, 2)1  3.82934    0.25162  15.219 2.46e-11 ***
## poly(date, 2)2 -0.68335    0.19716  -3.466  0.00295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.657309)
## 
##     Null deviance: 677.264  on 19  degrees of freedom
## Residual deviance:  31.992  on 17  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
anova(g1,g2,test="F") ## for quasi-models specifically
## Analysis of Deviance Table
## 
## Model 1: cases ~ date
## Model 2: cases ~ poly(date, 2)
##   Resid. Df Resid. Dev Df Deviance      F   Pr(>F)   
## 1        18     53.020                               
## 2        17     31.992  1   21.028 12.688 0.002399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

References

Dobson, Annette J., and Adrian Barnett. 2008. An Introduction to Generalized Linear Models. 3rd ed. Chapman; Hall/CRC.