Generalized linear models

11 March 2024

Basics

in R: glm(), model specification as before: glm(y~f1+x1+f2+x2, data=..., family=..., ...)
definition: family/link function

Family

family: what kind of data do I have?
- from first principles: family specifies the relationship between the mean and variance
- family=binomial: proportions, out of a total number of counts; includes binary (Bernoulli) (“logistic regression”)
- family=poisson: Poisson (independent counts, no set maximum, or far from the maximum)
- other (Normal ("gaussian"), Gamma)
default family for glm is Gaussian

Most GLMs are logistic regression

Link functions

on what scale are the relationship(s) between predictor(s) and response linear?
link function goes from data scale (counts, proportions etc.) to effect or link or linear predictor scale
- Poisson=log; binomial=logit
inverse link function goes from effect scale to data scale
- Poisson=exponential; binomial=logistic
each family has a default “canonical” link (sensible + nice math)
- default usually OK (except: use family=Gamma(link="log"))

Machinery

“linear predictor” \(\eta = \beta_0 + \beta_1 x_1 + \ldots\) describes patterns on the link scale
Fit doesn’t transform the responses: instead applies inverse link function to the linear predictor
- instead of \(\log(y) \sim x\), we analyze \(y \sim \mathrm{Poisson}(\exp(x))\)
this is good, because the observed value of \(y\) might be zero
- e.g. count (Poisson) phenotype vs. temperature (centered at 20 C)
- with \(\beta=\{1,1\}\), \(T=15\), \(\textrm{counts} \sim \textrm{Poisson}(\lambda=\exp(-4)=0.018)\)

Machinery (2)

Model setup is the same as linear models

categorical vs. continuous predictors
contrasts
interactions
multivariable regression vs ANOVA vs ANCOVA vs …

but the linear relationship is set up on the link scale

log/exponential function (`log`/`exp`): count data

logit/logistic function (`qlogis`/`plogis`)

diagnostics

harder than linear models: plot is still somewhat useful
binary data especially hard
goodness of fit tests, \(R^2\) etc. hard (can always compute cor(observed,predict(model, type="response")))
residuals are Pearson residuals by default [(obs-exp)/sqrt(V(exp))]
predicted values on the effect scale by default: use type="response" to back-transform
performance::check_model(), DHARMa package are OK (simulateResiduals(model,plot=TRUE))

overdispersion

too much variance: (residual deviance)/(residual df) should be \(\approx 1\). (Ratio >1.2 worrisome; ratio>3, v. worrisome (check your model & data!)
quasi-likelihood models (e.g. family=quasipoisson); fit, then adjust CIs/p-values
alternatives:
- Poisson \(\to\) negative binomial (MASS::glm.nb)
- binomial \(\to\) beta-binomial (glmmTMB package)
overdispersion not relevant for
- binary responses
- families with estimated scale parameters (Gaussian, Gamma, NB, …)
- models that already account for it (NB, quasi-likelihood)

parameter interpretation

as with linear models (change in response per change in input)
but on link scale
log link: proportional for small \(\beta\), changes
- e.g. \(\beta=0.01 \to\) “\(\approx\) 1% change per unit change in input”
- \(\beta=3 \to\) “\((e^3) \approx\) 20-fold change per change in input”

logit link

log odds
odds = \(p/(1-p)\) (1% probability is 1:99; 50% = 1:1)
effect of 1 unit change = multiply by odds ratio (neutral=1)
… or add log-odds = logit (neutral=0)

interpreting logit scale

effect on the prob scale depends on baseline probability
- low baseline prob: like log link
- high baseline prob: prop. change in (1-prob)
- medium prob: absolute change \(\approx \beta/4\)
- e.g. going from log-odds of 0 to 1; estimated \(\Delta\) prob \(\approx\) 0.25
  - plogis(0)= 0.5
  - plogis(0+1)= 0.73
also see UCLA FAQ on odds ratios; read Gelman and Hill’s Applied Regression Modeling book (p. 81; Google books link)

inference

Wald \(Z\) tests (i.e., results of summary()), confidence intervals
- approximate, can be way off if parameters have extreme values (complete separation)
- asymptotic (finite-size correction/“degrees of freedom” are hard, usually ignored)
likelihood ratio tests (equivalent to \(F\) tests); drop1(model,test="Chisq"), anova(model1,model2)), profile confidence intervals (MASS::confint.glm)
AIC etc.

Model procedures

formula like lm
specify family (variance-mean), link (nonlinearity)
always do Poisson, binomial regression on counts, never proportions (although can specify response as a proportion if you also give \(N\) as the weights argument)
- Use offsets to address unequal sampling
always check for overdispersion if necessary (Poisson/binomial \(N>1\))
if you want to quote values on the original scale, confidence intervals need to be back-transformed; never back-transform standard errors alone

binomial models

for Poisson, Bernoulli (0/1) responses we only need one piece of information
how do we specify denominator (\(N\) in \(k/N\))?
traditional R: response is two-column matrix cbind(successes,failures) [not cbind(successes,total)]
also allowed: response is proportion (\(k/N\)), also specify weights=N
(if equal for all cases and specified on the fly need to replicate:
glm(p~...,data,weights=rep(N,nrow(data))))

offsets

constant terms added to a model
what if we want to model densities rather than counts?
log-link (Poisson/NB) models: \(\mu_0 = \exp(\beta_0 + \beta_1 x_1 + ...)\)
if we know the area then we want \(\mu = A \cdot \mu_0\)
equivalent to adding \(\log(A)\) to the linear predictor (\(\exp(\log(\mu_0) + \log(A)) = \mu_0 \cdot A\))
use ... + offset(log(A)) in R formula
for survival/event modeling over different periods of time, a similar offset trick works with link="cloglog" (see here)

add-on packages

ggplot2
- geom_smooth(method="glm", method.args=list(family=...))
dotwhisker, emmeans, effects, sjPlot
- need to interpret parameters appropriately
- means may be computed on link or response scale

Advanced topics

complete separation
ordinal data
zero-inflation
non-standard link functions
visualization (hard because of overlaps: try stat_sum, position="jitter", geom_dotplot, (beeswarm plot)
see also: GLM extensions talk (source)

Common(est?) `glm()` problems

neglecting overdispersion
binomial/Poisson models with non-integer data
equating negative binomial with binomial rather than Poisson
failing to specify family (\(\to\) linear model); using glm() for linear models (unnecessary)
predictions on effect scale
using \((k,N)\) rather than \((k,N-k)\) in binomial models
worrying about overdispersion unnecessarily (binary/Gamma)
back-transforming SEs rather than CIs
Poisson for underdispersed responses
ignoring random effects

Example

AIDS in Australia (Dobson and Barnett 2008)

data here

aids <- read.csv("../data/aids.csv")
aids <- transform(aids, date=year+(quarter-1)/4)
gg0 <- ggplot(aids,aes(date,cases))+geom_point()

Easy GLMs with ggplot

gg1 <- gg0 + geom_smooth(method="glm",colour="red",
                         formula=y~x,
                         method.args=list(family="quasipoisson"))

results

Equivalent code

g1 <- glm(cases~date, data = aids, family=quasipoisson(link="log"))
summary(g1)

Diagnostics (`plot(g1)`)

autocorrelation function

acf(residuals(g1)) ## check autocorrelation

DHARMa

library(DHARMa)
g0 <- update(g1, family=poisson)
plot(simulateResiduals(g0))

ggplot: try quadratic model

print(gg2 <- gg1+geom_smooth(method="glm",formula=y~poly(x,2),
                             method.args=list(family="quasipoisson")))

(see here, here for information on poly())

improved model

g2 <- update(g1,.~poly(date,2))

new diagnostics

autocorrelation function

acf(residuals(g2)) ## check autocorrelation

inference

summary(g2)

## 
## Call:
## glm(formula = cases ~ poly(date, 2), family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.86859    0.05004  77.311  < 2e-16 ***
## poly(date, 2)1  3.82934    0.25162  15.219 2.46e-11 ***
## poly(date, 2)2 -0.68335    0.19716  -3.466  0.00295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.657309)
## 
##     Null deviance: 677.264  on 19  degrees of freedom
## Residual deviance:  31.992  on 17  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

anova(g1,g2,test="F") ## for quasi-models specifically

## Analysis of Deviance Table
## 
## Model 1: cases ~ date
## Model 2: cases ~ poly(date, 2)
##   Resid. Df Resid. Dev Df Deviance      F   Pr(>F)   
## 1        18     53.020                               
## 2        17     31.992  1   21.028 12.688 0.002399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

References

Dobson, Annette J., and Adrian Barnett. 2008. An Introduction to Generalized Linear Models. 3rd ed. Chapman; Hall/CRC.

Basics

Basics

Family

Most GLMs are logistic regression

Link functions

Machinery

Machinery (2)

log/exponential function (log/exp): count data

logit/logistic function (qlogis/plogis)

diagnostics

overdispersion

parameter interpretation

logit link

interpreting logit scale

inference

Model procedures

binomial models

offsets

add-on packages

Advanced topics

Common(est?) glm() problems

Example

AIDS in Australia (Dobson and Barnett 2008)

Easy GLMs with ggplot

results

Equivalent code

Diagnostics (plot(g1))

autocorrelation function

DHARMa

ggplot: try quadratic model

improved model

new diagnostics

autocorrelation function

inference

References

log/exponential function (`log`/`exp`): count data

logit/logistic function (`qlogis`/`plogis`)

Common(est?) `glm()` problems

Diagnostics (`plot(g1)`)