Introduction

History

ANOVA, ANCOVA, regression, \(t\) test are all (part of) the same thing, the general linear model (twitter thread, cheat sheet)
“linear model” (lm in R), to distinguish it from the generalized linear model (glm in R)
Unfortunately SAS calls it PROC GLM

Basic theory

Assumptions

Response variables are linear functions of predictor (derived) variables, in turn based on input (observed) variables
- One or more predictor variables per input variable
- Estimate a parameter for the effect of each predictor variable (more later)
Independence
Errors or residuals are Normally distributed with constant variance
- In other words, the difference between our model predictions and our observations is Normal
- not assuming the marginal distribution is Normal
(Predictor variables measured without error)

Machinery

Least squares fit - minimize the squared differences between predictions and observations
LS fits have lots of nice properties
Solution is a simple (!?) matrix equation
Sensitive to some violations of assumptions - e.g. anomalous events have a strong effect

One-parameter variables

Continuous predictor: straight line, one parameter
- Also implies one input variable
\(Y = a+bX\): \(Y\) is the response, \(X\) is the input variable
(\(b\) is the slope - expected change in \(Y\) per unit change in \(X\))
Categorical predictor with two categories: one parameter
- difference in predicted value between levels, or code the categories as 0, 1 … (dummy variables)
Parameters are (usually) easy to interpret
Confidence intervals for the parameter

Multi-parameter variables

With more than two categories, there is more than one predictor variable (parameter) associated with a single input variable
- Why can’t we just code them, for example as 0, 1, 2?
Non-linear response to an input variable
- Linear models can work for non-linear responses! e.g. \(Y = a + bX + cX^2\) (predictors are 1, \(X\), \(X^2\))
  - For complex smooth curves better to use splines (splines::ns(), or mgcv package)
More than one predictor variable per input variable

Multi-parameter variables are hard to interpret

If you’re just trying to control for it, include it in the model and then ignore it!
Otherwise, you can (and should) get a \(p\) value for the variable as a whole
But we can only get CIs on the parameters — and there are different ways to parameterize (contrasts)
- If you do have (a) clear scientific question(s), you can set up contrasts that allow you to test it/them (we can help)
- If you must, do pairwise comparisons, test each pair of variables for differences and make an aabbc list (more later)

Interactions

Interactions allow the value of one predictor to affect the relationship between another predictor and the response variable
Interpreting main effects in the presence of interactions is tricky (principle of marginality)
Your estimate of the effect of variable \(B\) now varies
You need to pick a fixed point, or average in some way
Example: \(Y = a + b_1 X_1 + b_2 X_2 + b_{12} X_1 X_2\)
The response to \(X_1\) is \(Y = (a+b_2 X_2) + (b_1+b_{12}X_2) X_1\)
the response to \(X_1\) depends on the value of \(X_2\).

An experimental example

You want to know whether (how!) a drug treatment changes the metabolism of some rabbits
You’re using adult, prime-aged rabbits and keeping them under controlled conditions, so you don’t expect their metabolism to change without the drug.
- Not good enough!
You also introduce some control rabbits and treat them exactly the same, including giving them fake pills. You find no significant change in the metabolism of the control rabbits through time
- Still not good enough! Why?

Testing interactions

Use an interaction: \[ M = a + B_x X + B_t t + B_{xt} Xt \]
The interaction term \(B_{xt}\) represents the difference in the response between the two groups.
It asks: did the treatment group change differently than the control group?

Interactions and parameters

In simple cases, the interaction may use only one parameter
We can use CIs, and coefficient plots, and get a pretty good idea what’s going on
In complicated cases, interaction terms have many parameters
These have all the interpretation problems of other multi-parameter variables, or more
Think about “differences in differences”

Interactions: example

Bear road-crossing
Predictor variables: sex (categorical), road type (categorical: major/minor), road length (continuous)
Two-way interactions
- sex \(\times\) road length: “are females more sensitive to amount of road than males?”
- sex \(\times\) road type: “do females vary behaviour between road type more than males?”
- road type \(\times\) road length: “does amount of road affect crossings differently for different road types?”

Statistical philosophy

Don’t accept the null hypothesis
- Don’t throw out predictors you wanted to test because they’re not significant
- Don’t throw out interactions you wanted to test because they’re not significant
This may make your life harder, but it’s worth it for the karma
- There are techniques to deal with multiple predictors (e.g., ridge or lasso regression)
- good ways to estimate sensible main effects in the presence of interactions (centering variables (Schielzeth 2010); “sum-to-zero contrasts”)

Diagnostics

Because the linear model is sensitive (sometimes!) to assumptions, it is good to evaluate them
Concerns (in order):
- Bias/linearity: did you miss a nonlinear relationship?
- Heteroscedasticity (does variance change across the data set, e.g. increasing variance with increasing mean?)
- Normality (assuming no overall problems, do your residuals look Normal?)
- Independence is hard to test
Normality is the least important of these assumptions

Default plots in R

skewdata <- read.csv("../data/skewdat.csv")
m1 <- lm(skew~Size, data=skewdata)

plot(m1, id.n=4)

performance::check_model()

performance::check_model(m1)

Transformations

One way to deal with problems in model assumptions is by transforming one or more of your variables
Transformations are not cheating: a transformed scale may be as natural (or more natural) a way to think about your data as your original scale
The linear scale (no transformation) often has direct meaning, if you are adding things up or scaling them (as in our ant example)
The log scale is often the best scale for thinking about physical quantities: 1 is to 10 as 10 is to?

Transformation tradeoffs

A transformation may help you meet model assumptions
- Homoscedasticity
- Linearity
- Normality
But there is no guarantee that you can fix them all
Piles of zeros are hard too (consider GLMs)

Transformations to consider

log-lin, lin-log and log-log for various sorts of exponential and power relationships
Box-Cox and related transformations
Avoid classical ‘transform then linear model’ recommendations for
- probability data (logistic, arcsin or arcsin-square root) (Warton and Hui 2011)
- or count data (log, log(1+x) or square root) (O’Hara and Kotze 2010; Ives 2015; Warton et al. 2016; Morrissey and Ruxton 2020)
- Generally better to respect the structure of these data with a GLM

Deciding whether to transform

It’s not OK to pick transformations based on trying different ones and looking at \(p\) values
Box-Cox transformation tries out transformations of the form \((y^\lambda-1)/\lambda\) (\(\lambda=0\) corresponds to log-transformation) (positive data only)
see MASS::boxcox() or car::powerTransform()

Collinearity

ignore common advice about “variance inflation factors”
simply throwing away correlated predictors is snoop-y and unstable
other solutions: PCA, a priori choice
Dormann et al. (2012), Graham (2003), Morrissey and Ruxton (2018), Vanhove (2021)

Tools for fitting and inference

Basic tools

lm fits a linear model
summary prints statistics associated with the parameters that were fitted
dotwhisker::dwplot() plots a coefficient table (Gelman, Pasarica, and Dodhia 2002); by_2sd=TRUE automatically scales continuous predictors
car::Anova() and drop1() will find \(p\) values that test the effect of variables that have more than one parameter. anova(model1, model2) tests the difference between two specific models (only needed for specialized comparisons)
broom::tidy() gives you tidy coefficient tables; consider conf.int = TRUE

Plotting

plot applied to an lm object gives you diagnostic plots
In ggplot, geom_smooth(method="lm") fits a linear model to each group of data (i.e. each group that you have identified by plotting it in a different colour, within a different facet, etc.
effects, emmeans, sjPlot packages …
tools:
- predict gives predicted values, and standard errors.
- simulate simulates values from the fitted model
- methods(class="lm") shows lots of possibilities

Examples

Simple model

data

lizards <- read.csv("../data/lizards.csv", stringsAsFactors=TRUE)
lizards$time <- factor(lizards$time,levels=c("early","midday","late"))
lmint <- lm(grahami~light*time, data=lizards)
lmboth <- lm(grahami~light + time, data=lizards)
lmlight <- lm(grahami~light, data=lizards)
lmtime <- lm(grahami~time, data=lizards)

Summary

summary(lmboth)

## 
## Call:
## lm(formula = grahami ~ light + time, data = lizards)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4250 -11.6937   0.2125   6.6250  28.5750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   27.288      5.631   4.846 0.000112 ***
## lightsunny   -19.325      5.734  -3.370 0.003213 ** 
## timemidday    13.137      7.106   1.849 0.080099 .  
## timelate      -9.500      6.853  -1.386 0.181743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.71 on 19 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.475 
## F-statistic: 7.634 on 3 and 19 DF,  p-value: 0.001513

Test time by comparing with light model (dropping time)

anova(lmboth, lmlight)

## Analysis of Variance Table
## 
## Model 1: grahami ~ light + time
## Model 2: grahami ~ light
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     19 3569.6                              
## 2     21 5481.9 -2   -1912.3 5.0895 0.01698 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test light (identical to test in summary()

anova(lmboth, lmtime)

## Analysis of Variance Table
## 
## Model 1: grahami ~ light + time
## Model 2: grahami ~ time
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     19 3569.6                                
## 2     20 5703.6 -1     -2134 11.359 0.003213 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## In this simple case, drop1 and car::Anova
##  will do exactly the same thing we did above

Test both

drop1(lmboth, test="F")

## Single term deletions
## 
## Model:
## grahami ~ light + time
##        Df Sum of Sq    RSS    AIC F value   Pr(>F)   
## <none>              3569.6 124.03                    
## light   1    2134.0 5703.6 132.81 11.3589 0.003213 **
## time    2    1912.3 5481.9 129.90  5.0895 0.016983 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Better anova table

car::Anova(lmboth)

## Anova Table (Type II tests)
## 
## Response: grahami
##           Sum Sq Df F value   Pr(>F)   
## light     2134.0  1 11.3589 0.003213 **
## time      1912.3  2  5.0895 0.016983 * 
## Residuals 3569.6 19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction model

print(summary(lmint))

## 
## Call:
## lm(formula = grahami ~ light * time, data = lizards)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.250  -4.125   1.250   6.875  17.750 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             23.500      5.375   4.372 0.000416 ***
## lightsunny             -11.750      7.602  -1.546 0.140606    
## timemidday              27.750      7.602   3.650 0.001980 ** 
## timelate               -12.750      7.602  -1.677 0.111798    
## lightsunny:timemidday  -32.833     11.190  -2.934 0.009266 ** 
## lightsunny:timelate      6.500     10.751   0.605 0.553432    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 17 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.677 
## F-statistic: 10.22 on 5 and 17 DF,  p-value: 0.0001179

Sum-to-zero contrasts

If you want to make comparisons you should centre continuous input variables and use sum-to-zero contrasts for factors:

lmint_c <- update(lmint, contrasts = list(time = contr.sum, light = contr.sum))
print(summary(lmint_c))

## 
## Call:
## lm(formula = grahami ~ light * time, data = lizards, contrasts = list(time = contr.sum, 
##     light = contr.sum))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.250  -4.125   1.250   6.875  17.750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.2361     2.2547   8.088 3.14e-07 ***
## light1        10.2639     2.2547   4.552 0.000282 ***
## time1         -0.6111     3.1463  -0.194 0.848299    
## time2         10.7222     3.2714   3.278 0.004440 ** 
## light1:time1  -4.3889     3.1463  -1.395 0.181001    
## light1:time2  12.0278     3.2714   3.677 0.001870 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 17 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.677 
## F-statistic: 10.22 on 5 and 17 DF,  p-value: 0.0001179

Multiple comparisons

Multiple comparisons are tricky
can get away without them for a linear model (use variable-level \(p\) values) unless you want to do pairwise comparisons.
Unlike variable-level tests, MCs depend on how you set up your model (i.e. centering of continuous variables, contrasts of categorical variables)

compact letter displays

There is a tradition of figuring out which pairs of treatments are significantly different and assigning a letter to each such “not-clearly-not-equivalent” group. If you need to do this to make your supervisor happy, you can use ggpubr. The multcomp package can do the basic computations too.

Pairwise comparisons with emmeans

See emmeans vignette on comparisons

library(emmeans)
library(ggplot2); theme_set(theme_bw())
e1 <- emmeans(lmboth, spec = ~ time)
plot(pairs(e1)) + geom_vline(xintercept = 0, lty = 2)

Expected marginal means plot

Or you could just look at the picture:

plot(e1)

Generalizations

(part of) the statistical universe

Extensions (Faraway 2016)

Generalized linear models
- (Some) non-linear relationships
- Non-normal response families (binomial, Poisson, …)
Mixed models incorporate random effects
- Categories in data represent samples from a population
- e.g. species, sites, genes … experimental blocks
Generalized additive models (GAMs)
- much broader range of non-linear relationships

References

Dormann, Carsten F., Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, et al. 2012. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography, no–. https://doi.org/10.1111/j.1600-0587.2012.07348.x.

Faraway, Julian J. 2016. Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. 2nd ed. Chapman & Hall/CRC.

Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2): 121–30. http://www.jstor.org/stable/3087382.

Graham, Michael H. 2003. “Confronting Multicollinearity in Ecological Multiple Regression.” Ecology 84 (11): 2809–15. https://doi.org/10.1890/02-3114.

Ives, Anthony R. 2015. “For Testing the Significance of Regression Coefficients, Go Ahead and Log-Transform Count Data.” Methods in Ecology and Evolution 6 (7): 828–35.

Morrissey, Michael B., and Graeme D. Ruxton. 2018. “Multiple Regression Is Not Multiple Regressions: The Meaning of Multiple Regression and the Non-Problem of Collinearity.” Philosophy, Theory, and Practice in Biology 10 (3).

———. 2020. “Revisiting Advice on the Analysis of Count Data.” Methods in Ecology and Evolution 11 (9): 1133–40.

O’Hara, Robert, and Johan Kotze. 2010. “Do Not Log-Transform Count Data.” Nature Precedings, 1–1.

Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients.” Methods in Ecology and Evolution 1: 103–13. https://doi.org/10.1111/j.2041-210X.2010.00012.x.

Vanhove, Jan. 2021. “Collinearity Isn’t a Disease That Needs Curing.” Meta-Psychology 5 (April). https://doi.org/10.15626/MP.2021.2548.

Warton, David I., and Francis K. C. Hui. 2011. “The Arcsine Is Asinine: The Analysis of Proportions in Ecology.” Ecology 92 (January): 3–10. https://doi.org/10.1890/10-0340.1.

Warton, David I., Mitchell Lyons, Jakub Stoklosa, and Anthony R. Ives. 2016. “Three Points to Consider When Choosing a LM or GLM Test for Count Data.” Methods in Ecology and Evolution 7 (8): 882–90.

Linear models lecture

Jonathan Dushoff and Ben Bolker

Introduction

History

Basic theory

Assumptions

Machinery

One-parameter variables

Multi-parameter variables

Multi-parameter variables are hard to interpret

Interactions

An experimental example

Testing interactions

Interactions and parameters

Interactions: example

Statistical philosophy

Diagnostics

Default plots in R

performance::check_model()

Transformations

Transformation tradeoffs

Transformations to consider

Deciding whether to transform

Collinearity

Tools for fitting and inference

Basic tools

Plotting

Examples

Simple model

Summary

Test time by comparing with light model (dropping time)

Test light (identical to test in summary()

Test both

Better anova table

Interaction model

Sum-to-zero contrasts

Multiple comparisons

Multiple comparisons

compact letter displays

Pairwise comparisons with emmeans

Expected marginal means plot

Generalizations

(part of) the statistical universe

Extensions (Faraway 2016)

References