Introduction

History

  • ANOVA, ANCOVA, regression, \(t\) test are all (part of) the same thing, the general linear model (twitter thread, cheat sheet)
  • “linear model” (lm in R), to distinguish it from the generalized linear model (glm in R)
  • Unfortunately SAS calls it PROC GLM

Basic theory

Assumptions

  • Response variables are linear functions of predictor (derived) variables, in turn based on input (observed) variables
    • One or more predictor variables per input variable
    • Estimate a parameter for the effect of each predictor variable (more later)
  • Independence
  • Errors or residuals are Normally distributed with constant variance
    • In other words, the difference between our model predictions and our observations is Normal
    • not assuming the marginal distribution is Normal
  • (Predictor variables measured without error)

Machinery

  • Least squares fit - minimize the squared differences between predictions and observations
  • LS fits have lots of nice properties
  • Solution is a simple (!?) matrix equation
  • Sensitive to some violations of assumptions - e.g. anomalous events have a strong effect

One-parameter variables

  • Continuous predictor: straight line, one parameter
    • Also implies one input variable
  • \(Y = a+bX\): \(Y\) is the response, \(X\) is the input variable
    (\(b\) is the slope - expected change in \(Y\) per unit change in \(X\))
  • Categorical predictor with two categories: one parameter
    • difference in predicted value between levels, or code the categories as 0, 1 … (dummy variables)
  • Parameters are (usually) easy to interpret
  • Confidence intervals for the parameter

Multi-parameter variables

  • With more than two categories, there is more than one predictor variable (parameter) associated with a single input variable
    • Why can’t we just code them, for example as 0, 1, 2?
  • Non-linear response to an input variable
    • Linear models can work for non-linear responses! e.g. \(Y = a + bX + cX^2\) (predictors are 1, \(X\), \(X^2\))
      • For complex smooth curves better to use splines (splines::ns(), or mgcv package)
  • More than one predictor variable per input variable

Multi-parameter variables are hard to interpret

  • If you’re just trying to control for it, include it in the model and then ignore it!
  • Otherwise, you can (and should) get a \(p\) value for the variable as a whole
  • But we can only get CIs on the parameters — and there are different ways to parameterize (contrasts)
    • If you do have (a) clear scientific question(s), you can set up contrasts that allow you to test it/them (we can help)
    • If you must, do pairwise comparisons, test each pair of variables for differences and make an aabbc list (more later)

Interactions

  • Interactions allow the value of one predictor to affect the relationship between another predictor and the response variable
  • Interpreting main effects in the presence of interactions is tricky (principle of marginality)
  • Your estimate of the effect of variable \(B\) now varies
  • You need to pick a fixed point, or average in some way
  • Example: \(Y = a + b_1 X_1 + b_2 X_2 + b_{12} X_1 X_2\)
  • The response to \(X_1\) is \(Y = (a+b_2 X_2) + (b_1+b_{12}X_2) X_1\)
    the response to \(X_1\) depends on the value of \(X_2\).

An experimental example

  • You want to know whether (how!) a drug treatment changes the metabolism of some rabbits
  • You’re using adult, prime-aged rabbits and keeping them under controlled conditions, so you don’t expect their metabolism to change without the drug.
    • Not good enough!
  • You also introduce some control rabbits and treat them exactly the same, including giving them fake pills. You find no significant change in the metabolism of the control rabbits through time
    • Still not good enough! Why?

Testing interactions

  • Use an interaction: \[ M = a + B_x X + B_t t + B_{xt} Xt \]
  • The interaction term \(B_{xt}\) represents the difference in the response between the two groups.
  • It asks: did the treatment group change differently than the control group?

Interactions and parameters

  • In simple cases, the interaction may use only one parameter
  • We can use CIs, and coefficient plots, and get a pretty good idea what’s going on
  • In complicated cases, interaction terms have many parameters
  • These have all the interpretation problems of other multi-parameter variables, or more
  • Think about “differences in differences”

Interactions: example

  • Bear road-crossing
  • Predictor variables: sex (categorical), road type (categorical: major/minor), road length (continuous)
  • Two-way interactions
    • sex \(\times\) road length: “are females more sensitive to amount of road than males?”
    • sex \(\times\) road type: “do females vary behaviour between road type more than males?”
    • road type \(\times\) road length: “does amount of road affect crossings differently for different road types?”

Statistical philosophy

  • Don’t accept the null hypothesis
    • Don’t throw out predictors you wanted to test because they’re not significant
    • Don’t throw out interactions you wanted to test because they’re not significant
  • This may make your life harder, but it’s worth it for the karma
    • There are techniques to deal with multiple predictors (e.g., ridge or lasso regression)
    • good ways to estimate sensible main effects in the presence of interactions (centering variables (Schielzeth 2010); “sum-to-zero contrasts”)

Diagnostics

  • Because the linear model is sensitive (sometimes!) to assumptions, it is good to evaluate them
  • Concerns (in order):
    • Bias/linearity: did you miss a nonlinear relationship?
    • Heteroscedasticity (does variance change across the data set, e.g. increasing variance with increasing mean?)
    • Normality (assuming no overall problems, do your residuals look Normal?)
    • Independence is hard to test
  • Normality is the least important of these assumptions

Default plots in R

skewdata <- read.csv("../data/skewdat.csv")
m1 <- lm(skew~Size, data=skewdata)
plot(m1, id.n=4)

performance::check_model()

performance::check_model(m1)

Transformations

  • One way to deal with problems in model assumptions is by transforming one or more of your variables
  • Transformations are not cheating: a transformed scale may be as natural (or more natural) a way to think about your data as your original scale
  • The linear scale (no transformation) often has direct meaning, if you are adding things up or scaling them (as in our ant example)
  • The log scale is often the best scale for thinking about physical quantities: 1 is to 10 as 10 is to?

Transformation tradeoffs

  • A transformation may help you meet model assumptions
    • Homoscedasticity
    • Linearity
    • Normality
  • But there is no guarantee that you can fix them all
  • Piles of zeros are hard too (consider GLMs)

Transformations to consider

  • log-lin, lin-log and log-log for various sorts of exponential and power relationships
  • Box-Cox and related transformations
  • Avoid classical ‘transform then linear model’ recommendations for
    • probability data (logistic, arcsin or arcsin-square root) (Warton and Hui 2011)
    • or count data (log, log(1+x) or square root) (O’Hara and Kotze 2010; Ives 2015; Warton et al. 2016; Morrissey and Ruxton 2020)
    • Generally better to respect the structure of these data with a GLM

Deciding whether to transform

  • It’s not OK to pick transformations based on trying different ones and looking at \(p\) values
  • Box-Cox transformation tries out transformations of the form \((y^\lambda-1)/\lambda\) (\(\lambda=0\) corresponds to log-transformation) (positive data only)
  • see MASS::boxcox() or car::powerTransform()

Collinearity

  • ignore common advice about “variance inflation factors”
  • simply throwing away correlated predictors is snoop-y and unstable
  • other solutions: PCA, a priori choice
  • Dormann et al. (2012), Graham (2003), Morrissey and Ruxton (2018), Vanhove (2021)

Tools for fitting and inference

Basic tools

  • lm fits a linear model
  • summary prints statistics associated with the parameters that were fitted
  • dotwhisker::dwplot() plots a coefficient table (Gelman, Pasarica, and Dodhia 2002); by_2sd=TRUE automatically scales continuous predictors
  • car::Anova() and drop1() will find \(p\) values that test the effect of variables that have more than one parameter. anova(model1, model2) tests the difference between two specific models (only needed for specialized comparisons)
  • broom::tidy() gives you tidy coefficient tables; consider conf.int = TRUE

Plotting

  • plot applied to an lm object gives you diagnostic plots
  • In ggplot, geom_smooth(method="lm") fits a linear model to each group of data (i.e. each group that you have identified by plotting it in a different colour, within a different facet, etc.
  • effects, emmeans, sjPlot packages …
  • tools:
    • predict gives predicted values, and standard errors.
    • simulate simulates values from the fitted model
    • methods(class="lm") shows lots of possibilities

Examples

Simple model

data

lizards <- read.csv("../data/lizards.csv", stringsAsFactors=TRUE)
lizards$time <- factor(lizards$time,levels=c("early","midday","late"))
lmint <- lm(grahami~light*time, data=lizards)
lmboth <- lm(grahami~light + time, data=lizards)
lmlight <- lm(grahami~light, data=lizards)
lmtime <- lm(grahami~time, data=lizards)

Summary

summary(lmboth)
## 
## Call:
## lm(formula = grahami ~ light + time, data = lizards)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4250 -11.6937   0.2125   6.6250  28.5750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   27.288      5.631   4.846 0.000112 ***
## lightsunny   -19.325      5.734  -3.370 0.003213 ** 
## timemidday    13.137      7.106   1.849 0.080099 .  
## timelate      -9.500      6.853  -1.386 0.181743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.71 on 19 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.475 
## F-statistic: 7.634 on 3 and 19 DF,  p-value: 0.001513

Test time by comparing with light model (dropping time)

anova(lmboth, lmlight)
## Analysis of Variance Table
## 
## Model 1: grahami ~ light + time
## Model 2: grahami ~ light
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     19 3569.6                              
## 2     21 5481.9 -2   -1912.3 5.0895 0.01698 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test light (identical to test in summary()

anova(lmboth, lmtime)
## Analysis of Variance Table
## 
## Model 1: grahami ~ light + time
## Model 2: grahami ~ time
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     19 3569.6                                
## 2     20 5703.6 -1     -2134 11.359 0.003213 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## In this simple case, drop1 and car::Anova
##  will do exactly the same thing we did above

Test both

drop1(lmboth, test="F")
## Single term deletions
## 
## Model:
## grahami ~ light + time
##        Df Sum of Sq    RSS    AIC F value   Pr(>F)   
## <none>              3569.6 124.03                    
## light   1    2134.0 5703.6 132.81 11.3589 0.003213 **
## time    2    1912.3 5481.9 129.90  5.0895 0.016983 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Better anova table

car::Anova(lmboth)
## Anova Table (Type II tests)
## 
## Response: grahami
##           Sum Sq Df F value   Pr(>F)   
## light     2134.0  1 11.3589 0.003213 **
## time      1912.3  2  5.0895 0.016983 * 
## Residuals 3569.6 19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction model

print(summary(lmint))
## 
## Call:
## lm(formula = grahami ~ light * time, data = lizards)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.250  -4.125   1.250   6.875  17.750 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             23.500      5.375   4.372 0.000416 ***
## lightsunny             -11.750      7.602  -1.546 0.140606    
## timemidday              27.750      7.602   3.650 0.001980 ** 
## timelate               -12.750      7.602  -1.677 0.111798    
## lightsunny:timemidday  -32.833     11.190  -2.934 0.009266 ** 
## lightsunny:timelate      6.500     10.751   0.605 0.553432    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 17 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.677 
## F-statistic: 10.22 on 5 and 17 DF,  p-value: 0.0001179

Sum-to-zero contrasts

If you want to make comparisons you should centre continuous input variables and use sum-to-zero contrasts for factors:

lmint_c <- update(lmint, contrasts = list(time = contr.sum, light = contr.sum))
print(summary(lmint_c))
## 
## Call:
## lm(formula = grahami ~ light * time, data = lizards, contrasts = list(time = contr.sum, 
##     light = contr.sum))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.250  -4.125   1.250   6.875  17.750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.2361     2.2547   8.088 3.14e-07 ***
## light1        10.2639     2.2547   4.552 0.000282 ***
## time1         -0.6111     3.1463  -0.194 0.848299    
## time2         10.7222     3.2714   3.278 0.004440 ** 
## light1:time1  -4.3889     3.1463  -1.395 0.181001    
## light1:time2  12.0278     3.2714   3.677 0.001870 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 17 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.677 
## F-statistic: 10.22 on 5 and 17 DF,  p-value: 0.0001179

Multiple comparisons

Multiple comparisons

  • Multiple comparisons are tricky
  • can get away without them for a linear model (use variable-level \(p\) values) unless you want to do pairwise comparisons.
  • Unlike variable-level tests, MCs depend on how you set up your model (i.e. centering of continuous variables, contrasts of categorical variables)

compact letter displays

There is a tradition of figuring out which pairs of treatments are significantly different and assigning a letter to each such “not-clearly-not-equivalent” group. If you need to do this to make your supervisor happy, you can use ggpubr. The multcomp package can do the basic computations too.

Pairwise comparisons with emmeans

See emmeans vignette on comparisons

library(emmeans)
library(ggplot2); theme_set(theme_bw())
e1 <- emmeans(lmboth, spec = ~ time)
plot(pairs(e1)) + geom_vline(xintercept = 0, lty = 2)

Expected marginal means plot

Or you could just look at the picture:

plot(e1)

Generalizations

(part of) the statistical universe

Extensions (Faraway 2016)

  • Generalized linear models
    • (Some) non-linear relationships
    • Non-normal response families (binomial, Poisson, …)
  • Mixed models incorporate random effects
    • Categories in data represent samples from a population
    • e.g. species, sites, genes … experimental blocks
  • Generalized additive models (GAMs)
    • much broader range of non-linear relationships

References

Dormann, Carsten F., Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, et al. 2012. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography, no–. https://doi.org/10.1111/j.1600-0587.2012.07348.x.
Faraway, Julian J. 2016. Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. 2nd ed. Chapman & Hall/CRC.
Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. 2002. “Let’s Practice What We Preach: Turning Tables into Graphs.” The American Statistician 56 (2): 121–30. http://www.jstor.org/stable/3087382.
Graham, Michael H. 2003. “Confronting Multicollinearity in Ecological Multiple Regression.” Ecology 84 (11): 2809–15. https://doi.org/10.1890/02-3114.
Ives, Anthony R. 2015. “For Testing the Significance of Regression Coefficients, Go Ahead and Log-Transform Count Data.” Methods in Ecology and Evolution 6 (7): 828–35.
Morrissey, Michael B., and Graeme D. Ruxton. 2018. “Multiple Regression Is Not Multiple Regressions: The Meaning of Multiple Regression and the Non-Problem of Collinearity.” Philosophy, Theory, and Practice in Biology 10 (3).
———. 2020. “Revisiting Advice on the Analysis of Count Data.” Methods in Ecology and Evolution 11 (9): 1133–40.
O’Hara, Robert, and Johan Kotze. 2010. “Do Not Log-Transform Count Data.” Nature Precedings, 1–1.
Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients.” Methods in Ecology and Evolution 1: 103–13. https://doi.org/10.1111/j.2041-210X.2010.00012.x.
Vanhove, Jan. 2021. “Collinearity Isn’t a Disease That Needs Curing.” Meta-Psychology 5 (April). https://doi.org/10.15626/MP.2021.2548.
Warton, David I., and Francis K. C. Hui. 2011. “The Arcsine Is Asinine: The Analysis of Proportions in Ecology.” Ecology 92 (January): 3–10. https://doi.org/10.1890/10-0340.1.
Warton, David I., Mitchell Lyons, Jakub Stoklosa, and Anthony R. Ives. 2016. “Three Points to Consider When Choosing a LM or GLM Test for Count Data.” Methods in Ecology and Evolution 7 (8): 882–90.