19 Mar 2026

the big picture

Up to now, univariate models; have to decide which is “the” response variable.

Might measure lots of different characteristics for each individual. Could model characteristics separately, but we don’t get any information about the overall changes among individuals. We might want:

  • a combined test of the effect of a predictor on individuals’ characteristics
  • information on how characteristics change in a correlated way, at the population level or within individuals

Alternatives

  • several (or many) univariate models with multiplicity correction (e.g. differential expression analysis)
  • collapse variables to summary statistics (diversity metrics, principal components)

examples

  • morphometric measurements (e.g. head breadth, head length, snout-vent length)
  • morphometric ‘landmarks’ (e.g. vein intersections on a fly wing)
  • responses to different stimuli
  • suites of behavioural responses
  • omics: gene expression, microbial abundance, etc.

what is a covariance matrix?

\[ \left( \begin{array}{cccc} \sigma^2_1 & \sigma_{12} & \sigma_{13} & \dots \\ . & \sigma^2_{2} & \sigma_{23} & \dots \\ . & . & \sigma^2_{3} & \dots \\ \vdots & \vdots & \vdots & \ddots \end{array} \right) \]

  • symmetric
  • \(\sigma^2_i\) are the variances of each variable
  • \(\sigma_{ij}\) are the covariances between \(i\) and \(j\)
  • scaling the covariance matrix appropriately → correlation matrix
  • the variables may be response variables (leg, arm, …) or effects (slope, contrasts)

Useful packages

library(lme4)
library(glmmTMB)
## Warning in check_dep_version(dep_pkg = "TMB"): package version mismatch: 
## glmmTMB was built with TMB package version 1.9.19
## Current TMB package version is 1.9.20
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(tidyverse); theme_set(theme_bw())
library(corrplot)
library(broom.mixed)
library(dotwhisker)
library(car)
## install.packages('gllvm',
##                  repos = c('https://jenniniku.r-universe.dev',
##                            'https://cloud.r-project.org'))
library(gllvm)
## maybe?
library(MCMCglmm)
library(brms)

data

We’ll use a Drosophila melanogaster data set from Ian Dworkin’s PhD work (Dworkin 2005). This was from a study that was meant to test predictions of a model on how mutational and environmental variation can influence the overall structure of phenotypic variation. The data include several traits (lengths) on the first leg as well as the number of sex comb teeth (a structure used to clasp females during copulation) for different wild type strains (line) reared at different developmental temperatures (temp), with and without a mutation that effects proximal-distal axis development in limbs (genotype).

Get data from https://doi.org/10.5061/dryad.8376 or this repo:

dll_data <- read_csv("../data/dll.csv", show_col_types = FALSE) |>
  mutate(across(c(temp, replicate, where(is.character)), factor))
morph_vars <- c("femur","tibia","tarsus","SCT")

dll_data
## # A tibble: 1,973 × 8
##    replicate line   genotype temp  femur tibia tarsus   SCT
##    <fct>     <fct>  <fct>    <fct> <dbl> <dbl>  <dbl> <dbl>
##  1 1         line-1 Dll      25    0.590 0.499  0.219     9
##  2 1         line-1 Dll      25    0.550 0.501  0.214    13
##  3 1         line-1 Dll      25    0.588 0.488  0.211    11
##  4 1         line-1 Dll      25    0.588 0.515  0.211    NA
##  5 1         line-1 Dll      25    0.596 0.502  0.207    12
##  6 1         line-1 Dll      25    0.577 0.499  0.207    14
##  7 1         line-1 Dll      25    0.618 0.494  0.204    11
##  8 1         line-1 Dll      25    0.582 0.528  0.203    12
##  9 1         line-1 Dll      25    0.572 0.479  0.202    10
## 10 1         line-1 Dll      25    0.568 0.479  0.202    12
## # ℹ 1,963 more rows

## tabulate and show only the first five lines
with(dll_data, table(temp, line, genotype))[,1:5,]
## , , genotype = Dll
## 
##     line
## temp line-1 line-11 line-12 line-13 line-15
##   25     20      16      20      17      19
##   30      9      30      20       6      14
## 
## , , genotype = wt
## 
##     line
## temp line-1 line-11 line-12 line-13 line-15
##   25     42      22      22      19      20
##   30     21      32      14      21      16

viz 1

grp <- as.numeric(as.factor(dll_data$temp)) ## numeric line value
pairs(dll_data[morph_vars], gap = 0,
      col = grp)

(there’s a GGally::ggpairs() function but it’s clunky)

viz 2

form <- ~ femur + tibia + tarsus + SCT | interaction(genotype, temp)
car::scatterplotMatrix( form,
                  ellipse = TRUE, data = dll_data, gap = 0,
                  plot.points = TRUE, pch = 20, cex  = 0.5)

Should you scale the response variables?

  • it depends
  • if responses have different units, yes
  • if responses have same units (e.g. all lengths), maybe
    • do you want to weight responses with larger variance more heavily?
  • log transformation is also possible
    • also makes changes unitless (proportional)
    • but makes different assumptions about the shape of the data

scaling

dll_data <- (dll_data |>
               ## scale & center (and convert back from matrix to vector)
               mutate(across(any_of(morph_vars), ~drop(scale(.)),
                             .names = "{.col}_s"))
)
morph_vars_s <- paste0(morph_vars, "_s")

Multivariate linear model

\[ \mathbf{Y} = \mathbf{XB} + \mathbf{E} \]

  • \(\mathbf{Y}\), \(\mathbf{B}\), \(\mathbf{E}\) are all matrices (stacking by column)
  • like doing a bunch of linear regresssions in parallel (efficiently)

fit the model

  • left-hand side of the formula is a numeric matrix
  • right-hand side is … whatever …
mlm_fit1 <- lm(as.matrix(dll_data[morph_vars_s]) ~ genotype,
               data = dll_data)
print(mlm_fit1)
## 
## Call:
## lm(formula = as.matrix(dll_data[morph_vars_s]) ~ genotype, data = dll_data)
## 
## Coefficients:
##              femur_s  tibia_s  tarsus_s  SCT_s  
## (Intercept)   0.0861   0.2204  -0.0799    0.1422
## genotypewt   -0.1514  -0.3931   0.1419   -0.2948

  • summary() gives a list of summaries, one for each response
  • coef() gives the coefficient matrix
  • broom::tidy() works too
coef(mlm_fit1)
##             femur_s tibia_s tarsus_s  SCT_s
## (Intercept)  0.0861   0.220  -0.0799  0.142
## genotypewt  -0.1514  -0.393   0.1419 -0.295
head(broom::tidy(mlm_fit1))
## # A tibble: 6 × 6
##   response term        estimate std.error statistic  p.value
##   <chr>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 femur_s  (Intercept)   0.0861    0.0331      2.60 9.40e- 3
## 2 femur_s  genotypewt   -0.151     0.0442     -3.43 6.22e- 4
## 3 tibia_s  (Intercept)   0.220     0.0339      6.50 1.03e-10
## 4 tibia_s  genotypewt   -0.393     0.0453     -8.69 7.91e-18
## 5 tarsus_s (Intercept)  -0.0799    0.0343     -2.33 2.01e- 2
## 6 tarsus_s genotypewt    0.142     0.0458      3.10 1.97e- 3

car::Anova(mlm_fit1)
## 
## Type II MANOVA Tests: Pillai test statistic
##          Df test stat approx F num Df den Df Pr(>F)    
## genotype  1     0.102       54      4   1913 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a more complicated model

We can include multiple predictors, interactions, etc..

mlm_fit2 <- update(mlm_fit1, . ~ genotype*temp)
car::Anova(mlm_fit2)
## 
## Type II MANOVA Tests: Pillai test statistic
##               Df test stat approx F num Df den Df Pr(>F)    
## genotype       1    0.1042     55.6      4   1911 <2e-16 ***
## temp           1    0.3077    212.3      4   1911 <2e-16 ***
## genotype:temp  1    0.0761     39.3      4   1911 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Effect sizes?

  • Mahalanobis distance: scale distances by within-group covariance matrix
  • multivariate analogue of Cohen’s \(D\)

Mixed models

The previous expression for the multivariate model was

\[ \mathbf{Y} = \mathbf{XB} + \mathbf{E} \]

In contrast, the expression for the mixed model is

\[ y = \mathbf{X\beta} + \mathbf{Zb} + \epsilon \]

where \(\mathbf{b}\) is a set of Gaussian variables with a variance-covariance matrix \(\mathbf{\Sigma}\) which we estimate.

  • Suppose we have observations of \(m\) individuals, with \(p\) different observations (traits, or time points, or types of measurement, or …) for each individual.
  • expand (“melt”, or tidyr::pivot_longer() the data to be \(mp\) observations long
  • then treat each individual (previously a single row but now \(p\) rows) as a group (we’ll call this units)

melting

Generate a single column with the numeric values of the responses, and a second variable that stores the trait type.

melting

dll_melt <- (dll_data
    |> select(-any_of(morph_vars))
    |> mutate(units=factor(1:n()))
    |> pivot_longer(names_to = "trait", cols = morph_vars_s)
    |> drop_na()
    |> arrange(units)
)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(morph_vars_s)
## 
##   # Now:
##   data %>% select(all_of(morph_vars_s))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## saveRDS(dll_melt, file="dll_melt.rds")

Look at how this has changed the structure of the data from a wide format:

head(dll_data) # original wide
## # A tibble: 6 × 12
##   replicate line   genotype temp  femur tibia tarsus   SCT femur_s tibia_s
##   <fct>     <fct>  <fct>    <fct> <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>
## 1 1         line-1 Dll      25    0.590 0.499  0.219     9   1.51    0.597
## 2 1         line-1 Dll      25    0.550 0.501  0.214    13   0.138   0.654
## 3 1         line-1 Dll      25    0.588 0.488  0.211    11   1.45    0.194
## 4 1         line-1 Dll      25    0.588 0.515  0.211    NA   1.44    1.16 
## 5 1         line-1 Dll      25    0.596 0.502  0.207    12   1.70    0.712
## 6 1         line-1 Dll      25    0.577 0.499  0.207    14   1.04    0.599
## # ℹ 2 more variables: tarsus_s <dbl>, SCT_s <dbl>

to a long format:

head(dll_melt)
## # A tibble: 6 × 7
##   replicate line   genotype temp  units trait     value
##   <fct>     <fct>  <fct>    <fct> <fct> <chr>     <dbl>
## 1 1         line-1 Dll      25    1     femur_s   1.51 
## 2 1         line-1 Dll      25    1     tibia_s   0.597
## 3 1         line-1 Dll      25    1     tarsus_s  1.75 
## 4 1         line-1 Dll      25    1     SCT_s    -1.22 
## 5 1         line-1 Dll      25    2     femur_s   0.138
## 6 1         line-1 Dll      25    2     tibia_s   0.654

lmer formula

lmer_form <- value ~ 0 + trait:(genotype*temp) +
  (0+trait|line) + (0+trait|units)

Fixed-effect component

  • it doesn’t make sense to consider effects that apply equally to all traits
  • so, let trait interact with all the other variables, but nothing else
  • use +0 (or -1) to drop intercept
  • 0 + trait:(genotype*temp) is equivalent to
    trait:(1+genotype+temp+genotype:temp)
    • or trait + trait:genotype + trait:temp + trait:genotype:temp

Random-effect component

  • (trait+0|line) means “variance/covariances of traits among lines”
  • +0 so we consider traits, not differences among traits
    • rows/cols of covariance matrix are (femur, tibia, tarsus, SCT)
  • (trait+0|units) specifies “var/cov of traits among units (individuals)”

lme4: control specificatons

  • lmer always includes a residual variance term. This is redundant because we have only one data point per individual per trait: lmerControl(...) tells lmer not complain
  • As a result, our within-individual correlation estimates will be slightly overestimated (not so for GLMMs)

Now fit the model

lmer1 <- lmer(lmer_form,
              data=dll_melt,
              control=lmerControl(
                check.nobs.vs.nlev="ignore",
                check.nobs.vs.nRE="ignore"))

Notes on lme4 results

  • the fit is a little slow (about 10 seconds on a fast machine: GLMMs would be even slower)
  • print(lmer1), summary(lmer1) useful but awkward (16 fixed effect coefficients, we’ll look at graphical summaries instead
  • isSingular() tells you if the model is singular
  • fixef() (fixed effects), ranef() (line/indiv deviations from pop mean), coef() (line/indiv-level estimates), VarCorr (random effects var/cov)

lme4: random-effects var/cov

vv1 <- VarCorr(lmer1)
print(vv1)
##  Groups   Name          Std.Dev. Corr           
##  units    traitfemur_s  0.654                   
##           traitSCT_s    0.728    0.05           
##           traittarsus_s 0.565    0.58 0.20      
##           traittibia_s  0.670    0.91 0.12 0.47 
##  line     traitfemur_s  0.489                   
##           traitSCT_s    0.392    0.14           
##           traittarsus_s 0.462    0.81 0.37      
##           traittibia_s  0.472    0.87 0.22 0.83 
##  Residual               0.463

lme4: correlation plot

par(mfrow=c(1,2))
## fix unit variance-covariance by adding residual variance:
diag(vv1$units) <- diag(vv1$units)+sigma(lmer1)^2
corrplot.mixed(cov2cor(vv1$line),upper="ellipse")
corrplot.mixed(cov2cor(vv1$units),upper="ellipse")

Correlations of traits across lines are stronger than correlations within individuals. In both cases correlations are all positive (i.e. first axis of variation is size variation?)

lme4: coefficient plot

cc1 <- tidy(lmer1,effect="fixed") |>
  tidyr::separate(term,into=c("trait","fixeff"),extra="merge",
                  remove=FALSE)
dwplot(cc1)+facet_wrap(~fixeff,scale="free",ncol=2)+
  geom_vline(xintercept=0,lty=2)

These results tell approximately the same story (coefficients are consistent across traits within a fixed effect, e.g. effect of higher temperature is to reduce scores on all traits).

lme4: random effects confidence intervals

This is very slow, you probably don’t want to evaluate it on your computer

t1_CI <- system.time(cc2 <- tidy(lmer1,
            effect = "ran_pars",
            conf.int = TRUE,
            conf.method = "profile",
            parallel="multicore",
            ncpus=8))
saveRDS(cc2, file="../data/cc2.rds")

(about 40 minutes to run on 8 cores on a fast machine)

miscellaneous

  • same tricks work for glmmTMB
  • high-dimensional data (e.g. ’omics) may require other approaches
  • reduced-rank or factor-analytic models (glmmTMB only at present)
    • rr(0+trait|line), d) flattens the 4-D covariance into a \(d\)-D space
    • sort of like doing PCA and keeping only \(d\) components
  • Bayesian tools: MCMCglmm, brms
  • multi-type models (some numeric, some count, some binary traits?): MCMCglmm, gllvm?

what the heck are eigenvalues (eigenvectors)?

  • eigenvalues: variances along the principal axes of the covariance matrix
  • eigenvectors: directions of the principal axes

References