1 Thinking about effect sizes.

1.1 Introduction

For most of the tutorial we are going to stick to the two group type evaluation (i.e. extending on a “t-test” like scenario), but focusing on estimating meaningful effect sizes.

1.2 loading libraries.

1.2.1 PLEASE NOTE the dabestr library seems to have a bug

Temporary fix (thanks Ben)

Install (just once) using the line below (uncomment it)

#remotes::install_github("bbolker/dabestr", ref="unpaired_unbalanced")
library(dabestr)
## 
## Attaching package: 'dabestr'
## The following object is masked from 'package:base':
## 
##     load

If you have not installed these before, you may need to do first (you only need to install them once).

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:dabestr':
## 
##     cliffs_delta, cohens_d, cohens_h, hedges_g
library(emmeans)
library(ggplot2)
library(ggbeeswarm)

We are using some new R libraries today. Both effectsize and dabestr (also see here) that will help with making the estimates and the plotting. As we get to some more advanced models, some of the tools in ggstatsplot will help. Eventually we will also use both the effects library and emmeans library for complex models.

1.3 functions we will use in todays class

A nice set of outputs from a linear model

lm_out <- function(x = modname) {
    cbind(as.matrix(summary(x)$coef[,1:3]),
          as.matrix(confint(x)) )}

1.4 Back to our favourite data

sct_data  <- read.csv("http://beaconcourse.pbworks.com/f/dll.csv",
                       h = T, stringsAsFactors = TRUE)

sct_data <- na.omit(sct_data)

Today we are going to work with a subset of the data so we can focus on some of the questions. Specifically only with the flies reared at 25C from a single strain (line). This will result in a pretty modest sample size within each group, but useful for our purposes.

str(sct_data)
## 'data.frame':    1918 obs. of  8 variables:
##  $ replicate: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ line     : Factor w/ 27 levels "line-1","line-11",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ genotype : Factor w/ 2 levels "Dll","wt": 1 1 1 1 1 1 1 1 1 1 ...
##  $ temp     : int  25 25 25 25 25 25 25 25 25 25 ...
##  $ femur    : num  0.59 0.55 0.588 0.596 0.577 ...
##  $ tibia    : num  0.499 0.501 0.488 0.502 0.499 ...
##  $ tarsus   : num  0.219 0.214 0.211 0.207 0.207 ...
##  $ SCT      : int  9 13 11 12 14 11 12 10 12 13 ...
##  - attr(*, "na.action")= 'omit' Named int [1:55] 4 61 73 92 93 142 207 268 315 319 ...
##   ..- attr(*, "names")= chr [1:55] "4" "61" "73" "92" ...
with(sct_data, table( line, genotype, temp))
## , , temp = 25
## 
##            genotype
## line        Dll wt
##   line-1     19 36
##   line-11    16 22
##   line-12    20 21
##   line-13    15 19
##   line-15    18 20
##   line-16    11 19
##   line-17    10 19
##   line-18    21 22
##   line-19     4 19
##   line-2     20 38
##   line-20    17 20
##   line-21    20 20
##   line-22    14 19
##   line-23     8 20
##   line-24     0 20
##   line-26    17 20
##   line-27    20 20
##   line-3     20 24
##   line-4     32 38
##   line-6      4 17
##   line-7     40 44
##   line-8     30 43
##   line-9     16 15
##   line-CanS   0  0
##   line-OreR  12 20
##   line-Sam    0  0
##   line-w      8 19
## 
## , , temp = 30
## 
##            genotype
## line        Dll wt
##   line-1      8 21
##   line-11    28 31
##   line-12    20 14
##   line-13     6 21
##   line-15    14 16
##   line-16    17 18
##   line-17    20 18
##   line-18    38 38
##   line-19    20 20
##   line-2     20 22
##   line-20    22 19
##   line-21    21 18
##   line-22    20 25
##   line-23    20 20
##   line-24     4 16
##   line-26     0  0
##   line-27    18 22
##   line-3      6  5
##   line-4     21 19
##   line-6     11 14
##   line-7     22 21
##   line-8     17 18
##   line-9      6  3
##   line-CanS   8  4
##   line-OreR  17 20
##   line-Sam   17 21
##   line-w      8 19
sct_dat_subset <- sct_data[sct_data$temp == 25 & sct_data$line == "line-7",]

sct_dat_subset <- droplevels(sct_dat_subset)

dim(sct_dat_subset)
## [1] 84  8
with(sct_dat_subset, 
     table(genotype))
## genotype
## Dll  wt 
##  40  44
sct_dat_subset$genotype <- relevel(sct_dat_subset$genotype, "wt") # explain in class

1.5 quick plot of the data

ggplot(sct_dat_subset, aes(y = SCT, x = genotype, color = genotype)) +
  geom_quasirandom(dodge.width = 0.25, cex = 2, alpha = 0.8)

ggplot(sct_dat_subset, aes(y = tarsus, x = genotype, color = genotype)) +
  geom_quasirandom(dodge.width = 0.25, cex = 2, alpha = 0.8)

1.6 t-tests

  1. Run t-tests to compare differences between genotypes for the two response variables (SCT and tarsus). Anything that seems pretty important that is missing when you print out the results (print)
sct_t_test <- t.test(SCT ~ genotype, 
                    alternative = "two.sided",
                    data = sct_dat_subset)


print(sct_t_test)
## 
##  Welch Two Sample t-test
## 
## data:  SCT by genotype
## t = -7, df = 73, p-value = 3e-09
## alternative hypothesis: true difference in means between group wt and group Dll is not equal to 0
## 95 percent confidence interval:
##  -2.41 -1.31
## sample estimates:
##  mean in group wt mean in group Dll 
##              10.8              12.7
tarsus_t_test <- t.test(tarsus ~ genotype, 
                    alternative = "two.sided",
                    data = sct_dat_subset)

print(tarsus_t_test)
## 
##  Welch Two Sample t-test
## 
## data:  tarsus by genotype
## t = -2, df = 81, p-value = 0.03
## alternative hypothesis: true difference in means between group wt and group Dll is not equal to 0
## 95 percent confidence interval:
##  -0.011081 -0.000593
## sample estimates:
##  mean in group wt mean in group Dll 
##             0.196             0.202
  1. Do you know what “kind” of t-test this is (in terms of the samples)?

  2. Figure out how to extract just the t-statistic and degrees of freedom from this object. Indeed, if you can, write a little function to extract the t-statistic, df, confidence intervals, estimate and standard error only.

t_useful_bits <- function(t_object) {
  return(c( 
           t_object$estimate,
           difference = as.numeric(diff(t_object$estimate)),
           SE = t_object$stderr,
           CI = t_object$conf.int[1:2],
           t_object$statistic, 
           t_object$parameter))}
t_useful_bits(sct_t_test)
##  mean in group wt mean in group Dll        difference                SE 
##            10.818            12.675             1.857             0.276 
##               CI1               CI2                 t                df 
##            -2.406            -1.308            -6.738            73.255
t_useful_bits(tarsus_t_test)
##  mean in group wt mean in group Dll        difference                SE 
##          0.195791          0.201628          0.005837          0.002636 
##               CI1               CI2                 t                df 
##         -0.011081         -0.000593         -2.214473         81.248269
  1. Repeat this, but using the lm function. What do each of the coefficients (coef) mean in each model? How does it compare to the t-tests.
sct_mod1 <- lm(SCT ~ genotype, 
               data = sct_dat_subset)

tarsus_mod1 <- lm(tarsus ~ genotype, 
               data = sct_dat_subset)

coef(sct_mod1)
## (Intercept) genotypeDll 
##       10.82        1.86
coef(tarsus_mod1)
## (Intercept) genotypeDll 
##     0.19579     0.00584
  1. Instead of running summary on the model object, use the lm_out function I wrote above. What is the difference? Why did I do this?
lm_out(sct_mod1)
##             Estimate Std. Error t value 2.5 % 97.5 %
## (Intercept)    10.82      0.188   57.59 10.44   11.2
## genotypeDll     1.86      0.272    6.82  1.32    2.4
lm_out(tarsus_mod1)
##             Estimate Std. Error t value    2.5 % 97.5 %
## (Intercept)  0.19579    0.00182  107.64 0.192172 0.1994
## genotypeDll  0.00584    0.00264    2.21 0.000593 0.0111

1.7 How does this help us?

We have compares the effects of genotype on two traits? But how do we know which (if any) is having a big effect or a small effect? One is in count of bristles, one in length in mm. So it is pretty hard to compare. So perhaps a use of standardized effect sizes may be useful.

1.8 Effect sizes

We will compute a few of the effect sizes we discussed in lecture. We are going to use the functions in the effectsize library for this.

For simple experimental designs the effectsize library is generally a more useful choice than dabestr (but does not use bootstrapped CIs or makes pretty plots). Indeed it has many many options. One somewhat odd thing is that for glass_delta (where we used the standard deviation of the “control” group), it is the second group that is used for the standard deviation so we need to switch the releveling. Slight differences between the two libraries in values is because of some bias correction that I think is applied.

Again, all of this (how we are scaling our differences) would be determined a priori. For this tutorial I am just showing you how to do it.

We will focus on Hedges’ g as it is the “sample size corrected” version of Cohen’s D.

sct_dat_subset$genotype_rev <- relevel(sct_dat_subset$genotype, "Dll")

hedges_g(SCT ~ genotype_rev, data = sct_dat_subset)
## Hedges' g |       95% CI
## ------------------------
## 1.48      | [0.99, 1.95]
## 
## - Estimated using pooled SD.
hedges_g(tarsus ~ genotype_rev, data = sct_dat_subset)
## Hedges' g |       95% CI
## ------------------------
## 0.48      | [0.05, 0.91]
## 
## - Estimated using pooled SD.

We will discuss below how this helps us to compare the influence of genotype on these two very different types of traits.

We can also compute regular Cohen’s d or Glass’s delta (SD of control group is used for scaling)

cohens_d(SCT ~ genotype_rev, data = sct_dat_subset)
## Cohen's d |       95% CI
## ------------------------
## 1.49      | [1.00, 1.97]
## 
## - Estimated using pooled SD.
cohens_d(tarsus ~ genotype_rev, data = sct_dat_subset)
## Cohen's d |       95% CI
## ------------------------
## 0.48      | [0.05, 0.92]
## 
## - Estimated using pooled SD.
glass_delta(SCT ~ genotype_rev, data = sct_dat_subset)
## Glass' delta |       95% CI
## ---------------------------
## 1.71         | [1.09, 2.32]
glass_delta(tarsus ~ genotype_rev, data = sct_dat_subset)
## Glass' delta |       95% CI
## ---------------------------
## 0.48         | [0.04, 0.92]

1.9 Gardner-Altman estimation plots

Let’s first use dabestr to compute the differences and then use a Garnder-Altman estimation plot from Gardner and Altman 1986 (British Medical Journal 292:746).

dim(sct_dat_subset)
## [1] 84  9
# needs a tibble
sct_dat_subset_tb <- tibble::tibble(sct_dat_subset)

diff_SCT <- load(data = sct_dat_subset_tb, 
                               x = genotype, y = SCT,
                             idx = c("wt", "Dll")) 

mean_diff_SCT <- mean_diff(diff_SCT) # get mean difference
## Warning in wilcox.test.default(control, test, alternative = "two.sided"):
## cannot compute exact p-value with ties
## Warning in asin(sqrt(prop_control)): NaNs produced
## Warning in asin(sqrt(prop_test)): NaNs produced
mean_diff_SCT
## DABESTR v2023.9.12
## ==================
## 
## Good afternoon!
## The current time is 13:15 pm on Saturday February 17, 2024.
## 
## The unpaired mean difference between Dll and wt is 1.857 [95%CI 1.325, 2.393].
## The p-value of the two-sided permutation t-test is 0.0000, calculated for legacy purposes only.
## 
## 5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
## Any p-value reported is the probability of observing the effect size (or greater),
## assuming the null hypothesis of zero difference is true.
## For each p-value, 5000 reshuffles of the control and test labels were performed.
dabest_plot(mean_diff_SCT)

diff_tarsus <- load(data = sct_dat_subset_tb, 
                               x = genotype, y = tarsus,
                             idx = c("wt", "Dll")) 

mean_diff_tarsus <- mean_diff(diff_tarsus) # get mean difference
## Warning in wilcox.test.default(control, test, alternative = "two.sided"):
## cannot compute exact p-value with ties
mean_diff_tarsus
## DABESTR v2023.9.12
## ==================
## 
## Good afternoon!
## The current time is 13:15 pm on Saturday February 17, 2024.
## 
## The unpaired mean difference between Dll and wt is 0.006 [95%CI 0, 0.011].
## The p-value of the two-sided permutation t-test is 0.0296, calculated for legacy purposes only.
## 
## 5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
## Any p-value reported is the probability of observing the effect size (or greater),
## assuming the null hypothesis of zero difference is true.
## For each p-value, 5000 reshuffles of the control and test labels were performed.
dabest_plot(mean_diff_tarsus)

1.10 standardized measures of effect

Or we can (using dabest) do Cohen’s d or Hedges’s g. Importantly in dabest the Hedges g is just the bias corrected version of Cohen’s d. For this data set (since sample sizes are not too small), the differences do not matter much. Usually best to use Hedge’s g if you are not sure though.

we have to use dabestr::functionName as both packages have functions with the same name.

G_SCT <- dabestr::hedges_g(diff_SCT)
## Warning in wilcox.test.default(control, test, alternative = "two.sided"):
## cannot compute exact p-value with ties
## Warning in asin(sqrt(prop_control)): NaNs produced
## Warning in asin(sqrt(prop_test)): NaNs produced
G_SCT
## DABESTR v2023.9.12
## ==================
## 
## Good afternoon!
## The current time is 13:15 pm on Saturday February 17, 2024.
## 
## The unpaired Hedges'g between Dll and wt is 1.476 [95%CI 1.003, 1.927].
## The p-value of the two-sided permutation t-test is 0.0000, calculated for legacy purposes only.
## 
## 5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
## Any p-value reported is the probability of observing the effect size (or greater),
## assuming the null hypothesis of zero difference is true.
## For each p-value, 5000 reshuffles of the control and test labels were performed.
dabest_plot(G_SCT)

G_tarsus <- dabestr::hedges_g(diff_tarsus)
## Warning in wilcox.test.default(control, test, alternative = "two.sided"):
## cannot compute exact p-value with ties
G_tarsus
## DABESTR v2023.9.12
## ==================
## 
## Good afternoon!
## The current time is 13:15 pm on Saturday February 17, 2024.
## 
## The unpaired Hedges'g between Dll and wt is 0.479 [95%CI 0.023, 0.942].
## The p-value of the two-sided permutation t-test is 0.0296, calculated for legacy purposes only.
## 
## 5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
## Any p-value reported is the probability of observing the effect size (or greater),
## assuming the null hypothesis of zero difference is true.
## For each p-value, 5000 reshuffles of the control and test labels were performed.
dabest_plot(G_tarsus)

1.10.1 what does this tell us?

Looking at these plots we can say (relative to the variation in the traits) that the effect the mutant allele has on the number of SCT is much greater than on the length of the leg segment (for this strain anyways). That is we can use this to understand something of the pleiotropic effects of the mutation.

How much greater?

1.476/0.479
## [1] 3.08

About 3 times greater effect (relative to the pooled standard deviation of each trait).

1.11 What if we wanted to compare just to the variation in the control group.

Normally we would have decided all of this in advance of collecting data and modeling. But for purposes of getting a sense of how things may (or may not) change, let’s try doing this just using the standard deviation of the control group (the wild type).

SCT_means <- with(sct_dat_subset, 
                  tapply(SCT, INDEX = genotype, mean))

SCT_sd <- with(sct_dat_subset, 
                  tapply(SCT, INDEX = genotype, sd))

tarsus_means <- with(sct_dat_subset, 
                  tapply(tarsus, INDEX = genotype, mean))

tarsus_sd <- with(sct_dat_subset, 
                  tapply(tarsus, INDEX = genotype, sd))

Glass_delta_SCT <- diff(SCT_means)/SCT_sd["wt"]

Glass_delta_tarsus <- diff(tarsus_means)/tarsus_sd["wt"]


Glass_delta_SCT
##  Dll 
## 1.71
Glass_delta_tarsus
##   Dll 
## 0.484
Glass_delta_SCT/Glass_delta_tarsus
##  Dll 
## 3.54

About 3.5 times greater in effect on SCT compared to the length of the tarsus. Take a look and see if you can figure out why this differs from Hedge’s g

1.12 How about scaling by the mean of the control group?

Another common approach would be to scale by the mean of the control group. There may be a pre-built function, but I could not find it, so let’s just hard code it.

diff_mean_scaled_SCT <- diff(SCT_means)/SCT_means["wt"]

diff_mean_scaled_tarsus <- diff(tarsus_means)/tarsus_means["wt"]

diff_mean_scaled_SCT 
##   Dll 
## 0.172
diff_mean_scaled_tarsus
##    Dll 
## 0.0298
diff_mean_scaled_SCT/diff_mean_scaled_tarsus
##  Dll 
## 5.76

Using the mean standardized measure of effect gives us a somewhat different picture, where the effects of the mutant allele on SCT number is approximately 5.8 times greater than the effect on the length of the tarsus. Often the mean scaled measures can be more difficult to interpret (even if they are pretty easy to compare), so use with some degree of caution.

Obviously (and I said it above, but it bears repeating), for a real analysis the appropriate measure of effect sizes would have been determined a priori along with what values of the measure would likely be deemed biologically relevant. In addition to our readings, there is a nice, but brief discussion of some of this here.

1.13 emmeans can do this all as well

For models with complexity of multiple predictors, continuous predictors, interactions etc, emmeans can probably handle what you want. It is very flexivle, but not super-intuitive (for emmeans, which is usually pretty good.)

See ?eff_size

Even though you are scaling by sigma in the function, you could use other measures.

Note that for complex models, a reasonable measure of the pooled sd is just the standard deviation of the residuals (sometimes called the residual standard error)

summary(sct_mod1)
## 
## Call:
## lm(formula = SCT ~ genotype, data = sct_dat_subset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.675 -0.818  0.182  1.182  3.325 
## 
## Coefficients:
##             Estimate Std. Error t value
## (Intercept)   10.818      0.188   57.59
## genotypeDll    1.857      0.272    6.82
## 
## Residual standard error: 1.25 on 82 degrees of freedom
## Multiple R-squared:  0.362,  Adjusted R-squared:  0.354 
## F-statistic: 46.5 on 1 and 82 DF,  p-value: 1.42e-09
emm1 <- emmeans(sct_mod1, ~genotype)
emm1
##  genotype emmean    SE df lower.CL upper.CL
##  wt         10.8 0.188 82     10.4     11.2
##  Dll        12.7 0.197 82     12.3     13.1
## 
## Confidence level used: 0.95
eff1 <- eff_size(emm1, sigma = sd(sct_mod1$residuals), edf = 50)

eff1
##  contrast effect.size    SE df lower.CL upper.CL
##  wt - Dll        -1.5 0.266 82    -2.03    -0.97
## 
## sigma used for effect sizes: 1.239 
## Confidence level used: 0.95
sqrt(2/50) # relative accuracy of sigma
## [1] 0.2