experimental design

the most important thing

  • design your experiment well and execute it well:
    you needn’t worry too much in advance about statistics
  • don’t: you’re doomed, statistics can’t save you
  • randomization, control, replication

randomization

  • random assignment to treatments
  • poorer alternative: haphazard assignment
  • stratification
    (i.e., randomize within groups)
  • related: experimental blinding

control

  • maximize desired variation
    • e.g. large doses (tradeoff with biological realism)
  • minimize undesired variation
    • within-subjects designs
      • e.g. paired, randomized-block, crossover
    • reduce environmental variation
      • tradeoff with generality
      • e.g. environmental chambers, inbred/clones
  • isolate desired effects: positive/negative controls
    (vehicle-only, cage treatments, etc.)
  • measure/control for covariates

Hurlbert (1984) Table 1

Source of confusion Features of an experimental design that reduce or eliminate confusion
Temporal change Control treatments
Procedure effects Control treatments
Experimenter bias Randomized assignment of
experimental units to
treatments
Randomization in conduct
of other procedures
“Blind” procedures*
Experimenter-generated
variability (random error)
Replication of treatments
Initial or inherent
variability among
experimental units
Replication of treatments
Interspersion of treatments
Concomitant observations
Nondemonic intrusion Replication of treatments
Interspersion of treatments
Demonic intrusion Eternal vigilance, exorcism,
human sacrifices, etc.

replication

  • how big does your experiment need to be? (Lakens 2022)
  • power: probability of detecting an effect of a particular size,
    if one exists
  • more generally: how much information? what kinds of mistakes? (Gelman and Carlin 2014)
  • underpowered studies - failure is likely - cheating is likely - significance filter \(\to\) biased estimates
  • overpowered studies waste time, lives, $
  • pseudoreplication (Hurlbert 1984; Davies and Gray 2015) confounding sampling units with treatment units

power analysis

definition

  • power is the probability of successfully rejecting the null hypothesis
    • i.e., being able to see something clearly
  • depends on
    • biological effect size
    • noise level
    • experimental design (sample size + …)
  • uses null-effect counterfactual

power analysis: a cautionary conversation

an introductory video

where do effect sizes come from?

  • pilot studies? (danger)
  • previous literature
  • minimal interesting sample size (“SESOI”, “MCID”)
  • estimates of uncertainty
    • particularly hard to think about
    • think in terms of coefficient of variation and translate
    • binomial tests easier, but watch out:
      assume no overdispersion

if you can’t guess an effect size you shouldn’t be doing an experiment

power analysis: methods

methods, cont.: power analysis in R

apropos("^power\\.")  ## base-R functions
## [1] "power.anova.test" "power.prop.test"  "power.t.test"
a1 <- available.packages(repos="https://cran.rstudio.com")
pow_pkgs <- grepv("power", rownames(a1), ignore.case=TRUE)
length(pow_pkgs)
## [1] 60
head(pow_pkgs, 10)
##  [1] "agpower"          "BayesianPower"    "BayesPower"       "CoRpower"        
##  [5] "crt2power"        "depower"          "easypower"        "ecopower"        
##  [9] "extraSuperpower"  "InteractionPoweR"

the pwr package

library("pwr")
apropos("^pwr")
##  [1] "pwr.2p.test"    "pwr.2p2n.test"  "pwr.anova.test" "pwr.chisq.test"
##  [5] "pwr.f2.test"    "pwr.norm.test"  "pwr.p.test"     "pwr.r.test"    
##  [9] "pwr.t.test"     "pwr.t2n.test"

also: library("sos"); findFn("{power analysis}")

ant example

dd <- read.csv("../data/ants.csv")
power.t.test(n=c(10,10),delta=2,sd=1)
power.t.test(power=0.8,delta=2,sd=1)

power curve calculation

nvec <- 2:15
powfun <- function(n) {
   power.t.test(n, delta=2, sd=1)$power
}
powvec <- sapply(nvec,powfun)
plot(nvec, powvec, type="b",
     xlab="sample size (each group)",
     ylab="power",
     main = "delta = 2, sd = 1",
     ylim = c(0,1))

power curve

avoid: scaled effect sizes

  • e.g. Cohen’s \(d\)/\(g\)
  • scaling not by standard deviation of a predictor, but by the noise
  • removes units (and biological meaning)
  • refers only to clarity, not biological impact
  • if you need these for input to a program, start with biological effects and translate

avoid: “T-shirt” effect sizes

  • Cohen proposed standardized “small”, “medium”, and “large” scaled effect sizes
  • convenient: but scaled/unitless/non-biological!

From Russ Lenth FAQs:

for a “medium” effect size, you’ll choose the same \(n\) regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects

avoid: retrospective power analysis

  • running
  • tautological: high \(p\)-value \(\leftrightarrow\) low power
  • essentially useless
  • instead:
    • show confidence intervals
    • (if necessary) pretend you’re doing prospective analysis
  • push back: Thomas (1997), Gerard, Smith, and Weerakkody (1998), Hoenig and Heisey (2001)

what to do about bad news?

  • simplify the question
  • use simpler designs (e.g. low/high vs continuous range)
  • push treatments harder
  • ask a different question

what if your analysis is more complex?

  • simplify
  • simulate
    • see chap 5 (Bolker 2008)
    • much more flexible - e.g. simulate effects of lack of balance - endpoints other than power (e.g. CV)

simulation (linear regression example)

## experimental design
N <- 20; x_min <- 0; x_max <- 2
x <- runif(N, min=x_min, max=x_max)
## model world
a <- 2; b <- 1; sd_y <- 1
## setup
nsim <- 1000; pval <- numeric(N); set.seed(101)
for (i in 1:nsim) {
  y_det <- a + b * x  ## deterministic y
  y <- rnorm(N, mean = y_det, sd = sd_y)
  m <- lm(y ~ x)
  pval[i] <- coef(summary(m))["x", "Pr(>|t|)"] ## extract p-value
}
mean(pval<0.05)
## [1] 0.688

power of clarity

references

Bolker, Benjamin M. 2008. Ecological Models and Data in R. Princeton University Press.
Davies, G. Matt, and Alan Gray. 2015. “Don’t Let Spurious Accusations of Pseudoreplication Limit Our Ability to Learn from Natural Experiments (and Other Messy Kinds of Ecological Monitoring).” Ecology and Evolution, October. https://doi.org/10.1002/ece3.1782.
Faul, Franz, Edgar Erdfelder, Axel Buchner, and Albert-Georg Lang. 2009. “Statistical Power Analyses Using G*Power 3.1: Tests for Correlation and Regression Analyses.” Behavior Research Methods 41 (4): 1149–60. https://doi.org/10.3758/BRM.41.4.1149.
Gelman, Andrew, and John Carlin. 2014. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641–51. https://doi.org/10.1177/1745691614551642.
Gerard, Patrick D., David R. Smith, and Govinda Weerakkody. 1998. “Limits of Retrospective Power Analysis.” The Journal of Wildlife Management 62 (2): 801–7. http://www.jstor.org/stable/3802357.
Hoenig, John M., and Dennis M. Heisey. 2001. “The Abuse of Power.” The American Statistician 55 (1): 19–24. https://doi.org/10.1198/000313001300339897.
Hurlbert, Stuart H. 1984. “Pseudoreplication and the Design of Ecological Field Experiments.” Ecological Monographs 54 (2): 187–211. https://doi.org/10.2307/1942661.
Lakens, Daniël. 2022. “Sample Size Justification.” Collabra: Psychology 8 (1): 33267. https://doi.org/10.1525/collabra.33267.
Lenth, R. V. 2006. “Java Applets for Power and Sample Size [Computer Software].” http://www.stat.uiowa.edu/~rlenth/Power.
Thomas, Len. 1997. “Retrospective Power Analysis.” Conservation Biology 11 (1): 276–80. https://doi.org/10.1046/j.1523-1739.1997.96102.x.