Install the lmPerm
, coin
,
gtools
packages if you don’t already have them. Download
the ants and reproductive skew data and put them
somewhere sensible.
library("ggplot2"); theme_set(theme_bw())
library("lmPerm")
library("coin")
library("gtools")
Data originally from Gotelli and Ellison (2004).
ants <- read.csv("../data/ants.csv")
ants$place <- factor(ants$place)
print(ants)
## place colonies
## 1 field 12
## 2 field 9
## 3 field 12
## 4 field 10
## 5 forest 9
## 6 forest 6
## 7 forest 4
## 8 forest 6
## 9 forest 7
## 10 forest 10
Look at the data (with stat_sum()
to visualize
overlapping data points. (Jittering with geom_jitter()
is
also a possibility, but stat_sum()
is prettier; try
geom_beeswarm
from the ggbeeswarm
package for
larger data sets). scale_size()
tells ggplot to scale the
area of the points proportional to the size (breaks=1:2
tells it what values to show in the legend). We don’t really need the
boxplot here, but shown for comparison (and to illustrate that boxplots
are a little silly for tiny data sets; if you must show them, show the
points as well).
print(ggplot(ants,aes(place,colonies))
+ geom_boxplot(fill="lightgray")
+ stat_sum(alpha=0.7)
+ scale_size(breaks=1:2, range=c(3,6))
)
There are always trade-offs between simplicity, transparency, length of code, computational efficiency …
The simplest way to do this would be something like:
set.seed(101) ## for reproducibility
nsim <- 9999
res <- numeric(nsim) ## set aside space for results
for (i in 1:nsim) {
## standard approach: scramble response value
perm <- sample(nrow(ants))
bdat <- transform(ants,colonies=colonies[perm])
## compute & store difference in means; store the value
res[i] <- mean(bdat$colonies[bdat$place=="field"])-
mean(bdat$colonies[bdat$place=="forest"])
}
obs <- mean(ants$colonies[ants$place=="field"])-
mean(ants$colonies[ants$place=="forest"])
## append the observed value to the list of results
res <- c(res,obs)
set.seed(<integer>)
resets the random-number
stream to a specified place. You can use any integer you like. You
should always use set.seed()
before running computations
involving randomizationsfor
loops are a way to repeat computations many times:
e.g. see here
for an introductiontransform()
is a base-R analog of tidyverse
mutate()
sample()
is a general-purpose tool: by default it
samples a specified number of values without replacement, which
means that it generates a re-ordering of the numbers,
e.g. set.seed(101); sample(5)
produces a vector
(2,1,3,5,4)
. colonies[perm]
above scrambles
the response variable with respect to the predictors (in this case, to
the “field” vs. “forest” location)A picture of the results:
hist(res,col="gray",las=1,main="")
abline(v=obs,col="red")
Some alternative recipes for computing the difference in the means:
(1) base R with aggregate()
… either of these could be
substituted for the mean(bdat,...)
line in the code
above.
agg <- aggregate(colonies~place,FUN=mean,data=bdat)
res[i] <- agg$colonies[1]-agg$colonies[2]
or tidyverse:
res[i] <- (bdat
%>% group_by(colonies)
%>% summarise(colonies=mean(colonies))
%>% pull(colonies) ## extract a single column
%>% diff() ## difference between elements
)
If you want to be very fancy/tidyverse-ish, check out
purrr::map_dbl(1:n, ~ ...)
Since there aren’t actually that many possible outcomes, we could plot them this way instead of using a histogram:
par(las=1,bty="l")
plot(prop.table(table(round(res,2))),
ylab="Proportion",axes=FALSE)
axis(side=2)
axis(side=1,at=seq(-4,4,by=2))
points(obs,0,pch=16,cex=1.5,col="red")
If we want a two-tailed test, we have to decide whether we are
doubling the observed value or counting the area in both tails. If
x
is a logical vector (such as res>=obs
),
then mean(x)
first converts FALSE
values to 0
and TRUE
values to 1, then computes the mean; this
calculates the proportion of the values that are
TRUE
. (It’s equivalent to
sum(x==TRUE)/length(x)
.)
2*mean(res>=obs) ## doubling (as suggested by JD)
## [1] 0.0466
mean(abs(res)>=abs(obs)) ## count both tails: matches lmPerm
## [1] 0.0374
Instead of computing the difference between means, we could use the
test statistic from a standard statistical test. Although we’re
using the same test statistic, we’re not assuming that the
values of the test statistic are \(t\)-distributed, which would require the
assumptions we want to avoid [i.e., that the colony counts are Normally
distributed within each habitat]. The standard parametric test to use
here would be a \(t\) test, or
equivalently a 1-way ANOVA (as done by lm()
). For some
reason R’s t-test seems to use opposite signs for the effect size
(i.e. it reports a positive value, “field minus forest”, rather than a
negative value as we did above), but this doesn’t really matter. The
test statistic here is not the difference between the means, as we used
above, but the difference divided by the standard error. In this case
this should give the same answer …
(tt <- t.test(colonies~place,data=ants,var.equal=TRUE))
##
## Two Sample t-test
##
## data: colonies by place
## t = 2.9632, df = 8, p-value = 0.01806
## alternative hypothesis: true difference in means between group field and group forest is not equal to 0
## 95 percent confidence interval:
## 0.8316859 6.6683141
## sample estimates:
## mean in group field mean in group forest
## 10.75 7.00
If you want to use this in your testing code you would use
tt <- t.test(colonies~place,data=bdat,var.equal=TRUE)
res[i] <- tt$statistic
in place of the difference between means computed above.
lmPerm
R has a software package (lmPerm
) that automatically
fits linear models and computes p-values by permutations. Here the
lmp()
function is the permutation analog of the
lm()
(linear model) function in base R.
summary(lmp(colonies~place,data=ants))
## [1] "Settings: unique SS "
##
## Call:
## lmp(formula = colonies ~ place, data = ants)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.000 -1.000 -0.375 1.250 3.000
##
## Coefficients:
## Estimate Pr(Exact)
## place1 1.875 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.961 on 8 degrees of freedom
## Multiple R-Squared: 0.5233, Adjusted R-squared: 0.4637
## F-statistic: 8.78 on 1 and 8 DF, p-value: 0.01806
We’ll talk more about specifying linear models next week, but the
formula is response ~ <predictor variables>
; in this
case place
is the predictor (this is again equivalent to a
t-test or a one-way ANOVA with two groups). lmp()
seems to
automatically change the contrast settings from the default treatment
contrast to sum-to-zero contrasts, so that the reported effect size is
half what it was (3.75/2), because it is computing the difference
between the (unweighted) average of the two groups and the first group
(field).
coin
The coin
package is big and complicated and powerful.
For each of the tests it provides, it allows a choice of whether to use
differences of ranks or raw differences, and whether to use (1)
asymptotic p-values (like the classic nonparametric tests:
Kruskal-Wallis, Mann-Whitney, etc.); (2) approximate p-values
(taking many random samples), or (3) exact p-values
(effectively, generating all possible combinations). The formulas are
interpreted in the same way as above.
## default: asymptotic
oneway_test(colonies~place,data=ants)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: colonies by place (field, forest)
## Z = 2.1701, p-value = 0.03
## alternative hypothesis: true mu is not equal to 0
## exact distribution
oneway_test(colonies~place,data=ants,distribution="exact")
##
## Exact Two-Sample Fisher-Pitman Permutation Test
##
## data: colonies by place (field, forest)
## Z = 2.1701, p-value = 0.0381
## alternative hypothesis: true mu is not equal to 0
## approximate (random-sampling/Monte Carlo)
oneway_test(colonies~place,data=ants,distribution=approximate(nresample=9999))
##
## Approximative Two-Sample Fisher-Pitman Permutation Test
##
## data: colonies by place (field, forest)
## Z = 2.1701, p-value = 0.0372
## alternative hypothesis: true mu is not equal to 0
Suppose we want to be careful as JD suggests and compute only the
values corresponding to the actual permutations. Make sure the
gtools
package is loaded and generate the combinations, as
in the original example:
ind_comb <- combinations(nrow(ants), sum(ants$place=="field"))
nrow(ind_comb) ## count combinations
## [1] 210
head(ind_comb) ## look at the first few
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 1 2 3 5
## [3,] 1 2 3 6
## [4,] 1 2 3 7
## [5,] 1 2 3 8
## [6,] 1 2 3 9
Now write two functions. The first, simfun()
, simulates
a randomized data set given inputs (in this case, the input is a list of
elements to be assigned to the “field” category). We take the
colonies
column from the original ants
data
set and arrange the field-assigned colony counts first, and the
non-field-assigned colony counts second.
simfun <- function(cc) {
transform(ants,colonies=c(colonies[cc],colonies[-cc]))
}
cc
is a set of indices, colonies[-cc]
selects all but those values from the colonies
vector.The second function, sumfun()
, takes a simulated data
set and returns whatever summary statistic we want. In this case I
decided to use the \(t\) statistic as
computed by R. (In many cases simple summary statistics can be computed
more efficiently by doing it by hand, but it’s often conceptually
clearer to run exactly the same test that we would have used in
the non-permutation analysis and extract the test statistic, which is
usually stored as a list element called “statistic”, from it.)
sumfun <- function(dat) {
t.test(colonies~place,data=dat,var.equal=TRUE)$statistic
}
ncomb <- nrow(ind_comb)
permdist <- numeric(ncomb)
## ind_comb[i,] is the ith row of the matrix of combinations
for (i in 1:ncomb) {
permdist[i] <- sumfun(simfun(ind_comb[i,]))
}
(this could also be done using R’s apply()
function).
What do we get, and how does it compare with the distribution we would
expect from classical statistics, which is a \(t\)-distribution with 8 degrees of
freedom?
obs_stat <- tt$statistic
hist(permdist,col="gray",breaks=30,freq=FALSE,main="")
curve(dt(x,df=tt$parameter),add=TRUE,col="red")
abline(v=obs_stat,col="blue")
One way to get the \(p\)-value:
(obs_pval <- 2*mean(permdist>=obs_stat))
## [1] 0.04761905
mean(permutations>=obs))
is a trick
to calculate the proportion: the logical statement returns a logical
(FALSE
/TRUE
) vector, which then gets converted
to a 0/1 vector when you ask R to take the mean, so this is equivalent
to counting the number of true values and dividing by the length …This gives just the single \(p\)-value, which we can compare with the \(p\)-value we got from the classical test (0.018)
Re-doing the ants example as done originally, but now reorganizing it
to use simfun()
and sumfun()
. You can skip
ahead to the reproductive-skew (regression) example if you like.
We’ll use the same sumfun()
as before, but we need to
define a new version of simfun()
. Because we are picking a
different value every time, we don’t need to keep track of which sample
we are on; simfun()
doesn’t need to take any arguments, and
we can use R’s replicate()
function to generate as many
permutation results as we want (this is basically the same as the
for()
loop we started above, just a little more
compact).
simfun_rsamp <- function() {
transform(ants,colonies=sample(colonies))
}
set.seed(101)
permdist_rsamp <- replicate(9999,sumfun(simfun_rsamp()))
The result isn’t quite the same as the exact value derived above, but it’s pretty close to the result we got before:
2*mean(permdist_rsamp>=obs_stat)
## [1] 0.04640464
If we want to switch test statistics, we only need to switch
sumfun()
. (with()
below is a shortcut so we
don’t have to use dat$
as often.)
sumfun_diffmean <- function(dat) {
with(dat,
mean(colonies[place=="field"])-mean(colonies[place=="forest"]))
}
sumfun_diffmean(ants) ## test
## [1] 3.75
permdist_diffmean <- apply(ind_comb,
MARGIN=1,function(x) sumfun_diffmean(simfun(x)))
2*mean(permdist_diffmean>=sumfun_diffmean(ants))
## [1] 0.04761905
This gives exactly the same result as the original approach, because there is a one-to-one relationship between differences between means and \(t\) statistics …
Some data from Holly Kindsvater on reproductive skew in fish (from Paczolt et al. (2015)):
skewdat <- read.csv("../data/skewdat.csv")
(ggplot(skewdat, aes(Size,skew))
+ geom_point()
+geom_smooth(method="lm", formula=y~x)
)
summary(lm(skew~Size,data=skewdat))
##
## Call:
## lm(formula = skew ~ Size, data = skewdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19909 -0.02180 0.01480 0.03348 0.10050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.266029 0.091030 -2.922 0.00667 **
## Size 0.004607 0.001957 2.354 0.02554 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06164 on 29 degrees of freedom
## Multiple R-squared: 0.1605, Adjusted R-squared: 0.1315
## F-statistic: 5.543 on 1 and 29 DF, p-value: 0.02554
Can we trust this regression? Let’s try a permutation test.
Since all the \(x\)
(Size
) values are unique, there are a total of 31!
(factorial) possible permutations, or \(8.2\times 10^{33}\), way too many to do by
brute force (insert calculations here about what fraction of the job we
will have done by the time the sun burns out …)
simfun_rsamp2 <- function(respvar="skew",data=skewdat) {
permdat <- data
permdat[[respvar]] <- sample(permdat[[respvar]])
permdat
}
sumfun_skew <- function(dat) {
coef(lm(skew~Size,data=dat))["Size"]
}
set.seed(101)
permdist_skew <- replicate(8000,sumfun_skew(simfun_rsamp2()))
(skew_pval <- mean(abs(permdist_skew)>=abs(sumfun_skew(skewdat))))
## [1] 0.024
The results are very close to the classical test result (before trying this with 8000 replicates, I tried a few times with 2000 replicates and found that the results varied between about 0.02 and 0.035 – maybe JD was right …)
We could also use lmPerm
for this:
summary(lmp(skew~Size,data=skewdat, maxIter=16000, Ca=1e-3))
## [1] "Settings: unique SS : numeric variables centered"
##
## Call:
## lmp(formula = skew ~ Size, data = skewdat, maxIter = 16000, Ca = 0.001)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19909 -0.02180 0.01479 0.03348 0.10050
##
## Coefficients:
## Estimate Iter Pr(Prob)
## Size 0.004607 16000 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06164 on 29 degrees of freedom
## Multiple R-Squared: 0.1605, Adjusted R-squared: 0.1315
## F-statistic: 5.543 on 1 and 29 DF, p-value: 0.02554
Or coin
:
independence_test(skew~Size,data=skewdat,teststat="scalar",
distribution="asymptotic")
##
## Asymptotic General Independence Test
##
## data: skew by Size
## Z = 2.1941, p-value = 0.02823
## alternative hypothesis: two.sided
independence_test(skew~Size,data=skewdat,teststat="scalar",
distribution=approximate(nresample=16000))
##
## Approximative General Independence Test
##
## data: skew by Size
## Z = 2.1941, p-value = 0.02531
## alternative hypothesis: two.sided
Since the standard error of an estimated proportion is \(\sqrt{p(1-p)/n}\), the coefficient of variation (ratio of the standard error to the mean estimate, \(p\)) is \(\sqrt{(1-p)/(pn)}\). Thus for an observed \(p\)-value, if we want to get the coefficient of variation down to a specified level \(c\) (say 5%, so the confidence intervals are approximately \(\pm\) 10% of the estimated \(p\)-value) then we need to take \(n\) large enough so that \(c = \sqrt{(1-p)/(pn)}\), or \(n \approx (1-p)/(p c^2)\); if \(p\) is small then this is further approximated by \(1/(p c^2)\) (e.g. for a \(p\)-value of 0.05 accurate within \(c=0.05\), we need \(1/(0.5 \cdot 0.5^2) = 1/(0.5^3) = 20^3 = 8000\) samples (slightly fewer since we have neglected the \(1-p\) term). If we wanted a similarly accurate value for our current answer, with a \(p\)-value about half as large, we would need twice as many samples.
independence_test(skew~Size,data=skewdat,teststat="scalar",
distribution=approximate(nresample=16000))
##
## Approximative General Independence Test
##
## data: skew by Size
## Z = 2.1941, p-value = 0.02431
## alternative hypothesis: two.sided