1 The basics of Simulations in R

1.1 What do I mean by simulate? Why Monte Carlo?

1.2 Our first example

Let us simulate 10000 observations from a normally distributed variable with a mean of 5 and a standard deviation of 2. In other words: \(\sim N(\mu = 5, \sigma = 2)\)

We will use the rnorm function which generates random values from a normal distribution

norm_simulated <- rnorm(n = 10000, mean  = 5, sd = 2)

par(mfrow=c(1,1))

plot(norm_simulated, 
     pch = 20, col = rgb(0.5,0,0.5,0.25),
     ylab = "simulated values")

hist(norm_simulated, 
     freq = F,
     xlab = "simulated values",
     main = "histogram of simulated values")

curve(dnorm(x, mean = 5, sd = 2), from = 0, to = 10, 
      add = T, col = "red")

With a \(~N(\mu = 5, \sigma = 2)\) distribution superimposed (red) over the distribution of simulated values.

1.3 look at summary statistics (mean and sd) from this.

mean(norm_simulated)
## [1] 5.01
sd(norm_simulated)
## [1] 2

1.4 repeat this a few times. what do you observe?

Let’s repeat a few times, with a more modest sample size.

norm_simulated1 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated2 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated3 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated4 <- rnorm(n = 100, mean  = 5, sd = 2)

par(mfrow = c(2,2))
hist(norm_simulated1)
hist(norm_simulated2)
hist(norm_simulated3)
hist(norm_simulated4)

par(mfrow = c(1,1))

mean(norm_simulated1)
## [1] 5.11
mean(norm_simulated2)
## [1] 4.76
mean(norm_simulated3)
## [1] 5.13
mean(norm_simulated4)
## [1] 5.19

So for each (iteration) of the simulation, we have sampled 100 observations from a population with a normal distribution with \(\sim N(\mu = 5, \sigma = 2)\). If we wanted to change the number of observations per iteration we simply change n. Similarly we could change the values for mean (or make it some function like a line), the standard deviation and the type of probability distribution we are sampling from.

1.4.1 Please simulate 30 observations from

\(\sim N(\mu = 10, \sigma = 3)\) 5 times.

1.5 more efficient performing the simulation

Of course performing this one iteration at a time is not particularly useful, so instead we find a way to repeat it efficiently.

One way to do this in R is to use the replicate function, which just repeats the function call n times. Notice that there is the n=4 in the replicate which refers to how many times to repeat the call to the function (in this case the rnorm). the n=100 in rnorm is how many samples are drawn from the population (distribution).

let’s do this again, but repeat the sampling process (number of iterations) 2000 times. \(\sim N(\mu = 7, \sigma = 1.8)\)

norm_sim_all_3 <- replicate(n=2000, 
                            rnorm(n = 100, mean = 7, sd = 1.8)) 

We can ask about the spread of the means from our simulated trials. In other words for each of the 2000 iterations of the simulation, what is the mean for the 100 values we simulated? How might you code this?

hist(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean),
     main = "distribution of means from simulation",
     xlab = "mean values")

# standard deviation among sampled means (the standard error of the mean)
sd(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean))
## [1] 0.177

Let’s repeat this experiment with lower sample size (say 25)

norm_sim_all_4 <- replicate(n = 2000, 
                            rnorm(n = 25, mean = 5,sd = 2))

# standard deviation among sampled means (the standard error of the mean)
sd(apply(norm_sim_all_4,2, mean))
## [1] 0.397
hist(apply(norm_sim_all_4,2,mean),
     main = "distribution of means from simulation",
     xlab = "mean values")

How about if we have much larger sample size (say 10000)?

norm_sim_all_5 <- replicate(n = 2000,
                            rnorm(10000, 5, 2))


sd(apply(norm_sim_all_5, 2, mean))
## [1] 0.02
hist(apply(norm_sim_all_5, 2, mean),
     main = "distribution of means from simulation",
     xlab = "mean values")

2 Do you see a pattern for the precision of the mean across repeated simulated samples?

sd(apply(norm_sim_all_4, 2, mean)) # n=25
## [1] 0.397
sd(apply(norm_sim_all_2, 2, mean)) # n=100
## [1] 0.279
sd(apply(norm_sim_all_5, 2, mean)) # n=10000
## [1] 0.02

Notice that as sample size (for a given simulated sample) gets larger, the variation (sd among the means) among the estimated means gets smaller!

We have just learned about monte carlo methods. When we perform simulations from that distributions, as our n (sample size) increase we will get closer to the expected value (which in this case we provided). That is the law of large numbers.

This hopefully reinforces why the sample size influences the estimated the standard error so clearly!

2.1 Using for loops instead

We can also run such simulations using for loops, which can be more efficient when the number of simulations (or parameters being estimated) is large

3 How many iterations of the simulation do we want to do? We will call this “N”

N <- 2000

First we initialize a variable to store the mean values we are going to estimate from each iteration of the simulation

simulated_means <- rep(NA, N) # Create an empty vector of length N

head(simulated_means)  # Filled with "missing data" until we fill it
## [1] NA NA NA NA NA NA

Now we use the for loop to repeat this from i=1 until i=N (which is 2000 in this case )

for (i in 1:N) {
    sim_data <- rnorm(n = 100, mean  = 5, sd = 2)
    simulated_means[i] <- mean(sim_data)
    rm(sim_data) # While this gets over-written each iteration of the loop, 
       #we want to remove it after the last iteration.
}

the vector “simulated_means” now contains all of the means from the N iterations of the simulation.

hist(simulated_means)

sd(simulated_means)
## [1] 0.194

3.1 Simulations for a simple linear regression

Instead of simulating just a mean and sd, let’s specify something more interesting… How about a linear regression!!!!!

let’s have a regression \(Y \sim N(\mu = a + b*x, \sigma = sd)\)

Specifically with an intercept of 5 (a = 5) and a slope of 0.7 (b = 0.7).

\[Y \sim N(\mu = 5 + 7*x, \sigma = sd)\]

a = 5
b = 0.7
x <- seq(2, 20) # our predictor variable
y_fixed <- a + b*x  # deterministic part of our model

par(mfrow = c(1, 1))
plot(y_fixed ~ x, pch = 20, cex = 1.5, col = "red")  # The deterministic relationship.

But we want our response to be randomly sampled (conditioned on x), so we need to be able to incorporate this. So let’s say sd=2.5

\[Y \sim N(\mu = 5 + 7*x, \sigma = 2.5)\]

y_sim_1 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 

So we have now simulated values of our response variable y given a deterministic and stochastic component to the model!

par(mfrow = c(1, 1))
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")
abline(a = 5, b = 0.7, lwd = 2) # Expected relationship based on the "known" parameters we used.

The line is the relationship based on the “known” parameters we used. The red points are the simulated values of \(y\) we just generated.

But what are the parameter estimates for the regression, based on the simulated data (for y)? We can use the lm function to run a linear regression.

y_sim_1_lm <- lm(y_sim_1 ~ x)

 # notice parameter estimates and RSE!

coef(y_sim_1_lm)
## (Intercept)           x 
##       3.740       0.849
confint(y_sim_1_lm) # does it include our expected (known) parameter values?
##             2.5 % 97.5 %
## (Intercept) 1.089   6.39
## x           0.633   1.06
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")

abline(a = 5, b = 0.7, lwd = 2) 
abline(reg=y_sim_1_lm, lty = 2, col = "purple", lwd = 2) # estimated values based on simulated data.

Not identical, but similar to the “true” (which is known because we chose it) relationship. The point is, we have now done a little simulation of our regression model! We have the R tools we need to really begin learning statistics and about uncertainty!

Let’s repeat this simulation a few times and add the lines on. Simulations are in purple the “true” relationship in black

y_sim_2 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 
y_sim_2_lm <- lm(y_sim_2 ~ x)

y_sim_3 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 
y_sim_3_lm <- lm(y_sim_3 ~ x)


plot(y_sim_1 ~ x, 
     col = "red", type = "n",
     ylab = "y (simulated)")

abline(a = 5, b = 0.7, lwd = 2)
abline(reg = y_sim_1_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_2_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_3_lm, lty = 2, col = "purple", lwd = 2)

In the future tutorials we will show to make this more efficient so we can generate lots of simulations over a range of plausible values.