Distribution of Sample Means

In this activity, we will explore the distribution of sample means taken from a parent distribution.

Parent Distribution

We are going to draw samples from a normal distribution having a mean of \( \mu=100 \) and a standard deviation \( \sigma=10 \). Let's begin by drawing a picture of this distribution.

Let's first enter \( \mu=100 \) and \( \sigma=10 \).

mu = 100
sigma = 10

If we start at the mean \( \mu=100 \), then move three standard deviations \( \sigma=10 \) to the left, we arrive at the left endpoint 70. If we start at the mean \( \mu=100 \), then move three standard deviations \( \sigma=10 \) to the right, we arrive at the right endpoint 130. We can use R's seq command to generate \( x \)-values ranging from 70 to 130. By using length=200, we generate 200 equally spaced numbers starting at 70 and ending at 130.

x = seq(70, 130, length = 200)

Next, we use the probability density function for the normal distribution to generate the \( y \)-values of our desired normal curve. Recall that this command takes the form dnorm(x,mu,sigma), where we use the \( x \)-values generated above with mu=100 and sigma=10.

y = dnorm(x, mu, sigma)

We can now use R's plot command to generate the normal density curve.

plot(x,y,type="l", col="red",
     main=paste("Mean = ",mu, ", Standard Deviation = ", sigma),
     xlab="X",
     ylab="P")

plot of chunk unnamed-chunk-4

Note that the “balance point” of the distribution occurs at 100, which agrees with the fact that the mean of this distribution is \( \mu=100 \). Secondly, note that the distribution is completed within three standard deviations \( \sigma=10 \).

Drawing a Huge Sample from the Distribution

Now, how do we generate a random sample from the normal distribution having mean \( \mu=100 \) and standard deviation \( \sigma=10 \)? The answer is, “Use rnorm”. Let's begin by drawing 10,000 numbers from this distribution, then sketching a histogram of the result. To accomplish this task, we use rnorm(n,mu,sigma), where n=10000, mu=100, and sigma=10.

n = 10000
mu = 100
sigma = 10
x = rnorm(n, mu, sigma)

Now we use the hist command to create a histogram of the sample. Setting prob=TRUE creates a density histogram, which allows us to compare it with the density curve shown above.

hist(x, prob = TRUE)

plot of chunk unnamed-chunk-6

Note that the resulting histrogram has a normal shape and a balance point at 100, which agrees nicely with the fact that we are drawing from a normal distribution having a mean \( \mu=100 \).

We can use R's xlim command to force limits on the \( x \)-axis. This will be quite helpful when comparing plots with different standard deviations.

hist(x, prob = TRUE, xlim = c(70, 130))

plot of chunk unnamed-chunk-7

Drawing Samples of Size \( n=5 \) and Calculating their Mean

The goal in this section is to draw samples of size \( n=5 \) from the parent distribution (normal with mean \( \mu=100 \) and standard deviation \( \sigma=10 \)), calculate their mean, then determine their distribution. Let's begin by using rnorm(n,mu,sigma) to draw a sample of size \( n=5 \) from the parent distribution.

x = rnorm(5, 100, 10)
x
## [1]  82.59 103.47 115.48 102.83  78.04

Note how this command drew five random numbers from our parent distribution. Next, we want to find the mean of these five numbers.

mu5 = mean(x)
mu5
## [1] 96.48

That seems pretty simple, lets draw another sample of five numbers and calculate the mean of the sample.

x = rnorm(5, 100, 10)
mu5 = mean(x)
x
## [1] 113.93  86.55 102.05 113.47  97.93
mu5
## [1] 102.8

We could shorten the process with the following code.

mu5 = mean(rnorm(5, 100, 10))
mu5
## [1] 93.42

Note that because we are drawing random samples we get a different mean result for each new sample.

Pretty straightforward. But what we really want to do is to repeat this process many times, then use the results to determine the distribution of these sample means.

Replicating the Process

R's replicate command will allow us to simplify the process. For example, suppose that we would like to repeat the above process six times. We would then write the following code.

xbar_5 = replicate(6, mean(rnorm(5, 100, 10)))
xbar_5
## [1] 101.45 100.50  95.27  92.00  95.44 107.17

Note that the replicate command repeated the process six times. That is, it drew six random samples of size \( n=5 \) from the parent normal distribution (\( \mu=100 \) and\( \sigma=10 \)), then calculated the mean of each sample.

Next, let's repeat the process 500 times, then sketch a histogram of the means of the random samples.

xbar_5 = replicate(500, mean(rnorm(5, 100, 10)))
hist(xbar_5, prob = TRUE)

plot of chunk unnamed-chunk-13

Let's try that again. Remember, each time we are getting 500 new sample means, so the picture changes each time we run our code. This time, let's calculate the mean and standard deviation of the 500 sample means and add this information to the title of the histogram. We'll also use R's round command to round the results to two decimal places.

xbar_5 = replicate(500, mean(rnorm(5, 100, 10)))
mu_5 = round(mean(xbar_5), 2)
sd_5 = round(sd(xbar_5), 2)
hist(xbar_5, prob = TRUE, main = paste("mean = ", mu_5, ", sd = ", sd_5))

plot of chunk unnamed-chunk-14

This picture gives us the distribution of \( \overline{X} \), the distribution of the sample means of the parent distribution.

Comparing the Parent Distribution with the Distribution of Sample Means

We're now going to repeat the process above, and sketch the parent distribution and the distribution of samples means in a side-by-side image. In order to help with the comparison, we will set breaks=12 and xlim=c(70,130).

par(mfrow=c(1,2))
x=rnorm(10000,100,10)
hist(x,
     prob=TRUE,
     breaks=12,
     xlim=c(70,130),
     main="mean = 100, sd = 10")
xbar_5 <- replicate(500,mean(rnorm(5,100,10)))
mu_5=round(mean(xbar_5),2)
sd_5=round(sd(xbar_5),2)
hist(xbar_5,
     prob=TRUE,
     breaks=12,
     xlim=c(70,130),
     main=paste("mean = ",mu_5,", sd = ",sd_5))

plot of chunk unnamed-chunk-15

par(mfrow=c(1,1))

Note that the mean of the distribution of sample means closely matches the mean of the parent distribution, but the standard deviation of the distribution of sample means is much smaller than 10, the standard deviation of the parent distribution.

Increasing the Sample Size

We're now going to repeat the above process six more times. Each time, we will increase the size of the samples being drawn from the parent distribution. We will start by drawing samples of size \( n=5 \), then \( n=10 \), 20, 30, 40, and 50.

Similar steps are used to draw samples of size 30, 40, and 50. We then arrange the histograms in three rows and two columns for purposes of comparison.

par(mfrow = c(3, 2))

xbar_5 <- replicate(500, mean(rnorm(5, 100, 10)))
mu_5 = round(mean(xbar_5), 2)
sd_5 = round(sd(xbar_5), 2)
hist(xbar_5, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_5, ", sd = ", sd_5))

xbar_10 <- replicate(500, mean(rnorm(10, 100, 10)))
mu_10 = round(mean(xbar_10), 2)
sd_10 = round(sd(xbar_10), 2)
hist(xbar_10, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_10, ", sd = ", sd_10))

xbar_20 <- replicate(500, mean(rnorm(20, 100, 10)))
mu_20 = round(mean(xbar_20), 2)
sd_20 = round(sd(xbar_20), 2)
hist(xbar_20, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_20, ", sd = ", sd_20))

xbar_30 <- replicate(500, mean(rnorm(30, 100, 10)))
mu_30 = round(mean(xbar_30), 2)
sd_30 = round(sd(xbar_30), 2)
hist(xbar_30, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_20, ", sd = ", sd_30))

xbar_40 <- replicate(500, mean(rnorm(40, 100, 10)))
mu_40 = round(mean(xbar_40), 2)
sd_40 = round(sd(xbar_40), 2)
hist(xbar_40, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_40, ", sd = ", sd_40))

xbar_50 <- replicate(500, mean(rnorm(50, 100, 10)))
mu_50 = round(mean(xbar_50), 2)
sd_50 = round(sd(xbar_50), 2)
hist(xbar_50, prob = TRUE, breaks = 12, xlim = c(70, 130), main = paste("mean = ", 
    mu_50, ", sd = ", sd_50))

plot of chunk unnamed-chunk-16


par(mfrow = c(1, 1))

Some Intermediate Conclusions

Looking at our last image, two conclusions are drawn:

  1. Each histogram of sample means appears to have the same mean as the parent distribution, namely 100.

  2. As the sample sizes grow larger, that is \( n=5 \), 10, 20, 30, 40, and 50, the distribution of sample means become thinner. That is, the sample standard deviation decreases.

Relation Between Sample Size and Standard Deviation of Distribution of Sample Means

In the histogram images above, we see that as we increase the size of samples drawn from the parent distribution, the distribution of sample means narrows in width, that is, the standard deviation of the distribution of sample means gets smaller and smaller. Can we take an even close look at this relationship?

Let's start by entering the sample sizes, then the standard deviation of the distribution of sample means for each sample size.

n = c(5, 10, 20, 30, 40, 50)
s = c(sd_5, sd_10, sd_20, sd_30, sd_40, sd_50)
n
## [1]  5 10 20 30 40 50
s
## [1] 4.61 3.20 2.22 1.92 1.63 1.42

Next. let's provide a scatterplot of this data.

plot(n,s,
     main="Sample Standard Deviations vs Sample Size",
     xlab="Sample Size",
     ylab="Sample Standard Deviation")

plot of chunk unnamed-chunk-18

In the scatterplot, we see that as the sample size increases, the standard deviation of the distribution of sample means decreases. This agrees nicely with the histogram images above, where we see that as the sample size increases, the width of the histogram representing the distribution of sample means decreases.

However, the scatterplot does not show a linear association. Indeed, the scatterplot resembles some sort of curve, that is, a nonlinear association.

As a second attempt at a scatterplot, let's try plot the logarithm of the sample standard deviations versus the logarithm of the sample size.

plot(log(n),log(s),
     main="Logarithm of Standard Deviation vs Logarithm of Sample Size",
     xlab="Logarithm of Sample Size",
     ylab="Logarithm of Sample Standard Deviation")

plot of chunk unnamed-chunk-19

Aha! Interesting! There now seems to be a very strong negative linear association between the logarithm of the sample standard deviations and the logarithm of the sample sizes. Let's try to fit a line of best fit to this image.

plot(log(n),log(s),
     main="Logarithm of Standard Deviation vs Logarithm of Sample Size",
     xlab="Logarithm of Sample Size",
     ylab="Logarithm of Sample Standard Deviation")
res=lm(log(s)~log(n))
abline(res)

plot of chunk unnamed-chunk-20

Nice! Now, what is the equation of the line of best fit? You may recall that abline represents the equation \( y=a+bx \), where \( a \) is the intercept and \( b \) is the slope of the line of best fit. Let's show these values.

res
## 
## Call:
## lm(formula = log(s) ~ log(n))
## 
## Coefficients:
## (Intercept)       log(n)  
##       2.325       -0.501

To perform some of the following computations, let's store the coefficients in the variables a and b.

a = coef(res)[1]
b = coef(res)[2]
a
## (Intercept) 
##       2.325
b
## log(n) 
## -0.501

Thus, the equation of the line of best fit is:

\[ y=2.3255-0.501x \]

However, we are not using the variable \( x \) and \( y \) on our horizontal and vertical axis. Our horizontal axis is \( \log n \) and our vertical axis is \( \log s \), so we must replace \( x \) and \( y \) with these variables. Hence, the equation of our line really is:

\[ \log s=2.3255 -0.501\log n \]

Next, if we take the exponential of both sides of this last equation, we get:

\[ \begin{align*} e^{\log s}&=e^{2.3255 -0.501\log n}\\ s&=e^{2.3255}e^{-0.501\log n}\\ s&=10.2314e^{\log n^{-0.501}}\\ s&=10.2314n^{-0.501}\\ s&=\frac{10.2314}{n^{0.501}} \end{align*} \]

Note that this final result is very nearly the same as

\[ s=\frac{10}{n^{1/2}}, \]

which is the same as:

\[ s=\frac{10}{\sqrt{n}}. \]

We've done a lot of calculations, so we may not remember that 10 was the standard deviation of the parent population (check at the beginning of the activity). What this last statement tells us is that the standard deviation of the distribution of sample means equals the standard deviation of the parent population from which the samples are drawn from, divided by the square root of the sample size \( n \). In symbols, we would write:

\[ \sigma_{\,\overline{X}}=\frac{\sigma_{\,X}}{\sqrt{n}} \]

Final Conclusions

Let \( X \) represent the parent distribution from which we are drawing samples. Let \( \mu_X \) represent the mean of the parent distribution and let \( \sigma_X \) represent the standard deviation of the parent distribution.

Secondly, let \( n \) represent the sample size drawn from the parent distribution and let \( \overline{X} \) represent the mean of the sample. Here are the facts drawn from this activity.

  1. When the parent distribution from which we draw our samples is normal, then the distribution of sample means is also normal.
  2. The mean of the distribution of sample means equals the mean of the parent distribution. In symbols, we write: \[ \mu_{\,\overline{X}}=\mu_{\,X} \]
  3. The standard deviation of the distribution of sample means is found by dividing the standard deviation of the parent distribution by the square root of the sample size. In symbols, we write: \[ \sigma_{\,\overline{X}}=\frac{\sigma_{\,X}}{\sqrt{n}} \]