When Can We Approximate the Binomial Distribution with a Normal Distribution?

The purpose of this activity is to determine when it is permissable to approximate a binomial distribution with a normal distribution. Aliaga claims that it is permissable to approximate a binomial distribution with a normal distribution if and only if \(np\ge 5\) and \(nq\ge 5\). Let’s begin by reminding our readers of the binomial parameters and their meanings.

\[\begin{align*} n&=\text{Number of trials}\\ p&=\text{Probability of success}\\ q&=\text{Probability of failure} \end{align*}\]

We then let the variable \(X\) represent the number of successes in \(n\) trials (for example, the number of heads in 10 tosses).

Further, if \(X\) is a binomial random variable, where \(n\) is the number of trials, \(p\) is the probability of success on any trial, and \(q\) is the probability of failure on any one trial, then the mean of the binomial distribution is \[\mu=np,\] the variance is \[\sigma^2=npq,\] and the standard deviation is \[\sigma=\sqrt{npq}.\]

Now, we’ll examine the Aliaga rule (\(np\ge 5\) and \(nq\ge 5\)) for a number of examples.

Example: Fails Aliaga Rule

For our first example, we’ll let the number of trials be \(n=10\) and the probability of success on any one trial be \(p=0.1\). Enter \(n=10\), \(p=0.1\), and \(q=1-p=0.9\).

n=10
p=0.1
q=1-p

Because there are \(n=10\) possible trials, the variable \(X\), which represents the number of successes in \(n=10\) trials, can take on each of the numbers 0, 1, 2, …, 10. We can calculate the probability \(Pr(X=k)\) for \(k=0, 1, 2, ... ,10\) using R’s dbinom(k,n,p) command.

x=0:10
y=dbinom(x,n,p)

We can now create a stickplot.

plot(x,y,type="h",lwd=2,col="red")

Clearly, this distribution is not normal in shape as it is highly skewed to the right. Moreover, when we apply the Aliaga test:

\[ \begin{align*} np&=(10)(0.1)=1\\ nq&=(10)(0.9)=9 \end{align*}\]

It is not the case that both \(np\ge 5\) and \(nq\ge 5\). However, let’s try to add an appropriate normal curve to our plot. We first calculate the mean and standard deviation of the binomial distribution using \(\mu=np\) and \(\sigma=\sqrt{npq}\).

mu=n*p
s=sqrt(n*p*q)

We will now redraw our binomial distribution, then add a normal distribution with mean mu and standard deviation s.

plot(x,y,type="h",lwd=2,col="red")
xx=seq(0,10,length=200)
yy=dnorm(xx,mu,s)
lines(xx,yy,lwd=2,col="blue")

As evidenced in our last plot, the normal curve is not a good fit. Indeed, part of the normal curve is not even present in the picture.

Let’s now find the exact probability \(Pr(X\le1)\) using R’s pbinom(x,n,p) command.

pbinom(1,n,p)
## [1] 0.7360989

Now let’s see what normal approximation of the probability \(Pr(X\le 1)\) is using R’s pnorm(x,mu,sd) command.

pnorm(1,mu,s)
## [1] 0.5

Note that the approximation is not even close to the exact probability. Hence, because this example fails the Aliaga test (\(np\ge 5\) and \(nq\ge 5\)), we cannot approximate this binomial distribution with a normal distribution.

Example: Barely Passes Aliaga Rule

Let’s look at a second example. This time we’ll let the number of trials equal \(n=10\) again, but raise the probability of success on any individual trial to \(p=0.5\). Note that this time, both portions of Aliaga’s rule (\(np\ge 5\) and \(nq\ge 5\)) are satisfied.

\[\begin{align*} np&=(10)(0.5)=5\\ nq&=(10)(0.5)=5 \end{align*}\]

Now, let’s see how things work this time. Enter \(n=10\), \(p=0.5\), and \(q=1-p=0.5\).

n=10
p=0.5
q=0.5

Because there are \(n=10\) possible trials, the variable \(X\), which represents the number of successes in \(n=10\) trials, can take on each of the numbers 0, 1, 2, …, 10. We can calculate the probability \(Pr(X=k)\) for \(k=0, 1, 2, ... ,10\) using R’s dbinom(k,n,p) command.

x=0:10
y=dbinom(x,n,p)

We can now create a stickplot.

plot(x,y,type="h",lwd=2,col="red")

This time, the binomial distribution has a normal shape. We will now redraw our binomial distribution, then add a normal distribution with mean \(\mu=np\) and standard deviation \(\sigma=\sqrt{npq}\), which we represent with the variables mu and s.

plot(x,y,type="h",lwd=2,col="red")
mu=n*p
s=sqrt(n*p*q)
xx=seq(0,10,length=200)
yy=dnorm(xx,mu,s)
lines(xx,yy,lwd=2,col="blue")