From the Poisson Distribution to Stirling's Approximation
The Poisson distribution is the most famous probability model for counts, non-negative integer values. Many real-world phenomena are well approximated by this distribution, including the number of German bombs that landed in 1/4km grid squares in south London during WWII. Formally, we say that a discrete random variable \(X\) follows a Poisson distribution with rate parameter \(\mu > 0\), abbreviated \(X \sim \text{Poisson}(\mu)\), if \(X\) has support set \(\{0, 1, 2, ...\}\) and probability mass function \[ p(x) \equiv \mathbb{P}(X=x) = \frac{e^{-\mu }\mu^x}{x!}. \] Using some clever algebra with sums it’s not too hard to show that the rate parameter, \(\mu\), is both the mean and the variance of \(X\).
Numerical problems? Try taking logs.
Now, suppose that we wanted to plot the pmf of a Poisson RV with rate \(\mu = 171\).
The R function for the pmf of a Poisson RV is dpois()
, so we can make our plot as follows (indicating the rate parameter as a vertical line)
library(tidyverse)
tibble(x = 0:300) %>%
mutate(p = dpois(x, 171)) %>%
ggplot(aes(x, p)) +
geom_point() +
geom_vline(xintercept = 171) +
ylab('Poisson(171) pmf')
For such a large value of \(\mu\), this distribution looks decidedly bell-shaped.
And indeed, it turns out to be extremely well-approximated by a normal distribution, as we’ll see below.
It’s also clear that \(X\) is most likely to take on a value relatively close to 171.
We can use dpois()
to calculate the exact probability that \(X = 171\) as follows: the answer is just over 3%.
dpois(171, 171)
## [1] 0.03049301
Now let’s try to calculate exactly the same probability by hand, that is by using the formula for the Poisson pmf from above.
my_dpois <- function(x, mu) {
exp(-mu) * mu^x / factorial(x)
}
my_dpois(171, 171)
## [1] NaN
What gives?!
The abbreviation NaN
stands for “not a number.”
The problem in this case is that both the numerator and denominator of the fraction inside of my_dpois()
evaluate to infinity when mu
and x
are 171, and the ratio \(\infty/\infty\) is undefined.1
c(numerator = exp(-171) * 171^171, denominator = factorial(171))
## numerator denominator
## Inf Inf
As I discussed in an earlier post, computers can only store a finite number of distinct numeric values.
It’s not literally true that factorial(171)
equals \(\infty\).
What’s really going on here is that factorial(171)
is such a large number that it can’t be stored as a floating-point number.
In this case there’s a very simple fix.
If you haven’t seen this trick before, it’s a helpful one to keep up your sleeves: if you run into numerical problems with very large or very small values, try taking logs.2
The log of the Poisson pmf is simply
\[
\log p(x) = -\mu + x \log(\mu) - \log(x!).
\]
R even has a convenient, built-in function for evaluating the natrual log of a factorial: lfactorial()
.
Now we can compute the log of our desired probability as follows:
-171 + 171 * log(171) - lfactorial(171)
## [1] -3.490258
To obtain the probability, simply exponentiate:
exp(-171 + 171 * log(171) - lfactorial(171))
## [1] 0.03049301
Of course this just passes the buck to lfactorial()
. So how does this mysterious function work? The bad news is that I’m not going to tell you; the good news is that I’m going to show you something even better, namely Stirling’s approximation: a way to understand now \(n!\) behaves qualitatively that turns out to give a pretty darned good approximation to lfactorial()
.
This may seem like an odd topic for a blog devoted to econometrics and statistics, so allow me to offer a few words of justification.
First, computations involving \(n!\) come up all the time in applied work.
Second, it can be extremely helpful for certain theoretical arguments to have good approximations to \(n!\) for large values of \(n\).
Finally, and most importantly from my perspective, the heuristic argument I’ll use below relies on none other than the central limit theorem.
So even if you’ve seen a more traditional proof of Stirling’s approximation, I hope you’ll enjoy this alternative approach.3
Stirling’s Approximation
The key step in our argument is to show that the pmf of a \(\text{Poisson}(\mu)\) random variable is well-approximated by the \(\text{Normal}(\mu, \mu)\) density. This explains the bell-shaped curve that we plotted above. To obtain this result, we’ll use the central limit theorem. But there is one fact that you will need to take on faith if you don’t already know it: if \(X_1 \sim \text{Poisson}(\mu_1)\) is independent of \(X_2 \sim \text{Poisson}(\mu_2)\) then \(X_1 + X_2 \sim \text{Poisson}(\mu_1 + \mu_2)\). Proceeding by induction we can view a Poisson(171) random variable as the sum of 171 independent Poisson(1) random variables. More generally, we can view a Poisson RV with rate parameter \(n\) as the num of \(n\) iid Poisson(1) random variables. By the central limit theorem, it follows that \[ \sqrt{n}(\bar{X}_n - 1) \rightarrow_d \text{N}(0,1) \] since the mean and variance of a Poisson(1) RV are both equal to one. From a practical perspective, this means that \(\sqrt{n}(\bar{X}_n - 1)\) is approximately equal to \(Z\), a standard normal random variable. Re-arranging, \[ X_1 + X_2 + ... + X_n = n\bar{X}_n = n + \sqrt{n} \times [\sqrt{n}(\bar{X}_n - 1)] \approx n + \sqrt{n} Z \] and \(n + \sqrt{n} Z\) is simply a \(\text{N}(n, n)\) random variable! This is a quick way of seeing why the \(\text{Poisson}(\mu)\) distribution is well-approximated by the \(\text{N}(\mu, \mu)\) distribution when \(\mu\) is large.
Now let’s run with this.
As we just saw, for large \(\mu\) the Poisson\((\mu)\) pmf is well-approximated by the Normal\((\mu, \mu)\) density:
\[
\frac{e^{-\mu}\mu^x}{x!} \approx \frac{1}{\sqrt{2\pi \mu}} \exp\left\{ -\frac{1}{2}\left( \frac{x - \mu}{\sqrt{\mu}}\right)^2\right\}
\]
This approximation is particularly accurate for \(x\) near the mean. This is convenient, because substituting \(\mu\) for \(x\) considerably simplifies the right hand side:
\[
\frac{e^{-\mu}\mu^\mu}{\mu!} \approx \frac{1}{\sqrt{2\pi\mu}}
\]
Re-arranging, we obtain
\[
\mu! \approx \mu^\mu e^{-\mu} \sqrt{2 \pi \mu}
\]
Taking logs of both sides gives:
\[
\log(\mu!) \approx \mu \log(\mu) - \mu + \frac{1}{2} \log(2 \pi \mu)
\]
Writing this with \(n\) in place of \(\mu\) gives the following:
\[
\log(n!) \approx n \log(n) - n + \frac{1}{2} \log(2 \pi n)
\]
This is called Stirling’s Approximation. The usual way of writing this excludes the \(\log(2\pi n)/2\) term, yielding \(\log(n!) \approx n\log(n) - n\), which is fairly easy to remember. Including the extra term, however, gives increased accuracy for smaller values of \(n\).
While I haven’t formally proved this, it turns out that
\[
\log(n!) \sim n \log(n) - n + \frac{1}{2} \log(2 \pi n)
\]
as \(n \rightarrow \infty\). In other words, the ratio of the LHS and RHS tends to one in the large \(n\) limit.
Perhaps surprisingly, this approximately is extremely accurate even for fairly small values of \(n\), as we can see by comparing it against lfactorial()
.
stirling1 <- function(n) n * log(n) - n
stirling2 <- function(n) n * log(n) - n + 0.5 * log(2 * pi * n)
tibble(n = 1:20) %>%
mutate(Stirling1 = stirling1(n),
Stirling2 = stirling2(n),
R = lfactorial(n)) %>%
knitr::kable(digits = 3)
n | Stirling1 | Stirling2 | R |
---|---|---|---|
1 | -1.000 | -0.081 | 0.000 |
2 | -0.614 | 0.652 | 0.693 |
3 | 0.296 | 1.764 | 1.792 |
4 | 1.545 | 3.157 | 3.178 |
5 | 3.047 | 4.771 | 4.787 |
6 | 4.751 | 6.565 | 6.579 |
7 | 6.621 | 8.513 | 8.525 |
8 | 8.636 | 10.594 | 10.605 |
9 | 10.775 | 12.793 | 12.802 |
10 | 13.026 | 15.096 | 15.104 |
11 | 15.377 | 17.495 | 17.502 |
12 | 17.819 | 19.980 | 19.987 |
13 | 20.344 | 22.546 | 22.552 |
14 | 22.947 | 25.185 | 25.191 |
15 | 25.621 | 27.894 | 27.899 |
16 | 28.361 | 30.667 | 30.672 |
17 | 31.165 | 33.500 | 33.505 |
18 | 34.027 | 36.391 | 36.395 |
19 | 36.944 | 39.335 | 39.340 |
20 | 39.915 | 42.331 | 42.336 |
Epilogue
I have a bad habit of trying to add a “moral” or “lesson” to the end of my posts, but I suppose there’s no point trying to break the habit today! While there are easier ways to derive Stirling’s approximation, there are two things I enjoy about this one. First, we get a more accurate approximation than \(n \log(n) - n\) with practically no effort. Second, making unexpected connections between facts that we already know both deepens our understanding and helps us “compress” information. If you ever forget Stirling’s approximation, now you know how to very quickly re-derive it on the spot!
There’s an important but subtle difference between
NA
andNaN
. The former is synonymous with “missing.” Ifx
equalsNA
this means “we don’t know the value ofx
.” If insteadx
equalsNaN
, this means “x
isn’t missing, but it’s not a well-defined numeric value either.”↩︎Unless otherwise specified log always means “natural logarithm” on this blog :)↩︎
I first came across this argument from the late David MacKay’s fantastic book Information Theory, Inference, and Learning Algorithms. His book on sustainable energy, while a bit out-of-date at this point, is also spectacularly good.↩︎