Intro

Limit theorems describe universal phenomena exhibited by repeated samples, through which it is possible to recover information on the underlying probability measure. If tossing a coin repeatedly (repeated samples) gives eventually similar proportions of Heads and Tails, then one can deduce the coin is fair (statement on probability measure). This is called the law of large numbers, the first main limit result in this chapter. After some theoretical development, we will turn to a more refined question. Although the proportion of Heads gets closer and closer to the limit value of one half, it is still random. We will discuss the difference from this limit value, which, when viewed from up close, exhibits a very interesting structure. This result is known as the central limit theorem.

Here is one. Not so long ago I visited Shelbourne Farms in Vermont. On one of the walls, there was a pushpin visitor map showing where people have visited from. This is the photo on the left. When I got home I checked the US population density map on the US Census Bureau (most recent is from 2010). This is the photo on the right (I inverted colors: darker is denser).

As you can see, the “randomly” generated map is a pretty accurate approximation of the real one, and with a fairly small number of random points (samples), relative to the entire population.

Pretty remarkable, right?

Tail Estimates: Markov inequality

The Markov inequality allows one to get upper bounds of probabilities in terms of expectations. The expectation is a “brief summary” of the distribution of a Random Variable (RV).

Theorem 1.

pMarkov’s Inequality] Let @@@X@@@ be a nonnegative random variable. Then for any @@@\alpha>0@@@,

$$ P(X>\alpha) \le \frac{E [X] }{\alpha}.$$

Proof.

We start with the RHS. By the definition of the expectation,

$$ \begin{aligned} E[ X ] &= \int_0^\infty P(X>s) ds\\ &\ge \int_0^\alpha P(X>s) ds \\ &\ge \int_0^\alpha P(X>\alpha) ds \\ &= \alpha P(X>\alpha). \end{aligned} $$ ■

Example 1.

The average annual household income in my town is @@@\$100\text{K}@@@. Show that the proportion of households with an annual income larger than @@@\$1\text{M}@@@ does not exceed @@@10\%@@@.

That is, if @@@X@@@ is a random variable distributed according to the household incomes distribution, then @@@E[X] = 10^5@@@. Clearly, @@@|X|=X@@@, so Markov’s inequality gives @@@P(X>10^6)\le 10^5/10^6 = 0.1=10\%@@@.

Note that we did not assume anything on the distribution of the household incomes, and the only information used was its expectation.

We’re going to push this inequality further to get even better results.

Suppose that @@@Y@@@ is an RV (not required to be nonnegative), and that @@@f@@@ is a strictly increasing nonnegative function on @@@[0,\infty)@@@ (say @@@f(x)=x^c@@@ or @@@f(x) =e^{cx}@@@ for some @@@c>0@@@). Then for every @@@\beta>0@@@, the events @@@\{|Y|>\beta \}@@@ and @@@\{f(|Y|)>f(\beta)\}@@@ are equal. Apply Markov’s inequality with @@@X=f(|Y|)@@@ and @@@\alpha = f(\beta)@@@), to obtain

$$ P(|Y|>\beta) \le \frac{E [f (|Y|)]}{f(\beta)}. $$

Let’s look at a specific case that is of particular importance. Let @@@X@@@ be any random variable with finite expectation. Set @@@Y= X-E[X]@@@, and let @@@f(y) = y^2@@@. Note that @@@f(|Y|) = (X - E[X])^2@@@, and its expectation is the all-familiar variance of @@@X@@@, @@@\sigma_X^2@@@. In this particular case, the general Markov inequality becomes:

Corollary 2: Chebychev’s Ineuqality.

Suppose @@@X@@@ has a finite first moment. Then for any @@@\beta>0@@@,

$$ P(|X-E[X]|>\beta) \le \frac{\sigma_X^2}{\beta^2}.$$

Example 2.

Suppose that a very large population has a certain unknown proportion @@@p@@@ of left handed people (or whatever label you choose). Our problem is to try to estimate @@@p@@@, but without asking everyone.

Here is what we do:

We independently sample @@@n@@@ individuals (say 1024), and let @@@X_n@@@ denote the number of left-handed people in our sample. By construction, @@@X_n \sim \mbox{Bin}(n,p)@@@.
Let @@@\hat X_n=X_n/n@@@ denote the empirical proportion of left-handed people.
We find @@@E[\hat X_n ] = p@@@, and @@@\sigma_{\hat X_n}^2 = \frac{p(1-p)}{n}@@@.
Apply Corollary 2 to @@@X= \hat X_n@@@ to obtain

$$ P( |\hat X_n - p|>\beta ) \le \frac{p(1-p)}{n \beta^2} $$

Using the fact that @@@p(1-p) \le \frac 14@@@:

$$ P( |\hat X_n - p|>\beta ) \le \frac{p(1-p)}{n \beta^2}\le \frac{1}{4 n \beta^2} $$

In other words, the probability that our (random) empirical proportion is more than @@@\beta@@@ units away from the real unknown @@@p@@@ is no more than @@@\frac{1}{4n \beta^2}@@@. Or, with probability (confidence) larger than @@@1-\frac{1}{4n \beta^2}@@@, the unknown proportion lies in the (random) interval @@@(\hat X_n -\beta,\hat X_n+\beta)@@@. This type of statement is known as a confidence interval.

Let’s test this with real numbers. Suppose I want to be sure with @@@90\%@@@ confidence (probability) that the empirical proportion is within @@@5\%@@@ of the real unknown proportion.

@@@\beta =0.05@@@.
We want the RHS of the inequality to be smaller than @@@1-0.9=0.1@@@:

$$ \frac{1}{4n (1/20)^2} < 0.1 \implies n \ge 1000. $$

If we sample @@@1000@@@ people independently and calculate the empirical proportion, @@@\hat X_{1000}@@@, then with confidence of @@@90\%@@@ or more, the real unknown population proportion is between @@@\hat X_{1000}-0.05@@@ and @@@\hat X_{1000}+0.05@@@. This works for any @@@p@@@ and any population size, and is why polls work (theoretically).

To “standardize” Corollary 2, let @@@\beta = \sigma_X \alpha@@@, where @@@\alpha@@@ is any positive number and @@@\sigma_X@@@ is the standard deviation. Corollary 2 then becomes:

$$ P(|X-E[X]|>\alpha \sigma_X) \le \frac{1}{\alpha^2}.$$

Thus, measuring the difference from the expectation in units of standard deviation gives a “standardized” probability estimate which does not depend on the RV itself.

Exercise 1.

Let @@@X\sim \mbox{Bin}(1000,0.7)@@@. Get an upper bound on the probability that @@@X@@@ is at least @@@3\%@@@ larger than its expectation

First use Theorem 1;
Then use Corollary 2.
Which one is better? (actual probability is around @@@7.8\%@@@).

Convergence in probability

“To converge” means: to tend or move toward one point or one another. (merriam-webster.com)

Definition 1.

Let @@@Z_1,Z_2,\dots@@@ and @@@Z@@@ be RVs in the same probability space. We say that the sequence @@@(Z_n:n\in\N)@@@ converges in probability to @@@Z@@@ if for any @@@\epsilon>0@@@,

$$P(|Z_n-Z|>\epsilon) \to 0.$$

This can be interpreted as @@@Z_n = Z+ \underbrace{ Z_n - Z}_{=N_n}@@@, where @@@N_n@@@ is some “noise”. Convergence in probability means that the probability that the noise is above any given level eventually diminishes.

Weak Law of Large Numbers

The law of large numbers is the statement that, under reasonable assumptions on the sampling, the time averages (the average number of Heads among the first @@@n@@@ tosses) converge to the sample space average (expectation, or the probability of Heads).

Theorem 3: Weak Law of Large Numbers.

Suppose that @@@X_1,X_2,\dots@@@ are random variables in the same probability space satisfying the following:

@@@E[X_i] =\mu_i\in \R@@@, and @@@a_n = \sum_{i\le n} E[X_i]\to \infty@@@ as @@@n\to\infty@@@.
For each @@@i@@@, the variance of @@@X_i@@@, @@@\sigma_{X_i}^2@@@ is finite, and satisfies

$$\lim_{n\to\infty} \frac{1}{n^2}\sum_{i\le n} \sigma_{X_i}^2=0.$$

For each @@@i\ne j@@@, the RVs @@@X_i@@@ and @@@X_j@@@ are uncorrelated, that is

$$ E[X_i X_j] = E[X_i]E[X_j].$$

Let @@@S_n = \sum_{i\le n} X_i@@@. Then

$$ \frac{S_n}{a_n}\to 1,$$

in probability. Furthermore, for any positive @@@\beta@@@,

$$ P( |\frac{S_n}{a_n} - 1|>\beta) \le \frac{\sum_{i=1}^n \sigma^2_{X_i}}{n^2 \beta^2}\to 0.$$

In the case where @@@\mu_1=\mu_2=\dots=\mu@@@, the theorem states that:

$$ \frac{S_n}{n}\to \mu,\mbox{in probability as }n\to\infty.$$

Corollary 4: Weak Law - IID Version.

Let @@@X_1,X_2,\dots@@@ be IID (independent and identically distributed) with finite mean @@@\mu@@@ and variance @@@\sigma^2@@@. Let @@@\hat S_n = \frac{ X_1+\dots+X_n}{n}@@@ be the empirical mean of the first @@@n@@@ samples. Then @@@\hat S_n@@@ converges to @@@\mu@@@ in probability, and more specifically, for any @@@\beta>0@@@,

$$ P(|\hat S_n -\mu|>\beta) \le \frac{\sigma^2_X}{n \beta^2}.$$

The WLLN tells us that @@@S_n \approx n \mu@@@, meaning the random sum @@@S_n@@@ is “highly concentrated” around its expectation.

Proof.

The proof relies entirely on Corollary 2.

The expectation of the empirical mean is @@@E[\hat S_n ] = \mu@@@.
The variance is @@@\sigma^2_{\hat S_n} = \frac{1}{n^2} \sigma^2_{S_n}@@@.
Due to the uncorrelated condition (which is satisfied by IID RVs), the variance of the sum is the sum of the variances: @@@\sigma^2_{S_n} = \sum_{i=1}^n \sigma^2_{X_i} = n\sigma^2_X@@@.
Therefore, @@@\sigma^2_{\hat S_n} = \frac{n \sigma^2_X}{n^2} = \frac{\sigma^2_X}{n}@@@.
Plugging this into Corollary 2:

$$ P(|\hat S_n - \mu|>\beta) \le \frac{\sigma^2_X}{n \beta^2}.$$

By assumption, the right-hand side tends to zero as @@@n \to \infty@@@.

■

Example 3.

This example calculates the expected payout for a Keno game to illustrate the WLLN, showing that in the long run, the expected earnings per game (@@@E[Y_1] = 0.6487@@@) are less than the ticket price (@@@\$1@@@), guaranteeing a profit for the lottery.

Exercise 2.

If I play a 1000 games of Keno as described in Example 3, in how many of them I would win @@@\$100@@@ according to the WLLN? (just give a ballpark).

Example 4: Glivenko Cantelli Lemma.

Let @@@Z_1,Z_2,\dots@@@ be IID with distribution function @@@F_Z@@@. The empirical CDF is defined as @@@\hat F_n(x) = \frac1n \sum_{i\le n} X_i(x)@@@, where @@@X_i(x)@@@ is the indicator of @@@\{Z_i \le x\}@@@. By the WLLN, for every fixed @@@x@@@,

$$\hat F_n (x) \to F_Z(x),$$

in probability. This shows one can completely recover the CDF from IID samples.

Example 5.

The daily change in the value of a portfolio has @@@E[X]=\$1\text{K}@@@ (@@@\mu=1@@@) and @@@\sigma=\$0.1\text{K}@@@ (@@@\sigma=0.1@@@). Starting at @@@\$100\text{K}@@@, the expected time to reach @@@\$300\text{K}@@@ is about @@@200@@@ days, since @@@100+n\mu \approx 300@@@, so @@@n \approx 200@@@.

Using the WLLN bound with @@@n=200@@@ and @@@\beta=0.1@@@, the probability of the accumulated change @@@S_n@@@ being more than @@@0.1n@@@ away from @@@n\mu@@@ is:

$$ P(|S_{200}- 200|>\text{20}) \le \frac{\sigma^2}{n \beta^2} = \frac{0.01}{200 \cdot (0.1)^2} = \frac{1}{200}=0.005.$$

Thus, with probability of at least @@@99.5\%@@@, after @@@200@@@ days the portfolio will be within @@@\pm\$20\text{K}@@@ from @@@\$300\text{K}@@@.

Exercise 3.

A coin is repeatedly tossed, and I record the number of Heads minus the number of Tails. After @@@100@@@ tosses, this difference is about a third of the number of tosses. What is your estimate for the probability of Heads? (Note: we do not assume that the coin is fair)

Moment Generating Functions

Definition 2.

Let @@@X@@@ be an RV. The moment generating function (MGF) of @@@X@@@, @@@\varphi_X@@@, is defined as

$$\varphi_X (t) = E[e^{t X}].$$

@@@\varphi_X (t)@@@ is the expectation of @@@e^{tX}@@@, a transformation of @@@X@@@.

Exercise 4.

Let @@@X@@@ be any RV. What is the value of @@@\varphi_X(0)@@@?

Proposition 5: Taylor Expansion MGF.

Suppose @@@\varphi_X@@@ is finite in some open interval containing @@@0@@@. Then the Taylor expansion of @@@\varphi_X@@@ converges in this interval and is given by

$$\varphi_X (t) = 1+ E[X]t+ \frac{E[X^2]}{2} t^2 + \frac{E[X^3]}{3!}t^3 + \dots.$$

In particular, for any positive integer @@@k@@@ the @@@k@@@-th moment of @@@X@@@, @@@E[X^k]@@@, is finite and given by

$$E[X^k]=\varphi^{(k)}(0)= \frac{d^k}{dt^k} \varphi_X(0).$$

The moments of @@@X@@@ are the “DNA” of the MGF, encoded through the Taylor expansion.

Example 6.

Let @@@X\sim \mbox{Exp}(\lambda)@@@. The MGF is

$$\varphi_X(t) =\frac{\lambda}{\lambda-t}=\frac{1}{1-\frac{t}{\lambda}}, \quad \text{for } t<\lambda.$$

The Taylor expansion is @@@\varphi_X(t) = 1+ \frac{t}{\lambda} + \frac{t^2}{\lambda^2} + \dots@@@. From this, the @@@k@@@-th moment is @@@E[X^k] =k! \lambda^{-k}@@@. The variance is @@@\sigma^2_X = E[X^2] - E[X]^2 = \frac{2}{\lambda^2} - \left(\frac{1}{\lambda}\right)^2 = \frac{1}{\lambda^2}@@@.

Example 7.

Suppose that @@@X\sim \mbox{Pois}(\lambda)@@@. The MGF is:

$$\varphi_X(t) = \sum_{k=0}^\infty e^{-\lambda} e^{t k} \frac{\lambda^k}{k!} =e^{-\lambda} e^{\lambda e^t }.$$

Therefore,

$$ \varphi_X (t) = e^{\lambda (e^t -1)}.$$

We can find the moments using derivatives:

@@@\varphi_X '(t) = \lambda e^t \varphi_X (t) \implies E[X] = \varphi_X '(0) = \lambda@@@.
@@@\varphi_X"(t) = (\lambda e^t)^2 \varphi_X (t) + \lambda e^t \varphi_X(t) \implies E[X^2] = \varphi_X"(0) = \lambda^2 +\lambda@@@.
The variance is @@@\sigma^2_X = E[X^2] - E[X]^2 = (\lambda^2 +\lambda) - \lambda^2 = \lambda@@@.

Exercise 5.

Let @@@X\sim \mbox{U}[a,b]@@@. Show that

$$\varphi_X (t) = \begin{cases} \frac{e^{tb}-e^{ta}}{t(b-a)} & t \ne 0 \\ 1 & t= 0\end{cases}$$

Example 8.

Using the Taylor expansion of @@@e^x = \sum_{k=0}^\infty \frac{x^k}{k!}@@@ on the MGF of @@@U\sim U[a,b]@@@, we get

$$ \begin{aligned} \varphi_U(t) &= \frac{1}{t(b-a)}\left( \sum_{k=0}^\infty \frac{(bt)^k}{k!} - \sum_{k=0}^\infty \frac{(at)^k}{k!} \right) \\ &= \frac{1}{t(b-a)} \left( \sum_{k=1}^\infty \frac{b^k-a^k}{k!} t^k \right) \\ &= 1 + \sum_{k=1}^\infty \frac{b^{k+1}-a^{k+1}}{(b-a)(k+1)!} t^k \end{aligned} $$

The @@@k@@@-th moment, @@@E[U^k]@@@, is given by @@@k!@@@ times the coefficient of @@@t^k@@@:

$$E[U^k] = \frac{k!}{(k+1)!}\frac{b^{k+1}-a^{k+1}}{b-a} = \frac{1}{k+1}\frac{b^{k+1}-a^{k+1}}{b-a}.$$

For @@@k=2@@@, @@@E[U^2]= \frac{b^3-a^3}{3(b-a)} = \frac{a^2+ab+b^2}{3}@@@. The variance is @@@\sigma^2_U = E[U^2] - E[U]^2 = \frac{b^3-a^3}{3(b-a)} - \left(\frac{b+a}{2}\right)^2 = \frac{(b-a)^2}{12}@@@.

Example 9.

Suppose @@@X \sim \Gamma(\alpha,\lambda)@@@. Then

$$ \varphi_X (t) = \begin{cases} \left(\frac{\lambda}{\lambda -t}\right)^\alpha & t < \lambda \\ \infty & \mbox{otherwise.} \end{cases}$$

The derivation shows:

$$ \begin{aligned} \varphi_X(t) &= \int_0^\infty e^{tx} \frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\lambda x} dx\\ &= \frac{\lambda^\alpha}{\Gamma(\alpha)} \int_0^\infty x^{\alpha-1} e^{-(\lambda-t)x} dx\\ &= \left(\frac{\lambda}{\lambda -t}\right)^\alpha \int_0^\infty \underbrace{\frac{(\lambda - t)^\alpha}{\Gamma(\alpha)}x^{\alpha-1} e^{-(\lambda-t)x}}_{\scriptsize \mbox{density of }\Gamma(\alpha,\lambda - t)}dx\\ &= \left(\frac{\lambda}{\lambda -t}\right)^\alpha. \end{aligned} $$

Proposition 6: MGF under translation and dilation.

Let @@@X@@@ be an RV and @@@\varphi_X@@@ be its MGF. Then:

For any constant @@@a\in \R@@@, @@@\varphi_{X+a}(t) =e^{at}\varphi_X (t)@@@.
For any constant @@@c \in\R@@@, @@@\varphi_{cX}(t) = \varphi_X(ct)@@@.

Corollary 7: MGF for some of Independent.

If @@@X@@@ and @@@Y@@@ are independent and have MGFs @@@\varphi_X@@@ and @@@\varphi_Y@@@ which are both finite on some open interval containing @@@0@@@, then @@@\varphi_{X+Y}@@@ is also finite on this interval and is equal to

$$\varphi_{X+Y} (t) = \varphi_X(t) \varphi_Y(t).$$

This extends to any finite number of independent RVs.

Example 10.

Suppose @@@X \sim \mbox{Bin}(n,p)@@@. Since @@@X@@@ is a sum of @@@n@@@ independent @@@\mbox{Ber}(p)@@@ RVs, and @@@\varphi_{X_i}(t) = p e^t+ (1-p)@@@, by the Corollary 7,

$$\varphi_X (t) = (pe^t + (1-p))^n.$$

Theorem 8: MGF Determines Distribution.

Suppose that @@@\varphi_X=\varphi_Y<\infty@@@ on some open interval containing @@@0@@@, then @@@F_X=F_Y@@@. This means the MGF determines the CDF.

Corollary 9: Multiplicative MGF implies independence.

Suppose that @@@X@@@ and @@@Y@@@ are RVs satisfying that @@@\varphi_{X+Y}=\varphi_X \varphi_Y@@@ on some open interval containing @@@0@@@. Then @@@X@@@ and @@@Y@@@ are independent.

Example 11.

Suppose that @@@X\sim \mbox{Pois}(\lambda)@@@ and @@@Y \sim \mbox{Pois}(\mu)@@@ are independent.

$$ \varphi_{X+Y} (t) = \varphi_X(t) \varphi_Y(t) = e^{\lambda (e^t -1)} e^{\mu (e^t -1)} = e^{(\mu+\lambda) (e^t -1)}.$$

This is the MGF of @@@\mbox{Pois}(\lambda+ \mu)@@@, so @@@X+Y \sim\mbox{Pois}(\lambda+ \mu)@@@.

Example 12.

Let @@@X_1,X_2,\dots@@@ be IID, and let @@@R@@@ be an independent RV with support in the nonnegative integers. Let @@@S=X_1+\dots +X_R@@@. If @@@\varphi_{X}@@@ is the MGF of @@@X_1@@@, then

$$\varphi_S (t) = \varphi_R (\ln \varphi_{X}(t)).$$

The derivation is:

$$ \begin{aligned} \varphi_S(t) &= E[ e^{t S} ]= E[e^{t(X_1+\dots+X_R)}] \\ &= \sum_{r=0}^\infty E[ e^{t (X_1+\dots+X_r)} \mid R=r] p_R(r) \\ &= \sum_{r=0}^\infty E[ e^{t (X_1+\dots+X_r)}] p_R(r) \\ &= \sum_{r=0}^\infty E[ e^{t X_1}]^r p_R(r) \\ &= \sum_{r=0}^\infty \varphi_{X}(t)^r p_R(r) \\ &= E[\varphi_X(t)^R] \\ &= E[e^{\ln \varphi_X(t) \cdot R}] = \varphi_R (\ln \varphi_{X}(t)). \end{aligned} $$

Central Limit Theorem

Theorem 10: Central Limit Theorem.

Let @@@X_1,X_2,\dots@@@ be IID with finite expectation @@@E[X_1]=\mu@@@ and finite, positive variance @@@\sigma^2>0@@@. Let @@@S_n = \sum_{i=1}^n X_i@@@. Then, as @@@n\to\infty@@@,

$$ \frac{S_n - n \mu}{\sqrt{n \sigma^2}} \to Z,$$

in distribution, where @@@Z \sim N(0,1)@@@ is a standard normal RV.

Applications

The Central Limit Theorem (CLT) allows for a normal approximation of the distribution of a sum of IID RVs, even if the underlying distribution is not normal.

Example 13.

Revisiting the confidence interval example, the CLT suggests that for large @@@n@@@:

$$\hat X_n \approx p + \sqrt{ \frac{p(1-p)}{n}} N(0,1).$$

This leads to an approximation for the tail probability:

$$ \begin{aligned} P( |\hat X_n - p|>\beta ) &\approx P( \sqrt {\frac{p(1-p)}{n}}|N(0,1)|>\beta)\\ &= 2 \Phi\left( - \sqrt{ \frac{n}{p(1-p)}} \beta\right). \end{aligned} $$

Example 14.

A computer repair service owner reported a total revenue of @@@\$70,200@@@ for @@@900@@@ visits, with an average charge of @@@\$80@@@ and standard deviation of @@@\$20@@@.

Expected revenue: @@@900 \times \$80 = \$72,000@@@.
Standard deviation of total revenue: @@@\sqrt{900} \times @@@$20 = 30 \times @@@\$20 = @@@$600{::nomarkdown}@@@.

The probability of earning @@@{:/nomarkdown}$70,200$ or below is approximated by the CLT:

$$P(\text{Revenue} \le 70,200) \approx P\left(N(0,1) \le \frac{70,200 - 72,000}{600} \right) = \Phi(-3) = 0.001.$$

This is highly unlikely. The key is that the reported value is @@@3@@@ standard deviations below the average.

CLT Approximation Algorithm

To approximate @@@P(S_n \in [a,b])@@@:

Assume @@@n@@@ is large enough. The normalized sum @@@\frac{S_n - n \mu}{\sqrt{n \sigma^2}}@@@ is nearly @@@N(0,1)@@@.
The approximation formula is:

$$ P( S_n \in [a,b]) \approx \Phi\left(\frac{b - n\mu}{\sqrt{n\sigma^2}}\right)-\Phi\left(\frac{a- n \mu}{\sqrt{n\sigma^2}}\right).$$

Continuity Correction

When @@@S_n@@@ takes only integer values (e.g., Binomial or Poisson RVs), a continuity correction is used. This is done by replacing @@@b@@@ with @@@b + 0.5@@@ and @@@a@@@ with @@@a-0.5@@@ in the formula, which bridges the discrete setting to the continuous normal distribution.

Example 15.

The number of students showing up to office hours is @@@X \sim \mbox{Bin}(64, 1/2)@@@, so @@@n=64, n\mu=32@@@. We want to approximate @@@P(29 \le X \le 34)@@@.

@@@\sqrt{n\sigma^2} = \sqrt{64 \cdot \frac{1}{2} \cdot \frac{1}{2}} = 4@@@.
CLT with continuity correction:

$$ P(29 \le X \le 34) \approx \Phi\left( \frac{34.5 - 32}{4}\right) - \Phi\left(\frac{28.5 -32}{4}\right) = 0.5432275.$$

The exact probability is @@@0.5429792@@@.
If we ignore the continuity correction, the approximation is @@@0.4648351@@@, which is a bad approximation.

Proof the CLT

We did not develop the tools to give a complete proof. Instead, we will prove the following up to a technical detail which requires too much analysis for this course.

Proposition 11.

Let @@@X_1,X_2,\dots@@@ be IID with finite third moment @@@E[|X_1|^3]<\infty@@@, variance @@@\sigma^2>0@@@ and expectation @@@\mu@@@, and let @@@S_n =\sum_{k=1}^n X_k@@@ be the @@@n@@@-th partial sum. Then

$$\frac{ S_n - n \mu}{\sqrt{n \sigma^2} } \Rightarrow N(0,1).$$

Furthermore, for every @@@f:\R\to \R@@@ which is bounded and has bounded derivatives up to the third order,

$$\left | E[f(\frac{ S_n - n \mu}{\sqrt{n \sigma^2}} ) - E[f(N(0,1))] \right | \le \frac{ C_f c}{6\sqrt{n}},$$

where @@@C_f=\sup_x |f"'(x)|@@@ and @@@c = E[|X_1|^3]+ E[|Z|^3]@@@.

Proof.

We can always assume that the RVs @@@X_1,\dots,X_n@@@ have mean zero and variance @@@1@@@. Why? replace @@@X_k@@@ by @@@\frac{ X_k- \mu}{\sqrt{\sigma^2}}@@@. So let’s assume that. To continue, in order to ease the notation, we put a hat, @@@ \hat\cdot@@@, as shorthand for dividing by @@@\sqrt{n}@@@, so @@@\hat X_k=X_k/\sqrt{n}@@@ and @@@\hat S_n= S_n/\sqrt{n}@@@.

By Theorem ??, we need to show that for any @@@f@@@ which is continuous and bounded,

$$\lim_{n\to\infty} E[f(\hat S_n)] = E [f(N(0,1))].$$

We are going to do that for a smaller class of functions: functions which are bounded, have derivatives up to the third order and their first, second and third derivatives are all bounded. This may seem like a very small class, but with some analysis (not here), we can show showing only for this class implies convergence for all continuous and bounded functions.

To begin our proof, let’s take another sequence @@@Z_1,\dots,Z_n@@@ of IID @@@N(0,1)@@@ RVs, independent of the @@@X_1,\dots,X_n@@@. Let @@@T_n = Z_1+\dots+Z_n@@@. We are going to gradually move from @@@S_n@@@ to @@@T_n@@@ by replacing one summand at a time:

$$ \newcommand{\dTo}{\downarrow} \begin{array}{cccccccc} S_n = S_{n,0} = & X_1 & + X_2 & + X_3 & + \dots + & X_{n-1} & +X_n \\ & \dTo \\ S_{n,1} = & {\bf Z_1} & + X_2 & + X_3 & + \dots + & X_{n-1} & +X_n \\ & & \dTo \\ S_{n,2} = & Z_1 & + {\bf Z_2} & + X_3 & + \dots + & X_{n-1} & +X_n \\ & & & \dTo \\ \vdots & & & \ddots \\ \vdots & & & & \ddots \\ \vdots & & & & & \ddots \\ & & & & & \dTo \\ S_{n,n-1} = & Z_1 & + Z_2 & + Z_3 & + \dots + & {\bf Z_{n-1}} & +X_n \\ & & & & & & \dTo \\ T_n= S_{n,n} = & Z_1 & + Z_2 & + Z_3 & + \dots + & Z_{n-1} & +{\bf Z_n} \end{array} \label{eq:change} $$

The boldface is only to highlight what has been changed. We also define @@@k=1,\dots,n@@@, let @@@R_{n,k} = S_{n,k} -Z_k@@@. Then

$$S_{n,k-1} = R_{n,k} + X_k\mbox{ and }S_{n,k} = R_{n,k}+Z_k,$$

and that by construction, each of the sums on the RHS is of two independent RVs. Recall the @@@\hat \cdot @@@ notation: @@@\hat S_{n,k} = S_{n,k}/\sqrt{n}@@@, @@@\hat Z_k=Z_k/\sqrt{n}@@@, etc.

Next, pick a “test function” @@@f@@@. We will assume that @@@f@@@ has a continuous an bounded derivative of the third order. We are interested in the difference between @@@ E[f (\hat S_n)]@@@ and @@@E[f(Z)]@@@. Since @@@T_n@@@ is a sum of IID @@@N(0,1)@@@, @@@T_n \sim N(0,n)@@@ and therefore @@@\hat T_n \sim N(0,1)@@@, and in particular, @@@E[f(N(0,1))]=E[f(\hat T_n)]@@@. From now on, we will use the expression on the RHS. Why did we define the @@@S_{n,k}@@@’s? To get the following telescopic sum:

$$ \begin{align} \nonumber \left | E[f (\hat S_n)]- E[f(\hat T_n )] \right| & = \left | \sum_{k=1}^n E[f(\hat S_{n,k})] - E[f(\hat S_{n,k-1})],\right| \\ \label{eq:CLT_telescope} & \le \sum_{k=1}^n \left | E[f(\hat S_{n,k})] - E[f(\hat S_{n,k-1})] \right|, \end{align} $$

and of course, we are going to estimate each summand separately using Taylor expansion:

$$f (\hat S_{n,k}) = f( \hat R_{n,k}+\hat Z_k) =f(\hat R_{n,k}) + f'(\hat R_{n,k}) \hat Z_k+ \frac{1}{2} f"(\hat R_{n,k}) (\hat Z_k)^2 + \frac{1}{6} f"'(M_{n,k})(\hat Z_k)^3.$$

where @@@M_{n,k}@@@ is some number between @@@\hat R_{n,k}@@@ and @@@\hat R_{n,k}+\hat Z_k@@@. Similarly,

$$f (\hat S_{n,k-1}) = f( \hat R_{n,k}+\hat X_k) =f(\hat R_{n,k}) + f'(\hat R_{n,k}) \hat X_k+ \frac{1}{2} f"(\hat R_{n,k}) (\hat X_k)^2 + \frac{1}{6} f"'(\tilde M_{n,k})(\hat X_k)^3.$$

where @@@\tilde M_{n,k}@@@ is some number between @@@\hat R_{n,k}@@@ and @@@\hat R_{n,k}+\hat X_k@@@.

Since @@@\hat R_{n,k}@@@, @@@\hat Z_k@@@ and @@@\hat X_k@@@ are independent, it follows that after taking expectation we have

$$ \begin{align*} E [ f (\hat S_{n,k}) ]& =E[ f(\hat R_{n,k}) ]+ E[f'(\hat R_{n,k})]\underset{=0}{E[ \hat Z_k]}+ \frac{1}{2}E[ f"(\hat R_{n,k})]\underset{=\frac{1}{n}}{E[ (\hat Z_k)^2]} + \frac{1}{6} E[ f"'(M_{n,k}) (\hat Z_k)^3]\\ & = E[ f(\hat R_{n,k}) ] + \frac{1}{2n}E[ f"(\hat R_{n,k})]+ \frac{1}{6} E[ f"'(M_{n,k})(\hat Z_k^3)], \end{align*} $$

and similarly

$$E [ f (\hat S_{n,k-1}) ]=E[ f(\hat R_{n,k}) ]+\frac{1}{2n} E[ f"(\hat R_{n,k})]+ \frac{1}{6} E[ f"'(\tilde M_{n,k})(\hat X_k)^3].$$

Therefore,

$$ \begin{align*} \left | E[f(\hat S_{n,k}) ] -E [f(\hat S_{n,k-1}) ] \right | & =\frac{1}{6} \left | E[ f"'(M_{n,k})] (\hat Z_k^3)] - E[ f"'(\tilde M_{n,k})(\hat X_k)^3]\right| \\ & \le \frac{1}{6}\left | E[ f"'(M_{n,k})] (\hat Z_k^3)] |+ |E[ f"'(\tilde M_{n,k})(\hat X_k)^3]\right| \end{align*} $$

Remember that we assumed that the third derivative of @@@f@@@ is bounded? In particular, there exists some constant @@@C_f\ge0@@@ such that @@@|f"'(x)|\le C_f@@@ for all @@@x@@@. We’re now going to use this to obtain

$$ \begin{align*} |E[f"'(M_{n,k})] (\hat Z_k^3) ] |& \le E[|f"'(M_{n,k})| |\hat Z_k^3| ]\\ & \le C_f E[ |\hat Z_k|^3] \\ & = \frac{C_f}{n^{3/2}} E[|Z_k|^3]\\ & = \frac{C_f}{n^{3/2}} c_1 \end{align*} $$

where @@@c_1=E[|Z_1|^3]@@@. Similarly,

$$|E[f"'(M_{n,k})] (\hat Z_k^3) ]\le \frac{C_f}{n^{3/2}}c_2,$$

where @@@c_2= E[|X_1|^3]@@@. As a result, it follows that

$$\left | E[f(\hat S_{n,k}) ] -E [f(\hat S_{n,k-1}) ] \right | \le \frac{C_f}{6 n^{3/2}} (c_1+c_2).$$

Plugging this inequality in the telescopic sum ??: we obtain

$$\left | E[f (\hat S_n)]- E[f(\hat T_n )] \right| \le\frac{C_f}{n^{3/2}} \sum_{k=1}^n (c_1+c_2) = \frac{C_f}{6\sqrt{n}}(c_1+c_2),$$

and the result follows.

■

Problems

Problem 1.

Suppose that @@@X@@@ is a random variable satisfying @@@E[X]=0@@@ and @@@E[X^2]=\sigma^2@@@. Show that

$$P(X >t ) \le \frac{\sigma^2}{\sigma^2+t^2}.$$

(Hint: for any @@@a>0@@@, @@@\{X>t\} =\{X+a > t+a\} \subset \{ (X+a)^2 > (t+a)^2\}@@@)

We know that the expected college education costs in the US is $65K per student with standard deviation equal to $15K. Use part a. to find an upper bound on the proportion of students spending $100K or more.

Problem 2.

Show that @@@X_n \to X@@@ in probability if and only if @@@E [ \frac{|X_n -X|}{1+|X_n -X|}]\to 0@@@ as @@@n\to\infty@@@.

Hint: if @@@X_n \to X@@@ in probability, then the result follows from the definition and the boundedness of the function @@@\phi(x)= |x|/(1+|x|)@@@. Conversely, observe that

$$E[\phi(X_n-X)] > E [ {\bf 1}_{\{|X_n-X|>\epsilon\}} \phi(X_n-X)] > \frac{\epsilon}{1+\epsilon} P(|X_n-X|>\epsilon).$$

Problem 3.

Show that under the assumptions of the WLLN for IID, ??, @@@\frac{S_n - n \mu}{n^\alpha}\to 0@@@ in probability for any @@@\alpha>\frac 12@@@.

Problem 4.

Suppose that @@@T_1,T_2,\dots@@@ are nonnegative IID RVs taking values in @@@\Z_+@@@, with finite strictly positive expectation. For each @@@n\in \N@@@, let @@@S_n=\sum_{j=1}^n T_j@@@. For @@@k\in\N@@@, let @@@N_k=\max\{ n: S_n \le k\}@@@. Interpretation: @@@S_1@@@ the number of days until the first time I get a flat tire, @@@S_2@@@ is the number of days, until the second time I get a flat tire, etc. Now @@@N_k@@@ is the number of flats I had up to day @@@k@@@. Show

$$\lim_{k\to\infty} \frac{N_k }{ k} \to E[T_1],$$

in probability.

This intuitive clear, on the average a flat tire occurs every @@@E[T_1]@@@ days, so we have @@@1/E[T_1]@@@ flat tires per day. The mathematical argument is a little more subtle than that.

Problem 5.

Let @@@(X_n:n\in\N)@@@ be a sequence of IID nondegenerate RVs (nondegenerate means @@@P(X_1=a)<1@@@ for any @@@a@@@).

Show that the sequence does not converge in probability.
Conclude that no subsequence converges in probability.

Problem 6.

Suppose that @@@(M_n:n\in\N)@@@ is a sequence of RVs converging to @@@0@@@ in distribution, and let @@@X@@@ be a RV satisftying @@@P(X\le 0)=0@@@. Show that @@@\lim_{n\to\infty} P(M_n <X) = 1@@@.

Problem 7.

An urn contains @@@N@@@ balls: @@@K@@@ white and @@@b=N-K@@@ black. A random sample of @@@n\le N@@@ balls is withdrawn without replacement. Let @@@X@@@ denote the number of white balls. Recall that the distribution of @@@X@@@ is called hypergeometric with parameters @@@N@@@, @@@K@@@ and @@@n@@@.

In Example 8 we calculated the expectation of @@@X@@@ using the additivity of the expectation. Derive the expectation through a direct computation of @@@\sum_{k}k p_X(k)@@@.
Show that if @@@N \to \infty@@@ and @@@K=K_N/N \to p \in (0,1)@@@, then

$$P( X = k) \to \binom{n}{k}p^k (1-p)^{n-k}.$$

Can you explain this result?

Problem 8.

We prove that @@@\int e^{-x^2/2}dx = \sqrt{2\pi}@@@. Let @@@I@@@ be that integral.

Show that @@@I^2 = \iint e^{-(x^2+y^2)/2} dx dy@@@.
Change the last double integral to polar coordinates and calculate it.

Problem 9.

Find the limit @@@\lim_{n\to\infty} e^{-n} \sum_{k=0}^n \frac{n^k}{k!}@@@.

(Hint: If @@@X@@@ is @@@\mbox{Pois}(n)@@@, it can be viewed as the sum of @@@n@@@ independent @@@\mbox{Pois}(1)@@@ RVs)

Problem 10.

I have three RVs @@@X_1,X_2,X_3@@@ which are independent. Suppose that the MGF of @@@X_1,X_2,X_3@@@ are @@@(1-3t)^{-1},(1-3t)^{-1.25},(1-3t)^{-3.75}@@@. Let @@@X=X_1+X_2+X_3@@@. Compute @@@E[X^4]@@@.

Intro
Tail Estimates: Markov inequality
Convergence in probability
Weak Law of Large Numbers
Moment Generating Functions
Central Limit Theorem
Proof the CLT
Problems