Intro
Discrete RVs
”[[Wikipedia:Pliny_the_Elder|Pliny the Elder]] is quoted (Book I, Se.c 7.) “The only certainty is that nothing is certain”. The statement is usually true for RVs, at least in some sense, because we cannot predict the value of the RVs prior to performing the experiment: can’t tell what number will show when you roll a die. However, for many purposes we need a number, one number to “summarize” the RV. Here are concrete examples.
- What’s the fair price of an insurance policy? The probability my house will be damaged in a fire in a one-year period is @@@0.2@@@ (see also fast facts about fire). If that happens, the insurance company will pay its reconstruction cost of @@@\$300\text{K}@@@ (I wish). What is a “fair” price to charge for an annual insurance policy? Well, roughly 1 in 500 houses burn down, and therefore, the fair price is @@@\$300\text{K}/500=\$600@@@. Let’s redo this, calculating first the total amount the company will pay out in a year:
The fair price for policy per customer is then the total payments divided by the number of policy holders:
$$ \mbox{Fair price} =\$300\text{K} * \%\mbox{damaged} + 0* \%\mbox{undamaged}=\$300\text{K}*0.002 =\$600.$$- How to summarize data in a single number? @@@1/6@@@ of my class got @@@60@@@ in the exam, @@@1/3@@@ got @@@75@@@, and the rest (half the class) got @@@90@@@. How do I summarize these data in a single number? There is no unique answer to that.
- Some may say @@@90@@@, because this is the most common value, also known as the mode.
- And some will say it is the largest value of @@@90@@@ (because we’re optimistic), or the lowest value of @@@60@@@ (because we want to show we’re tough).
- Or some point in between, say @@@85@@@ (or any other number between @@@75@@@ and @@@90@@@, because half of class above and half is below), known as a median.
Most of us, however, would not chose any of the above, perhaps because each ignores some of the data, and would instead calculate the average which takes into account all data provided:
$$ 60* \frac 16 + 75 *\frac 13 + 90* \frac 13 = 80.$$Here’s the definition. We begin with the more intuitive case of discrete RVs.
Suppose that @@@X@@@ is a discrete RV with PMF @@@p_X@@@. The expectation of @@@X@@@, denoted by @@@E[X]@@@ is defined as
$$ E[X]= \sum_{x} xp_X(x),\mbox{ provided }\sum_{x} \ | x | \ p_X(x)<\infty.$$The sum is taken over all @@@x@@@ in the support of @@@X@@@, that is the set of values @@@x@@@ such that @@@P(X=x)>0@@@. The finite sum condition in the definition guarantees that the first sum is well-defined, and if it holds we say that @@@X@@@ has finite expectation. The condition is only relevant when the support of @@@X@@@ is infinite or contains @@@\pm \infty@@@. Before doing ``real’’ examples, let’s consider some nearly trivial ones. Suppose @@@X@@@ is a constant RV. That is @@@X@@@ takes only one value @@@c@@@. Then the probability that @@@X=c@@@ is @@@1@@@, and the probability that @@@X@@@ is equal to any other number is zero. So @@@E[X] = c*1=c@@@. Next, what about an indicator of an event @@@A@@@? It is equal to @@@1@@@ on @@@A@@@ and to zero on @@@A^c@@@. Therefore
$$E[X] = 0 P(\underset{=A^c}{\underbrace{X=0}})+ 1 P(\underset{=A}{\underbrace{X=1}})= P(A).$$Now for a real one.
The price of a painting is @@@\$100@@@. In two years it will be worth @@@5000@@@ with probability @@@2\%@@@ and nothing with probability @@@98\%@@@. What is the expected value of the painting in two years?
Our RV @@@X@@@, representing the price of the painting in two years, is equal to @@@5000@@@ with probability @@@0.02@@@ and equal to @@@0@@@ with probability @@@.98@@@. Therefore the expected value is equal to
$$ E[X] = 5000*0.02 + 0*0.98 = 100.$$Just the same as today. Maybe but another painting as an investment or just buy this because we like it.
Here’s another one.
In a jar I have @@@3@@@ white balls and @@@4@@@ black balls. I select two balls. What is the expected number of white balls?
Let @@@X@@@ be the number of white balls. Then @@@X@@@ is hypergeometric. Thus, the PMF @@@p_X@@@ is given by
$$\begin{equation*} \begin{array}{|r|c|c|c|} \hline\hline x = & 0 & 1 & 2\\ \hline p_X(x) = & \frac{ \binom{4}{2}}{\binom{7}{2}} & \frac{\binom{4}{1}\binom{3}{1}}{\binom{7}{2}} & \frac{\binom{3}{2}}{\binom{7}{2}} \\ \hline x* p_X(x) = & 0*\frac{6}{21} & 1*\frac{12}{21} & 2*\frac{3}{21} \\ \hline\hline \end{array} \end{equation*}$$The expectation is the sum of the numbers in the last row, and is therefore equal to @@@\frac{18}{21}=\frac{6}{7}@@@. This is a little short of @@@1@@@, because there are less white balls than black balls. But wait! Let @@@X_1@@@ be thed indicator of first ball is white, and let @@@X_2@@@ be the indicator that the second ball is white. Then we know that both are indicators of events, each having probability @@@3/7@@@ (using the formula we derived when discussing independence), therefore @@@X_1,X_2\sim \mbox{Bern}(3/7)@@@ and @@@E[X_1]=E[X_2]=3/7@@@. Also, @@@X=X_1+X_2@@@, and our calculation above shows that @@@E[X]=E[X_1]+E[X_2]@@@. Is this just a coincidence? No. This is an illustration of one of the most important properties of the expectation, see Theorem Theorem 3.
General formula
Ok. For discrete RVs the notion of expectation is pretty straightforward. What do we do for a general RV? Well, I will give you now a definition that works for all RVs whether they are discrete, continuous, or neither. It may seem a strange and counter-intuitive, but for discrete RVs is nothing but rewriting of Definition Definition 1 through a method known as summation by parts, yes the discrete analog of something you know well from calculus:
Suppose that @@@X@@@ is an RV. The expectation of @@@X@@@, denoted by @@@E[X]@@@, is defined as
$$ E[X] = \int_0^\infty P(X>t) dt - \int_0^\infty P(X<-t)dt,$$provided both integrals are finite.
Again, if both integrals are finite, we say that @@@X@@@ has finite expectation. If the first is infinite and the second is finite, we say that @@@E[X]=+\infty@@@, and if the first is finite and the second is infinite, we say that @@@E[X]=-\infty@@@. Otherwise (both integrals are infinite), the expectation of @@@X@@@ is not defined.
Note that the expression above can be also written in terms of the CDF of @@@X@@@,
$$ E[X] =\int_0^\infty (1-F_X(t)) dt - \int_0^\infty F_X(-t) dt.$$Use this definition to compute the expectation of an indicator of an event with probability @@@p@@@.
If you’re interested in how this definition actually generalizes Definition 1, watch this
Here’s a nice result.
Suppose that @@@P(X>0)=1@@@. Then @@@E[X]\ge 0@@@, and equality holds if and only if @@@P(X=0)=1@@@
In other words: a nonnegative (with probability @@@1@@@) RV has nonnegative expectation, and it is zero if and only if the RV is equal to zero (with probability @@@1@@@).
Continuous RVs with densities
The calculation of the expectation for RVs with densities is pretty much the same as for discrete RVs. Let’s see exactly how. Suppose that @@@X@@@ is a RV with density @@@f@@@. Starting from Definition Definition 2 of the expectation, we have
$$ E[X] = \int_0^\infty \underset{=P(X>t)}{\underbrace{\int_t^\infty f(u) du}} dt - \int_0^\infty \int_{-\infty}^{-t} f(u) du.$$Changing the order of integration, we have
$$\begin{align*} E[X] &= \int_0^\infty \int_0^u dt f(u) du - \int_{-\infty}^0 \int_{0}^{-u} dt f(u) du \\ & = \int_0^\infty u f(u) du + \int_{-\infty}^0 u f(u) du\\ & = \int_{-\infty}^\infty u f(u) du. \end{align*}$$And, as expected, we recover a continuous analog of the formula in Definition Definition 1:
Suppose that @@@X@@@ has density @@@f_X@@@. Then the expectation of @@@X@@@ is equal to
$$ E[ X] = \int_{-\infty}^\infty u f_X(u) du,$$provided @@@\int_{-\infty}^\infty \ | u | \ f_X(u) du<\infty@@@.
You see? Instead of summation, we have an integral. Instead of PMF we have a density.
Let’s calculate some expectations.
- ’’‘Continuous Uniform’’’: @@@X\sim \mbox{U}[a,b]@@@. Then @@@E[X] = \frac{a+b}{2}@@@. Indeed,
- ’’‘Exponential’’’: @@@X\sim \mbox{Exp}(\lambda)@@@. Then @@@E[X] = \frac{1}{\lambda}@@@. Indeed,
We can calculate this through integration by parts:
$$ (*) = u (-e^{-\lambda u})|_0^\infty +\int_0^\infty e^{-\lambda u} du = \frac{1}{\lambda}.$$Alternatively, we can use the following differentiation under integral sign trick
$$ (*) = \int_0^\infty \lambda \frac{\partial }{\partial \lambda} (-e^{-\lambda u}) d u = -\lambda\frac{\partial }{\partial \lambda} \int_0^\infty e^{-\lambda u}du = -\lambda \frac{\partial }{\partial \lambda} \frac{1}{\lambda}= \frac{1}{\lambda}.$$Or, we can simply use Definition Definition 2:
$$ E[X] = \int_0^\infty P(X>t) dt - 0 = \int_0^\infty 1-F_X(t) dt = \int_0^\infty e^{-\lambda t} dt = \frac{1}{\lambda}.$$It’s all a matter of taste, but you need to suppose now your integrals.
Linearity of Expectation
The last result we present before turning to calculations is the linearity of the expectation. This is an important and useful result.
Suppose that @@@X@@@ has finite expectation and @@@Y@@@ has expectation. Then
- For any @@@c\in \R@@@, @@@E[cX] = c E[X]@@@.
- @@@E[X+Y] = E[X]+E[Y]@@@.
Before we move to examples, I’d like to note that the linearity extends to any finite sum, by iterating. @@@E[X+Y+Z] = E[(X+Y)+Z] = E[X+Y]+E[Z]= E[X]+E[Y]+E[Z]@@@. Remember that. Infinite sums? Not always, but yes if all RVs are nonnegative. Some may also want to ask about products. Is @@@E[XY]@@@ equal to the product of @@@E[X]@@@ and @@@E[Y]@@@? Usually not (think of indicators of disjoint events). We will discuss some cases when we talk about independence of RVs.
I open the store at @@@8\text{AM}@@@. The first customer arrives to the store @@@X@@@ hours after its opening where @@@\mbox{U}[0,2]@@@, and stays there for @@@Y@@@ hours, where @@@Y\sim \mbox{U}[0,2]@@@ What is the expected time the first customer leaves the store?
We need to find @@@E[ 8+ X+Y]@@@. The Proposition gives
$$ E[8+X+Y] = E[8]+E[X]+E[Y] = 8+ 1+1= 10.$$The expected time the first customer departs is @@@10:00\text{AM}@@@. It is utterly important to note that in this problem it is utterly wrong to try to calculate the CDF or density of @@@X+Y@@@ because the data do not determine it uniquely. Why? For example, @@@Y@@@ may be equal to @@@X@@@ (a customer who chooses to stay an amount equal to time of arrival after store opening for whatever reason), or @@@Y@@@ can be @@@2-X\sim \mbox{U}[0,2]@@@ (the customer decides to leave at time @@@10:00@@@). In the first case @@@X+Y=2X\sim\mbox{U}[0,4]@@@. In the latter case, @@@X+Y = 2@@@, a constant RV, hence discrete. The point is that the linearity of expectation allows to completely avoid this discussion: it’s irrelevant.
Here’s another one.
Let’s revisit Example Example 2. Let @@@X_1@@@ and @@@X_2@@@ be the indicators of the first ball being white and second being white, respectively. Note, that the events are not independent, but have the same probability, equal to @@@\frac{\binom{3}{1}}{\binom{7}{1}} = \frac{3}{7}@@@, and therefore @@@E[X_1]=E[X_2]= \frac{3}{7}@@@. Now @@@X= X_1+X_2@@@. Therefore @@@E[X]=E[X_1]+E[X_2] = \frac{6}{7}@@@. Let’s suppose we select four balls. What is then the expectation of the number of white balls? By linearity of expectation it would be @@@\frac{4*3}{7}=\frac{12}{7}@@@. Finally, sanity check: what if we select all balls? Then clearly we have all @@@3@@@ white balls, so the answer is @@@3@@@, and indeed, this is equal to @@@7*\frac{3}{7}@@@.
Here is yet another example.
In a barn, @@@100@@@ chicks sit peacefully in a circle. Suddenly, each chick randomly pecks the chick immediately to its left or right. What is the expected number of unpecked chicks? source
Before answering, let’s generalize to make it more interesting. Suppose each chick chooses a side with probability @@@p@@@ to be clockwise and @@@1-p@@@ for counterclockwise, and then pecks the chick to that side with probability @@@q@@@. In the original problem @@@p=1/2@@@ and @@@q=1@@@.
Label the chicks as @@@1,\dots,100@@@, and let @@@X_i@@@ be the indicator of chick @@@i@@@ not being pecked. Then @@@X_i@@@ is equal to @@@1@@@ if neither the clockwise neighbor nor the counterclockwise neighbor pecked it.
- The event that the counterclockwise neighbor does not peck it is @@@p(1-q)+(1-p)@@@ (choosing that side, but deciding not to peck, or not choosing that side).
- Similarly, the probability that the clockwise neighbor does not peck is @@@(1-p)(1-q)+p@@@.
Since each of the events is determined by a different chick, they are independent. And so their intersection, the event that our own chick is not pecked, has probability @@@(p(1-q)+(1-p))((1-p)(1-q)+p)=(1-pq)(1-(1-p)q)@@@. Since @@@X_i@@@ is the indicator of this event, this number is also the expectation of @@@X_i@@@. Since the number of chicks not pecked is @@@X_1+\dots+X_100@@@, it follows from the additivity of expectation that the expected number of those left not pecked is @@@100 (1-pq)(1-(1-p)q)@@@. Note that this can be as small as @@@0@@@ (@@@p=1,q=1@@@) or as high as @@@100@@@ (@@@q=0@@@). In the case @@@p=1/2@@@ and @@@q=1@@@ we get @@@25@@@.
I’m shooting my basketball @@@n\ge 6@@@ times. The probability I make a shot is @@@p@@@, and all shots are independent. What is the expected number of streaks of @@@6@@@ consecutive successful shots? source
Let @@@X_j,~j=1,\dots,n-5@@@ be the indicator of the event that a streak of @@@6@@@ successful shots begins at the @@@j@@@-th shot. Clearly @@@E[X_j] = P(X_j=1) =p^6@@@ for all @@@j@@@. The number of streaks is equal to @@@X_1+ \dots + X_{n-5}@@@. Therefore the expected number of streaks is @@@(n-5)p^6@@@.
At a conference each of the @@@100@@@ participants gets a unique name tag. When registering, the participants pick a tag uniformly without looking at it. Let @@@X@@@ be the number of participants who got their own name tag. What is the expectation of @@@X@@@?
We close this section with the calculation of the expectation for all discrete RVs.
- ’’‘Bernoulli’’’: Suppose @@@X\sim \mbox{Bern}(p)@@@. Then @@@E[X] = 1 *p + 0*(1-p)=p@@@.
- ’’‘Binomial’’’: Suppose @@@X \sim \mbox{Bin}(n,p)@@@. Then @@@X= \sum_{j=1}^n X_j@@@, where @@@X_1,\dots,X_n@@@ are each @@@\mbox{Bern}(p)@@@. By linearity of expectation
- ’’‘Geometric’’‘:Suppose @@@X\sim \mbox{Geom}(p)@@@. Then
Want another idea? Let’s use Definition Definition 2:
$$\begin{align*} E[X]& = \int_0^\infty P(X>t) dt\\ & = \sum_{j=0}^\infty \int_j^{j+1} P(X>t) dt = \sum_{j=0}^\infty P(X>j)\\ & = \sum_{j=0}^\infty (1-p)^j= \frac{1}{1-(1-p)}=\frac 1p. \end{align*}$$Here’s one more… @@@X@@@ represents the time of the first success:
- With probability @@@p@@@ we’re done at the first trial, and with probability @@@1-p@@@ the first trial fails so we need to continue.
- @@@\Rightarrow@@@ we count at least one trial with probability @@@1@@@, and;
- with probability @@@1-p@@@ add the number of trials until the first success, but starting from the second trial. The expected number of trials (counting from the second trial) until the first success is, again, @@@E[X]@@@.
Putting both together, we have
$$E[X] = 1+ (1-p) E[X],$$and the only finite solution is @@@E[X]=\frac{1}{p}@@@.
Okay, but that may not seem rigorous enough, right? We will be able to justify this line of thinking in a nice and clean way when we deal with conditional expectation. In the meantime, let’s try to do this by manipulating power series. Try to identify the ideas described above as steps in the calculation:
$$\begin{align*} E[X] & = \sum_{k=1}^\infty k (1-p)^{k-1}p \\ & = 1*p + \sum_{k=2}^\infty k (1-p)^{k-1}p \\ & = 1*p + (1-p) \sum_{k=2}^\infty k (1-p)^{k-2}p\\ & = 1*p +(1-p) \sum_{j=1}^\infty (j+1) (1-p)^{j-1}p\\ & = 1*p + (1-p) \left ( \underset{=E[X]}{\underbrace{\sum_{j=1}^\infty j (1-p)^{j-1}p}} + \underset{=1}{\underbrace{\sum_{j=1}^\infty (1-p)^{j-1}p}}\right)\\ & = 1+ (1-p) E[X] \end{align*}$$- ’’‘Hypergeometric’’’: A brute force calculation is not hard, but let’s rely on additivity of expectation. Recall: we’re selecting (without replacement) @@@n@@@ balls from a jar that contains @@@N@@@ balls, @@@K@@@ of which are white and @@@b=N-K@@@ are black, and @@@X@@@ the number of white balls selected. If @@@X_1,\dots,X_n@@@ are the indicators that the first, second,… @@@n@@@-th ball is white, then these @@@n@@@ RVs are @@@\mbox{Bern}(K/N)@@@ (we showed this when studying independence), and thus each has expectation @@@K/N@@@. It follows from the additivity of the expectation that
- ’’’ Negative Binomial’’’: Negative binomial with parameters @@@r@@@ and @@@p@@@, @@@X\sim \mbox{NB}(r,p)@@@. We take the liberty not to do the direct calculation. Recall that @@@X@@@ counts the number of failures until the @@@r@@@-th success in a sequence of independent trials where the probability of success in each is @@@1-p@@@. If @@@X_r@@@ is the number of trials until the @@@r@@@-th success, then clearly @@@X=X_r-r@@@, while @@@X_r@@@ is a sum of @@@r@@@ @@@\mbox{Geom}(1-p)@@@ RVs (number of trials between any two successive successes is @@@\mbox{Geom}(1-p)@@@. Therefore,
- ’’‘Poisson’’’: Suppose @@@X\sim \mbox{Pois}(\lambda)@@@. Then
- ’’‘Discrete Uniform’’’: Suppose that @@@X@@@ is uniform on @@@\{1,\dots,n\}@@@. Then
Now @@@\sum_{j=1}^n j = \sum_{j=1}^n (n+1-j)@@@. Therefore
$$\star= \frac 12 \left (\sum_{j=1}^n j + \sum_{j=1}^n (n+1-j)\right)= \frac 12 n \times (n+1),$$and so po
$$E[X] = \frac{n(n+1)}{2 n} = \frac{n+1}{2}.$$What is the expected number of Tails until I get the third Heads in an infinite sequence of independent coin tosses with probability @@@1/2@@@ for each?