Intro
Now that we know a lot about RVs and their distributions, we are going to study pairs, triples, or more random variables, objects known as random vectors. Of course, we are going to study the distributions of these objects too, known as joint distributions. This is guaranteed to make your life harder, but it is very important, as almost all applications of probability involve more than one RV.
Let’s consider our upcoming exam. A student’s grade is a RV, say @@@X@@@. The amount of hours the student studied is another RV, @@@Y@@@. What is the probability I’ll end up with a grade lower than @@@70\%@@@ if I studied more than @@@4@@@ hours? What is the expected number of study hours among those students who got more than @@@80\%@@@? Dealing with events of the form “@@@X@@@ is in this range, and @@@Y@@@ is in that range” is the main topic of joint distributions.
Another example? The number of claims by customers of an insurance company per year is a random variable @@@N@@@. The amount of the @@@n@@@-th claim is a random variable @@@X_n@@@. The amount paid in claims is therefore the sum @@@X_1+\dots+X_N@@@, a sum of a random number of random variables. How do we calculate the probability that the insurance company will have to pay more than a certain amount in claims? What about the expectation?
I chose these two examples because they are simple and represent two extremes. The first describes a scenario where the randomness has two components (or three, etc), while the second involves an unbounded number of components (at least theoretically). There’s a pretty broad range, and random vectors are all around us. Can you think of an example yourself?
Before we continue, a word on nomenclature: “Joint distributions” = “Multivariate distributions”
Capisce?
Random Vectors and their Distribution
Suppose that @@@X@@@ and @@@Y@@@ are RVs defined on the same probability space. We mathematicians like order and - even more - notation. Instead of referring to the pair of RVs @@@X@@@ and @@@Y@@@, we simply refer to the pair as the ‘'’random vector @@@(X,Y)@@@’’’. The random vector is a single object which reminds us that all is defined within the same probability space. It gives both order — “@@@X@@@ is the first component of the random vector” — and, even better, new notation. Everybody is satisfied. As customary with vectors, we will refer to @@@X@@@ and @@@Y@@@ as the ‘'’first and second components of @@@(X,Y)@@@,’’’ respectively. We will also refer to them as the first, respectively, second marginals of the random vector.
For simplicity of the presentation, we will mostly (but not exclusively) discuss random vectors having two components, that is random vectors taking values in @@@\R^2@@@, although all theory usually extends directly to higher dimensions.
Let’s do our first random vector. We will keep it simple so we can actually calculate some things without knowing much. So let’s get rollin’. A die of course.
We play a game by rolling a die twice. Let @@@X@@@ be the value of the smaller number and let @@@Y@@@ be the value of the larger number. Let’s calculate the probability that the RV @@@(X,Y)@@@ is equal to @@@(i,j)@@@. This event is of course @@@\{(X,Y)=(i,j)\}@@@, but it’s also the intersection @@@\{X=i\}\cap \{Y=j\}@@@. Both mean the same. Let’s calculate: clearly, @@@P((X,Y)=(i,j))@@@ can be only positive if @@@i\le j@@@ and both @@@i,j \in \{1,\dots,6\}@@@. For such pairs of @@@(i,j)@@@ we need to consider two cases.
- @@@i<j@@@ (e.g. @@@X=3@@@ and @@@Y=4@@@). This can be obtained by either rolling @@@i@@@ followed by @@@j@@@, or by rolling @@@j@@@ followed by @@@i@@@. Therefore the probability is @@@2*\frac{1}{36}=\frac{1}{18}@@@.
- @@@i=j@@@ (e.g. @@@X=4@@@ and @@@Y=4@@@). This can be only obtained by rolling @@@i@@@, then again @@@i@@@, so the probability is @@@\frac{1}{36}@@@.
Summarizing,
$$ P((X,Y)=(i,j)) = \begin{cases} \frac{1}{18} & 1\le i < j \le 6 \\ \frac{1}{36} & i=j \in \{1,\dots,6\}\\ 0 & \mbox{otherwise.}\end{cases} $$Let’s continue developing this example. We know what is the probability the vector attains any value, but what is the distribution of, say @@@X@@@? Clearly, the support of @@@X@@@ is @@@\{1,2,3,4,5,6\}@@@, so @@@X@@@ is a discrete RV, and to find its distribution, it is enough to find its PMF. Since we can split the event @@@\{X=i\}@@@ according to the values of @@@j@@@, the law of total probability gives us
$$P(X=i) = \sum_{j}P(X=i \mbox{ and }Y=j)=\sum_{j}P((X,Y)=(i,j)).$$Recall that the probabilities in the sum on the right are only positive if @@@j\ge i@@@. The one corresponding to @@@j=i@@@ is @@@\frac{1}{36}@@@, while the remaining ones, if any, are equal to @@@\frac{1}{18}@@@. This gives
$$ P(X= i) = \begin{cases} \frac{1}{36}+(6-i) \frac{1}{18} & i\in \{1,\dots,6\} \\ 0 & \mbox{otherwise.} \end{cases} $$So @@@X@@@ is the most likely to be @@@1@@@ and least likely to be @@@6@@@. The whole procedure makes sense, right? We will make this into a method very soon.
As with RVs, the random vector @@@(X,Y)@@@ induces a probability distribution, known as the ‘'’joint distribution’’’ of @@@X@@@ and @@@Y@@@ and denoted by @@@P_{X,Y}@@@. This is a probability measure describing the probabilities assigned to @@@(X,Y)@@@ being in each Borel subset of @@@\R^2@@@:
$$P_{X,Y}(B) = P((X,Y)\in B),~ B\in {\cal B}(\R^2).$$You don’t need to remember what Borel sets are because I’m going to remind you. Practically, these are all subsets of @@@\R^2@@@ you can describe in words or formulas. Mathematically, the Borel
$$\sigma$$-algebra, the set of all Borel sets, is smallest @@@\sigma@@@-algebra containing all rectangles (and then all complements of such rectangles, unions of such rectangles, complements of such unions, unions of all above subsets, complements, etc…).
As with RVs, we also have CDFs. Since it’s more complicated than the RV case, we will try to keep it really short. The ‘'’joint CDF’’’, denoted by @@@F_{X,Y}@@@ is defined as
$$F_{X,Y} (x,y) = P(X\le x, Y\le y)=P_{X,Y}( (-\infty,x]\times (-\infty,y]).$$As in the case of RVs, the CDF determines the distribution of @@@(X,Y)@@@: the distribution can be recovered from the CDF.
The distribution of each of the marginals @@@X@@@ and @@@Y@@@ of @@@(X,Y)@@@ is called ‘'’a marginal distribution’’’. The joint distribution determines the marginal distributions, but the converse is not true in general. The procedure to extract the marginal of @@@X@@@ from that of @@@(X,Y)@@@ is this:
$$P( X\in A) = P(X \in A, Y \in \R) = P_{X,Y}( ( A \times \R), $$with similar formulas for @@@Y@@@. As for the relation between the joint CDF and the CDF of the marginals, observe first that
$$\{X\le x\} = \{X\le x,Y< \infty\}=\cup_{n=1}^\infty \{X\le x, Y \le n\}.$$Thus,
$$P(X\le x ) = P(\cup_{n=1}^\infty \{X\le x,Y\le n\}) = \lim_{n\to\infty} P(X\le x,Y\le n) = \lim_{n\to\infty} F_{X,Y} (x,n),$$where the second equality follows from continuity of probability measures with respect to monotone sequences of events. By a simple argument based on monotonicity, we can take the limit as @@@y\to\infty@@@ (not just along the integers), to obtain:
$$\begin{equation} \boxed { \label{eq:cdf_marginal} F_X (x) = \lim_{y\to\infty} F_{X,Y}(x,y) }\end{equation}$$Similarly,
$$P_Y(Y\in A) = P_{X,Y}(X\in \R,Y\in A),\mbox{ and } F_Y(y) = \lim_{x\to\infty} F_{X,Y}(x,y).$$Let @@@(X,Y)@@@ be a random vector and suppose @@@F_{X,Y} (x,y) = xy^2@@@ for @@@0\le x,y \le 1@@@.
- Extend @@@F_{X,Y}@@@ to the entire plane.
- What is the CDF of @@@X@@@ and of @@@Y@@@?
-
What is the probability of @@@\frac 13< X \le \frac 12, Y > 1/3@@@? Express your answer in terms of the CDF.
- We only define @@@F_{X,Y}@@@ on @@@[0,1]^2@@@. But @@@F_{X,Y}(1,1) = 1@@@, that is @@@P(X\le 1,Y\le 1)=1@@@ and since this the probability of the intersection of two events and therefore less than or equal to the probability of each one, and probabilities are never larger than @@@1@@@, so we have that @@@P(X\le 1) = P(Y\le 1) =1@@@. Because of this, we conclude that @@@F_{X,Y} (x,1) = P(X\le x, Y\le 1) = P(X\le x)@@@ (probability of event with another that has probability @@@1@@@ is always equal to the probability of the former), and similarly @@@F_{X,Y} (1,y) = P(Y\le y)@@@. The last two observations identify the CDFs of @@@X@@@ and @@@Y@@@, but they also allow us to conclude that @@@P(X\le 0) =P(Y\le 0)=0@@@. So now we know that the two RVs @@@X@@@ and @@@Y@@@ take values in @@@[0,1]@@@. We will use this to expand @@@F_{X,Y}@@@ to the plane. Clearly it has to be zero on the second, third and fourth quadrant because the evaluation of@@@F_{X,Y}@@@ at every point in each of these corresponds calculation of an event with zero probability (@@@X@@@ or @@@Y@@@ being negative). As for the first quadrant, all we need to know is that @@@P(X\le 1)=1@@@ and @@@P(Y\le 1)=1@@@, so the probability of @@@\{X\le x,Y\le y \}@@@ is the same as @@@\{X\le \min(x,1),Y \le \min(y,1)\}@@@, and this allows to conclude that
The first line gives the first quadrant and the second gives all remaining ones.
- If we let @@@y\to\infty@@@ in the above formula (or use the previously derived observation @@@P(X\le x) = F_{X,Y}(x,1)@@@), we conclude that @@@X\sim \mbox{U}[0,1]@@@, and repeating for @@@Y@@@ we have that the CDF of @@@Y@@@ is zero for negative numbers, @@@y^2@@@ for @@@y \in [0,1)@@@ and equal to @@@1@@@ for @@@y\ge 1@@@.
- We show how to manipulate the expression to express it as differences/sums of the CDFs. We will be able to do it much faster after we learn about independence of RVs. This is
The joint CDF of the random vector @@@(X,Y)@@@ is
$$\begin{equation} \label{eq:joint_CDF} F_{X,Y}(x,y)= 1-e^{-x}-e^{-2y}+e^{-x-2y},~x,y>0.\end{equation}$$Find the CDF of @@@Y@@@. What is the (name and parameter!) distribution of this RV?
Consider the random vector @@@(X,Y)@@@ with joint CDF \eqref{eq:joint_CDF} from Exercise 1. Express @@@P(2\le X\le 3,Y>1)@@@ in terms of the CDF.
Finally, lets consider the more general higher dimensional setup. It’s ‘’‘mutatis mutandis’’’, the same but with the appropriate changes.
If @@@X_1,\dots,X_n@@@ are random variables defined on the same probability space (say outcomes of rolling a die repeatedly @@@n@@@ times, listed in chronological order), then the corresponding random vector @@@(X_1,\dots,X_n)@@@ induces a joint distribution @@@P_{X_1,\dots,X_n}@@@ on the Borel subset of @@@\R^n@@@, given by
$$P_{X_1,\dots,X_n}(B) = P((X_1,\dots,X_n) \in B),~B \in {\cal B}(\R^n),$$and the joint distribution function for the random vector is given by
$$F_{X_1,X_2,\dots,X_n}(x_1,\dots,x_n) = P(X_1\le x_1,\dots,X_n \le x_n).$$If, for example, we want to find the joint CDF of @@@X_1@@@ and @@@X_3@@@, then this is given by
$$F_{X_1,X_3} (x_1,x_3) = \lim_{y\to\infty} F_{X_1,X_2,\dots,X_n}(x_1,y,x_3,y,\dots,x_n). $$Independent Random Variables
Definition and Properties
We know what independent events are: knowing whether one has occurred or not does not affect the probability of the other (the probability of one conditioned on the other is the same as the probability without conditioning). Similarly, we want a notion of when knowledge of one RV does not give any information on another RV(s). Why? If we are able to describe complex system with such basic building blocks, the calculations will be more manageable. This leads to the following definition of independent RVs, which is a an extension of the [[Probability spaces#indep_events|definition of independent events]].
Let @@@(X,Y)@@@ be a random vector. Then the RVs @@@X@@@ and @@@Y@@@ are called ‘'’independent’’’ if for any tow intervals @@@I,J@@@, the events @@@\{X\in I\}@@@ and @@@\{Y\in J\}@@@ are independent, that is
$$P(X\in I,Y\in J)= P(X \in I) P(Y\in J). $$What we’re saying is that independence of the RVs @@@X@@@ is @@@Y@@@ is this: Any statement about @@@X@@@ is independent of any statement about @@@Y@@@.
An easy calculation (try it!) would convince you that the following is true:
Let @@@(X,Y)@@@ be a random vector. Then @@@X@@@ and @@@Y@@@ are independent if and only if @@@\{X\le x\}@@@ and @@@\{Y\le y\}@@@ are independent for any @@@x,y \in \RR@@@ or, equivalently,
$$F_{X,Y}(x,y) = F_X (x) F_Y(y). $$Show that the events @@@A@@@ and @@@B@@@ are independent if and only if the RVs @@@X={\bf 1}_A@@@ and @@@Y={\bf 1}_B@@@ (the indicators of @@@A@@@ and @@@B@@@, respectively), are independent.
Let @@@(X,Y)@@@ be a random vector with the property that @@@\{X\le x\}@@@ and @@@\{Y>y\}@@@ are independent for any @@@x,y \in\RR@@@. Are @@@X@@@ and @@@Y@@@ independent?
Let @@@(X,Y)@@@ be the random vector with CDF as given in \eqref{eq:joint_CDF} from Exercise 1 and let @@@\tilde X= X+3,\tilde Y = 2Y@@@. Find the joint CDF of @@@\tilde X@@@ and @@@\tilde Y@@@ and determine if they are independent or not.
Before moving to more concrete settings, we need to point out the following:
Let @@@(X,Y)@@@ be a random vector. Then its marginals are independent if and only if @@@f(X)@@@ and @@@g(Y)@@@ are independent for any choice of functions @@@f@@@ and @@@g@@@.
Recall that if @@@h:\R\to \R@@@ is a function, for any set @@@A\subseteq\R@@@, @@@h^{-1}(A) = \{x : h(x) \in A\}@@@. In particular, the sets @@@\{h(x) \in I\}@@@ and @@@\{x \in h^{-1}(I)\}@@@ are equal. With this let’s continue.
- @@@\Rightarrow@@@ If @@@X@@@ and @@@Y@@@ are independent, then
- @@@\Leftarrow@@@ Choose any Borel subsets @@@I@@@ and @@@J@@@ of @@@\R@@@, and let @@@f(x) = {\bf 1}_{I} (x), g(y) ={\bf 1}_{J} (y)@@@. Then @@@\{X \in I\} = \{f(X) =1\}@@@ and @@@\{Y\in J\}= \{g(Y) =1\}@@@. Since the two events on the right are independent, so are the two events on the left.
- Suppose that @@@X@@@ and @@@Y@@@ are independent. Explain why @@@X^2@@@ and @@@Y@@@ are independent.
- Is the converse true? That is, if @@@X^2@@@ and @@@Y@@@ are indepedendent then necessarily @@@X@@@ and @@@Y@@@ are independent?
Now let’s see what independence allows us to do.
- Let @@@(X,Y)@@@ be a random vector with independent marginals, @@@X\sim \mbox{Exp}(\lambda)@@@ and @@@Y\sim \mbox{Exp}(\mu)@@@. Show that @@@\min(X,Y)\sim \mbox{Exp}(\mu+\lambda)@@@.
Let @@@Z=\min(X,Y)@@@. Then for @@@z>0@@@:
$$\begin{align*} F_Z(z) &= P(Z\le z)\\ & =P( \min(X,Y) \le z) \\ & = 1-P(\min(X,Y) >z) \\ &= 1- P(X>z,Y>z)=1-P(X>z)P(Y>z) \\ & = 1-e^{-\lambda z} e^{-\mu z}\\ & =1-e^{-(\lambda+ \mu)z} \end{align*}$$Also since @@@X,Y \ge 0@@@, it follows that @@@F_Z(z)=0@@@ for @@@z\le 0@@@. Therefore @@@Z@@@ has CDF of @@@\mbox{Exp}(\lambda+ \mu)@@@, that is @@@Z \sim \mbox{Exp}(\lambda + \mu)@@@.
- What if @@@X@@@ and @@@Y@@@ where not independent? Let’s consider the most extreme case where one is a function of the other. As before take @@@X\sim \mbox{Exp}(\lambda)@@@, and set @@@Y=\lambda/\mu X@@@. Observe that
Therefore @@@Y\sim \mbox{Exp}(\mu)@@@. Clearly, @@@Y\le X@@@ if @@@\lambda \le \mu@@@ and otherwise @@@X<Y@@@. In other words, @@@\min (X,Y)@@@ is @@@Y@@@ when @@@\lambda \le \mu@@@ and @@@X@@@ when @@@\mu <\lambda@@@, and therefore the minimum is not @@@\mbox{Exp}(\lambda+ \mu)@@@-distributed.
Suppose that @@@X@@@ is independent of itself. Show that there exists some constant @@@c@@@ such that @@@P(X=c)=1@@@.
Higher dimensional analogs
Armed with the notion of pairs of independent RVs, let’s make the general definition.
- Let @@@(X_1,\dots,X_n)@@@ be a random vector. The RVs @@@X_1,\dots,X_n@@@ are independent if for any intervals @@@I_1,\dots,I_n@@@ we have
- More generally, if @@@X_1,X_2,\dots@@@ are random variables on the same probability space, we say that they are independent if the components of @@@(X_1,X_2,\dots,X_n)@@@ are independent for all @@@n\ge 1@@@.
Now for an important notion. If the random variables @@@X_1,X_2,\dots@@@ (possibly finite) defined on the same probability space are independent and identically distributed (distribution of all is the same), we simply write that they form an ‘'’IID sequence’’’ or that they are ‘'’IID’’’.
To establish independence it is enough to consider the joint CDF.
The marginals of the random vector @@@(X_1,\dots,X_n)@@@ are independent if and only if
$$F_{X_1,X_2,\dots,X_n} (x_1,\dots,x_n) = F_{X_1}(x_1) F_{X_2} (x_2) \dots F_{X_n}(x_n), $$for all @@@x_1,x_2,\dots,x_n@@@.
Here is a very important note. Clearly, if @@@X_1,\dots,X_n@@@ are independent then any pair @@@X_i@@@, @@@X_j@@@, @@@i\ne j@@@ are independent, but the converse is not true. Want an example? Quite a while back we gave an example of three events @@@A,B,C@@@ which were not independent, yet any two of which were independent. Let @@@X_1={\bf 1}_A@@@, @@@X_2 ={\bf 1}_B@@@ and @@@X_3= {\bf 1}_C@@@. Then @@@X_i@@@ is independent from @@@X_j@@@ if @@@j\ne i@@@ but @@@X_1,X_2@@@ and @@@X_3@@@ not not independent: the probability they are all equal to @@@1@@@ is zero, while (in our example) the probability each one is equal to @@@1@@@ is @@@1/2@@@.
Of course, the Proposition 2 with two components has its own higher-dimensional analog.
The marginals of the random vector @@@(X_1,\dots,X_n)@@@ are independent if and only if for any functions @@@f_1,\dots,f_n@@@, the random variables @@@f_1(X_1),\dots,f_n (X_n)@@@ are independent.
Expectation of Products
As not expected, the multiplicative property of probabilities propagates through expectation in the following way:
Let @@@(X_1,X_2,\dots,X_n)@@@ be a random vector with independent marginals, each having finite expectation. Then
$$E[ X_1X_2\dots X_n ] = E[X_1]E[X_2]\cdots E[X_n]. $$Proposition 4 if @@@X_1,X_2,\dots,X_n@@@ are independent then @@@f_1(X_1),\dots,f_n (X_n)@@@ are also independent, and therefore the proposition gives us
$$E[f_1(X_1)f_2(X_2)\cdots f_n(X_n)]=E[f_1(X_1)]\cdots E[f_n(X_n)], $$provided @@@E[f_i(X_i)]@@@, @@@i=1,\dots,n@@@ are all finite.
The return rate for my portfolio for each of the four quarters of the year, expressed in percentage points, is a random variable @@@X_i,~i=1,\dots,4@@@. Assuming that the return rates are independent, find the expected annual return rate.
Well, after the first quarter I have @@@1+X_1/100@@@ of what I Started with, and after the second quarter I have @@@(1+X_1/100)(1+X_2/100)@@@ of what I started with, etc. Therefore after @@@4@@@ quarters I have @@@\prod_{i=1}^4 (1+X_i/100)@@@ of what I started with, so that my return rate is @@@R= 100(\prod_{i=1}^4 (1+X_i/100) -1) @@@ (multiplication by @@@100@@@ to express in percentage points). The expected return rate is therefore
$$\begin{align*} E[R ]&= E [100 \prod_{i=1}^4 (1+X_i/100) -1 ]\\ & = 100 E[ \prod_{i=1}^4 (1+X_i/100) -1 ] \\ & = 100 [E[ \prod_{i=1}^4 (1+X_i/100) ]-100 \\ & = 100 \left (\prod_{i=1}^4 E [1+X_i/100] - 1\right)\\ & = 100 \left ( \prod_{i=1}^4 (1+ E[X_i]/100)-1\right) \end{align*}$$(the “@@@-1@@@” is outside the product)
The costs for the producers of a rock festival is approximately @@@\$ n e^{-n/100}@@@, where @@@n@@@ is the number of tickets sold. Suppose that the number of tickets sold is @@@N\sim \mbox{Bin}(1000,0.3)@@@. Find the expected cost per patron.
The cost per patron is @@@e^{-N/100}@@@. But since @@@N\sim \mbox{Bin}(1000,0.3)@@@, it is a sum of @@@1000@@@ independent @@@\mbox{Bern}(0.3)@@@, @@@X_1,\dots,X_{1000}@@@. Thus,
$$\begin{align*} E[e^{-N}]&=E[e^{-\sum_{j=1}^{1000} X_j/100}]= E[\prod_{j=1}^{1000} e^{-X_j/100}]\\ & =\prod_{j=1}^{1000} E[e^{-X_j/100}]\\ & = \prod_{j=1}^{1000} (e^{-1/100} 0.3+ 1* 0.7)=\$ 0.05. \end{align*}$$Variance of sums of Independent RVs
Suppose that @@@(X,Y)@@@ is a vector with independent marginals. Let’s calculate the variance of @@@X+Y@@@. Intuitively, each variable carries its own variation from its own expectation, and since they are independent one’s variation does not affect the variation of the other, and we can expect the variation to be additive. Let’s derive this:
$$\begin{align*} E[ (X+Y)^2 ] & = E[X^2] + 2E[XY] + E[Y^2] \\ & = \sigma^2_X + E[X]^2 + 2 E[X] E[Y] + \sigma^2_Y + E[Y]^2 \\ & = \sigma^2_X + \sigma^2_Y + E[X]^2 + 2 E[X] E[Y] + E[Y]^2 \\ & = \sigma^2_X + \sigma^2_Y + (E[X+Y])^2. \end{align*}$$Subtracting the second summand on the RHS from both side verifies what we have guessed. We can generalize this to more than two RVs by induction (adding one at a time) to obtain the following proposition
Suppose that @@@(X_1,X_2,\dots,X_n)@@@ is a random vector with independent marginals, each having finite variance. Let @@@S_n = X_1+\dots+ X_n@@@. Then @@@S_n@@@ has finite variance and
$$\sigma^2_{S_n} = \sigma_{X_1}^2+ \dots + \sigma^2_{X_n}.$$Use this to derive the variance for @@@\mbox{Bin}(n,p)@@@.
Remember the negative binomial, @@@X_r\sim \mbox{NB}(r,p)@@@? We calculated its expectation earlier. Let’s redo it.
comment %} Example 8. endcomment %}
Recall that if @@@X_r@@@ is the number of trials until the @@@r@@@-th success in a sequence of independent trials with probability @@@1-p@@@ of success in each, then @@@X=X_r-r@@@, that is @@@X@@@ is the number of trials until the @@@r@@@-th success. Since the variance does not change when adding a constant, @@@\sigma^2_X = \sigma^2_{X_r}@@@. Also, observe that @@@X_r@@@ is a sum of @@@r@@@ independent @@@\mbox{Geom}(1-p)@@@. Indeed, the number of trials until the first success is @@@\mbox{Geom}(1-p)@@@, and the number of additional trials until the next success is also @@@\mbox{Geom}(1-p)@@@, independent of the former. More generally, the number of trials between successive successes are all independent @@@\mbox{Geom}(1-p)@@@, as can be shown by simple induction. The variance of @@@\mbox{Geom}(1-p)@@@ is @@@\frac{p}{(1-p )^2}@@@, as we previously calculated and therefore,
$$\sigma^2_X =\sigma^2_{X_r} =\frac{rp}{(1-p)^2}. $$Discrete Random Vectors
The Joint PMF
Just like the case of a single discrete RV, we have the notion of discrete random vectors, and corresponding PMFs which determine the distribution.
If @@@X@@@ and @@@Y@@@ are both discrete, then the random vector @@@(X,Y)@@@ has a ‘'’joint PMF @@@p_{X,Y}@@@’’’ given by
$$\begin{equation} \boxed { \label{eq:ind_pmf} p_{X,Y} (x,y) = P(X=x,Y=y). }\end{equation}$$The connection between the PMF and the CDF is given
$$\begin{equation} \boxed { \label{eq:disc_joint_cdf} F_{X,Y} (x,y) = \sum_{s\le x,t\le y} p_{X,Y}(s,t). }\end{equation}$$Conversely, if @@@p@@@ is a nonnegative function on @@@\R^2@@@ which is strictly positive only on a finite or countable subset of @@@\R^2@@@ and satisfies
$$\sum_{x,y} p(x,y) =1,$$then @@@p@@@ is the PMF of a random vector with CDF given by \eqref{eq:disc_joint_cdf}.
Toss a fair dice repeatedly until you see an even number. Let @@@X@@@ denote the number of tosses, and let @@@Y@@@ denote the even number observed. What is the PMF of the random vector @@@(X,Y)@@@?
The random vector @@@(X,Y)@@@ can take any value of the form @@@(n,e)@@@, where @@@n=1,2,\dots@@@ and @@@e=2,4,6@@@. The event @@@(X,Y)=(n,e)@@@ corresponds to all first @@@n-1@@@ tosses being odd, and @@@n@@@-th toss being @@@e@@@. Since tosses are independent, the probability is then equal to
$$p_{X,Y}(n,e) = (\frac{3}{6})^{n-1} \frac{1}{6} = \frac{1}{2^{n}}\frac{1}{3}.$$Summarizing:
$$\begin{equation} \label{eq:until_even} p_{X,Y} (n,e) = \begin{cases} \frac{1}{3 \cdot 2^n} & n =1,2,\dots,~e=2,4,6 \\ 0 & \mbox{otherwise}\end{cases} \end{equation}$$Marginals from Joint PMF
So suppose we have a PMF for a random vector. How do we get the PMF of each of its marginals, say @@@X@@@? We simply “integrate @@@y@@@ out”. Similarly,
$$\begin{equation} \boxed { \label{eq:pmf_marginal} p_X(x) = \sum_{y} P(X=x,Y=y) =\sum_y p_{X,Y}(x,y). }\end{equation}$$Similarly,
$$p_Y(y) = \sum_x p_{X,Y}(x,y). $$Let’s continue Example 7, where we roll a die repeatedly until we see an even number. The number of rolls is @@@X@@@ and the number appearing on the last roll is @@@Y@@@. What are the PMFs of @@@X@@@ and @@@Y@@@?
- Let’s start with @@@X@@@. If @@@n=1,2,\dots@@@, then
and for all other @@@n@@@, @@@p_X(n)=0@@@. This is the PMF of @@@\mbox{Geom}(\frac 12)@@@. Therefore @@@X\sim \mbox{Geom}(1/2)@@@.
- What about @@@Y@@@? For @@@e=2,4,6@@@, we have
while for all other @@@e@@@, @@@p_Y(e)=0@@@. Therefore @@@Y@@@ is uniform on @@@\{2,4,6\}@@@.
Independence of Marginals
We already have a criterion for independence Proposition 1. When dealing with discrete random vectors, all is much simpler if we switch to working with the PMFs instead.
Let @@@(X,Y)@@@ be a random vector with joint PMF @@@p_{X,Y}@@@.
- Let @@@p_X@@@ and @@@p_Y@@@ denote the PMFs of @@@X@@@ and @@@Y@@@. Then @@@X@@@ and @@@Y@@@ are independent if and only if
for all @@@x@@@ and @@@y@@@ in the respective supports of @@@X@@@ and @@@Y@@@.
- @@@X@@@ and @@@Y@@@ are independent if and only if @@@p_{X,Y}@@@ is a product of a function of @@@x@@@ and a function of @@@y@@@. That is there exists functions @@@f@@@ and @@@g@@@ such that \begin{equation} p_{X,Y} (x,y) = f(x) g(y) \end{equation} for ‘‘allll’’ @@@x@@@ and @@@y@@@.
Note that the two criterions above are equivalent (both are necessary and sufficient). We will mostly work with the first, and please be very careful with the second. Using it improperly leads to a lot of errors, so make sure you understand Example 14.
We have one fair @@@6@@@-face die and toss it twice. Let @@@X@@@ denote the number in the first toss and @@@Y@@@ the number in the second toss. Then @@@X@@@ and @@@Y@@@ are independent uniform on @@@\{1,\dots,6\}@@@.
As usual, without any assumptions, we take @@@P@@@ as the uniform distribution on the @@@36@@@ possible outcomes. Indeed, choose @@@i,j\in \{1,\dots 6\}@@@….
Let’s repeat the same experiment as in the last example, and let @@@S=X+Y@@@. Then @@@S@@@ and @@@Y@@@ are not independent. Why? @@@P(X=6)=\frac 16@@@, but @@@P(X=6|S=12) \ne P(X=6)@@@: conditioning the event @@@\{X=6\}@@@ on the event @@@\{S=12\}@@@ changes the probability of the former, hence the two events are not independent.
- All we need is one case to ruin independence.
- Note, for example that @@@P(S=7) = \frac{6}{36}@@@, and therefore
That is for every @@@i@@@, @@@\{X=i\}@@@ and @@@\{S=7\}@@@ are independent, although the random variables @@@X@@@ and @@@S@@@ are not independent.
Show that the random variables @@@X@@@ and @@@{\bf 1}_{\{7\}}(S)@@@ from the last example are independent.
Now for some unfinished business…
Are the RVs @@@X@@@ and @@@Y@@@ from Example 7 independent? We only need to check for @@@n=1,2,\dots@@@ and @@@e=2,4,6@@@, the respective supports. This can be seen from our formula for the PMF \eqref{eq:until_even}, or from the formula for the marginals we obtained in Example 8, which we will now use to prove the independence. We have
$$p_{X,Y}(n,e)\overset{\eqref{eq:until_even}}{=} \frac{1}{3\cdot 2^n} \overset{\eqref{eq:wait_marginal},\eqref{eq:even_marginal}}{=} p_Y(e) p_X(n) ,$$and therefore @@@X@@@ and @@@Y@@@ are independent. Indeed, waiting until an even number lands does not tell anything about which number it would be.
Let’s consider another example, one where independence fails. Of course, this is “usually” the case.
Toss a fair dice repeatedly. Let @@@X@@@ be number of tosses until the first “@@@1@@@” and let @@@Y@@@ denote the number of tosses until the first “@@@2@@@”. Find the PMF of @@@(X,Y)@@@, the distribution of each marginal and whether they are independent. To begin, it is extremely easy to identify the distributions of the marginals and we don’t need to compute the joint PMF first. The tosses are independent. In each toss, the probability it lands @@@1@@@ is @@@1/6@@@. Therefore @@@X@@@, the number of tosses until first @@@1@@@ is @@@\mbox{Geom}(1/6)@@@. Same with @@@Y@@@. That is,
$$p_Y(y) = p_X(y) = (\frac{5}{6})^{y-1} \frac{1}{6} = \frac{5^{y-1}}{6^y}.$$To find the joint PMF, observe that the vector @@@(X,Y)@@@ only takes values of the form @@@(x,y)@@@ where @@@x\ne y@@@ and @@@x,y =1,2,\dots@@@. Let’s fix such pair @@@x,y@@@ and assume that @@@x<y@@@. The event @@@\{(X,Y)=(x,y)\}@@@ means that the first @@@x-1@@@ tosses are in @@@\{3,\dots,6\}@@@, the @@@x@@@-th toss is @@@1@@@, the tosses after the @@@x@@@-th and before the @@@y@@@-th are not @@@2@@@ (total of @@@y-x-1@@@ tosses), and the @@@y@@@-th toss is a @@@2@@@. By independence of the tosses, we have \begin{align} p_{X,Y} (x,y) & = (\frac{4}{6})^{x-1} \frac {1}{6} (\frac{5}{6})^{y-x-1} \frac{1}{6}\ & = \frac{4^{x-1}5^{y-x-1}}{6^y}=\frac {1}{25}(\frac{5}{6})^y(\frac{4}{5})^{x-1}. \end{align} Repeating the computation with in the case @@@y<x@@@ gives the same formula, with the roles of @@@x@@@ and @@@y@@@ changed in the RHS. For all other cases, including @@@x=y@@@, @@@p_{X,Y}(x,y)=0@@@. Are @@@X@@@ and @@@Y@@@ independent? Ask yourself the following: does knowing something on @@@X@@@ can change the distribution of @@@Y@@@? Yes. If @@@X=1@@@ then @@@Y@@@ cannot be @@@1@@@, although @@@p_Y(1)= \frac{1}{6}@@@. That is,
$$0= p_{X,Y}(1,1) \ne p_X(1) p_Y(1) = \frac {1}{36}.$$Therefore \eqref{eq:disc_vec_ind} fails for one pair, and therefore the RVs @@@X@@@ and @@@Y@@@ are not independent.
Let’s see independence in practice.
Suppose that @@@X\sim \mbox{Geom}(p_1)@@@ and @@@Y\sim Geom(p_2)@@@ are independent.
- For @@@k\in\Z_+@@@ find @@@P(X=Y+k)@@@.
- Find @@@P(X>Y)@@@. Observe that
We have used the formula for the geometric series to get the last line. We turn to b. We can do it by observing that
$$P(X>Y) = \sum_{k=1}^\infty P(X=Y+k),$$and using the result from part a., or skipping part a, and proceed as follows:
$$\begin{align*} P(X>Y) & = \sum_{y} P(X>Y,Y=y) \\ & =\sum_{y} P(X>y,Y=y)\\ & = \sum_{y} P(X>y) p_Y(y)\\ & = \sum_{y=1}^\infty (1-p_1)^y (1-p_2)^{y-1}p_2\\ & = (1-p_1)p_2 \sum_{y=1}^\infty ((1-p_1)(1-p_2))^{y-1}\\ & = \frac{(1-p_1)p_2}{1-(1-p_1)(1-p_2)}. \end{align*}$$Suppose that @@@p_{X,Y}(x,y)@@@ is equal to a constant @@@c@@@ on the triangular set @@@\{(x,y): x\in \{0,\dots,4\},y \in \{0,\dots,x\}\}@@@ and is zero otherwise. Are @@@X@@@ and @@@Y@@@ independent? Here’s an incorrect argument: the PMF is constant @@@c@@@, and therefore trivially a multiple of a function of @@@x@@@ times a function of @@@y@@@, both can be taken as constant functions. From our second critertion for independence, the RVs are then independent. The problem is that the PMF is not constant. It is constant on the prescribed set, but is zero otherwise. This is very important and completely violates the condition of independence. Note, for example, that the probability that @@@X = 4@@@ is
$$\sum_{y=0}^4 c = 5c,$$and the probability that @@@Y=4@@@ is
$$\sum_{x=4}^4 c = c.$$The probability of the intersection @@@\{X=4,Y=4\}@@@ is exactly @@@c@@@ (one point in our triangular set), so independence would imply @@@5c^2 = c@@@, which gives @@@c=0@@@ or @@@c=\frac{1}{5}@@@. The value @@@c=0@@@ is inadmissible, because a PMF has to sum to @@@1@@@. And the value @@@c=\frac 15@@@ has a similar fate, because of the very same reason: choosing it and summing over our triangular shape we get total probability exceeding @@@1@@@ (there are exactly @@@15@@@ points in our triangle).
Expectation
How do we compute expectations when we are given a joint PMF? If we’ve given both @@@X@@@ and @@@Y@@@, it is reasonable to ask how to compute expectations of functions of @@@X@@@ and @@@Y@@@ (sum, product, whatever). How to compute? Exactly as we did for functions of a single RV, with the appropriate changes. Suppose @@@g:\R^2\to \R@@@. Then the expectation of the RV @@@g(X,Y)@@@is given by the formula
$$\begin{equation} \boxed { \label{eq:function_discrete_vector} E[g(X,Y)] = \sum_{x,y} g(x,y) p_{X,Y}(x,y), }\end{equation}$$provided @@@\sum_{x,y} |g(x,y)| p_{X,Y}(x,y)<\infty@@@. Of course, sometimes thre function is only of one variable. This doesn’t much, we still use the same formula. Of course, the computation of the expectation of @@@X@@@ is a special case of that type with @@@g(x,y) = x@@@, and then \eqref{eq:function_discrete_vector} becomes
$$\begin{align*} E[ X ] & = \sum_{x,y} x p_{X,Y}(x,y)\\ & = \sum_{x} x \overset{=p_X(x)}{\overbrace{(\sum_{y} p_{X,Y} (x,y))}}\\ & = \sum_{x} x p_X (x) ,\end{align*}$$as expected.
Conditional PMFs
Whenever @@@p_Y(y)>0@@@, we can define the conditional PMF of @@@X@@@ conditioned on @@@Y@@@, and denote it by @@@p_{X|Y}(x|y)@@@:
$$\begin{equation} \boxed { \label{eq:cond_pmf} p_{X|Y}(x | y) = P(X=x | Y=y)= \frac{ p_{X,Y}(x,y)}{p_{Y}(y)}. }\end{equation}$$Of course, when @@@X@@@ and @@@Y@@@ are independent, @@@p_{X|Y}(x|y) = p_{X}(x)@@@, but otherwise this is not the case. Note also that as a function of @@@x@@@, the conditional PMF is by itself a PMF: it is nonnegative and sums to @@@1@@@. It simply gives the “adjusted” probabilities for @@@X@@@ assuming @@@Y=y@@@.
Find @@@p_{X|Y}(x|y)@@@ for @@@X@@@ and @@@Y@@@ as in Example 12. First we need to fix @@@y@@@. All that’s left is to plug into the definition. When @@@x<y@@@ we have
$$\begin{equation} \label{eq:disc_cond} p_{X|Y}(x | y)= \frac{p_{X,Y}(x,y)}{p_Y(y)} = \frac{ 6^{-y} 4^{x-1}5^{y-x-1}}{6^{-y}5^{y-1}}=\frac{1}{4} (\frac{4}{5})^{x}=\frac{1}{5}(\frac{4}{5})^{x-1}. \end{equation}$$For @@@x=y@@@, we have @@@ p_{X|Y}(x|y) = 0@@@ because although @@@p_Y(y)>0@@@, @@@p_{X,Y}(y,y)=0@@@. Finally, for @@@x>y@@@, we repeat the computation, in \eqref{eq:disc_cond} but switching the roles of @@@x@@@ and @@@y@@@ in the numerator. This gives
$$\begin{align*} p_{X|Y}(x | y) & = \frac{ 6^{-x} 4^{y-1}5^{x-y-1}}{6^{-y}5^{y-1}}\\ & =6^{y-x} 4^{y-1} 5^{x-2y}\\ & =\frac 14 (\frac{4}{5})^y(\frac{5}{6})^{x-y}\\ & = (\frac{4}{5})^{y-1}\frac 16 (\frac{5}{6})^{x-y-1} \end{align*}$$Let’s summarize:
$$\begin{equation} \label{eq:disc_cond_sum} p_{X|Y}(x | y) = \begin{cases} \frac15 (\frac{4}{5})^{x-1} & x<y \\ 0 & x=y \\ (\frac{4}{5})^{y-1} \frac 16(\frac{5}{6})^{x-y-1}& x>y \end{cases} \end{equation}$$Let’s try to intuitively understand this formula. Suppose we know that the first @@@2@@@ occurred at time @@@y@@@. This simply tells us that the first @@@y-1@@@ tosses are not @@@2@@@, the @@@y@@@-th toss is @@@2@@@, and that’s it. No other conditions. Armed with this, the (conditioned) probability that @@@X=x@@@ is obtained from the following argument:
- If @@@x<y@@@, then @@@x-1@@@ tosses where outcomes for each are any of the four @@@\{3,4,5,6\}@@@ out of the possible five, @@@\{1,3,4,5,6\}@@@, followed by the toss whose outcome is @@@1@@@, one out of five, giving the first line in \eqref{eq:disc_cond_sum}
- If @@@x=y@@@, since we can’t have two different numbers at the same toss, we obtain that the probability is zero.
- If @@@x>y@@@, we have @@@y-1@@@ first tosses where outcomes for each are in @@@\{3,4,5,6\}@@@, followed by @@@y@@@-th toss equal to @@@2@@@ (assumed, so probability of that is one), then followed by @@@x-y-1@@@ tosses, each in any of five @@@\{2,3,4,5,6\}@@@ out of the possible six, followed by last toss equal to @@@1@@@, one out of six, giving the last line in \eqref{eq:disc_cond_sum}.
Sometimes the joint distribution of a vector is given in terms of conditional PMF, and one needs to recover marginals from this description. Here is how we do it:
$$\begin{equation} \boxed { \label{eq:pmf_from_cond} p_X (x) = \sum_y p_{X,Y}(x,y) = \sum_y p_{X|Y}(x | y) p_{Y}(y). }\end{equation}$$This, of course, is nothing but the total probability formula. Here is a classical example which is a special case of the notion of thinning of Poisson RVs:
The number of visits to an online store in a day is @@@N\sim \mbox{Pois}(\lambda)@@@ for some @@@\lambda@@@. Each visitor makes a purchase with probability @@@p@@@, independently of anything else. Find the distribution of the number of purchases, @@@X@@@ per day, and the joint distribution of @@@N@@@ and @@@N-X@@@. Are the latter independent? To find the distribution of @@@X@@@, observe that what we are told is the following: given @@@N=n@@@, the number of purchases @@@X@@@ is Binomial with parameters @@@n@@@ and @@@p@@@. Therefore,
$$p_{X|N}(x|n) = \binom{n}{x} p^x (1-p)^{n-x} \mbox{ with }, ~x \in \{0,\dots,n\}.$$Let’s calculate:
$$\begin{align*} p_X(x) & = \sum_{n}p_{X|N}(x | N)p_N(n) \\ & =\sum_{n=x}^\infty \binom{n}{x} p^x (1-p)^{n-x} e^{-\lambda}\frac{ \lambda^n}{n!} \\ & = \frac{e^{-\lambda}(\lambda p )^x }{x!} \sum_{n=x}^\infty \frac{(1-p)^{n-x}\lambda^{n-x} }{(n-x)!} \\ & = \frac{1}{x!} e^{-\lambda} (\lambda p)^x e^{\lambda (1-p)} = e^{-\lambda p} \frac{ (\lambda p )^x}{x!}, \end{align*}$$Hence @@@p_X(n) \sim \mbox{Pois}(\lambda p)@@@. This is interesting, right? Similarly, @@@N-X@@@ has distribution @@@\mbox{Pois}((1-p)\lambda)@@@(because we can repeat this calculation but now count those visitors which did not make a purchase, selecting now each with probability @@@1-p@@@). But that’s not all, yet. Let’s look at the joint PMF of @@@X@@@ and @@@N-X@@@.
$$\begin{align*} p_{X,N-X}(x,y) & = p_{X,N}(x,x+y)\\ & = p_{X|N}(x | x+y) p_{N}(x+y)\\ & = \binom{x+y}{x} p^x (1-p)^{y} e^{-\lambda} \frac{\lambda ^{x+y} }{(x+y)!}\\ & = \frac{e^{-\lambda p}(\lambda p)^x }{x!} \times \frac{e^{-\lambda (1-p)} (\lambda (1-p))^y }{y!}\\ & = p_X(x) p_{N-X} (y) \end{align*}$$Wow! The number of customers who made a purchase and the number of customers who did not make a purchase are independent. This is one of the beautiful features of Poisson: if a Poisson RV represents a number of objects, and you mark each object with fixed probability, independently of the others, then the number of marked and number of unmarked are both Poisson, and are independent. The first property…
title: Joint Distributions layout: latex —
Explain the comment above on binomial distributions.
Okay. What about expectations? As noted above, the conditional PMF is a genuine PMF, and this leads to additional notation. The expectation of @@@X@@@ conditioned on @@@Y=y@@@, denoted by @@@E[X | Y=y]@@@ is defined as the expectation expectation of @@@X@@@ as it is defined relative to the conditional PMF @@@p_{X | Y}(x | y)@@@. That is,
$$\begin{equation} \boxed {\label{eq:cond_exp_discrete} E[X | Y=y] = \sum_{x} x p_{X | Y}(x | y), }\end{equation}$$whenever the RHS is defined. Often, conditional expectation is a tool for computing expectation. Indeed,
$$ \begin{align*} E[X] &=\sum_{x} x p_X(x) =\sum_{x}x \left (\sum_{y} p_{X | Y}(x | y) p_Y(y)\right)\\ & =\sum_{y} \left (\sum_{x} p_{X | Y}(x | y) \right) p_Y(y)\\ & = \sum_{y} E[X | Y=y]p_Y(y) \end{align*} $$That is,
$$\begin{equation} \boxed { E[X] =\sum_y E[X | Y=y]p_Y(y) }\end{equation}$$Let’s revisit the last derivation of the expectation of Geometric RV in the Example disc_exp. Let @@@X\sim \mbox{Geom}(p)@@@ for @@@p\in (0,1]@@@. We will argue that @@@E[X]=1/p@@@ by conditioning on @@@Y@@@, where @@@Y@@@ is the indicator that the first trial is successful. We know that @@@E[X | Y=1]@@@ is @@@1@@@. What about @@@E[X | Y=0]@@@? We have a failure in the first trial. Then we wait for the first success but now starting from the second trial. Therefore @@@E[X | Y=0]=E[1+X]@@@. Putting it all together, we have We have
$$ \begin{align*} E[X]&= E[X | Y=1]p_Y(1) + E[X | Y=0]p_Y(0)\\ & = 1 *p + E[ X | Y=0](1-p)\\ & = 1*p + E[1+X](1-p) \\ & = 1+(1-p) E[X], \end{align*} $$from which it follows that @@@E[X]=1/p@@@.
Similarly, we can define the conditional expectation of a random variable of the form @@@g(X,Y)@@@
$$\begin{equation} \boxed {\label{eq:cond_exp_fun_disc} E[g(X,Y) | Y=y] = \sum_{x} g(x,y) p_{X | Y}(x | y), }\end{equation}$$whenever the RHS is defined. Note that since we condition on @@@Y=y@@@, we can replace @@@g(X,Y)@@@ with @@@g(X,y)@@@. At this stage, you may ask for the connection between the conditional expectations and the expectation. Let’s get it.
$$ \begin{align*} E[g(X,Y)] & = \sum_{x,y} g(x,y) p_{X,Y}(x,y) \\ & \sum_{x,y} g(x,y) p_{X | Y} (x | y) p_Y(y) \\ & = \sum_y \left(\sum_x g(x,y) p_{X | Y} (x | y)\right)p_Y(y)\\& \sum_y E[g(X,Y) | Y=y]p_Y(y). \end{align*} $$Summarizing,
$$\begin{equation} \boxed {\label{eq:disc_noncond_from_cond} E[g(X,Y)] = \sum_{y} E[g(X,Y) | Y=y] p_Y(y). }\end{equation}$$Compute @@@E[X | Y=y]@@@ where @@@X@@@ and @@@Y@@@ are as in Example 12. Writing down the formula is pretty straightforward:
$$ E[X | Y=y] =\underset{(*)}{\underbrace{\sum_{x<y} x \frac{1}{5} (\frac{4}{5})^{x-1}}} + (\frac{4}{5})^{y-1} \underset{(**)}{\underbrace{\sum_{x>y} x \frac{1}{6} (\frac{5}{6})^{x-y-1}.}}$$We need to compute its value. By changing indices in @@@(**)@@@ from @@@x@@@ to @@@k=x-y@@@, we have
$$ (**) = \sum_{k=1}^\infty (k+y) (\frac{1}{6})(\frac{5}{6})^{k-1}=6+y,$$which holds because the expression in the middle is simply the expectation of @@@\mbox{Geom}(1/6)+y@@@. As for @@@(*)@@@, we will use differentiation under the summation sign:
$$ (*) =\frac 15 \frac{d}{d p } \sum_{x=1}^{y-1} p^x, \mbox{ with } p= 4/5.$$Since @@@\sum_{x=1}^{y-1} p^x = \frac{1-p^y}{1-p}@@@, the derivative is equal to @@@-y p^{y-1} (1-p)^{-1} + (1-p^y)(1-p)^{-2} = (1-p)^{-2} (1-p^y - y(1-p)p^{y-1})@@@. In our case, @@@p=\frac{4}{5}@@@ and @@@ 1-p=\frac{1}{5}@@@, so we obtain
$$ \begin{align*} (*) &= 5 - 5 (\frac{4}{5})^y -y (\frac{4}{5})^{y-1}\\ & = 5 - (\frac{4}{5})^{y-1} (4+y) \end{align*} $$All together,
$$ E[ X | Y=y ] = 5 - (\frac{4}{5})^{y-1} (4+y-y-6)= 5 +2 (\frac{4}{5})^{y-1}.$$To finish, let’s compute @@@E[X]@@@. Recall that @@@X@@@ is the number of tosses until the first @@@1@@@. Therefore @@@X\sim \mbox{Geom}(1/6)@@@, and @@@E[X]=6@@@. We compute it again, using the identity \eqref{eq:disc_noncond_from_cond}. Applying the identity with @@@g(x,y) = x@@@ and recalling that @@@Y@@@ is also @@@\mbox{Geom}(1/6)@@@, we obtain
$$ \begin{align*} E[X] &= \sum_{y} E[X | y] p_Y(y) =\sum_{y=1}^\infty (5+ 2 (\frac{4}{5})^{y-1}) p_Y(y)\\ & = 5 + \frac{2}{6} \sum_{y=1}^\infty (\frac{4}{5} \times \frac{5}{6})^{y-1} \\ & = 5 + 1=6. \end{align*} $$This one is a brain teaser. Please think before you read the answer.
You’re tossing a fair dice. What is the expected number of tosses until you first see a @@@2@@@, conditioned on all tosses were even?
Easy. If I know all tosses were even, then the probability of seeing a @@@2@@@ each time is @@@1/3@@@. Therefore the number of tosses until the first @@@2@@@ is geometric with parameter @@@1/3@@@ and the answer is @@@3@@@.
'’Wrong!’’ What exactly is wrong? ‘‘Everything’’ (had to say it, but not really). Let me explain: in this problem we are
- not conditioning on “all tosses were even” because this is an event of probability zero. This is not even what we’re saying.
- we are conditioning on “all tosses before the first @@@2@@@ were even” or, equivalently, “@@@2@@@ is the first to appear among the numbers @@@1,2,3,5@@@”. Heuristically, by conditioning, we are restricting the discussion to @@@1/4@@@ of the sequences, and then each sequence attains @@@4@@@ times its original probability. So what is the probability that the first toss is a @@@2@@@? @@@4*1/6@@@. What is the probability that @@@2@@@ appears first in the second toss (remember that we do not allow @@@1,3,5@@@ to appear before @@@2@@@? @@@4* (2/6)*1/6@@@, more generally, what is the probability that @@@2@@@ will appear for the first time at the @@@n@@@-th toss? @@@4* (2/6)^{n-1}1/6=(1/3)^{n-1}2/3@@@. In other words, the distribution of the number of tosses until (including) the first @@@2@@@ is @@@\mbox{Geom}(2/3)@@@, and the expectation is @@@3/2@@@!
Let’s do it right now. Let @@@B@@@ be the event that all tosses before the first @@@2@@@ were even, and let @@@T@@@ be the number of the toss @@@2@@@ is first observed.
$$ P(T=n,B)= (\frac{2}{6})^{n-1}\times \frac{1}{6},$$because tosses are independent, and the @@@n-1@@@ first tosses are each @@@4@@@ or @@@6@@@, while the @@@n@@@-th toss is @@@2@@@.
By the total probability formula,
$$ P(B) = \sum_{n=1}^\infty P(T=n,B) = \frac{1}{6}\sum_{n=1}^\infty (\frac{2}{6})^{n-1}= \frac{1}{6}\times \frac{1}{1-1/3} =\frac 14.$$Summarizing,
$$ P(T=n | B) = \frac{P(T=n,B) }{P(B)}= \frac 23 (\frac{1}{3})^{n-1}.$$In other words, the distribution of @@@T@@@ conditioned on @@@B@@@ is @@@\mbox{Geom}(2/3)@@@. Its expectation (or conditional expectation) is therefore @@@3/2@@@. That was cool, right? comment %} The probability of seeing a @@@2@@@ each time is not @@@1/3@@@, but twice. Why? Let’s view this differently. The conditioning is merely saying that @@@2@@@ will appear before @@@1,3,5@@@. Or: The first time a number other than @@@4@@@ or @@@6@@@ will appear, it will be a @@@2@@@. By conditioning, we are restricting our sample space to only those sequences corresponding to @@@2@@@ being the first to appear ‘'’among @@@1,2,3,5@@@’’’. This means we are looking at only a “@@@1/4@@@” of all sequences, and as a result, the probability of each remaining sequence is @@@4@@@ times its original probability. So what is the probability that the first toss is a @@@2@@@? @@@4*1/6@@@. What is the probability that @@@2@@@ appears first in the second toss (remember that we do not allow @@@1,3,5@@@ to appear before @@@2@@@? @@@4* (2/6)*1/6@@@, more generally, what is the probability that @@@2@@@ will appear for the first time at the @@@n@@@-th toss? @@@4* (2/6)^{n-1}1/6=(1/3)^{n-1}2/3@@@. In other words, the distribution of the number of tosses until (including) the first @@@2@@@ is @@@\mbox{Geom}(2/3)@@@, and the expectation is @@@3/2@@@! endcomment %}
Sums of Independent RVs and Discrete Convolutions
One of the simplest operations when dealing with random vectors is adding the marginals. In this short section, we will see how to get those in the discrete setting.
Suppose that @@@(X,Y)@@@ is a discrete random vector. Set @@@Z=X+Y@@@. Then
$$\{Z=z\}= \{X+Y =z\}=\cup_{x} \{X+Y=z,X=x\} = \{X=x,Y=z-x\}.$$Therefore,
$$\begin{equation} \boxed {\label{eq:disc_convolution} p_Z(z) = \sum_{x} p_{X,Y}(x,z-x). }\end{equation}$$When @@@X@@@ and @@@Y@@@ are independent, we have the following
$$ p_Z(z) = \sum_{x} p_{X}(x) p_{Y} (z-x).$$Let @@@(X,Y)@@@ be a random vector with independent marginals, @@@X\sim \mbox{Pois}(\lambda)@@@ and @@@Y\sim\mbox{Pois}(\mu)@@@. We will show that
$$ \begin{equation}\label{eq:sum_poisson} Z=X+Y\sim\mbox{Pois}(\lambda+ \mu). \end{equation} $$Before showing the claim, note that by induction \eqref{eq:sum_poisson} shows that if @@@X_1,\dots,X_n@@@ are independent Poisson then their sum is a Poisson RV with parameter equal to the sum of the parameters @@@X_1,\dots,X_n@@@. Now let’s derive \eqref{eq:sum_poisson}. We use the discrete convolution formula. This gives
$$ \begin{align*} p_Z(z) &= \sum_{x=0}^\infty p_X (x) p_Y(z-x)\\ & = \sum_{x=0}^{z} e^{-\lambda} \frac{\lambda^x}{x!} e^{-\mu} \frac{\mu^{z-x}}{(z-x)!}\\ & = \frac{e^{-(\lambda+ \mu)}}{z!} \sum_{x=0}^z \frac{z!} {x! (z-x)!}\lambda^x \mu^{z-x}\\ & = e^{-(\lambda+ \mu)} \frac{ (\lambda + \mu)^z}{z!}. \end{align*} $$We have used the binomial formula to get the last equality.
Random Vectors with Densities
Joint Density
Recall that a RV @@@X@@@ had density if the ratio between the probability that @@@X@@@ lies in a small interval centered at a given point @@@x@@@ and the length of the interval tends to some (finite) constant as the length of the interval tends to zero. This constant is then known as the density of @@@X@@@ at @@@x@@@.
The story of random vectors is the same. Roughly speaking, a random vector @@@(X,Y)@@@ has density at @@@(x,y)@@@ if the ratio between the probability that @@@(X,Y)@@@ lies in a square centered at @@@(x,y)@@@ and the area of the square tends to a (finite) constant as the area of the square tends to zero. That constant is called the density of @@@(X,Y)@@@ at @@@(x,y)@@@. We can take this as a definition, but we prefer a more utilitarian approach that allows us to get quickly into calculations of probability bypassing the notion of a limit of these ratios.
A random vector @@@(X,Y)@@@ has a density if there exists a nonnegative Riemann integrable function @@@f_{X,Y}@@@, the ‘'’joint density function’’’ such that for all @@@x,y \in \R@@@
$$ F_{X,Y}(x,y) = \int_{-\infty}^x \int_{-\infty}^y f_{X,Y}(s,t) dt ds.$$This is the obvious analog of our definition of density for a RV. When using densities, the following result is our main workhorse. I chose the word “nice” to describe any set one can do Riemann intergrals over. This includes all rectangles of all forms, all polygons, and more generally, regions bounded bwtween graphs of two cotinuous functionds and their rotations, finite unions of things of this form, etc. Everything you saw in Calculus…
If @@@(X,Y)@@@ has density @@@f_{X,Y}@@@, then for any nice set @@@U@@@,
$$ P((X,Y) \in U) = \int_{U} f_{X,Y}(x,y) dx dy.$$Of course, the distinction between RVs with densities and discrete RVs extends to the case of random vectors. A vector which is discrete cannot have a density. Why? Suppose that for some point @@@(x_0,y_0)@@@, @@@P((X,Y)=(x_0,y_0))>0@@@. If @@@(X,Y)@@@ had density, then we’d have
$$ \int_{x_0}^{x_0} \int_{y_0}^{y_0} f_{X,Y}(x,y)dxdy >0,$$but this is impossible, because the integral of a function over a point is always zero (the lefthand side is equal to zero times @@@f_{X,Y}(x_0,y_0)@@@ which is necessarily zero). As before, the use of “density” is justified through the following:
Suppose @@@(X,Y)@@@ has density @@@f_{X,Y}@@@. Then for every continuity point @@@(x,y)\in \R^2@@@ of @@@f_{X,Y}@@@
$$ f_{X,Y}(x,y) = \lim_{\epsilon \to 0+} \frac{ P((X,Y) \in (x-\epsilon,x+\epsilon)\times (y-\epsilon,y+\epsilon))}{(2\epsilon)^2}.$$Show that if @@@(X,Y)@@@ has a density then @@@P(X=Y)=0@@@.
When is a given function a density and when does a random vector have a density? We’ll answer both simultaneously.
Let @@@f@@@ be a nonnegative Riemann integrable function on @@@\R^2@@@, satisfying @@@\iint f(x,y) dx dy=1@@@. Then
- There exists a random vector @@@(X,Y)@@@ whose density is @@@f@@@. The joint distribution function @@@F_{X,Y}@@@ is given by
- Conversely, if @@@F_{X,Y}@@@ is the joint distribution function of the random vector @@@(X,Y)@@@, and @@@\frac{\partial^2}{\partial x \partial y} F_{X,Y}@@@ exists and is equal to @@@f@@@, possibly except on a union of piecewise smooth curves, then @@@(X,Y)@@@ has a density and it is equal to @@@f@@@.
We note that the order of integration in 1. and differentiation in 2. can be changed: we can first integrate @@@dt@@@, then @@@ds@@@, or differentiate first with respect to @@@x@@@, then with respect to @@@y@@@. The statement will remain the same.
Suppose that @@@X\sim \mbox{U}[a,b]@@@ and @@@Y\sim \mbox{U}[c,d]@@@ are independent. Find the joint CDF of the random vector @@@(X,Y)@@@ and determine if it has a density. If it does, find it.
To solve, observe that
$$F_{X,Y} (x,y) = P(X\le x,Y \le y) = P(X\le x) P(Y\le y),$$with last equality due to independence. Thus,
$$F_{X,Y}(x,y) = F_X (x) F_Y(y) = \int_{-\infty}^x f_X(s) ds \int_{-\infty}^y f_Y(t) dt = \int_{-\infty}^y\int_{-\infty}^x f_X(s) f_Y(t) ds dt.$$This proves that @@@(X,Y)@@@ has density @@@f_{X,Y}(x,y)@@@, given by
$$f_{X,Y}(x,y) = f_X(s) f_Y(y) = \begin{cases} \frac{1}{(d-c)(b-a)} & a < x < b, c < y< d \\ 0 & \mbox{otherwise}\end{cases}$$Recall the joint CDF from Exercise 1. Show it has a density and find it.
Here is another example that shows we should take the condition @@@\iint f(x,y) dx dy=1@@@ in the proposition very seriously.
Suppose that the random vector @@@(X,Y)@@@ has joint CDF
$$F_{X,Y}(x,y)=\begin{cases} e^{-x} & 0\le x<y \\ (1-e^{-y})+(1-e^{-y})(e^{-y}-e^{-x}) & 0\le y\le x\\ 0 & \mbox{ otherwise} \end{cases}$$Does @@@(X,Y)@@@ have density?
To answer, let’s differentiate, except on the axes and the ray @@@\{(x,x):x\ge 0\}@@@, where we cannot differentiate. It should not be a problem because the Riemann integral of any integrable function over any line is zero anyway.
- At points @@@(x,y)@@@ outside the first quadrant, the CDF is identically zero, so @@@\frac{\partial^2 F_{X,Y}}{\partial x \partial y}(x,y)=0@@@.
- At all points @@@(x,y)@@@ where @@@0<x<y@@@, @@@\frac{\partial F_{X,Y}}{\partial y}=0@@@, so that the mixed derivative @@@\frac{\partial^2 F_{X,Y}}{\partial x \partial y}@@@ is zero.
- At all points @@@(x,y)@@@ where @@@0<y<x@@@, differentiating first with respect to @@@y@@@ and then with respect to @@@x@@@ gives
Summarizing, we have
$$ f(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} = e^{-x} e^{-y}$$if @@@y<x@@@, and elsewhere, except on the axes and the ray @@@\{(x,x):x\ge 0\}@@@, the mixed derivative is @@@0@@@. But note that if we did have a density, the values of the density along these linear piece are irrelevant because they have no affect on the integral (area of a linear segment is zero). So we’ve found a density, right? No. Let’s look at the integral of the function which is our candidate for density:
$$ \begin{align*} \iint f(x,y) dx dy &= \int_0^\infty \int_0^x f(x,y)dy dx\\ &=\int_0^\infty e^{-x} \int_0^x e^{-y} dy dx \\ = \int_0^\infty e^{-x} (1-e^{-x}) dx = 1 -\frac 12 = \frac 12. \end{align*} $$Oops! Then there is no density here, and this is no mistake. This example is of a random vector which is an analog of mixed RVs. What is missing? Let’s observe the following (this is a solution to Exercise 11). If we had a density @@@f@@@, then
$$ P(X=Y) = \iint_{\{(s,s):s \in \R \}}f(s,t) dsdt = \int_{-\infty}^\infty \underset{=0}{\underbrace{\int_{t}^t f(s,t) ds}} dt =0.$$In our case, @@@P(X=Y)@@@ is not zero. It is actually @@@1/2@@@. This is were all the missing probability left. We are not going to show it because we will let you show it in Problem 2, where you will see where this joint CDF comes from.
Marginals from Joint Density
How does one recover the marginal distribution from the joint density? The procedure is pretty simple:
$$P_X(A) = P(X \in A) = P(X \in A, Y \in \R) = \int_A (\int_{\R} f_{X,Y}(x,y) dy) dx,$$and this shows that @@@X@@@ has density given by
$$\begin{equation} \boxed {\label{eq:density_marginal} f_X (x) = \int_{\R} f_{X,Y} (x,y) dy. }\end{equation}$$Similarly @@@Y@@@ has a density given by
$$f_Y(y) =\int_{\R} f(x,y) dx.$$Note that the converse is not true: each of the marginals of @@@(X,Y)@@@ may have densities, but @@@(X,Y)@@@ itself may not have a density. Indeed, take @@@X \sim \mbox{U}[0,1]@@@ and @@@Y=X@@@. Then both @@@X@@@ and @@@Y@@@ have densities. Now let @@@D=\{(x,x):x\in \R^2\}@@@. Then @@@P_{X,Y}(D^c)=0@@@ (why?). Also, @@@D@@@ is a line, and the double integral of any Riemann integrable function on a line is zero. If we did have a density, @@@f@@@ we would have @@@\iint f dxdy=0@@@, a contradiction.
Let
$$f(x,y) = \begin{cases} 2e^{-(x+y)} & x>y>0 \\ 0 & \mbox{otherwise}\end{cases} $$Show that @@@f@@@ is a density function and find the densities of each of the marginals.
Clearly @@@f(x,y)\ge0@@@. Also
$$\iint f(x,y) dx dy =\int_0^\infty \int_x^\infty e^{-(x+y)} dy=2\int_0^\infty e^{-x} e^{-x}= 1.$$Now
$$f_X(x) = \int f(x,y) dy =2\int_0^x e^{-(x+y)}dy= 2(e^{-x}-e^{-2x}),$$while
$$f_Y(y) = \int f(x,y) dx = 2\int_y^\infty e^{-(x+y)} dx =2e^{-2y}.$$Clearly, @@@Y\sim \mbox{Exp}(2)@@@.
Independence of Marginals
Repeating the argument of Example 21 with any two independent RVs with densities proves the first part of the following proposition:
Let @@@(X,Y)@@@ be a random vector.
- Suppose that @@@X@@@ and @@@Y@@@ are independent. If @@@X@@@ has density @@@f_X@@@ and @@@Y@@@ has density @@@f_Y@@@, then @@@(X,Y)@@@ has a joint density @@@f_{X,Y} (x,y) = f_X(x)f_Y(y)@@@.
- If @@@(X,Y)@@@ has joint density of the form @@@f_{X,Y}(x,y) = f(x) g(y)@@@ for some nonnegative @@@f@@@ and @@@g@@@ with @@@\int f(x) dx=1@@@ (or @@@\int g(x)=1@@@), then @@@X@@@ and @@@Y@@@ are independent, @@@X@@@ has density @@@f@@@ and @@@Y@@@ has density @@@g@@@.
Prove the second part of the proposition.
Let’s see one useful aspect of independence in computation of probabilities.
Disclosure: I love my Subaru.
Suppose that the first time a new Honda breaks is Exponential with expectation one year, and the first time a Subaru breaks is Exponential with expectation six months. I have two new Hondas and one new Subaru. Assuming times until either car breaks are independent, what is the distribution of the time the first car breaks? What is the probability my Subaru will be the first to break? Write @@@H_1,H_2,S@@@ for the time in years, the first, second Honda and Subaru break respectively. Then @@@H_1,H_2 \sim \mbox{Exp}(1)@@@, and @@@S\sim \mbox{Exp}(2)@@@, and the three RVs are independent. We first reduce the problem to two RVs, setting @@@H=\min (H_1,H_2)@@@. As @@@P(H > t) = P( H_1 > t,H_2>t) =P(H_1>t) P(H_2>t) = (e^{- t})^2=e^{-2t}@@@. We conclude that @@@H\sim\mbox{Exp}(2)@@@. We showed this in Example 3, but it won’t hurt repeating. Also, since @@@H@@@ is a function of @@@H_1@@@ and @@@H_2@@@, it is independent of @@@S@@@. Repeating this argument we can see that the time the first car will break, @@@\min(H,S)@@@ satisfies:
$$ P( \min (H,S)>t) = P(H>t,S>t) = e^{-2t} e^{-2t} = e^{-4t},$$Therefore @@@\min (H,S) \sim\mbox{Exp}(4)@@@, and on the average, the first car will break down in 3 months (@@@1/4@@@ of a year). Now what is the probability it is the Subaru? Or, in other words, what is @@@P(S<H)@@@. Let’s calculate.
$$ \begin{align*} P(S<H) &= \int_0^\infty\int_s^\infty 2 e^{-2s} 2e^{-2t} dtds \\ &= \int_0^\infty 2e^{-4s}ds\\ &=\frac 12 \end{align*} $$Let’s generalize the last computation.
Suppose that @@@(X,Y)@@@ is a random vector with independent marginals, @@@X\sim \mbox{Exp}(\lambda)@@@ and @@@Y\sim \mbox{Exp}(\mu)@@@. Show that
$$ P(X<Y) = \frac{\lambda}{\lambda+\mu}.$$We’ve seen in Example 3 that @@@\min (X,Y)\sim \mbox{Exp}(\lambda+\mu)@@@. What is the probability that @@@X<Y@@@? Remember that larger parameter means small expectation (steeper drop in density as a function of @@@X@@@), so larger parameter means higher probability of being smaller. Enough talking, integral-time:
$$ \begin{align*} P(X< Y) &= \iint_{\{(x,y)\in \R^2:x<y\}} f_X(x) f_Y(y) dydx \\ & = \int_0^\infty \int_x^\infty \lambda e^{-\lambda x} \mu e^{-\mu y}dy dx\\ & = \int_0^\infty \lambda e^{-(\lambda + \mu) x} dx \\ & = \frac{\lambda}{\lambda+ \mu}. \end{align*} $$Repeat the last calculation, but changing the order of integration (inner integral is for @@@dx@@@ and exterior is @@@dy@@@). You need to get the same answer, of course. Just a little longer.
Expectation
Suppose @@@(X,Y)@@@ is a random vector with joint density @@@f_{X,Y}@@@, and @@@G:\R^2 \to \R@@@ is a continuous function. The expectation of @@@g(X,Y)@@@ is equal to
$$\begin{equation} \boxed {\label{eq:exp_function_density} E[g(X,Y) ] = \iint g(x,y) f_{X,Y} (x,y)dxdy. }\end{equation}$$provided the RHS is defined. We will not go into details. We will do some examples.
Let @@@(X,Y)@@@ be a random vector with independent @@@\mbox{U}[0,1]@@@ marginals. Compute @@@E[\max (X,Y)]@@@.
Recall that
$$\max(x,y) = \begin{cases} x & x \ge y \\ y & x< y. \end{cases}$$Now
$$ \begin{align*} E[\max(X,Y)] &= \int_0^1 \int_0^1 \max (x,y) dx dy \\ & =\int_0^1 (\int_0^y y dx ) dy + \int_0^1 (\int_{y}^1 x dx) dy.\\ & = \int_0^1 y^2 dy + \int_0^1 \frac{ 1- y^2}{2}dy \\ & = \frac 13 + \frac 12 - \frac 16 \\ & = \frac {2}{3} \end{align*} $$In the second line we split the inner integral first to values in @@@[0,y]@@@, where @@@y@@@ is the maximum, and then from @@@y@@@ to @@@1@@@ where @@@x@@@ is the maximum. Note that there is another way to do this.
$$ E[ \max(X,Y)] = \int_0^\infty P(\max (X,Y)>t) dt=\int_0^1 P(\max(X,Y)>t) dt.$$Now @@@\{\max(X,Y)>t\}@@@ is the complement of @@@\{\max (X,Y)\le t\}=\{X\le t\} \cap \{Y\le t\}@@@. Therefore
$$ P( \max (X,Y) > t) = 1- P(\max (X,Y) \le t) =1 - P(X\le t) P(Y\le t) = 1-t^2,$$where the second equality is due to independence. The integral now becomes
$$ E[\max(X,Y) ] = \int_0^1 1- t^2 dt = 1 - \frac 13 = \frac 23.$$Note that the CDF of @@@\max(X,Y)@@@ is given by the formula
$$ \begin{equation} \label{eq:cdf_max_unif} F_{\max(X,Y)} (t) = t^2,~ 0\le t \le 1 \end{equation} $$Let @@@(X,Y)@@@ be as in Example 26. Compute @@@E[\min(X,Y)]@@@, the expectation of the minimum of @@@X@@@ and @@@Y@@@.
Let’s do something more interesting.
Let @@@(X,Y)@@@ be a point selected at random on the unit disk @@@D= \{(x,y)\in \R^2: x^2+y^2 <1\}@@@. Find the expected distance of the random variable @@@R@@@, where @@@R@@@ is the distance of @@@(X,Y)@@@ from the origin. The distance from the origin of a point @@@(x,y)@@@ is equal to @@@\sqrt{ x^2 + y^2}@@@. Therefore @@@R=\sqrt{X^2+Y^2}@@@. The joint density of @@@(X,Y)@@@ is equal to
$$ f_{X,Y} (x,y) = \frac{1}{\pi} {\bf 1}_D (x,y).$$Therefore, the expectation is equal to
$$ E[ R ] = \frac{1}{\pi} \iint \sqrt{x^2+y^2}dxdy$$We change to polar coordinates to calculate this integral. Recall that @@@\sqrt{ x^2+y^2}=r@@@. Therefore the RHS is equal to
$$ \frac{1}{\pi} \int_0^{2\pi} \int_0^1 r r dr d\theta,$$where the extra @@@r@@@ is the Jacobian for this subtitution. It follows that
$$ E[ R ] = \frac{1}{\pi} \int_0^{2\pi} \frac 13 d\theta= \frac 23.$$By the way, what is the CDF of @@@R@@@? Let’s do a quick calculation.
$$ P(R \le t) = P( X^2 +Y^2 \le t^2) = \frac{1}{\pi} \iint_{\{(x,y): x^2 + y^2 \le t^2 \}} dxdy = \frac{1}{\pi} \pi t^2 =t^2.$$That is
$$ F_{R} (t) = t^2,~ t \in (0,1).$$Compare this with the CDF of the maximum of two independent @@@\mbox{U}[0,1]@@@, \eqref{eq:cdf_max_unif}. They are the same! Quite surprising, right? Same distribution obtained from two completely different procedures. Also, note that @@@\sqrt{\mbox{U}[0,1]}@@@ also has the same distribution. Can you explain the connections?
Conditional Densities
Suppose that @@@(X,Y)@@@ has density @@@f_{X,Y}@@@. Suppose we know @@@Y=y@@@ (note that the probability of this event is zero!). What is the (conditioned) distribution of @@@X@@@? To understand this let’s recall how to recovered the marginal of @@@X@@@. We saw that
$$ P(X\in A) = \int_A (\int_{\R} f(x,y) dy) dx=\int_{\R} (\int_A f(x,y)) dy.$$For each frozen @@@y@@@, the function we integrate inside, @@@f(x,y)@@@ (with @@@x@@@ as the integration variable), is not a probability density: its integral over @@@\R@@@ is:
$$ \int f(x,y) dx = f_Y(y).$$Though, the simple algebraic manipulation: dividing by @@@f_Y(y)@@@ gives us a density function. Of course we need to multiply back to get @@@f(x,y)@@@. Let’s write this again:
$$ P(X\in A) = \int (\int_A \frac{ f(x,y)}{f_Y(y)} dx f_Y(y) dy .$$We denote this newly defined density function by @@@f_{X | Y}(x | y)@@@, that is
$$\begin{equation} \boxed {\label{eq:cond_densities} f_{X | Y}(x | y) = \frac{f(x,y)}{f_Y(y)}, }\end{equation}$$and it represents the density of @@@X@@@, “conditioned” on @@@Y=y@@@, in the following sense:
$$ P(X\in A) = \int \int_A f_{X | Y}(x | y)dx f_Y(y)dy,$$slicing the probability according to the values of @@@Y@@@. More generally, we have
$$\begin{equation} \label{eq:total_expectation} E [ g(X,Y)] = \int (\int_A g(x,y) f_{X | Y} (x | y) dx) dy, \end{equation}$$whenever the RHS is defined.
Since the conditional density @@@f_{X | Y}(x | y)@@@ is a bonafide density function, we can define expectation with respect to it, the conditional expectation of @@@X@@@ or more generally of a function of @@@X@@@ and @@@Y@@@, conditioned on @@@Y=y@@@. We denote these expectations as @@@E[\quad \cdot\quad | Y=y]@@@. That is,
$$\begin{equation} \boxed {\label{eq:cond_exp_density}E[g(X,Y) | Y=y] = \int g(x,y) f_{X | Y}(x | y) dx,}\end{equation}$$whenever the RHS is defined. Note that since we assume @@@Y=y@@@, we can write @@@g(X,y)@@@ on the LHS. The conditional expectation amount to “slicing” the expectation of @@@g(X)@@@ according to the values of @@@Y@@@. That, is, with the notation of conditional expectation, \eqref{eq:total_expectation} becomes
$$\begin{equation} \boxed {\label{eq:cond_exp_densities} E [g(X,Y) ] = \int E[g(X,Y) | Y=y]f_Y(y)dy. }\end{equation}$$Consider the random vector from Example 23. Find the density of @@@X@@@ conditioned on @@@Y=y@@@ and the conditional expectation @@@E[X|Y=y]@@@.
Since we have calculated @@@f_Y(y)= 2 e^{-2y}@@@, it follows that
$$f_{X | Y} (x | y) = \frac {f_{X,Y}(x,y)}{f_Y(y)} = \frac{2e^{-x-y}}{2e^{-2y}}=e^{y} e^{-x},~x>y.$$Remember that here @@@y@@@ is frozen. It is easy to see that this is the distribution of @@@\mbox{Exp}(1)@@@, conditioned to be above @@@y@@@. Indeed, let @@@Z\sim\mbox{Exp}(1)@@@. Then for @@@x>y@@@,
$$P(Z>x | Z>y) = P(Z>x)/P(Z>y) = e^{-x} / e^{-y}.$$Therefore
$$ P(Z\le x | Z>y) = 1- e^{-x} / e^{-y},$$and after differentiation we observe that the density of @@@Z@@@ conditioned to be above @@@y@@@ is indeed @@@e^y e^{-x}@@@ for @@@x>y@@@.
Now let’s compute the conditional expectation.
$$\begin{align*} E[X | Y=y] &= \int x f_{X | Y}(x | y) dx\\ & = \int_y^\infty xe^{-(x-y)}dx\\ & \underset{u=x-y} {=} \int_0^\infty (u+y)e^{-u} du\\ & = 1+y. \end{align*}$$Can you give an intuitive explanation to this value?
Sometimes the joint distribution is given in terms of conditional densities. Here is an example
The distribution of tree height in the Amazon rain forest (this is all made up!) is @@@Y\sim \mbox{Exp}(\lambda)@@@. Conditioned on @@@Y=y@@@, the radius of the root system is @@@X\sim \mbox{Exp}(y)@@@. What is the distribution of the radius of the root system?
Here we are given the fact that @@@f_Y(y) = \lambda e^{-\lambda y},~y>0@@@ and @@@f_{X | Y} (x | y) = y e^{-y x}@@@. Therefore, the joint density function satisfies
$$f_{X,Y} (x,y) = f_{X | Y} (x | y) f_Y(y) = y e^{-xy} \lambda e^{-\lambda y} = y \lambda e^{- (x + \lambda)y }.$$Now
$$\begin{align*} f_X(x) &= \int f(x,y) dy = \lambda \int_0^\infty y e^{- (x+\lambda)y }\\ &=\frac{ \lambda}{\lambda+ x} \underset{=E[Exp(x+\lambda)]}{\underbrace{ (\lambda + x)\int_0^\infty y e^{-(x+\lambda)y} dy}}\\ & = \frac{\lambda}{(\lambda+ x)^2} \end{align*}$$Finally, sometimes, like in Bayes’ formula, we want to swap the roles when conditioning. The procedure is identical to Bayes’ formula:
$$f_{Y | X} (y | x) = \frac{f_{X,Y} (x,y)}{f_X(x)} = \frac{f_{X | Y}(x | y) f_Y(y) }{f_X(x)},$$or, in a more symmetric form,
$$ f_{X,Y} (x,y)= f_{Y | X} (y | x) f_X(x) = f_{X | Y} (x | y) f_Y(y).$$Let’s see this in an example.
Consider the random vector @@@(X,Y)@@@ from Example 29. Find @@@ f_{Y | X} (y | x)@@@.
We have
$$\begin{align*} f_{Y | X} (y | x) &= \frac{ f_{X | Y} (x | y) f_Y(y)}{f_X(x)} \\ & = \frac{y \lambda e^{- (x+ \lambda)y }} {\lambda / (\lambda+x)^2} \\ & = (\lambda +x)^2 y e^{-(x+\lambda)y}. \end{align*}$$As we will see in the first part of Example 33, this is the density of a sum of two independent @@@\mbox{Exp}(\lambda +x)@@@ RVs.
Repeat Example 29, but now taking @@@Y@@@ a RV with density @@@2y@@@ on @@@[0,1]@@@ and zero elsewhere, and @@@X\sim \mbox{U}[0,y]@@@.
Let @@@X@@@ and @@@Y@@@ be independent with expectation @@@0@@@. What is wrong here?
$$E[X^2 |X+Y=0 ] = E[X^2 | X=-Y] = E[X(-Y)] = -E[X]E[Y]=0$$Let’s close this section with a cool example.
We will show that for @@@a,b\in {\mathbb Z}_+@@@,
$$\begin{equation} \label{eq:integral} \int_0^1 x^a (1-x)^b dx = \frac{a!b!}{(a+b+1)!} \end{equation}$$How do we do that? One approach would be repeated integration by parts. We will offer a purely probabilistic method that won’t require calculation of any integrals. Let’s begin. Of course, there’s really no point continuing if @@@a=b=0@@@. It’s a triviality (don’t forget that @@@0!=1!=1@@@). Let’s assume then that @@@a+b=N\ge 1@@@. Now take @@@U,U_1,\dots,U_{N}@@@ IID @@@\mbox{U}[0,1]@@@.
The key to our solution is the answer to the following question:
'’What is the probability that @@@U@@@ is the @@@a+1@@@-th smallest among all @@@N+1@@@ RVs?’’
Call the event @@@A_a@@@ and denote its probability by @@@p_a@@@.
- First note that the question is well-defined. Indeed, the probability that any two of the @@@N+1@@@ RVs are equal is zero, because we have a continuous random vector. Therefore, with probability @@@1@@@ we can rank them with no ties (this would not be the case if we have IID which are discrete).
- Because the RVs are IID, the probability that @@@U@@@ happens to be the @@@a+1@@@-th is exactly the same as any of the @@@N+1@@@ RVs is the @@@a+1@@@-th. Since this gives us @@@N+1@@@ disjoint events whose union has probability @@@1@@@ (one of the RV’s must be ranked @@@a+1@@@-th),
- The event @@@A_a@@@ can be decomposed as follows.
- Let @@@{\cal C}@@@ be the set of combinations of @@@a@@@ elements from @@@\{1,\dots,N\}@@@. An element of @@@{\cal C}@@@ is a subset of @@@\{1,\dots,N\}@@@ consisting of exactly @@@a@@@ elements.
- For each combination in @@@C\in {\cal C}@@@, let @@@A_C@@@ be the event @@@\{\max_{i\in C} (U_i)<U\}\cap \{\min_{i\in C} U_i >U\}@@@. That is, the elements of @@@C@@@ represent the indices of the @@@a@@@ RVs smaller than @@@U@@@ and the remaining represent the indices of those larger than @@@U@@@.
- Then, up to an event with probability equal to zero, we have that @@@A_a = \cup_{C\in {\cal C}} A_C@@@, splitting the event @@@A@@@ according to the “identity” (=indices) of the remaining RVs are smaller than it.
- Therefore
- Now calculate @@@P(A_C)@@@ for @@@C\in {\cal C}@@@. To do that, condition on @@@U=x@@@. Because of the independence, for every @@@C\in {\cal C}@@@ we have
and since @@@U@@@ is @@@U[0,1]@@@, we have
$$\begin{equation} \label{eq:individual_PAC} P(A_C) = \int_0^1 P(A_C | U=x )dx = \int_0^1 x^a (1-x)^{b}dx. \end{equation}$$- Since the number of elements in @@@{\cal C}@@@ is @@@\binom{a+b}{a}@@@, it follows that
This establishes \eqref{eq:integral}, without calculating a single integral. I’d also like you to take a second look at the integral and note that is equal to @@@E[U^a (1-U)^b]@@@.
We will leave the proof of the next result to you because everything one needs to prove it is in Example 31.
Suppose that @@@X@@@ is a continuous RV with CDF @@@F@@@. Let @@@a,b\in {\mathbb Z}_+@@@. Then
$$ E [ F^a(X) (1-F(X))^b] = \frac{a!b!}{(a+b+1)!}.$$Though the proof can be repeated verbatim, I’d like to remind you that for @@@X@@@ and @@@F@@@ as in the Proposition 12, we showed that @@@F(X)@@@ is uniformly distributed on @@@[0,1]@@@, so that the seemingly more general result should not come as a surprise.
Sums of Independent RVs and Convolutions
Suppose that @@@(X,Y)@@@ is a vector with @@@X@@@ and @@@Y@@@ independent. In this section, we will discuss the density of the sum @@@X+Y@@@. This is obtained through the following derivation:
$$\begin{align*} P(X+Y\le z) &= \iint_{\{x+y \le z\}} f_X(x) f_Y(y) dy dx\\ & = \int_{-\infty}^\infty \int_{-\infty}^{z-x} f_X(x) f_Y(y) dydx\\ &\underset{\scriptsize u=y+x}{=} \int_{-\infty}^\infty \left( \int_{-\infty}^{z} f_X(x) f_Y(u-x)du\right) dx \\ & = \int_{-\infty}^z \left ( \int_{-\infty}^\infty f_X(x) f_Y(u-x) dx \right) du \end{align*}$$The inner integral is a function of @@@u@@@ we call the convolution of @@@f_X@@@ and @@@f_Y@@@. It is a form of a product, and is denoted by @@@f_X* f_Y (z)@@@. Differentiating the RHS with respect to @@@z@@@, we obtain that @@@Z=X+Y@@@ has density @@@f_Z(z)@@@ given by the formula
$$\begin{equation} \boxed {\label{eq:convolution} f_Z(z) = f_X \star f_Y (z) = \int f_X(x) f_Y (z-x) dx. }\end{equation}$$Since @@@X+Y=Y+X@@@, we conclude that @@@f_X*f_Y = f_Y * f_X@@@. Let’s put this into practice.
Let @@@(X,Y)@@@ be a random vector with independent @@@\mbox{U}[0,1]@@@ marginals. Find the density of @@@Z=X+Y@@@.
By the convolution formula,
$$ f_Z (z) =\int f_X (x) f_Y(z-x) dx.$$I did not write the limits because I want to discuss them. As @@@Z@@@ is the sum of two @@@\mbox{U}[0,1]@@@, its values can range from @@@0@@@ to @@@2@@@. Let’s first calculate the density when @@@z \in [0,1]@@@. In this case, since @@@f_Y@@@ is zero on negative numbers, we must have @@@z-x>0@@@ or @@@x<z@@@. Since @@@f_X@@@ and @@@f_Y@@@ are equal to @@@1@@@ on @@@[0,1]@@@ we have
$$ f_Z (z) = \int_0^z dx =z=1-(1-z).$$Let’s turn to the case @@@z\in [1,2]@@@. Here we need to integrate over a range of values for @@@x@@@ such that @@@z-x <1@@@, or @@@x> z-1@@@. That is
$$ f_Z (z) = \int_{z-1}^1 dx = 1-(z-1).$$Combining both cases, we have just obtained the following formula
$$ f_Z (z) = \begin{cases} 1-|z-1| & z\in [0,2] \\ 0 & \mbox{otherwise}\end{cases}$$Draw this!
Let @@@(X,Y)@@@ be a random vector with independent components @@@X\sim \mbox{U}[0,1]@@@ and @@@Y\sim\mbox{Exp}(1)@@@. Show that the density of @@@X+Y@@@ is given by the formula
$$f_{X+Y}(z) = e^{-z}\begin{cases} (e^z -1) & 0< z\le 1\\ (e-1) & z> 1\\ 0 & \mbox{otherwise}\end{cases} $$Let’s do another one.
Suppose that @@@(X_1,X_2)@@@ is a random vector with independent marginals @@@X_1\sim \mbox{Exp}(\lambda_1)@@@ and @@@X_2\sim \mbox{Exp}(\lambda_2)@@@. Find the density of @@@Z= X_1+X_2@@@.
Let’s find the convolution of @@@f_{X_1}@@@ and @@@f_{X_2}@@@:
$$\begin{align*} f_Z (z) &= \int f_{X_1} (x) f_{X_2} (z-x)dx \\ & = \lambda_1 \lambda_2 \int_0^z e^{-\lambda_1 x } e^{-\lambda_2(z-x)} dx \\ & = \lambda_1 \lambda_2 e^{-\lambda_2 z} \int_0^z e^{-(\lambda_1 -\lambda_2)x } dx. \end{align*}$$Note that the integral is up to @@@z@@@ because the density @@@f_{X_2}@@@ is zero for negative numbers. To conclude we nee to consider two cases
- @@@\lambda_1=\lambda_2@@@. Then
- @@@\lambda_1 \ne \lambda_2@@@. Without loss of generality, we will assume @@@\lambda_1>\lambda_2@@@. Then
Exponential RVs are important.
Calculate the density of the sum of @@@n@@@ independent @@@\mbox{Exp}(\lambda)@@@ RVs. Observe that for @@@n=1@@@, the density is, of course, @@@\lambda e^{-\lambda x}@@@. As we saw in the first part of Example 33 for @@@n=2@@@, the density is @@@\lambda (\lambda x)e^{-\lambda x}@@@. You probably see where this is going to (more or less, there may be a normalizing constant to make the integral @@@1@@@). Let’s assume the density of @@@n-1@@@ is @@@f_{n-1}@@@. We need to find its convolution with the density of @@@\mbox{Exp}(\lambda)@@@:
$$f_{n} (x) = \int_0^x f_{n-1} (y) \lambda e^{-\lambda (x-y)} dy.$$Let’s play smart and use some differential equations instead of guessing. Multiply both sides by @@@e^{\lambda x}@@@. What we get is
$$e^{\lambda x} f_{n} (x) = \lambda \int_0^x f_{n-1} (y) e^{\lambda y } dy.$$Or, if we define @@@g_n(x) = e^{\lambda x}f_n (x)@@@ (and same for @@@n-1@@@), we have shown that
$$ g_{n} (x) =\lambda \int_0^x g_{n-1}(y)dy.$$Since @@@g_1(x) = \lambda@@@, @@@g_2(x) = \lambda (\lambda x)@@@, @@@g_3 = \lambda \lambda^2 x^2/2@@@, and by induction,
$$g_n (x) = \lambda \frac{ (\lambda x)^{n-1}}{(n-1)!}.$$Thus for all @@@n@@@,
$$f_n (x) = \lambda \frac{(\lambda x)^{n-1}}{(n-1)!} e^{-\lambda x} .$$Suppose that @@@X@@@ is the sum of @@@3@@@ independent @@@\mbox{Exp}(1)@@@ RVs. Calculate @@@E[1/X^2]@@@.
Transformations
Suppose that @@@(X,Y)@@@ has density @@@f_{X,Y}(x,y)@@@, and suppose that there exists a random vector @@@(U,V)@@@ along with two functions @@@f_1,f_2:\R^2 \to \R@@@ such that @@@X=f_1(U,V)@@@ and @@@Y=f_2(U,V)@@@. In other words, @@@(X,Y)@@@ is a transformation of @@@(U,V)@@@. We write @@@T(u,v) = (f_1(u,v),f_2(u,v))@@@. What is the density of @@@(U,V)@@@?
Observe then that by the change of variables formula,
$$ P_{X,Y} ( A) = \iint_A f_{X,Y}(x,y) dx dy= \int_{T^{-1}(A)} f_{X,Y}(f_1(u,v),f_2(u,v)) |J(u,v)| d u dv,$$where @@@J(u,v)@@@ is the Jacobian determinant:
$$ J(u,v) = \det \frac{\partial (x,y)}{\partial (u,v)} = \det \left(\begin{array}{cc} \frac{\partial f_1}{\partial u} & \frac{\partial f_2}{\partial u} \\ \frac{\partial f_1}{\partial v} & \frac{\partial f_2}{\partial v}\end{array}\right).$$The matrix is simply the gradients of @@@f_1@@@ and @@@f_2@@@ written as column vectors.
Set @@@B= T^{-1}(A)@@@. Then @@@(X,Y)\in A@@@ if and only if @@@(U,V)\in B@@@. Therefore,
$$ P((U,V) \in B) = \iint_{B} f_{X,Y}(f_1(u,v),f_2(u,v) |J(u,v)|d u dv.$$We therefore proved that the density of @@@(U,V)@@@ is given by
$$\begin{equation} \boxed {\label{eq:var_change} f_{U,V} (u,v) = f_{X,Y}(f_1(u,v),f_2(u,v))|J(u,v)| }\end{equation}$$How to remember? The density of @@@(U,V)@@@ at the point @@@(u,v)@@@ is given by the density of @@@(X,Y)@@@ at the corresponding @@@(x,y)@@@, times the absolute value of the Jacobian determinant @@@\frac{\partial (x,y)}{\partial (u,v)}@@@ (gradients of @@@x=f_1(u,v),y=f_2(u,v)@@@ as functions of @@@u,v@@@ as column vectors), at that point.
This formula can be also turned around, now assuming that the density of @@@(U,V)@@@ is known and that of @@@(X,Y)@@@ is not. This is done through the inverse of @@@T@@@, @@@T^{-1}@@@. That is there exist functions @@@g_1(x,y),g_2(x,y)@@@ such that @@@(u,v) = T^{-1} (x,y) = (g_1(x,y),g_2(x,y))@@@. or
$$\begin{equation} \boxed {\label{eq:jacobian} f_{X,Y} (x,y) = f_{U,V} (g_1(x,y),g_2(x,y)) \left|\det \frac{\partial (u,v)}{\partial(x,y)} \right|. }\end{equation}$$For the pair @@@(x,y)= T(u,v)@@@ (equivalently @@@(u,v)= T^{-1}(x,y)@@@), the determinants in \eqref{eq:var_change} and the equation above are inverses of each other: their product is @@@1@@@. This sometimes simplifies calculations because, for example, it is easier to differentiate @@@(x,y)@@@ as functions of the polar coordinates @@@(r,\theta)@@@ than to do it the other way around.
Let @@@(X,Y)@@@ be uniformly distributed on the unit disk @@@D=\{(x,y)\in \R^2:x^2+y^2<1\}@@@. That is
$$f_{X,Y} (x,y) = \begin{cases} \frac{1}{\pi} & x^2 + y^2 <1 \\ 0 & \mbox{otherwise.}\end{cases}$$- Are @@@X@@@ and @@@Y@@@ independent? Careful here. Can we write @@@f_{X,Y}(x,y)@@@ as a product of a function of @@@x@@@ and a function of @@@y@@@? The answer is negative, and the reason is that @@@D@@@ is a circle (not a rectangle with sides parallel to axes). Let’s give two arguments. The first will be based on joint densities. By symmetry if @@@f_{X,Y} (x,y) = f(x)g(y)@@@, then @@@g=f@@@. In particular, we have @@@f_{X,Y}(x,x) = f(x)^2@@@. This implies that @@@f(x) = \frac{1}{\sqrt{\pi}}@@@ for all @@@|x|<1@@@. This choice of @@@f@@@ is not a density function. Ok. I said I’ll give you another one. This is based on the definition of independent RVs. The probability that @@@X>3/4@@@ is positive and is equal to the probability that @@@Y>3/4@@@ (why?). However, the event @@@\{X>3/4,Y>3/4\}@@@ is contained in the event @@@\{X^2 + Y^2 > 2 \frac{9}{16}\}@@@. But @@@2 \frac{9}{16}>1@@@, and therefore the probability of the intersection is zero, as @@@(X,Y)@@@ is in the unit disk.
- Write @@@X@@@ and @@@Y@@@ in polar coordinates:
Find the joint distribution of @@@(R,\Theta)@@@. To solve we use the last result:
$$f_{R,\Theta}(r,\theta) = f_{X,Y}(r \cos \theta, r\sin \theta) |J(r,\theta)|.$$The Jacobian in this case is equal to @@@r@@@. Therefore the joint density of @@@R@@@ and @@@\Theta@@@ is
$$f_{R,\Theta} (r , \theta) = \begin{cases} \frac{1}{\pi} r & r \in (0,1),~\theta \in [0,2\pi)\\ 0 &\mbox{ otherwise.}\end{cases}$$Thus, @@@R@@@ and @@@\Theta@@@ are independent because the joint density is the product of the densities of the marginals. What are the distributions of each of the marginals?
Now @@@R@@@ has density
$$f_{R}(r) = \int_{0}^{2\pi} f_{R,\Theta} (r,\theta) d \theta = 2 r,~r \in (0,1),$$or
$$F_R(r) = r^2,~ r \in (0,1).$$But this is not news. We’ve already computed it before in Example 27. The marginal @@@\Theta@@@ is @@@\mbox{U}[0,2\pi]@@@ because the joint density is only a function of @@@r@@@ and after integrating @@@r@@@ out, we get a constant (independent of @@@\theta@@@).
The random vector @@@(U,V)@@@ has independent marginals with @@@U\sim \mbox{Exp}(7)@@@ and @@@V \sim \mbox{Exp}(14)@@@. Set @@@X = U+3V@@@ and @@@Y= 2U-V@@@. Find the joint density of @@@(X,Y)@@@.
The first thing we’d like to mention and which is extremely important is the range of @@@(X,Y)@@@. Yes, @@@(U,V)@@@ has a density on the first quadrant, but @@@(X,Y)@@@ is a transformation, so it may take values elsewhere. To deal with this, see where the boundaries are mapped. The half-line @@@v=0,u>0@@@ is mapped to @@@(u,2u),~u>0@@@, and the half-line @@@u=0,v>0@@@ is mapped to @@@(3v,-v),~v>0@@@, that is the half line with slope @@@-1/3@@@. Now look at any arbitrary point in the @@@(U,V)@@@ range, say @@@(1,1)@@@. As this is mapped to @@@(4,1)@@@, it follows that the range of @@@(X,Y)@@@ is the cone
$$C = \{(x,y): 0<x,~ -\frac x3< y<2x\}.$$So our new density will be on this domain. The rest is algebra. Since
$$\begin{cases} x= f_1(u,v) = u +3v \\ y =f_2(u,v) = 2u -v \end{cases}$$Solving, we have
$$ \begin{cases} u = g_1(x,y) = \frac{1}{7} (x+3y)\\ v= g_2(x,y) = \frac{1}{7} (2x- y) \end{cases}$$The Jacobian matrix for @@@T^{-1}@@@ is equal to
$$\left( \begin{array}{cc} \frac 17& \frac 37 \\ \frac{2}{7} & -\frac{1}{7}\end{array}\right)$$Therefore
$$J_{T^{-1}}(x,y) = \frac{1}{49} \left( -1 -6 \right)=-\frac 17.$$Thus,
$$ f_{X,Y}(x,y) = f_{U,V}(\frac{x+3y}{7},\frac{2x-y}{7})\left|\det(J_{T^{-1}})\right|= f_{U,V}(\frac{x+3y}{7},\frac{2x-y}{7})\frac{1}{7}$$Since @@@U@@@ and @@@V@@@ are independent, @@@f_{U,V}(u,v) = f_U(u) f_V(v) = 7e^{-7u} \cdot 14e^{-14v}@@@ for @@@u,v>0@@@.
$$ f_{X,Y}(x,y) = \left(7 e^{-7 \frac{x+3y}{7}} \cdot 14 e^{-14 \frac{2x-y}{7}}\right) \frac{1}{7}= 14 e^{- (x+3y)} e^{-2(2x- y)}\frac{1}{7}= 14 e^{-x-3y-4x+2y}\frac{1}{7}= 2 e^{-5x-y} \mbox{ on }C$$and @@@f_{X,Y}(x,y)=0@@@ otherwise. Want to check?
Let @@@(X,Y)@@@ be a random vector with density @@@f_{X,Y}@@@. What is the density of the random vector @@@(aX,bY)@@@ where @@@a,b>0@@@?
Finally, here is an example that uses some geometry.
(source) Two points are sampled uniformly and independently from the square @@@[0,1]\times [0,1]@@@. What is the probability that the center of the square is contained in the circle formed by the diameter of the two points?
It doesn’t really matter where the square is, so to exploit the symmetry of the problem, assume that the square is @@@[-1/2,1/2]\times [-1/2,1/2]@@@, and let @@@Z_1=(X_1,Y_1),Z_2=(X_2,Y_2)@@@ be the two points selected. Then @@@X_1,Y_1,X_2,Y_2@@@ are all independent @@@\mbox{U}[-1/2,1/2]@@@.
Observe that a circle intersects the origin if and only if the norm of its center is smaller than its radius.
The norm of the center squared is
$$\left\|\frac{Z_1+Z_2}{2}\right\|^2=\frac 14 \|Z_1\|^2 + \frac 14 \|Z_2\|^2 +\frac 12 Z_1 \cdot Z_2,$$and the norm of the radius squared is
$$\left\|\frac{Z_1-Z_2}{2}\right\|^2=\frac 14 \|Z_1\|^2 + \frac 14 \|Z_2\|^2 - \frac 12 Z_1\cdot Z_2.$$Therefore the origin is in the circle if and only if
$$\left\|\frac{Z_1+Z_2}{2}\right\|^2 < \left\|\frac{Z_1-Z_2}{2}\right\|^2$$if and only if
$$\frac 14 \|Z_1\|^2 + \frac 14 \|Z_2\|^2 +\frac 12 Z_1 \cdot Z_2 < \frac 14 \|Z_1\|^2 + \frac 14 \|Z_2\|^2 - \frac 12 Z_1\cdot Z_2$$if and only if @@@\frac 12 Z_1 \cdot Z_2 < - \frac 12 Z_1 \cdot Z_2@@@ if and only if @@@Z_1 \cdot Z_2>0@@@. But this RV is continuous and symmetric, therefore the probability is @@@1/2@@@.
Sometimes you don’t really need fancy formulas to obtain joint distributions of transformations. In the next example, we will explore yet another nice property of the exponential distribution, and of sequences of identically distributed exponential RVs.
Let @@@X_1,\dots,X_n@@@ be independent @@@\mbox{Exp}(\lambda)@@@. We already showed that the probability that any two of the variables are the same is zero. That is, with probability @@@1@@@, there is a unique well defined minimum, second smallest, etc. Write @@@X_{(1)}@@@ for the minimum of @@@X_1,\dots,X_n@@@, @@@X_{(2)}@@@ for the second smallest, etc. More precisely, once @@@X_{(j)}@@@ has been defined we continue inductively letting
$$ X_{(j+1)} = \{\min X_k: k=1,\dots,n \mbox{ and } X_k > X_{(j)}\}.$$The “sorted” sequence @@@(X_{(1)},\dots,X_{(n)})@@@ is often called the \text{order statistics} of the sequence @@@(X_1,\dots,X_n)@@@. Note that the vector of order statistics is a (rather complex) transformation of our original vector:sorting is a deterministic function of the input.
What we will do is try to get some understanding of the order statistics. We already showed that @@@X_{(1)}@@@ is @@@\mbox{Exp}(n\lambda)@@@. In order to continue, we will not work directly with the order statistics but instead look at the sequence of induced increments. More precisely, let @@@Y_1 = X_{(1)}@@@ and continue inductively letting
$$ Y_{j+1} = X_{(j+1)} - X_{(j)},~j=1,\dots,n-1.$$That is @@@Y_2@@@ is the difference between the second smallest and the smallest, @@@Y_3@@@ is the difference between the third smallest and the second smallest, etc. Note that the sequence @@@(Y_1,\dots,Y_n)@@@ is a transformation of the order statistics. A rather simple one, right? We will calculate the joint distribution of @@@(Y_1,\dots,Y_n)@@@ directly without any special formulas for transformations, starting from the joint distribution of @@@(X_1,\dots,X_n)@@@. Get ready for a fun ride!
Let’s write the joint density of @@@(X_1,\dots,X_n)@@@, conditioned on the event @@@A = \{X_1< X_2<\dots <X_n\}@@@. Note that the event @@@A@@@ corresponds to one of the @@@n!@@@ possible permutations of indices of @@@\{1,\dots,n\}@@@, each corresponding to a unique ordering of the @@@n@@@ RVs in increasing values. Since the union of all of the corresponding events has probability @@@1@@@ (all are different with probability @@@1@@@, and when all are different there are @@@n!@@@ ways to order them), it follows that @@@P(A) = 1/n!@@@. We therefore have
$$ f(x_1,\dots,x_n | A) = n! \prod_{j=1}^n (\lambda e^{-\lambda x_j} {\bf 1}_{\{x_j > x_{j-1}\}}),$$with @@@x_0=0@@@ for convenience. The trick is to rewrite this, expressing it through the increments @@@x_j -x_{j-1}@@@. Let’s do it. Write @@@y_j = x_j-x_{j-1}@@@. Then for @@@j\ge 2@@@, @@@x_j = y_1 + \sum_{k=2}^j y_k@@@. Putting it all together we have that the joint density, expressed in terms of the variables @@@y_1,\dots,y_n@@@ is the function @@@g(y_1,\dots,y_n|A)@@@, which for @@@y_1,y_2,\dots,y_n>0@@@ is given by
$$\begin{align*} g(y_1,\dots,y_n|A) & = f(x_1,\dots,x_n|A) = n! \lambda^n e^{-\lambda y_1}e^{-\lambda (y_1 + y_2)} e^{-\lambda (y_1 + y_2 + y_3)} \cdots e^{-\lambda (y_1+ \cdots + y_n)} \\ & = n! \lambda^n e^{-\lambda n y_1} e^{-\lambda (n-1) y_2} \cdots e^{-\lambda y_n}\\ & = \prod_{j=1}^n ((n+1-j) \lambda) e^{-(n+1-j) \lambda y_j}. \end{align*}$$From their definitions, the variables @@@y_1,\dots,y_n@@@ represent differences between consecutive order statistics, and so we’re looking at the joint density of @@@Y_1,\dots,Y_n@@@ conditioned on the event @@@A@@@.
Since this particular form of density is invariant under the @@@n!@@@ different permutations, the joint density of @@@Y_1,\dots,Y_n@@@ remains the same when we do not condition on any particular permutation. In other words,
$$f_{Y_1,\dots,Y_n} = \prod_{j=1}^n ((n+1-j) \lambda) e^{-(n+1-j) \lambda y_j},~y_1,\dots,y_n >0.$$That is, @@@(Y_1,\dots,Y_n)@@@ are independent exponential RVs with @@@Y_j \sim \mbox{Exp}((n+1-j) \lambda)@@@. Let’s frame this:
$$\begin{equation} \boxed {\label{eq:exp_order} Y_1=X_{(1)} \sim \mbox{Exp}(\lambda n),Y_2=X_{(2)}-X_{(1)} \sim \mbox{Exp}(\lambda (n-1)),\dots, Y_n=X_{(n)}-X_{(n-1)} \sim \mbox{Exp}(\lambda), }\end{equation}$$and all are independent.
@@@5@@@ students are taking an exam, all starting at the same time and there’s no time limit. The time taken by each student to complete the exam is exponentially distributed with expectation of @@@1@@@ hour.
What is the expected time between the first completing the exam and the last completing the exam?
Poisson Distribution
Time for closure. When we first introduced the Poisson distribution it seemed pretty arbitrary. Now we are equipped with the tools to fully explain it.
Let’s start by making some assumptions. Suppose we’re counting visits to a website over some time period, with time being continuous. It is natural to assume the memoryless property for the time between each two consecutive visits. No matter where we are in time or what the number of visits so far, the probability that the next visit will occur within the next @@@s@@@ units of time depends only on @@@s@@@. This implies that the sequence of times between each two consecutive visits are IID @@@\mbox{Exp}(\lambda)@@@ for some @@@\lambda>0@@@. That is, setting @@@T_0=0@@@, then @@@T_1@@@ as the time of the first visit, @@@T_2@@@ as the time of the second visit, etc. the sequence @@@X_j=T_j- T_{j-1}@@@ then @@@(X_j:j\in\N)@@@ is an IID sequence with @@@X_1 \sim \mbox{Exp}(\lambda)@@@.
I’m now counting the number of visits up to time @@@t@@@. Call this @@@N_t@@@. I will have @@@k@@@ visits by time @@@t@@@ if and only if the @@@k@@@‘th visit occurred on or before time @@@t@@@, and the @@@k+1@@@‘th visit occurred after time @@@t@@@. That is, we have the equality of the events
$$ \{N_t =k\} = \{T_k\le t< T_{k+1} \}.$$We will compute the probability of the RHS. The calculation will be easy once we find the density of the vector @@@(T_k,X_{k+1})@@@ because @@@X_{k+1}@@@ is independent of @@@T_k@@@ by construction. To describe the event @@@\{T_k \le t < T_{k+1}\}@@@ in terms of @@@(T_k,X_{k+1})@@@, we need the first component @@@x@@@ (representing @@@T_k@@@) to be in @@@[0,t]@@@ and the second component @@@y@@@ (representing @@@X_{k+1}@@@) to be larger than @@@t-x@@@ (that is because @@@T_{k+1} =x + X_{k+1}@@@ and we want this to be @@@> t@@@). Therefore,
$$ P(N_t =k ) = \int_0^t \int_{t-x}^\infty f_{T_{k},X_{k+1}}(x,y) dy dx.$$By independence of the marginals and the fact that @@@X_{k+1} \sim \mbox{Exp}(\lambda)@@@, we have
$$\begin{align*} P(N_t =k ) &= \int_0^t \int_{t-x}^\infty f_{T_{k}}(x) \lambda e^{-\lambda y} dy dx\\ & =\int_0^t f_{T_{k}}(x) e^{-\lambda (t-x)}dx. \end{align*}$$But wait, the RHS is the convolution of @@@f_{T_{k}}@@@ and of @@@f_{X_{k+1}}@@@, evaluated at @@@t@@@, multiplied by @@@e^{\lambda t}@@@. Wait. Let @@@Z = T_k + X_{k+1} = T_{k+1}@@@. The density of @@@T_{k+1}@@@ is @@@f_{T_{k+1}}(t) = \int_0^t f_{T_k}(x) f_{X_{k+1}}(t-x) dx@@@. Since @@@f_{X_{k+1}}(y) = \lambda e^{-\lambda y}@@@,
$$f_{T_{k+1}}(t) = \int_0^t f_{T_k}(x) \lambda e^{-\lambda(t-x)} dx = \lambda e^{-\lambda t} \int_0^t f_{T_k}(x) e^{\lambda x} dx.$$The probability we calculated is:
$$ P(N_t =k ) = \int_0^t f_{T_{k}}(x) e^{-\lambda (t-x)}dx = e^{-\lambda t} \int_0^t f_{T_{k}}(x) e^{\lambda x} dx.$$Comparing the two, we see that @@@\frac{1}{\lambda} f_{T_{k+1}}(t) = P(N_t =k )@@@. Using @@@Example 34@@@, which states that @@@T_k \sim \mbox{Gamma}(k, \lambda)@@@ (or here @@@T_{k+1} \sim \mbox{Gamma}(k+1, \lambda)@@@ where @@@f_{T_{k+1}}(t) = \lambda \frac{(\lambda t)^{k}}{k!} e^{-\lambda t}@@@), we have:
$$ P(N_t=k)=\frac{1}{\lambda} f_{T_{k+1}}(t)= \frac{1}{\lambda} \left( \lambda \frac{ (\lambda t)^{k}e^{-\lambda t}}{k!} \right) = \frac{ (\lambda t)^{k}e^{-\lambda t}}{k!}.$$What have we just shown? Recall the PMF of @@@\mbox{Pois}(\lambda t)@@@? Not sure? Check here. Yes, we have just shown that
$$N_t \sim \mbox{Pois}(\lambda t).$$Let @@@(N_t:t\ge 0)@@@ be the Poisson process as constructed above. Then
- @@@N_0=0@@@.
- For every @@@t,s\ge 0@@@, @@@N_{t+s}-N_t@@@ and @@@N_s@@@ are independent.
- For every @@@t,s\ge 0@@@, @@@N_{t+s}-N_t \sim \mbox{Pois}(\lambda t)@@@.
Express the event @@@\{N_t \ge k\}@@@ in terms of the sequence @@@T_1,T_2,\dots@@@.
Apply Theorem 13 to show that if @@@(X,Y)@@@ is a random vector with independent marginals @@@X\sim \mbox{Pois}(\lambda_1)@@@ and @@@Y\sim \mbox{Pois}(\lambda_2)@@@, then @@@X+Y\sim \mbox{Pois}(\lambda_1+\lambda_2)@@@.
Problems
Recall that the joint distribution function of the random vector @@@(X,Y)@@@, @@@F_{X,Y}@@@ is given by the formula
$$ F_{X,Y} (x,y) = P(X \le x,Y \le y).$$Express
$$P( X \in (x_1,x_2],Y\in(y_1,y_2])$$in terms of the joint distribution function.
Let @@@(X,Y)@@@ be a random vector with @@@X@@@, @@@Y@@@ independent @@@\mbox{Exp}(1)@@@ RVs. Let @@@Z=\max(X,Y)@@@.
- Find the CDF of @@@(X,Z)@@@.
- What is @@@P(X=Z)@@@?
- Does @@@(X,Z)@@@ have a density?
Two very disorganized people agree to meet between 10:00AM and 11:00AM. Each arrives at a uniformly distributed time within that interval, independently of the other. What is the probability that the first one arriving will have to wait no more than @@@10@@@ minutes for the second?
Let @@@X_1,X_2,\dots, X_n@@@ be independent @@@\mbox{U}[0,1]@@@ RVs. Let @@@X= \min (X_1,\dots,X_n)@@@. You can think of @@@X_1,\dots,X_n@@@ as the times during the semester any one of my @@@n@@@ student advisees will first contact me (scaled to be in @@@[0,1]@@@) and of @@@X@@@ as the time the first among them will contact me.
- Find the distribution function of @@@X@@@ and its expectation.
- Find a function @@@g@@@ such that @@@X@@@ has the same distribution as @@@g(X_1)@@@.
- Repeat the above parts assuming now @@@X_1,\dots,X_n@@@ are independent @@@\mbox{Exp}(\lambda)@@@.
Suppose that @@@X@@@ is a random variable satisfying that @@@X@@@ and @@@X^2@@@ are independent. Show that there exists @@@c\ge 0@@@ such that @@@P(|X|=c)=1@@@ (that is, there exists @@@p\in [0,1]@@@ such that @@@P(X=c)=p,~P(X=-c)=1-p@@@).
(Hint: show that the CDF of @@@|X|@@@ can only take two values, by looking at the probability of events of the type @@@\{|X|\le x,X>x\}@@@, @@@\{|X|\le x,X<-x\}@@@ for @@@x\ge 0@@@).
Suppose that @@@(X,Y)@@@ is a random vector with density function
$$f_{X,Y}(x,y) = \begin{cases} e^{-y}& ~y\ge x\ge 0,\\ 0 & \mbox{otherwise}\end{cases}.$$- Find the joint distribution function.
- Find the density of @@@X@@@.
- Find the density of @@@Y@@@.
- Find the density of @@@X+Y@@@.
The distribution of @@@X@@@ given @@@Y=y@@@ is @@@\mbox{U}[y,y+1]@@@, and @@@Y@@@ has density @@@\frac{1}{\pi}\frac{1}{1+y^2}@@@.
- What is the density of @@@X@@@? (recall that the primitive of @@@\frac{1}{1+y^2}@@@ is @@@\arctan(y)@@@)
- What is the density of @@@Y@@@ conditioned on @@@X=x@@@?
Let @@@X_1,X_2,\dots@@@ be independent and identically distributed random variables, and let @@@N@@@ be a random variable independent of the @@@X_i@@@’s, taking values in @@@\{1,2,\dots\}@@@. Write @@@S_N = X_1 + \dots + X_N@@@ (think of @@@N@@@ as number of customers and @@@X_i@@@ is the purchase amount of the @@@i@@@-th customer).
- Assume that @@@X_1@@@ and @@@N@@@ have finite expectation. Show that @@@E[S_N]=E[N] E[X_1]@@@
- Assume that @@@X_1@@@ and @@@N@@@ have finite variance. Show that the variance of @@@S_N@@@ is equal to @@@E[N] \sigma^2_{X_1} +\sigma^2_N(E[X_1])^2@@@
- A die is rolled repeatedly until we first see a number @@@\ge 4@@@. What is the expected sum of all tosses including the last one ?
Suppose that @@@X@@@ and @@@Y@@@ are independent exponential RVs with parameter @@@1@@@. Find
- The density of @@@\max(X,Y)@@@ conditioned on @@@\min(X,Y)@@@.
- The joint density of @@@\max(X,Y)@@@ and @@@\min(X,Y)@@@.
- Are @@@\min(X,Y)@@@ and @@@\max(X,Y)-\min(X,Y)@@@ independent? (in your answer use your answer to the last part).
Let @@@(X,Y)@@@ be a random vector with independent marginals @@@X\sim \mbox{Pois}(\lambda)@@@ and @@@Y\sim \mbox{Pois}(\mu)@@@. Find the joint distribution of @@@X@@@ and @@@X+Y@@@. Show that the distribution of @@@X@@@ conditioned on @@@X+Y=z@@@ is binomial and find its parameters.
Four points are selected uniformly and independently on the unit sphere. What is the probability that the tetrahedron whose vertices are the points contains the origin?
Let @@@(X,Y)@@@ be uniformly distributed on the diamond shape @@@M=\{(x,y):|x|+|y|<1\}@@@. Show that @@@X+Y@@@ and @@@X-Y@@@ are independent and find their joint density.
The joint CDF of @@@(X,Y)@@@ is given by
$$F_{X,Y} (x,y) =\begin{cases} \frac{xy(2+x+y)}{(1+x)(1+y)(1+x+y)}& x,y\ge 0\\ 0 & \mbox{otherwise}\end{cases}$$Show that @@@Y@@@ has a density. Find it.
The joint CDF of @@@(X,Y)@@@ is given by
$$ F_{X,Y} (x,y) = \begin{cases} 0 & x<0 \mbox{ or } y<0 \\ yx \frac{x+y}{2} & x,y \in [0,1]\\ x \frac{x+1}{2} & x\in [0,1],y>1\\ y \frac{y+1}{2} & y \in [0,1],x>1\\ 1 & \mbox{otherwise} \end{cases}$$Show that @@@Y@@@ has density and find it.
You’re at the supermarket, next in line to be serviced by a cashier. There are @@@10@@@ cashiers, all currently occupied, and service time for each customer in each cashier is exponentially distributed with expectation @@@1@@@, with all times being independent.
What is the probability that you’ll finish before the nine other customers currently in the other cashiers?