Intro

Random variables are neither random nor variables. They are simply real-valued functions on the sample space, or real-valued measurements: an assignment of a numerical value for each outcome in the sample space. Number of Heads in a sequence of tosses is a random variable, and so (a bit of stretch) is the price I’ll pay at checkout next time I put my car in the repair shop, the amount you’ll win next time you play the lottery, the time I’ll wait at the station for the bus home, etc. Remember: functions, numerical assignments to outcomes in the sample space.

Random Variables

Definition and First Examples

First an acronym: RV. So why do we even care about RVs? Let’s think of a few reasons. Numbers are much easier to manipulate than abstract objects suchex as sequences of Heads and Tails, etc. They also allow to concisely summarize the outcome (e.g. “@@@7@@@ Heads”, rather than listing the entire sequence). Want another one? RVs could be thought of as monetary value assigned to outcomes: if you play a game of chance, your net gain is a random variable.

'’Essentially’’, if @@@\Omega@@@ is your sample space and @@@f:\Omega\to [-\infty,\infty]@@@ is a function, then @@@f(\omega)@@@ is a random variable. I write ‘‘essentially’’, because this captures the essence, but ignores an important technical aspect. Here is the precise definition

Definition 1.

Suppose that @@@\Omega@@@ is a sample space equipped with a @@@\sigma@@@-algebra @@@{\cal F}@@@. A random variable is a function @@@X: \Omega \to {\mathbb R} \cup\{\pm \infty\}@@@, satisfying that for each @@@\alpha \in {\mathbb R}@@@, the set @@@\{X\le \alpha\}@@@ is an event, that is an element in @@@{\cal F}@@@.

Remember we worked really hard to understand what events are? The technical aspect about RVs is that we want any (reasonable) statement about them like “@@@X@@@ is less than or equal to @@@100@@@” or “@@@X@@@ is strictly between @@@2@@@ and @@@4@@@” or “@@@X@@@ is even”, etc… to be an event. Why? Because (eventually) we want to be able to associate probabilities to the numerical values: for example, we would like to know what is the probability that our plane will be no more than 20 minutes late (identify the random variable here), and we only assign probabilities to events, right?

Turns out that the condition listed in the definition, requiring only a relatively small statements to be events is all we need in order to make pretty complex statements. We’ve already been in a very similar situation when discussing @@@\sigma@@@-algebras.

The most trivial RV? A constant. Whatever the @@@\sigma@@@-algebra is any constant function is always a RV. But that’s boring. Let’s turn to more interesting examples.

Example 1.

The following are examples of everyday RVs. In all cases, we assume the @@@\sigma@@@-algebra is rich enough to support them.

The score of the winning team in the next Superbowl.
Toss a coin repeatedly. The number of tosses until the first Heads, or @@@\infty@@@ if Heads never appears.
The price of a Tesla stock when the market opens on the first Thursday after this course ends.

Similarly to events, to define RVs all we need is a sample space and a @@@\sigma@@@-algebra (a pair known as a measurable space), not even a probability distribution. This is because RVs are deterministic real-valued functions on the sample space satisfying a consistency condition with the @@@\sigma@@@-algebra.

Example 2.

Think of the following experiment: your friend tosses a coin twice and tells you only whether a Head has appeared or not. Here the sample space is @@@\Omega= \{HH,HT,TH,TT\}@@@ and the information available to you in this experiment is only subsets determined by whether Heads appeared or not. Therefore the @@@\sigma@@@-algebra associated with this experiment is @@@\{\emptyset, \{HH,HT,TH\},\{TT\},\Omega\}@@@. Now if we let @@@X@@@ denote the number of Heads, then @@@X@@@ is a real-valued function on @@@\Omega@@@, but in this particular context, it is not a random variable: the set @@@\{X\le 1\}=\{HT,TH,TT\}@@@ is not an event. Knowing whether Heads appeared or not does not determine the number of Heads (it does only in one case: none appeared).

Remember we only require @@@\{X\le \alpha\}@@@ to be an event, where @@@\alpha@@@ ranges over all real numbers. Let’s see what this gives us, almost for free. It’s much more. Remember: complements, countable unions and countable intersections of events are all events. So:

@@@\{X>\alpha\}@@@ is an event because it’s the complement of @@@\{X\le \alpha\}@@@, which, by definition, is an event.
@@@\{\beta < X \le \alpha\}@@@ is an event, because it’s the intersection of the events @@@\{X\le \alpha\}@@@ (definition) and @@@\{X>\beta\}@@@ (what we just showed);
@@@\{\beta < X < \alpha\}@@@ is an event: it’s the union of @@@\{\beta < X \le \alpha - \frac 1n\}@@@, @@@n=1,2,\dots@@@.
@@@\{\beta \le X < \alpha\}@@@ is an event: it’s the intersection of @@@\{\beta -\frac 1n< X <\alpha\}@@@ over @@@n=1,2,\dots@@@.
@@@\{\beta \le X \le \alpha\}@@@ is an event. Note the special case @@@\beta=\alpha@@@: @@@\{X=\alpha\}@@@ is an event too.
The complement of any of the above is an event.
Any countable union of any of the above is an event, and so are complements of such unions, unions of those complements and anything listed before, complements of those, etc…

Exercise 1.

Let @@@X@@@ be a RV. Show that @@@\{X=\infty\}@@@ and that @@@\{X=-\infty\}@@@ are both events.

I hope you got the gist of it by now: if @@@I@@@ is any interval (open/closed/half of each/finite/infinite), then @@@\{X\in I\}@@@ is an event, and since the collection of events is a @@@\sigma@@@-algebra, we have that @@@\{X\in A\}@@@ is an event whenever @@@A@@@ is a countable union of intervals, or complements of such sets, or intersections of such sets, or unions of such intersections, etc.

Upon a slightly closer inspection, we see that the requirement @@@\{X\le \alpha\}@@@ is an event implies that the collection of subsets of @@@\R@@@

$$\{ A\subset \R: \{X \in A\}\in {\cal F}\}$$

is itself as @@@\sigma@@@-algebra on @@@\R@@@. In other words, whatever @@@(\Omega,{\cal F})@@@ were, a RV @@@X@@@ induces a @@@\sigma@@@-algebra on @@@\R@@@. This @@@\sigma@@@-algebra, includes all intervals as elements. Therefore it contains the Borel @@@\sigma@@@-algebra. We state this as a Corollary:

Corollary 1.

If @@@X@@@ is an RV then for every Borel set @@@{\cal B}@@@, @@@\{X \in B\}@@@ is an event.

Indicators

Simply put, an indicator is any RV taking only the values @@@0@@@ and @@@1@@@ (possibly only one of them). Indicators are also commonly known as Bernoulli RVs, a well-deserved name as these are the building blocks of RVs, and Mr. Bernoulli did make some significant contributions to the area.

[[File:Jakob Bernoulli.jpg

thumb

Jakob Bernoulli]]

How can one obtain a RV taking only the values @@@0@@@ and @@@1@@@? Fix an event @@@A@@@, and let

$$\begin{equation} \label{eq:indi} X(\omega) = \begin{cases} 1 & \omega \in A \\ 0 & \mbox{otherwise}\end{cases} \end{equation}$$

That is: @@@X@@@ “indicates” whether @@@A@@@ has occurred or not, and is therefore referred to as the ‘'’indicator (function) of @@@A@@@’’’, denoted for brevity by @@@{\bf 1}_A@@@. Let’s take @@@\alpha \in \{-\infty\}\cup \R@@@ and consider the event @@@\{X \le \alpha\}@@@. Then

It is @@@\emptyset@@@ when @@@\alpha <0@@@;
It is @@@A^c@@@ if @@@\alpha \in [0,1)@@@; and
It is @@@\Omega@@@ if @@@\alpha \ge 1@@@. As each of these are events, @@@X@@@ is an RV.

Example 3.

Several examples of indicators:

Toss a coin once. Set @@@X@@@ to @@@1@@@ if the coin lands Heads and zero otherwise. Thus @@@X@@@ is the indicator of “Heads”.
Toss a coin twice. Set @@@X@@@ to be @@@1@@@ if at least one Heads was observed and @@@0@@@ otherwise. Thus @@@X@@@ is the indicator of “at least one Heads”. What about @@@1-X@@@? This is equal to @@@1@@@ exactly when @@@X=0@@@, namely on the event “two Tails” and is equal to @@@0@@@ otherwise. Therefore @@@1-X@@@ is the indicator of “two Tails”.
Let @@@A@@@ and @@@B@@@ be two events (say: UConn men gets into NCAA tourney and UConn women do not get into Final Four, respectively). Let @@@X@@@ be one if both occur and zero otherwise. Then @@@X@@@ is again an indicator. It is the indicator of @@@A\cap B@@@.

If @@@X@@@ is defined as the indicator of an event @@@A@@@, then it only takes the values @@@0@@@ and @@@1@@@. The converse is also true. I’m leaving this as an exercise.

Exercise 2.

Suppose that @@@X@@@ is a random variable taking only values in @@@\{0,1\}@@@. Show \eqref{eq:indi} holds for some event @@@A@@@.

Some algebraic operations on indicators correspond to set operations on the respective events. Suppose @@@X@@@ is the indicator of the event @@@A@@@ and @@@Y@@@ is the indicator of the event @@@B@@@. Then @@@XY@@@ is equal to @@@1@@@ on @@@A \cap B@@@ and is zero otherwise. Therefore, @@@XY@@@ is the indicator of @@@A\cap B@@@. Note that @@@XY@@@ is also equal to the minimum of @@@X@@@ and @@@Y@@@ (look at each of the values and take the lower), denoted by @@@\min(X,Y)@@@. Let’s look at other operations. Before doing that, let me make sure you are familiar with the maximum and minimum functions. We’ll use them a lot. The maximum (larger) of two numbers @@@a@@@ and @@@b@@@ is denoted by @@@\max(a,b)@@@, and the minimum (smaller) is denoted by @@@\min(a,b)@@@. For example @@@\max (7,10)=10,\min(7,10)=7,\max(5,5)=5@@@, etc…

Here’s a video explaining RVs and indicators

Ready?

Exercise 3.

Suppose that @@@X@@@ and @@@Y@@@ are indicators of @@@A@@@ and @@@B@@@ respectively. Show that @@@1-X, \max(X,Y), XY@@@ are all indicators and identify the corresponding events in terms of @@@A@@@ and @@@B@@@.

One can combine indicators to obtain more elaborate random variables.

Example 4.

Toss a coin repeatedly. Let @@@X_j@@@ be the indicator that the @@@j@@@-th toss lands Heads. Let @@@S_n = X_1 + \dots + X_n@@@. Then @@@S_n@@@ counts the number of Heads in the first @@@n@@@ tosses. It is a sum of indicators.
Suppose you and I play rock-paper-scissors. If I win, you give me @@@\$10@@@ and if you win, I give you @@@\$20@@@. Let @@@X@@@ be the indicator of “I win”. Note that @@@1-X@@@ is the indicator of “you win”, and @@@10X - 20(1-X)= 30X - 20@@@ is a random variable describing the net amount I win in the game (equal to @@@10@@@ when I win and to @@@-20@@@ when I lose).

Example 5.

Consider an infinite sequence of independent tosses of a fair coin, with @@@X_k@@@ being the indicator of @@@k@@@-th toss being Heads.

Define a new function @@@Y@@@ on the sample space as follows. @@@Y=k@@@ if @@@X_k=1@@@ and @@@X_j=0@@@ for all @@@j<k@@@. We set @@@Y=\infty@@@ if @@@X_1=X_2=\dots =0@@@.

In words, @@@Y@@@ is equal to the number tosses until we land Heads. It is equal to @@@1@@@ if the first toss is Heads, @@@2@@@ if the first is Tails, and second is Heads, @@@3@@@ if the first two are Tails and the third is Heads, etc. If you think of Heads as “success”, then @@@Y@@@ is the “time” of the first success.

Of course, @@@Y@@@ is a random variable. Why? We know that for @@@k\in \N\cup\{\infty\}@@@, @@@\{Y=k\}@@@ is an event by definition, and since any event of the form @@@\{Y\le \alpha\}@@@ is a finite (possibly empty) union of such events, @@@Y@@@ is a RV.

This is an example of a RV that takes infinitely many values, and may also take the value @@@+\infty@@@.

Let’s try to go a little ahead and introduce probability. We already made some assumptions on the probability by saying the all tosses are fair and independent. What is the probability that @@@Y=k@@@? First let’s take @@@k \in \N@@@.

$$P(Y=k) = P( X_1 =0,\dots,X_{k-1}=0,X_k = 1) = P (k-1 \mbox{ Tails, followed by a Heads}) = (\frac{1}{2})^{k-1} \frac 12=2^{-k}.$$

where the last equality is because all tosses are independent and have probability @@@1/2@@@ of landing either Heads or Tails.

What about @@@Y=\infty@@@? As you guessed, this has probability zero. However, if we want to stick to our definition of indepdndence, we cannot repeat the argument above directly, because independence (even for an infinite sequence) was expressed in terms of probabilities of finite intersections. We’ll do that indirectly then. Observe thatfor every @@@k\in\N@@@, @@@\{Y=\infty\}\subset \{Y >k\}@@@. Therefore, @@@P(Y=\infty) \le P(Y>k)@@@. Now @@@\{Y>k\} = \{X_1=0,X_2=0,\dots, X_k=0\}@@@, or, in words, all the first @@@k@@@ tosses are Heads. The probability of this event is @@@(\frac{1}{2})^k@@@. Therefore, for every @@@k\in\N@@@,

$$ P(Y = \infty) \le (\frac{1}{2})^k.$$

Since probabilities are nonnegative, we conclude that @@@P(Y=\infty)=0@@@. Therefore, although this RV may be equal to @@@\infty@@@, the probability it is @@@\infty@@@ is zero.

Feel free to skip this last example at first reading. I know Jakob Bernoulli looks intimidating. I promise I won’t tell him.

Example 6.

In this example we show how one can start with indicators and end up with random variables that can take any value between @@@[0,1]@@@.

Consider an infinite sequence of fair-coin tosses, with @@@X_k@@@ being the indicator of @@@k@@@-th toss being Heads.

Write

$$X = \sum_{k=1}^\infty 2^{-k} X_k.$$

(you may ask yourself why is this infinite sum an RV. We will answer in the next section).

Unlike all previous examples, where all random variables took only finitely many values or countably many values, the range of @@@X@@@ as a function is uncountable. It is, in fact, the interval @@@[0,1]@@@ because every number in @@@[0,1]@@@ has a binary representation, that is a representation as a sum of positive powers of @@@1/2@@@.

Operations with RVs

Let’s get into the business of generating new RVs from existing ones. We sampled a little of it, but we’re going much deeper now, but it will be short.

Proposition 2.

Let @@@X,Y@@@ be RVs.

If @@@f:\R\to\R@@@ is piecewise continuous or monotone, then @@@f(X)@@@ is a RV.
If @@@g:\R^2 \to \R@@@ is continuous, then @@@g(X,Y)@@@ is a RV.
If @@@X_1,X_2,\dots@@@ are RVs, then @@@\sup_{j=1,2,\dots} X_j@@@ is a RV.

Let’s see what this tells us.

In a few words: pretty much every (reasonable) algebraic manipulation of random variables yields a random variable.

Let’s be a little more specific.

From the first part we see that things like @@@2X@@@, @@@X+1@@@, @@@e^X@@@ are all RVs. Similarly, if @@@X@@@ is any RV, then letting

$$Y = \begin{cases} 1 & X\ge 1440\\ 1/2 & X \in [1000,1440) \\ 0 & X<1000\end{cases}$$

is also a RV because it is a piecewise continuous function of @@@X@@@ (you can think of it as acceptance status as a function of SAT: @@@1@@@ for accepted, @@@1/2@@@ for waitlisted and @@@0@@@ for rejected).

From the second part, we see that @@@X\pm Y@@@ as well as @@@XY@@@ are RVs. Using this repeatedly with the first part, we see that objects like @@@\sin (X)+\cos (X-2Y)@@@ is a RV. By the way, you can iterate this to show that similar expressions involving three, four or any finite number of RVs is still a RV.

So what about the last part? What is even @@@\sup@@@? Definitely not something you take to the lake. It’s pronounced as “supremum” and is the equivalent of the notion of maximum, but when we have infinitely many numbers. Suppose you’re looking at an infinite sequence (of nmubers, of course). The supremum is defined as the smallest among all numbers bigger or equal to all elements in the sequence, and @@@+\infty@@@ if there are no such numbers. Let’s clarify:

When the sequence has a largest element, this is the supremum. In this case it coincides with the maximum.
When the sequence is something like @@@1-\frac 12,1-\frac 13,1-\frac 14,\dots@@@, then it does not have a largest element (a maximum). There are many numbers bigger than any number in the sequence: @@@2, 1.5, 1.001, 1.0001@@@. The smallest amongthem is @@@1@@@. That’s the supremum.
When the sequence does not have any number bigger than all of its elements then the supremum is defined as @@@+\infty@@@. The sequence @@@1,2,3,\dots@@@ is such a sequence, right?

Now that we understand the notion, let’s continue with the main message. It is important if we want to take limits (not to “limit” ourselves to finite operations). If @@@X_1,X_2,\dots@@@ are nonnegative RVs, then by the second part, the partial sums @@@S_1=X_1,S_2=X_1+X_2,S_3= X_1+X_2+X_3@@@ are all RVs. Observe that the series @@@\sum_{j=1}^\infty X_j@@@ (which can also take the value @@@+\infty@@@), is simply @@@\sup_{n=1,2,\dots} S_n@@@, and is therefore a RV. This explains why the series in Example Example 8 from last section is a RV.

Example 7.

Let @@@X@@@ be a RV. Is @@@1/X@@@ a RV?

Well, @@@1/X@@@ is not defined when @@@X=0@@@, so we need to do something about it. Letting @@@Z= \frac{1}{X} {\bf 1}_{\{X\ne 0\}}@@@ is a piecewise continuous function, and is therefore a RV. Also, @@@X Z = 1@@@, except when @@@X=0@@@, in which case the product is zero. Therefore, we can consider @@@Z@@@ as the reciprocal of @@@X@@@.

Exercise 4.

Suppose that @@@X@@@ is a RV. Show that @@@Y= \int_0^{|X|} e^{-x^2} dx@@@ is a RV.

Distribution (Functions) of RVs

In this section we finally make the link between RVs and probability measures.

Suppose we’ve got ourselves an RV @@@X@@@ on a probability space @@@(\Omega, {\cal F},P)@@@. Note that we now include a probability measure. Since we’ve already observed that for any Borel set @@@B@@@, @@@\{X\in B\}@@@ is an event, namely an element of @@@{\cal F}@@@, it has an assigned @@@P@@@-probability, @@@P (X \in B)@@@. Thus, letting

$$\begin{equation} \label{eq:Dist_X} P_X(B) = P(X \in B),~B \in{\cal B}, \end{equation}$$

we immediately see that @@@P_X@@@ is (drum roll) a probability measure on the measurable space @@@(\R,{\cal B})@@@, the real line equipped with the Borel @@@\sigma@@@-algebra @@@{\cal B}@@@. This probability measure is called ‘'’the distribution of @@@X@@@’’’.

Why is this important? No matter what @@@(\Omega,{\cal F})@@@ is, we are now dealing with probabilities on subsets of @@@\R@@@ (representing values @@@X@@@ takes), putting everything in one context of real numbers rather than obscure (or not, but completely general) sample spaces or @@@\sigma@@@-algebras.

We won’t work directly with distributions of RV, but we need to know that they give us the probabilities of all statement one can make about the RV. Instead of working with these measures, we will often work with something more pleasant:

Definition 2.

Suppose that @@@X@@@ is a RV on a probability space @@@(X,{\cal F},P)@@@, satisfying @@@P(-\infty < X < \infty) =1@@@. The ‘'’cumulative distribution function (CDF)’’’ of @@@X@@@, denoted by @@@F_X@@@, is the function

$$ F_X (x) =P(X\le x).$$

That is, @@@F_X (x) = P_X ((-\infty,x])@@@, so at first, the CDF may seem as a snapshot or a shadow of the distribution of the RV. This is not really true:

Theorem 3.

The CDF of a RV determines its distribution.

What we mean by this is knowing the CDF of @@@X@@@ (which describes the values @@@P_X@@@ assigns to intervals of the form @@@(-\infty,x]@@@) allows to determine @@@P_X(B)@@@ for any @@@B@@@, at least theoretically. This makes life easier, right? Let’s focus on CDFs and their properties.

Example 8.

Suppose that @@@X@@@ is the indicator that a fair coin lands Heads.

Then @@@X@@@ takes exactly two values, @@@0@@@ and @@@1@@@, each with probability @@@\frac 12@@@. For @@@x<0@@@, the events @@@\{X\le x\}@@@ is empty, therefore @@@F_X (x)=0@@@. For @@@x \in [0,1)@@@, we clearly have @@@\{X \le x\} = \{X=0\}@@@, and therefore @@@F_X (x) = \frac 12@@@, while for @@@x\ge 1@@@, the event @@@\{X\le x\} = \{X\in \{0,1\}\}@@@, @@@F_X (x) =1@@@. Let’s write this down:

$$ F_X (x) = \begin{cases} 0 & x < 0 \\ \frac 12 & 0 \le x <1 \\ 1 & 1\le x.\end{cases}$$

Here is the graph of @@@F_X@@@.

comment %}

p=0.5 x=seq(-0.2,1.2,0.002) y=c() y[x<=1.2]=1 y[x<1]=1-p y[x<0]=0 plot(x,y,pch='.',type="p",yaxt="n",xaxt="n", xlab="x",ylab="F_X",main="CDF of Bernoulli with p=1/2") axis(side=1, at=seq(-0.2,1.2,0.1)) axis(side=2,at=seq(0,1,0.25)) points(0,1-p,pch=16) points(1,1,pch=16)

endcomment %}

Here’s a video explaining the concept of the distribution of a RV, and the CDF.

We now list some important properties of CDFs.

Proposition 4.

Let @@@F_X@@@ be a CDF of a RV @@@X@@@. Then

@@@F_X@@@ is nondecreasing and is right-continuous.
@@@\lim_{x\to-\infty}F_X(x)=0@@@.
@@@\lim_{x\to\infty} F_X(x) = 1@@@.

Proof.

Suppose @@@x\le y@@@. Then

$$F_X(x) = P(X\le x) \le P(X\le y) = F_X(y),$$

showing that @@@F_X@@@ is nondecreasing. Next,

$$F_X (x) = P(X\le x) = P( \cap_{n=1}^\infty \{X\le x+1/n\})=\lim_{n\to\infty} P(X\le x+1/n) = \lim_{n\to\infty} F_X (x+ 1/n),$$

where the last equality is due to [cite here]. Due to monotonicity of @@@F_X@@@, @@@\lim_{y \searrow x} F_X(y)@@@ exists and is therefore equal to the RHS.

$$\lim_{n\to\infty} F_X(-n) =\lim_{n\to\infty} P(X\le -n)=P(\cap_{n\ge 1} \{X\le -n\}) = P(X=-\infty)=0.$$

$$\lim_{n\to\infty} F_X (n) = \lim_{n\to\infty} P(X\le n) = P(\cup_{n=1}^\infty X\le n) = P(X<\infty)=1.$$ ■

The converse is also true.

Theorem 5.

Suppose that @@@F@@@ is a nondecreasing and right-continuous function on @@@\R@@@, satisfying @@@\lim_{x\to-\infty} F(x) = 0@@@ and @@@\lim_{x\to\infty} F(x)=1@@@. Then there exists a probability space and a random variable on it whose distribution function is @@@F@@@.

We now illustrate how to use the CDF to find probability that @@@X@@@ is in any given interval, or any union of disjoint intervals. Let’s get rolling. For ease of notation, we write @@@F@@@ for the CDF of @@@X@@@, omitting the subscript. We assume @@@-\infty<\beta\le \alpha<\infty@@@

@@@\eqto{P(X \le \alpha)}{F(\alpha)} = 1- P(X>\alpha)\quad \Rightarrow \quad P(X>\alpha) = 1-F(\alpha)@@@.
@@@\eqto{P(X\le \alpha)}{F(\alpha)} = \eqto{P( X\le \beta)}{F(\beta)} + P( \beta <X \le \alpha)@@@. Therefore

$$ P(\beta < X \le \alpha) = F(\alpha) - F(\beta).$$

The CDF gives the probability of @@@X@@@ being less than or equal to @@@\beta@@@. What about the probability of being strictly less than @@@\beta@@@?

Before we answer, some additional notation. Remember that @@@F@@@ is nondecreasing? Then when we look at @@@F(x)@@@ as we increase @@@x@@@ to @@@\beta@@@ , the values are increasing to some limit, known as the left limit of @@@F@@@ at @@@\beta@@@ and denoted by @@@F(\beta-)@@@. This is not necessarily equal to @@@F(\beta)@@@. Why? Because we did not require @@@F@@@ to be continuous but only right-continuous. There may be a jump at @@@\beta@@@: look at the graph in Example Example 8. Well, the event @@@\{X< \beta\}@@@ is the union of the events @@@A_n= \{X \le \beta - 1/n\}@@@ as @@@n@@@ ranges over all @@@n\in\N@@@. Note that @@@A_{n}\subseteq A_{n+1}@@@, and therefore by continuity of probability measures,

$$P(X< \beta) = P(\cup A_n) =\lim_{n\to\infty} P(A_n)=\lim_{n\to\infty} P(X\le \beta - \frac 1n).$$

In other words, @@@\eqto{P(X\le \beta -\frac 1n)}{F(\beta- \frac 1n)} \underset{n\to \infty} {\nearrow} P(X <\beta)@@@. Therefore

$$ P(X<\beta) = \lim_{x\nearrow \beta} F(x)=F(\beta-).$$

@@@\eqto{P( X \le \alpha)}{F(\alpha)} =\eqto{ P(X<\beta )}{F(\beta-)} +P(\beta \le X \le \alpha)@@@. Therefore

$$ P( \beta \le X \le \alpha) = F(\alpha) - F(\beta-).$$

By letting @@@\alpha=\beta@@@ in last case, we have

$$\begin{equation} \label{eq:atom} P(X=\alpha) = F(\alpha) - F(\alpha-). \end{equation}$$

@@@\eqto{P(\beta < X \le \alpha)}{F(\alpha) - F(\beta)} = P(\beta < X < \alpha) + \eqto{P(X=\alpha)}{F(\alpha)-F(\alpha-)}.@@@ Therefore,

$$ P(\beta < X< \alpha) = F(\alpha-)-F(\beta).$$

Here is a summary of everything we derived:

Name	Interval @@@I@@@	@@@P(X \in I)@@@
unbounded below and closed	@@@(-\infty,\beta]@@@	@@@F_X (\beta)@@@
unbounded above and open	@@@(\beta,\infty)@@@	@@@1-F_X (\beta)@@@
bounded open from left and closed from right	@@@(\alpha,\beta]@@@	@@@F_X (\beta) - F_X(\alpha)@@@
singleton (single point)	@@@[\beta]@@@	@@@F_X (\beta) - F_X (\beta-)@@@
bounded open	@@@(\alpha,\beta)@@@	@@@F_X(\beta-) - F_X(\alpha)@@@
*CDF and Probabilties of Intervals*

You can get other types of intervals from these above:

@@@[\alpha,\beta)@@@ ? You can start from @@@(\alpha,\beta)@@@ and take its union with @@@[\alpha]@@@. This is a disjoint union, so probabilities add up:

$$P(\alpha \le X < \beta) = F_X (\beta-) - F_X (\alpha) + (F_X(\alpha) - F_X (\alpha-))=F_X (\beta-) - F_X (\alpha-).$$

@@@[\alpha,\beta]@@@ ? We can repeat the process from the last example and add @@@P(X=\beta)@@@. This gives

$$ P( \alpha \le X \le \beta) = F_X (\beta) - F_X (\alpha-).$$

@@@(-\infty,\beta)@@@ ? You can start from @@@(-\infty, \beta]@@@ and subtract the probability of @@@X=\beta@@@. This gives you

$$ P(X <\beta) = F_X (\beta-).$$

Why did we give \eqref{eq:atom} a number? Since @@@F@@@ is nondecreasing, the only discontinuities it has are jumps: the limit from the left and right exist and differ (limit from the right equal to the value at the point due to right-continuity). This equation explains the origins of jumps: @@@F@@@ has a jump at @@@\alpha@@@ if and only if @@@P(X=\alpha)>0@@@, in which case we say that @@@X@@@ has an ‘'’atom’’’ at @@@\alpha@@@. We’ll get back to atoms soon.

Exercise 5.

Let @@@X@@@ be a RV with CDF @@@F_X@@@. Express @@@P(1 \le X < 10)@@@ in terms of @@@F_X@@@. Note what type of inequalities are used.

Exercise 6.

Let @@@X@@@ be a RV with a continuous CDF @@@F_X@@@. If @@@F_X (1) = 0.5@@@ and @@@F_X(3)=0.7@@@, what is @@@P(|X-2|>1)@@@?

Exercise 7.

A RV @@@X@@@ have a CDF which is flat with the exception of three jumps of equal size, one at @@@-1@@@, one at @@@1@@@ and one at @@@2@@@. What is the probability that @@@X\ge 1@@@?

After all these calculations let’s talk a little about theory (I know it may sound funny). If @@@X@@@ is an RV on one probability space and @@@Y@@@ an RV on another probability space, it does not make sense to compare the two: they are functions on different domains. However, the introduction of the probability distribution and the CDF for each strips the sample space dependence from the random variable, keeping only the numerical values attained and the associated probabilities. This leads to the following (almost trivial definition).

Definition 3.

Let @@@X@@@ and @@@Y@@@ be RVs, possibly defined on distinct probability spaces. We say that @@@X@@@ and @@@Y@@@ have the same distribution or are equal in distribution, writing @@@X\sim Y@@@ or @@@X\overset{\mbox{dist}}{=}Y@@@ if @@@F_X=F_Y@@@ (or, equivalently if the distribution of @@@X@@@ is equal to the distribution of @@@Y@@@).

Equality in distribution is an Equivalence relation, an important notion in mathematics. If you know what an equivalence relation is, try to show this. Otherwise, think of an equivalence relation as a notion of similarity or as a “cheap equality”: we only consider equality of certain aspects or when viewed behind some filter. In our case @@@X@@@ and @@@Y@@@ are equal in distribution if their the probability of @@@\{X\in A\}@@@ and of @@@\{Y\in A\}@@@ is the same for all Borel sets @@@A@@@.

Example 9.

Let @@@X@@@ be the indicator of Heads in a fair coin toss. Let @@@Y=1-X@@@.

Then @@@X\ne Y@@@. However, @@@X\sim Y@@@ because their CDFs coincide.

Three Distribution Types

CDFs allow to classify distributions of RVs into three types.

Definition 4.

The distribution of an RV is

’'’Continuous’’’ if its CDF is continuous, or, equivalently @@@P(X=\alpha)=0@@@ for all @@@\alpha@@@ (no atoms)
’'’Discrete’’’ if there exists a finite or countable set @@@A@@@ such that @@@P(X \in A)=1@@@, or equivalently, @@@X@@@ is purely atomic;
’'’Mixed’’’ otherwise.

Though the three types describe the distribution of the RV rather than the RV itself, we will often refer to a RV as “continuous”, “discrete” or “mixed”, meaning that the underlying probability measure is such that the distribution of the RV is of the described type.

Discrete distributions

Basically, a distribution is discrete if the CDF only increases through jumps.

When @@@X@@@ is discrete, the finite or countable set @@@A@@@ satisfying @@@P(X=\alpha)>0@@@ for all @@@\alpha \in A@@@, is called the ‘'’support’’’ of @@@X@@@ (warning: in advanced math we use the same word for a slightly different object). We also define an associated function, the ‘'’probability mass function (PMF)’’’ of @@@X@@@, denoted by @@@p_X@@@ and given by

$$\begin{equation} \label{eq:PMF} p_X(x) = P(X=x),~x \in \R. \end{equation}$$

Of course, @@@p_X@@@ is a function from @@@\R@@@ to @@@[0,1]@@@ and it is nonzero only the support of @@@X@@@. It is not hard to see that the PMF determines the CDF:

Proposition 6.

If @@@X@@@ is a discrete RV, then its distribution is determined by its PMF. More precisely, for every @@@x\in \R@@@,

$$ F_X (x) = \sum_{\{\alpha \le x:p_X(\alpha)>0\}} p_X (\alpha).$$

Conversely, if @@@A@@@ is a finite or countable subset of @@@\R@@@, and @@@p:\R \to [0,1]@@@ satisfying

@@@p(x) >0@@@ if and only if @@@x \in A@@@; and
@@@\sum_{x\in A} p(x) =1@@@.

Then there exists a RV @@@X@@@ whose PMF is @@@p@@@.

In other words, to recover the CDF from the PMF, just add up all the nonzero values of the PMF at points @@@\alpha@@@ less than or equal to @@@x@@@. If there are none, @@@F_X(x)=0@@@.

The PMF is a very useful and usually easier object to work with than the CDF. Usually, we describe discrete RVs in terms of their PMFs:

Let @@@X@@@ be the number when rolling a standard and fair @@@6@@@-faced die. Then the PMF is pretty obvious: @@@p_X(1) = p_X(2) = \dots = p_X(6) = \frac 16@@@, while @@@p_X (x)=0@@@ for all other @@@x@@@. To recover the CDF, we need to do some summation, right? Here’s an illustration of the CDF. The default view gives you the CDF, but there are two alternate views that give you the PMF: one showing the summation process, and a second showing the PMF isolated from the CDF.

Let’s look at some more interesting examples.

Example 10.

Suppose that @@@p(2) =1/36,p(3)=2/36, ... ,p(6)=5/36, p(7)=6/36, p(8)=5/36,\dots, p(12)=1/36@@@ and @@@p(x)=0@@@ for all other @@@x@@@. Show that @@@p@@@ is the PMF of the sum of two independent fair dice tosses.

When tossing two dice, the number of outcomes is @@@36@@@. The possible values for the sum are @@@2,\dots,12@@@. The probability of each value for the sum is the number of ways you can form it divided by the number of outcomes. For values @@@j=2,\dots,7@@@, you can form the sum @@@j@@@ in @@@j-1@@@ ways (@@@(1,j-1)@@@, @@@(2,j-2), \dots, (j-1,1)@@@), so @@@p(j)=\frac{j-1}{36}@@@. Similarly any value @@@k@@@ in @@@\{8,\dots,12\}@@@ can be written as @@@14-j@@@ for @@@j\in \{2,\dots,6\}@@@, and if @@@(k_1,k_2)@@@ sums to @@@j@@@, then @@@(7-k_1,7-k_2)@@@ sums to @@@14-j@@@, and vice versa. Therefore, @@@p(k) = p(14-k)@@@, or @@@p(12)=p(2),p(11)=p(3),\dots, p(8)=p(6)@@@. We can summarize this PMF with a simple expression:

$$p(j) = \begin{cases} \frac{6- |j-7|}{36} & j=2,\dots,12\\ 0 & \mbox{otherwise} \end{cases}$$

Continuous distributions

Recall that the distribution of a RV @@@X@@@ is continuous when the corresponding CDF is continuous, or, equivalently when @@@P(X=x)=0@@@ for all @@@x@@@.

Let’s go one step further with continuity. Let’s make a defition

Definition 5.

We say that a CDF @@@F@@@ has a ‘'’density @@@f@@@’’’ if @@@F@@@ is the primitive of @@@f@@@, that is: there exists a nonnegative (improperly) Riemann integrable function @@@f@@@ such that for all @@@x\in \R@@@

$$\begin{equation} \label{eq:primitive} F (x) =\int_{-\infty}^x f(t) dt. \end{equation}$$

If @@@F@@@ is a CDF of the RV @@@X@@@, then @@@f@@@ is called the density of @@@X@@@, usaully referred to as @@@f_X@@@.

Here’s an immediate observation.

Corollary 7.

Let @@@X@@@ be an RV with density @@@f_X@@@. Then for any interval or finite union of intervals @@@I@@@,

$$P(X \in I) = \int_I f_X(t) dt.$$

Of course, we bother to introduce CDFs with densities because not every continuous CDF has a density. Ever heard of the Cantor function? Take a look at the figure. This function is continuous, but not an integral of anything. It is differentiable on an open set which is dense in @@@[0,1]@@@ and where differentiable, the derivative is zero. All the increase comes from a small but uncountable set, known as the Cantor set. Very roughly roughly, this is a continuous distribution which is “as close as it gets” to being discrete.

[[File:CantorEscalier-2.svg

thumb

The Cantor function]]

Lucky for us, almost all continuous distributions we will consider will have a density.

Before more theory, let’s consider a concrete case.

Example 11.

Let @@@F@@@ be the function

$$\begin{equation} \label{eq:cdf_uni} F(x)= \begin{cases} 0 & x< 0; \\ x & 0\le x <1; \\ 1 & x\ge 1,\end{cases} \end{equation}$$

Then @@@F@@@ is a continuous CDF. It has a density, given by

$$\begin{equation} \label{eq:density_uni} f(x) = \begin{cases} 1 & x \in (0,1); \\ 0 & \mbox{otherwise}.\end{cases} \end{equation}$$

With this extremely simple example in mind, suppose that @@@X@@@ has CDF @@@F_X@@@ and density @@@f_X@@@. Since @@@F_X@@@ is continuous for every interval @@@I@@@ with left endpoint @@@a@@@ and right endpoint @@@b@@@ (here of course @@@a\le b@@@, and the interval can be open/closed or neither), we have the equality

$$ P(X \in I ) \overset{X\mbox{ is continuous}}{=} P( X \in (a,b)) \overset{\eqref{eq:primitive}}{=} \int_a^b f_X(t) dt.$$

a Now let’s fix some @@@x_0 \in\R@@@. Assume that @@@f_X@@@ is continuous at @@@x_0@@@. Then the continuity of @@@X@@@ implies @@@P(X=x_0)=0@@@. Yet, taking @@@a=x_0-\epsilon,b=x_0+\epsilon@@@, for some small @@@\epsilon>0@@@, the euqation above gives

$$ P( x_0 - \epsilon < X < x_0+\epsilon) = \int_{x_0-\epsilon}^{x_0+\epsilon} f_X(t) dt \sim 2\epsilon f_X (x_0),$$

because the integral is over an interval so small that its value is essentially @@@f_X(x_0)@@@, the value of the function at the center of the interval, times the length of the interval, plus a smaller “error”. In other words, the probability that @@@X@@@ lies in the interval of length @@@2\epsilon@@@ centered around @@@x_0@@@ is roughly the length of the interval times the density at the center of the interval.

Bottom line(s):

Since probability of attaining a specific value is zero, we look at intervals instead.
We use lengths of intervals as reference for computing probabilities in intervals.
The density at a given point describes the ratio between the probability of being in a small interval around that point and the length of that interval, as the length of the interval tends to zero.

Before we move on, observe that the notion of density is the continuous analog of the PMF, and in fact, some authors will refer to PMF as “density”. To see the analogy, review our definition of density \eqref{eq:primitive} and the the connection between PMF and CDF

Now given an RV, when does it have a density? We will discuss some pretty sophisticated methods of finding out, but at the momemt, let’s stick to basic calculus. From the Fundamental Theorem of Calculus, we learn the following:

Theorem 8.

A RV @@@X@@@ has a density if and only if the CDF of @@@X@@@, @@@F_X@@@, has a derivative @@@f@@@, not necessarily defined everywhere, satisfying

$$ \int_{-\infty}^\infty f(x) dx =1.$$

Furthermore, in this case @@@f@@@ is the density of @@@X@@@.

Let’s try to digest this. Gimmie some RV @@@X@@@, let’s sat @@@X\sim \mbox{Bern}(0,1)@@@. Remember its CDF we plotted earlier? Well, it is flat, except for the jump at @@@0@@@ and at @@@1@@@. Therefore the derivative exists except at two points, and is equal to zero whereever it exists. The integral of the zero function is simply zero, so oops, this RV does not have a density.

On the other hand, if we look at the RV with CDF given in \eqref{eq:cdf_uni}, then again, its derivative exists everywhere expect for… the same towo points @@@0@@@ and @@@1@@@, and this derivative is given by the function \eqref{eq:density_uni}, which upon simple inspection, does integrate to @@@1@@@. Therefore this RV has a density.

Exercise 8.

Let

$$F(x)=\begin{cases} 0 & x<0 \\ x^2 & x \in [0,1) \\ 1 & x\ge 1\end{cases}$$

Show that @@@X@@@ has a density and find it.

Here’s an example that will show how far we can stretch this notion.

Example 12.

Let @@@X\sim \mbox{U}[0,1]@@@. What is the probability that the second digit in the decimal expansion of @@@X@@@ is @@@2@@@ (use the non-terminating expansion)?

Well, the second digits is @@@2@@@ if and only if @@@X@@@ is in @@@I_1=[0.02,0.03)@@@ or in @@@I_2=[0.12,0.13)@@@, or in @@@I_3=[0.22,0.23)@@@, etc. Thus, the answer is equal to

$$ \int_{I_1 \cup I_2 \cup \dots I_{10}} f(t) dt = \sum_{j=1}^{10} \int_{I_j} 1 dt = \sum_{j=1}^{10} |I_j| = 10* 1/100 = \frac{1}{10},$$

where @@@|I_j|@@@ denotes the length of the interval @@@I_j@@@ (right endpoint minus left endpoint).

Mixed distributions

The next one is from Mix and Match Jokes.

Q: What do you get when you cross a karate expert with a pig?

A: A porkchop.

Mixed are, in essence, obtained through a similar procedure. Take a discrete CDF and continuous one, combine to get a mixed RV. We’ll get to the mathematical theorem, but first let’s return to our definition of a mixed CDF: not continuous (so has jumps), and not discrere (so not only supported on jumps). This basically gives us

Proposition 9.

Let @@@F@@@ be a mixed CDF. Then there exist@@@p\in (0,1)@@@, a discrete CDF @@@F_D@@@ and a continuous CDF @@@F_C@@@ such that

$$\begin{equation} \label{eq:mixed_decomp} F(x) = p F_D (x) + (1-p) F_C(x). \end{equation}$$

Conversely if @@@F@@@ is any CDF satisfying \eqref{eq:mixed_decomp}, then @@@F@@@ is a mixed CDF.

Proof.

Take any mixed CDF @@@F@@@. Then we aready know

It has jumps. Let @@@A=\{x: F \mbox{ is not continuous at }x\}@@@. Then @@@A@@@ is not empty, and necessarily finite or countable.
It does not only increase through jumps. Therefore, @@@p@@@ the probability assigned to the set @@@A@@@ is @@@0< p= \sum_{x \in A} F(x) - F(x-)<1@@@. ** In light of this, define

$$p_D (x) = \frac{1}{p} \begin{cases} F(x) - F(x-) & x \in A \\ 0 & \mbox{otherwise}. \end{cases}$$

Then @@@p_D@@@ is a genuine PMF. ** Now let @@@F_D (x) = \sum_{y \in A,y\le x} p_D (y)@@@. Then @@@F_D@@@ is a discrete CDF, and it continuous at all points except on @@@A@@@.

Let @@@F_C (x) = \frac{1}{1-p} (F(x) - p F_D(x))@@@. From the definition of @@@F_D@@@ you should be able to conclude ** @@@F_C@@@ is nonnegative and nondecreasing. ** @@@\lim_{x\ to -\infty} F_C (x)=0,\lim_{x\to\infty} F_C (x) = 1@@@. ** @@@F_C@@@ is continuous (all jumps of @@@F@@@ are compensated by those of @@@F_D@@@. That simple).
Therefore @@@F_C@@@ is a continuous CDF.

Putting it all together, gives us \eqref{eq:mixed_decomp}. Conversely, given a CDF of the form \eqref{eq:mixed_decomp} for some @@@p\in (0,1)@@@, discrete @@@F_D@@@ and continuous @@@F_C@@@, then @@@F@@@ is mixed for the same reasons we listed below the joke.

■

How can you produce such mixed RVs? All you need is one example.

Example 13.

I have two RVs: @@@U\sim \mbox{U}[0,1]@@@ and @@@X\sim \mbox{Bern}(1/2)@@@. I’m forming a new RV @@@Y@@@ as follows. I toss a coin, independent of everything else. If lands @@@H@@@, I’m setting @@@Y=U@@@, while if lands @@@T@@@, then I’m setting @@@Y=X@@@.

Let’s find the CDF of @@@Y@@@.

$$ \begin{align*} F_Y (y) & = P( Y\le y) \overset{\mbox{total probability}}{=} P( Y\le y | H) P(H) + P(Y\le y | T) P(T) \\ & =\frac 12 P( U \le y | H) + \frac 12 P( X \le y | T) \\ & \overset{\mbox{independence}}{=} \frac 12 P(U\le y) + \frac 12 P(X\le y) \\ & = \frac 12 F_U (y) + \frac 12 F_X (y), \end{align*} $$

and of course, this is a mixed CDF because it is neither continuous nor discrete.

Exercise 9.

The professor who taught me undergrad probability used to say this: “The time you wait in a stoplight is a mixed RV”.

Explain!

Approximation of one type by another

Suppose @@@X@@@ is a continuous or mixed RV. Then it’s pretty easy to approximate it with a discrete RV. Here’s how.

Recall the floor function, @@@x\to \lfloor x\rfloor@@@ which gives us the largest integer @@@\le x@@@. For example, @@@\lfloor 0.9\rfloor=0@@@, @@@\lfloor 3\rfloor =3@@@, @@@\lfloor -.01\rfloor =-1@@@. Observe that

$$\begin{equation} \label{eq:floor} x-1 < \lfloor x \rfloor \le x \end{equation}$$

Now let’s take a RV @@@X@@@. We will come up with a discrete RV close to @@@X@@@. How close? Fix any @@@\epsilon>0@@@. We will find a RV @@@Y_\epsilon@@@ such that @@@0 \le X - Y_\epsilon <\epsilon@@@. All that’s left is to define our approximation:

$$ Y_\epsilon = \epsilon\lfloor X/\epsilon\rfloor.$$

Then from \eqref{eq:floor} we see that @@@X -\epsilon < Y_\epsilon \le X@@@. Also, @@@Y_\epsilon@@@ is discrete: it is always an integer multiple of @@@\epsilon@@@.

Now we can also approximate any RV with a RV that has a density. We need more theory to make this seem as simple, and we provide the construcion later. In the meantime, we will show that any CDF ca be approximated through one with density.

Pick a CDF @@@F@@@, and fix an @@@\epsilon>0@@@. This, again, will serve as how close the newly minted CDF will be to @@@F@@@. We define

$$F_\epsilon (x) = \frac{1}{\epsilon} \int_{x}^{x+\epsilon} F(t) dt.$$

Since @@@F@@@ is nondecreasing, so is @@@F_\epsilon@@@, and since an integral is continuous as a function of the lower and upper limits, @@@F_\epsilon@@@ is continuous. Furthermore,

$$ F(x) \le F_{\epsilon} (x) \le F(x+\epsilon),$$

and since @@@F@@@ is right continuous, @@@\lim_{\epsilon\to 0} F_\epsilon (x) = F(x)@@@ for all @@@x@@@.

Finally, observe that

$$\begin{equation} \label{eq:to_density} \begin{split} F_\epsilon (x) & = \frac{1}{\epsilon} \left ( \int_{-\infty}^{x+\epsilon} F(t ) dt - \int_{-\infty}^{x} F(t) dt\right) \\ & = \int_{-\infty}^x \frac{F(t-\epsilon) - F(t)}{\epsilon} dt. \end{split} \end{equation}$$

Therefore, according to our definition, @@@F_\epsilon(x)@@@ has density @@@\frac{F(x-\epsilon) - F(x)}{\epsilon}@@@.

Bottom line: we have found a CDF with density that converges to @@@F@@@. If we repeat the process again for @@@F_\epsilon@@@, we can obtain an approximation with a continuous density. Of course we can repeat again to get continuously differentiable density, etc. But there are simpler ways to get that in one operation. Again, when we learn more it will be much simpler.

Intro
Random Variables
Distribution (Functions) of RVs