Intro

In the 2018-2019 Women’s Final four that was held in Tampa, there were four teams: Our very own UConn team, Oregon, Notre Dame and Baylor.

We’re using the uniform probability measure, so the probability of Baylor (or UConn, or any other team) winning is @@@1/4@@@. After the semifinal UConn and Notre Dame were eliminated. What is then the probability of Baylor winning the title? @@@1/2@@@. What is the probability of UConn winning the title? @@@0@@@. cno What just happened? We learned something about the outcome after the semifinal. This lead to an adjusted, probability measure, known in our line of work as conditional probability, derived from the original probability by using what we learned about the outcome.

Conditional probabilities are abundant and very useful tools. Very often, what we are given are conditional probabilities rather than the actual probabilities. Conditional probabilities also serve as a vital tool for breaking down computations into “small” pieces.

Let’s begin with a fun riddle on conditional probability. https://www.youtube.com/watch?v=cpwSGsb-rTs After that, I strongly recommend watching this introductory video https://www.youtube.com/watch?v=JGeTcRfKgBo}

Definition of conditional probability

Throughout our discussion, @@@P@@@ is a probability measure on a (@@@\sigma@@@-algebra of subsets of a) sample space @@@\Omega@@@.

Conditional probability is the “adjusted” probability measure on @@@\Omega@@@ obtained by “restricting” the sample space to some event, say @@@B@@@, whose @@@P@@@-probability is not zero. The restriction can be thought of as the new probability measure obtained by assuming that the outcome is in @@@B@@@. Think of it as getting a tip about the outcome from a reliable source. Remember we described the information we will have about the outcome at the conclusion of our observation as the collection of all yes/no questions we will be able to answer? The act of conditioning the probability measure represents the change in the probability measure by having some of these questions answered for us, just as if we were an investigator that was just tipped by some reliable informer.

This is a very natural idea we’re exposed to on a daily basis. For example, according to data from Pew Research Center, @@@42\%@@@ of US households have a gun. The figure for urban households is @@@29\%@@@, and for rural is it @@@58\%@@@. In this example, the sample space is US households, and @@@P@@@ is the uniform probability measure on it. Let @@@A@@@ be the event “household has gun”. Then the first figure is @@@P(A)@@@, the proportion of households with guns. In mathematical terms, the latter two figures are conditional probabilities: the @@@P@@@-probability of @@@A@@@, conditioned, respectively on the event “urban household” and “rural household”: each represents the adjusted probability, when restricting the sample space to urban or to rural setting.

According to Census data, in 2015, @@@15.2\%@@@ of US population was 65 or older, and @@@16.2\%@@@ of the population of CT was 65 or older. The relation between the two figures? Let @@@\Omega@@@ be the US population in 2015 and let @@@P@@@ be the uniform probability measure on it. Let @@@A@@@ be the event “65 years or older”, and let @@@B@@@ be the event “CT resident”. Then the first figure is @@@P(A)@@@, while the second is the @@@P@@@-probability of @@@A@@@, conditioned on @@@B@@@. How did we obtain the conditional probability from @@@P@@@ in this example? There’s a simple procedure: we’re just calculating relative frequencies. Start from your event @@@A@@@, and keep only those elements in @@@A@@@ which are also in @@@B@@@. This leaves you with the intersection of @@@A@@@ and @@@B@@@, @@@A\cap B@@@, or, in words, the part of the population which is both 65 or older and CT residents. We then count the number of elements in @@@A\cap B@@@, and divide it by the number of elements in @@@B@@@ to obtain the relative frequency of “65 and older” in “CT resident”. In symbols, the conditional probability is equal to

$$ \frac = \frac = \frac{P(A\cap B)}{P(B)}.$$

More generally, since probabilities generalize relative frequencies, we use the above calculation as motivation for the definition of conditional probability:

Definition 1.

Suppose that @@@P@@@ is a probability on @@@\Omega@@@, and let @@@B@@@ be an event with @@@P(B)>0@@@. For each event @@@A@@@, we define the probability of @@@A@@@ conditioned (on given) @@@B@@@ as

$$\frac{P(A \cap B)}{P(B)},$$

and denote this quantity by @@@ P(A|B).@@@

In other words, @@@P(A|B)@@@ is the “relative probability” of @@@A@@@ in @@@B@@@.

How to read conditional probability in word problems? To find the event whose conditional probability we describe always try to answer the question “probability of what?”. To find the event we condition on, try to answer the question “assuming/knowing/restricting to what?”. You’ll have plenty of examples below.

Example 1.

It starts normal. A family has two children.

  1. What is the probability that the youngest is a boy?
  2. How would your answer change if we know that the family has a boy?
  3. How would your answer to part 1. change if we condition on a family that has a boy born on Sunday?

Before answering each let’s figure out the framework (at least for 1. and 2.). The sample space is all (ordered) sequences of length @@@2@@@, consisting of the symbols @@@b@@@ and @@@g@@@, representing the gender of the child, from eldest to youngest. @@@bg@@@, means first born boy and a younger sister. By the product rule, the sample space has @@@2\times 2@@@ elements. Since we did not make any assumptions, we assume that the probability on this space is uniform, and so the probability of every event is the number of its elements divided by @@@4@@@.

  1. Let @@@A@@@ be the event “youngest is a boy”, that is @@@\{gb,bb\}@@@. This event has @@@2\times 1@@@ elements, thus its probability is @@@\frac 12@@@. This answers the first question. #This involves conditioning. Let @@@B@@@ be the event “family has a boy”. Then what we’re asking is for the probability of @@@A@@@ conditioned on (or given) @@@B@@@, or, in symbols, @@@P(A|B)@@@. The complement of @@@B@@@ is @@@\{gg\}@@@. Therefore @@@P(B) =1-\frac14 = \frac 34@@@. We need to calculate the ratio @@@P(A\cap B)/ P(B)@@@. Since the statement “youngest is as boy” implies “family has a boy”, it follows that the corresponding events satisfy @@@A\subset B@@@ (if youngest is boy we definitely have a boy). Therefore @@@A\cap B = A@@@ and so @@@P(A\cap B) = P(A)= \frac 12@@@. The conditional probability is therefore
$$P(A|B) = \frac{ 1/2}{3/4} = \frac{2}{3}.$$
In other words, conditioning eliminates the outcome {::nomarkdown}@@@\{gg\}@@@{:/nomarkdown} from the sample space, leaving us with {::nomarkdown}@@@3@@@{:/nomarkdown} outcomes {::nomarkdown}@@@\{bb,bg,gb\}@@@{:/nomarkdown}, and out of which exactly two correspond to youngest is a boy.  1. This one is a little tricky. Not because of the math, but because the question is ambiguous: some may argue that the additional information on day of birth is completely irrelevant, so the answer is identical to b., while some may argue that this information means that the sample space is different and includes the day on which each child is born. With this latter interpretation the sample space is sequences of length two of symbols of the form {::nomarkdown}@@@x_i@@@{:/nomarkdown}, where {::nomarkdown}@@@x@@@{:/nomarkdown} represents the gender, being equal to {::nomarkdown}@@@b@@@{:/nomarkdown} or {::nomarkdown}@@@g@@@{:/nomarkdown}, while {::nomarkdown}@@@i@@@{:/nomarkdown} represents the day of the week, {::nomarkdown}@@@1@@@{:/nomarkdown} for Sunday, {::nomarkdown}@@@2@@@{:/nomarkdown} for Monday, ..., and {::nomarkdown}@@@7@@@{:/nomarkdown} for Saturday. Thus, by the product rule, we have {::nomarkdown}@@@14=2\times 7@@@{:/nomarkdown} symbols, and we have exactly {::nomarkdown}@@@14^2@@@{:/nomarkdown} sequences of length {::nomarkdown}@@@2@@@{:/nomarkdown} obtained from these symbols. Time for calculations.
- Complement of {::nomarkdown}@@@B@@@{:/nomarkdown} is all sequences without the symbol {::nomarkdown}@@@b_1@@@{:/nomarkdown}. This has {::nomarkdown}@@@13^2@@@{:/nomarkdown} sequences. Therefore {::nomarkdown}@@@|B|= 14^2-13^2 = 27@@@{:/nomarkdown}.
- Number of sequences with youngest being a boy is {::nomarkdown}@@@14*7@@@{:/nomarkdown}. Among those, exactly {::nomarkdown}@@@13*6@@@{:/nomarkdown} do not have a boy born on a Sunday. Therefore, number of elements in {::nomarkdown}@@@A\cap B@@@{:/nomarkdown} is {::nomarkdown}@@@14*7-13*6 = 13+ 7= 20@@@{:/nomarkdown}. 
- Summarizing, the answer is {::nomarkdown}@@@|A\cap B|/|B| = 20/27@@@{:/nomarkdown}. 
Exercise 1.

Give your own example of a probability measure on a sample space, an event and a corresponding conditional probability measure from “real life”.

Want to see some conditional probability in action?

Exercise 2.

Some forensics? Consider the following information about a game of chance.

  1. If I cheat, I win the game.
  2. The probability I win the game is @@@1/1,000@@@
  3. The probability I cheat is @@@1/2@@@.

Something is wrong. Very wrong. So wrong it is impossible. Can you find it?

Properties of conditional probability

The definition of conditional probability immediately leads to the following result

Proposition 1.

Suppose @@@P(B)>0@@@. Then

  1. if @@@A\subseteq B@@@, then @@@P(A|B) = P(A)/P(B)@@@.
  2. if @@@B \subseteq A@@@, then @@@P(A|B) =1@@@.
  3. If @@@A_1,A_2,\dots@@@ are pairwise disjoint, then
$$ P(\cup_{j=1}^\infty A_j|B) = \sum_{j=1}^\infty P(A_j|B).$$
Exercise 3.

Prove the proposition.

Observe that as a result, @@@P(B^c|B) =0@@@. Indeed,

$$1\overset{2.}{=} P(\Omega | B) \overset{3.}{=} P(B|B) + P(B^c|B) = \overset{2.}{1} + P(B^c|B).$$

A more theoretical perspective yields the following important obervation

Proposition 2.

Suppose @@@P(B)>0@@@. For any event @@@A@@@ let

$$Q(A) = P(A| B).$$

Then @@@Q@@@ is a probability measure, namely it satisfies the definition of a probability measure.

The proof follows directly from the definition of a probability measure and of conditional probability.

Exercise 4.

Prove the proposition.

Exercise 5.

Give an example of a probability measure @@@P@@@, and events @@@A@@@ and @@@B@@@ such that @@@P(B)P(B^c)>0@@@ and such that @@@P(A|B) + P(A|B^c) >1@@@.

Exercise 6.

Suppose that @@@P@@@ is a probability on @@@\Omega= \{H,T\}@@@ satisfying @@@P(H)=P(T)=\frac 12@@@ and @@@Q@@@ is a probability on @@@\Omega@@@ satisfying @@@Q(H)=1/3@@@. Can @@@Q@@@ be obtained from @@@P@@@ through conditioning on some event @@@B@@@?

Exercise 7.

In a presidential election a candidate Water Mallone received 40% of the total votes. At least 80% of the voters are in strong support of domestic growth of lychee fruit (yum!). Show that the proportion of voters from the latter group that voted for Water Mallone does not exceed 50%.

I’d like to give you one more example which I think is important to keep in mind.

Example 2.

Three candidates are running for office. The race is tight, and all outcomes are equally likely. You rank them (winner, second, third).

  1. What is the probability that your guess is correct?
  2. Assuming that your guess has at least one correct ranking what is the probability your guess is correct?
  3. Assuming your guess for the top ranked candidate is correct, what is the probability that your guess is correct?

From the wording of the problem, the underlying probability measure @@@P@@@ is the uniform measure on all different rankings. A ranking is a permutation of the three candidates, therefore @@@P(A)=|A|/3!=|A|/6@@@ for any event @@@A@@@.

Now for the actual probabilities:

  1. Let @@@A@@@ be the event that your ranking (for all candidates) is correct. This contains exactly one outcome so @@@P(A) = 1/6@@@, giving us an answer.
  2. The additional information in the second part amounts to conditioning on the event that your guess has at least one correct candidate. Call this event @@@B_1@@@. Now the complement of @@@B_1@@@ is all outcomes where every ranking is wrong. There are two ways to guess the first wrong, and each of those corresponds to exactly one way of selecting the second and last wrong. Altogether @@@B_1^c@@@ has two elements, so @@@B_1@@@ has @@@4@@@ elements. Note that @@@A\cap B_1@@@ is @@@A@@@ because @@@A\subset B_1@@@ (if you guessed all correctly, you definitely guessed at least one correctly). Therefore the answer, @@@P(A|B_1)=1/4@@@. The additional information increases your chances of guessing correctly.
  3. Finally, let @@@B@@@ be the event that you guessed the winner correctly. @@@B@@@ has two elements, and similarly to the last calculation, @@@A \cap B =A@@@, because @@@A\subset B@@@. Thus, @@@P(A|B)=1/2@@@.

It is clear why @@@P(A)< P(A|B_1)@@@ and @@@P(A)< P(A|B)@@@. But why is the difference between conditioning on the two events @@@B_1@@@ and @@@B@@@? The reason is that @@@B@@@ actually gives us more information than @@@B_1@@@, in the sense that @@@B@@@ limits the number of possibilities to @@@2@@@, while @@@B_1@@@ limits it to twice as much…

The Law of Total Probability

This is an unfairly bombastic name, which I am using because its the convention. A better fitting name, I believe, is the “total probability formula”. It is just a formula…

So far we discussed deriving conditional probability from the “original” or probability. We will now consider the question of recovering probabilities of events under the “original” probability from their conditional probabilities. Why?

  • Very often, some of the data are presented or collected in terms of conditional probabilities.
  • Many times, conditional probabilities are easier to compute than the complete probability, the same way that it is easier to chew smaller pieces.

We will shortly expand on both through examples. Before that, we will need another important formula, the total probability formula (more formally known as the law of total probability). Suppose that @@@B_1,\dots,B_n@@@ are disjoint @@@\cup_{j=1}^n B_j = \Omega@@@,and @@@P(B_j)>0@@@ for all @@@j@@@. We know that any event @@@A@@@ can be split into its part in @@@B_1@@@, its part in @@@B_2@@@, etc. (like any family can be split according to age groups of its members):

$$A= \cup_{j=1}^n (A\cap B_j),$$

and since the union is disjoint, we obtain the total probability formula: \begin{equation} \label{eq:total_prob} \boxed { P(A) = \sum_{j=1}^n P(A \cap B_j) =\sum_{j=1}^n \frac{P(A\cap B_j)}{P(B_j)}P(B_j)= \sum_{j=1}^n P(A|B_j) P(B_j). } \end{equation}

Here’s an example.

Example 3.

Consider the following table (source 1, source 2): comment %} \begin{array} \hline &\mbox{Educational attainment} & \%\mbox{ Unemployment rate } & \%\mbox{ Part of workforce }
\hline\hline 1 &\mbox{Doctoral degree} & 1.6 & 2
\hline 2 &\mbox{Professional degree} &1.6& 2
\hline 3 &\mbox{Master’s degree}& 2.4& 11
\hline 4 &\mbox{Bachelor’s degree} &2.7& 24
\hline 5& \mbox{Associate’s degree} &3.6& 11
\hline 6& \mbox{Some college, no degree}& 4.4& 16
\hline 7& \mbox{High school diploma} &5.2& 26
\hline 8& \mbox{Less than a high school diploma} &7.4 & 8
\hline \end{array} endcomment %} | Educational attainment | % Unemployment | % of work force | | — | — | — | | 1 | Doctoral degree | 1.6 | 2 | | 2 | Professional degree | 1.6 | 2 | | 3 | Master’s degree | 2.4 | 11 | | 4 | Bachelor’s degree | 2.7 | 24 | | 5 | Associate’s degree | 3.6 | 11 | | 6 | Some college, no degree | 4.4 | 16 | | 7 | High school diploma | 5.2 | 26 | | 8 | Less than a high school diploma | 7.4 | 8 |

Let @@@\Omega@@@ denote the entire workforce, and let @@@P@@@ be the uniform probability measure: the probability of an event is its relative frequency.

  • Let @@@A@@@ be part of the workforce unemployed at the time the data for the table were collected.
  • For each numbered line @@@j@@@, let @@@B_j@@@ the corresponding part of the workforce.

The right column of row @@@j@@@ gives @@@P(B_j)@@@, and the middle column gives the unemployment rate among members of @@@B_j@@@, or @@@P(A|B_j)@@@. If we are asked to find the overall unemployment rate, @@@P(A)@@@, then we can recover it easily from the total probability formula: \begin{align} P(A) &= \sum_{j=1}^8 P(A|B_j) P(B_j)
& = 0.016
0.02+0.0160.02+0.024.11\&+0.0270.24+0.0360.11+0.044.16+0.052.26+0.0740.08
&=0.04. \end{align
}

Let’s do an example where the total probability is given.

Example 4.

Earlier this chapter we quoted the following data. The probability of households with a gun is @@@42\%@@@, the probability of a gun in rural households is @@@58\%@@@ and the probability of a gun in other households (quoted research lists urban, but we wrote “other” to ensure we cover all other than rural) is @@@29\%@@@.

With the given figures, what proportion of households are rural?

To solve, let @@@G@@@ be the event “a gun in household” and @@@R@@@ be the event “rural household”. By total probability formula,

$$ P(G) = P(G|R) P(R) + P(G|R^c) P(R^c).$$

Copying what we are given, we have

$$ 0.42 = 0.58 P(R) + 0.29 *(1-P(R)).$$

Thus, @@@0.13 = 0.29P(R)@@@, or @@@P(R)=44.82\%@@@.

Exercise 8.

The probability I’ll get the grant I applied for is @@@20\%@@@. The probability that the grant will be given if my friend is a reviewer is @@@84\%@@@, and @@@20\%@@@ of the reviewers are my friends. What is the probability that the grant will be given if reviewed by someone who is not my friend?

Exercise 9.
  1. My average success rate in free throws is 40% (it’s a very generous estimate!).
  2. On a “good” day my success rate is 60%.
  3. On a “bad” day my success rate is 30%.

What proportion of the days are “good”?

Here’s an interesting observation which does not require the total probability formula:

Example 5.

Suppose we were not given the third column, but are given that the unemployment rate is @@@4\%@@@. What can we say about the proportion of the employment force with high-school diploma? Well, it may be as low as zero, that’s clear, but cannot be too large. Why?

$$0.04 = P(A) \ge P(A\cap B_7) = P(A|B_7)P(B_7) = 0.052*P(B_7).$$

Therefore, @@@P(B_7) \le 0.04/0.052 = 0.77.@@@ Note that the first inequality is due to the monotonicity property of probability measures. This estimate is crude (and very far from the actual figure), but at least tells us something.

Another example for the total probability formula.

Example 6.

You’re stuck in a hall in a mine. There are three exists, all look the same. The first leads to a trap and will get you killed. The second will take you out. The third will lead you to a maze that will take you back to the hall. Assuming you select each door with equal probability (each time), and you cannot distinguish between the exists, what is the probability you’ll be able to exit?

Let @@@A@@@ be the event that you will exit. Let @@@B_j@@@, @@@j=1,2,3@@@ be the event that the exit selected first is the @@@j@@@-th. What we’re given is the following:

  • @@@P(A|B_1)=0,P(A|B_2)=1,P(A|B_3)=?@@@
  • @@@P(B_1)=P(B_2)=P(B_3)=\frac 13@@@

Let’s resolve the question mark. If you take the third exit then you’re eventually starting afresh, so this conditional probability is actually… @@@P(A)@@@. Now use the total probability formula:

$$P(A) = P(A|B_1) P(B_1) + P(A|B_2)P(B_2) + P(A|B_3) P(B_3) = 0+ \frac 13+ \frac 13 P(A).$$

So, @@@\frac 23 P(A) = \frac 13@@@, that is @@@P(A)= 1/2@@@.

Let’s think heuristically: we’ll eventually choose one of the first two exists, and that would be the end of the game. By the given assumptions, conditioning on choosing the first or the second, the probability that either one is chosen is the same, @@@\frac 12@@@.

Exercise 10.

Discuss the variant of the problem where we have an additional fourth exit that, like the third exit, would eventually lead back to the hall.

The last three examples focused on data given in terms of conditional probability. Let’s look at an example that allows to greatly simplify calculation through conditioning. It’s a little more advanced. We can prove it by induction, but we’ll use some intuition.

Example 7.

You are about to board a plane. There are @@@n\ge 2@@@ passengers. The first is boarding the plane but lost their boarding pass (or phone) and upon entering the plane, randomly picks a seat. Each of the remaining passengers sit in their designated seat, if available, or randomly pick an available seat otherwise. What is the probability that the last passenger gets to sit in their designated spot?

Let @@@B_j,~j=1,\dots,n@@@ be the events that the first passenger takes the @@@j@@@-th seat. These are disjoint, their union is @@@\Omega@@@ and @@@P(B_j)=\frac 1n@@@. Let @@@A@@@ be the event that the @@@n@@@-th passenger got their designated seat. Knowing that the first sat where they were supposed to implies everyone else sat at their designated spots. So @@@P(A |B_1)=1@@@. Also, @@@P(A |B_n)=0@@@. Make sure you understand why. Therefore, by the total probability formula, we have

$$ P(A) = \frac 1 n +\sum_{j=2}^{n-1}P(A|B_j)\frac 1n .$$

It remains to figure out @@@P(A|B_j)@@@. Observe the following:

  • When the @@@n@@@-th passenger boards, exactly one seat will be available. That seat can be the @@@n@@@-th seat or the first seat. Why? Every other seat will be taken by one of the previous passengers, either because it was available when they were boarding or because someone else took it.
  • Let’s condition on @@@B_j@@@, and take a look at what happens from the time the @@@j@@@-th passenger boards. Conditioned on @@@B_j@@@, all passengers @@@j,j+1,\dots,n-1@@@ treat the @@@n@@@-th and the @@@1@@@-st seat the same way: it is not theirs, so will be equally likely to choose either one if their own seat is taken. Why? Every outcome in @@@A\cap B_j@@@ has a unique corresponding outcome in @@@A^c \cap B_j@@@, simply by exchanging the labels between the @@@1@@@-st and @@@n@@@-th seat. Therefore, @@@P(A^c|B_j) = P(A|B_j)@@@. Since the union of the two events @@@A@@@ and @@@A^c@@@ is the entire sample space, we have that @@@P(A|B_j) = \frac 12@@@. Now plug this into our formula, to obtain
$$ P(A) = \frac 1n + \frac 12 \frac{n-2}{n}=\frac 12.$$

Quite surprising, right?

I’d like to comment that this solution is not the simplest or most general and was given for educational purposes. When the @@@k@@@-th passenger boards, with @@@k\ge 2@@@, the remaining seats are always some subset of @@@n-k+1@@@ seats from the set @@@n-k+2@@@ seats for first passenger, plus those for the remaining @@@n-k+1@@@ passengers (from @@@k@@@-th through last). Each such the possible @@@\binom{n-k+2}{n-k+1}=n-k+2@@@ combination is equally likely as none of previous passengers distinguished any of those seats, and is identified by the unique seat among the @@@n-k+2@@@ not in it. Therefore, the probability that the seat of the @@@k@@@-th passenger is not taken is @@@1-\frac{1}{n-k+2}@@@, the probability of all combinations but the single one without the @@@k@@@-th passenger’s seat.

I want to close this section with a interesting phenomenon known as Simpson’s paradox, which shows how summarizing data can completely distort the picture.

Although the figures in the example are made up, the inspiration is from a famous case of gender bias in graduate admissions to UC Berkely in the nineteen seventies, and you can find actual figures in the above Wikipedia link, and other instances in a publication by the Brookings Institute. Of course, there’s also a cool TedEd video: https://www.youtube.com/watch?v=sxYrzzy3cq8&vl=en

Example 9.

The setup is the following. Suppose we are looking at admission rates to graduate school of women and men. Our sample space is the set of applicants. Write @@@W@@@ for the event “women” (all women in the pool), @@@M@@@ for the event “men”, and @@@A@@@ for admitted. We are given the admission rates, which are… conditional probabilities: @@@P(A|W)=0.34@@@ and @@@P(A|M)=0.45@@@. Seems like there’s some sort of bias, right?

Now condition further according to departments applied for. For simplicity, assume that there are just two departments, which we will call @@@I@@@ and @@@II@@@. We will assume that each applicant applies to exactly one department. Let’s assume we also know that the admission rates for each department are the following. comment %} [ \begin{array} \hline \mbox{Department} & \%\mbox{ Women applicants admitted} & \%\mbox{ Men Applicants admitted}
\hline\hline I & {\bf 30} & 20
\hline II & {\bf 60} & 50
\hline \end{array} ] endcomment %} | Department | % Women admitted | % Men admitted | | — | — | — | | I | 30 | 20 | | II | 60 | 50 |

Note that for each department the admission rates for women are higher than those for men, yet still, the overall admission rates for women are lower than those for men. How is that even possible? It is, and this is what we call Simpson’s paradox. Let’s write down the entries in the table in terms of conditional probabilities. Let @@@B_I@@@ be the event “applied to department I” and @@@B_{II}@@@ the event “applied to department II”. By assumption, @@@B_{II}=B_I^c@@@. The numbers in the first row are (from left to right) @@@P(A| W \cap B_I)=0.3@@@ (the rate of admitted students among women who applied to department I) and @@@P(A|M \cap B_I)=0.2@@@ (the rate of admitted students among men who applied to department I). By the total probability formula (this is a slight variation, can you notice the difference?)

$$ P(A\cap W) = P(A|W \cap B_I) P(B_I\cap W) + P(A | W \cap B_{II}) P(B_{II} \cap W).$$

Dividing by @@@P(W)@@@ we discover

$$ P(A|W) = P(A|W \cap B_I) P(B_I| W) + P(A | W \cap B_{II}) P(B_{II} | W),$$

a conditional version of the total probability formula, where @@@P(B_I|W)@@@ is the rate of women who applied to department @@@I@@@. This is a general formula, but with the figures we are given, the following equation needs to be satisfied:

$$ 0.34= 0.3* P(B_I|W) + 0.6 * P(B_{II}|W).$$

Since @@@B_I^c = B_{II}@@@, it follows that @@@P(B_{II}|W)=1-P(B_I|W)@@@. Therefore we have one unknown here to solve for. After doing some algebra, we obtain @@@0.3 P(B_I | W) = 0.24@@@, or @@@P(B_I | W) = 4/5=80\%@@@.

Repeating the same for @@@M@@@, we obtain

$$ P(A|M)= P(A|M \cap B_I) P(B_I| M) + P(A | W \cap B_{II}) P(B_{II} | M),$$

and with the data given:

$$ 0.45 = 0.2 *P(B_I|M) + 0.5 *P(B_{II}|M).$$

Again, the only unknown is @@@P(B_I|M)@@@, and after doing the algebra we obtain @@@0.3* P(B_I|M) = 0.05@@@. That is @@@P(B_I|M) = 1/6=16.66..\%@@@.

In summary, although the admission rate for each department was higher for women than for men in every department, the overall admission rate for women was lower than for men. The explanation is simple:

  • It is (much) harder to get into department I, just compare second row to first; and
  • @@@80\%@@@ of women applied to department I, while only @@@17\%@@@ of men applied to department I.

Iterated Conditioning

![thumb|When playing the game of [Wikipedia:Cluedo|Clue, we are essentially trying to compute iterated conditional probabilities]] Here’s a scenario. I’m trying to figure out where I left my keys yesterday. They could be in any of the following:

  • My bedroom.
  • My car.
  • A coffee shop I worked at.
  • my Dentist’s office (I went there after loosing a crown eating some toffee at the coffee shop).

Of course, I’ll treat that as a probabilistic model. Without any additional information, the probability I left them in any of the four locations is @@@1/4@@@, that is @@@P@@@ is uniform on @@@\Omega= \{\mbox{bedroom,car, coffee, dentist's}\}@@@. I checked my bedroom thoroughly, and know that the keys are not there. Therefore I need to adjust my probability measure to the corresponding probability measure, @@@P_1 = P( \cdot | B^c)@@@, where @@@B= \{\mbox{bedroom}\}@@@, because I know that the keys are not in the bedroom. Under @@@P_1@@@, the probability my keys are in the car, coffee shop or dentist are each @@@\frac{1}{3}@@@. Now I continue to my car, and check there. The keys are not there either. So… I have to adjust the probability measures accordingly, letting @@@P_2 = P_1 (\cdot | C^c)@@@, where @@@C=\{\mbox{car}\}@@@. Under @@@P_2@@@, the probability my keys are in the coffee shop or the dentist are @@@\frac 12@@@ each. I can define @@@P_3=P_2 (\cdot | D^c)@@@ where @@@D=\{\mbox{coffee}\}@@@. The resulting probability measure is a delta measure on “dentist”: all probability goes to a single element. What we’ll discuss here is the connection between these repeated conditional probability measures. As you can guess, @@@P_1@@@ is uniform on @@@B^c=\{\mbox{car,coffee, dentist}\}@@@ and @@@P_2@@@ is uniform on @@@B^c \cap C^c =\{\mbox{coffee,dentist}\}@@@. Of course, we we just described is exactly what pros do when new reliable data pours in: they keep adjusting the probability measure, reducing the uncertainty. It’s sort of like playing 20 questions. We’re here to discuss the mathematical structure associated with this procedure.

Suppose that @@@P@@@ is a probability measure and @@@B_1,B_2,\dots,B_n@@@ are events, with @@@ P(\cap_{i=1}^n B_i) >0@@@. Then we can define a sequence of conditional probability measures as follows:

\begin{equation} \label{eq:repeated_conditioning} P_0 =P, P_{j+1} (A) = P_j ( A | B_{j+1}). \end{equation}

That is

$$ P_1 (A) = P(A | B_1), P_2(A ) = P_1 (A|B_2),\dots, P_n (A) = P_{n-1} (A| B_n).$$

Do you see now why it was important to identify conditional probability as a new probability measure on the same probability space?

How do we express all of these new probability measures just in terms of @@@P@@@? That would safe a lot of hassle. Well, the answer is pretty simple. Let’s see.

\begin{equation} \label{eq:P1} P_1 (A) = \frac{ P( A \cap B_1 ) }{P(B_1)}, \end{equation}

and going one step further, just to see how things are going, we have

$$ P_2 (A) = P_1 ( A |B_2) = \frac{ P_1(A \cap B_2 ) }{P_1(B_2)} =\frac{ P( A \cap B_2 \cap B_1 ) }{P(B_1)} \times \frac{P(B_1)}{P(B_2 \cap B_1)},$$

Note that the last equality is obtained because we’re using \eqref{eq:P1} both for the numerator and the denominator for the expression in the middle. Therefore,

$$ P_2(A) = \frac{P(A \cap B_1 \cap B_2)}{ P(B_1 \cap B_2)}= P( A| B_1 \cap B_2).$$

I guess you already know where this is heading, right?

Proposition 3.

Let @@@P@@@ be a probability measure and @@@B_1,B_2,\dots,B_n@@@ be events satisfying @@@P( \cap_{i=1}^n B_i)>0@@@, and define @@@P_0,P_1,P_2,\dots,P_n@@@ through \eqref{eq:repeated_conditioning}. Then for all @@@j=1,\dots, n@@@, then

  1. For @@@j=1,\dots,n@@@, @@@\displaystyle P_j (A) = P( A | \cap_{i=1}^j B_i)@@@, and
  2. \begin{align} P( \cap_{i=1}^n B_i) &= \prod_{j=0}^{n-1} P_j (B_{j+1})
    & = P( B_1) P( B_2 |B_1) P(B_3 |B_2 \cap B_1) \cdots P( B_n | \cap_{i=1}^{n-1} B_i). \end{align
    }

Before we give a proof, we’d like to see how simple this seemingly scary formula actually is. Since @@@B_1 \cap B_2 \subseteq B_1@@@, we know that probability of @@@B_1 \cap B_2@@@ is less than or equal to the probability of @@@B_1@@@. Therefore @@@P(B_1 \cap B_2)@@@ can be written as @@@P(B_1)\times c@@@, where @@@c@@@ is a number between @@@0@@@ and @@@1@@@. But what is that number? Look again. It’s, by definition, @@@P(B_1|B_2)@@@. Now repeat: we know that @@@P(B_1\cap B_2\cap B_3) = c P(B_1 \cap C_2)@@@, and again, by definition, @@@c= P(B_3 | B_1 \cap B_2)@@@. You can see where this goes, right?

Proof.

We first prove the first statement by induction on @@@j@@@. The base case @@@j=1@@@ is \eqref{eq:P1}. As for the induction step, assume @@@P_j@@@ is of the given form, then \begin{align} P_{j+1} (A) &= P_j (A | B_{j+1})
& = \frac{ P_j( A \cap B_{j+1}) }{P_j (B_{j+1})}
& \overset{\mbox{induction}}{=}\frac{ P( A \cap B_{j+1} | \cap_{i=1}^j B_i) } {P(B_{j+1}| \cap_{i=1}^j B_i)}
& = \frac{ P( A \cap B_{j+1} \bigcap \cap_{i=1}^j B_i) }{P( \cap_{i=1}^j B_i)}\times \frac{P(\cap_{i=1}^j B_i)}{P(B_{j+1} \bigcap \cap_{i=1}^j B_i)}
& =\frac{ P( A \bigcap \cap_{i=1}^{j+1} B_i ) }{ P( \cap_{i=1}^{j+1} B_i)}
& = P( A | \bigcap_{i=1}^{j+1} B_i). \end{align
}

We turn to the second statement. From the first statement we have that for all @@@j=1,\dots,n-1@@@,

$$ P_j (B_{j+1}) =P (B_{j+1} | \cap_{i=1}^{j} B_i ) = \frac{ P( \cap_{i=1}^{j+1} B_i)}{P(\cap_{i=1}^j B_i)}.$$

Therefore, when taking the product over @@@j=1,\dots,n-1@@@, all terms cancel expect from @@@ P( \cap_{i=1}^{n} B_i) / P(B_1)@@@. Multiply this by @@@P(B_1)@@@ and the result follows.

Bayes’ Formula

thumb|Bayes' A blue neon sign showing Bayes’ formula at the offices of HP Autonomy Bayes’ formula is a simple rule that allows to reverse conditional probability: swapping between the event we condition on and the event whose (conditional) probability we wish to compute. Some conditional probabilities appear more naturally or are given to us, while we actually need the reverse. Let’s assume every pregnant person is a women, that is @@@P(\mbox{Woman}|\mbox{Pregnant})=1@@@. However, it would be of more interest to answer the reverse question: what proportion of women are pregnant, that is @@@P(\mbox{Pregnant}|\mbox{Woman})@@@. Bayes’ formula gives the connection between these two conditional probabilities.

When designing an HIV test the researchers are primarily interested in detecting HIV positive (simply put: not missing any real patients) or, in the language of probability, maximizing the probability that the test is positive, conditioned on the subject being HIV positive, @@@ P( \mbox{Test positive} |\mbox{HIV positive})@@@. However, individual subjects are usually more self-centered, and are interested in the probability of being HIV positive, conditioned on a positive test result, @@@P(\mbox{HIV positive}|\mbox{Test positive})@@@ (also, if you’re like me, the probability that I am HIV positive conditioned on a negative test result). The two conditional probabilities are not even in the same order of magnitude! Let’s do some very quick run of theory before getting into numerical examples.

The basic form

Recall that if @@@A@@@ and @@@B@@@ have positive probability, then

$$P(A|B) = \frac{P(A\cap B)}{P(B)}.$$

Similarly, \begin{equation} \label{eq:bayes1} \boxed { P(B|A) = \frac{P(B \cap A)}{P(A)} = \frac{P(A|B)P(B)}{P(A)}. } \end{equation}

This equation is a special case of Bayes’ formula. The next example illustrates the notion of a false positive.

Example 10.

Police use breathalyzers which display a drunkenness reading if the driver is drunk, and in 5% of the cases when the driver is not drunk. Roughly about one in 1000 drivers are drunk. Suppose a random driver takes the test. What is the probability that the driver is drunk (that is, the “drunk” reading AKA “positive” test result, is not false)?

Let @@@B@@@ be the event that the driver selected is drunk, and let @@@A@@@ be the event that the test displays a a drunkenness reading. We are asked for @@@P(B|A)@@@. We were given three pieces of information: @@@P(A|B)=1@@@, and @@@P(A|B^c)=0.05@@@, and @@@P(B)=0.001@@@. By Bayes’ formula, \begin{align} P(B|A) &= P(A|B) \frac{P(B)}{P(A)} = 1 \times \frac{P(B)}{P(A|B) P(B) + P(A|B^c) P(B^c)}
& = \frac{P(B)}{P(B) + 0.05(1-P(B))}
&=\frac{P(B)}{0.95 P(B) + 0.05}
&=\frac{0.001}{0.001
0.95+0.05}= \frac{1}{50.95}<2\%. \end{align*} @@@P(B|A)@@@ is less than fifty times smaller than @@@P(A|B)@@@. This is quite surprising, right?

Let’s forget about the math for the moment, and try to discuss it heuristically, choosing a “perfect sample” of @@@1000@@@ drivers. By perfect, I mean we’re going to taking the proportions literally, namely that exactly one among the @@@1000@@@ is drunk, and exactly @@@5\%@@@ of tests on not-drunk drivers will give a drunk reading. Under these assumptions, we will have exactly @@@1@@@ drunk reading corresponding to an actual drunk driver, and @@@0.05*999=49.95@@@ drunk reading coming from non-drunk drivers. Let’s round the last figure to @@@50@@@. In total, out of the @@@51@@@ drunk readings, only @@@1@@@ came from an actually drunk driver, and therefore the proportion of correct drunk readings among all drunk readings is @@@1/51@@@, which is less than @@@2\%@@@.

The test is not accurate enough to compensate for the fact that the drunk drivers are rare.

False positives are very common. https://www.youtube.com/watch?v=1csFTDXXULY Here’s a more serious application https://www.youtube.com/watch?v=M8xlOm2wPAA

Exercise 11.

Women are @@@47\%@@@ of the workforce, and @@@15\%@@@ of the work force have advanced degrees. Among those in the work force with advanced degrees, @@@36\%@@@ are women. What proportion of women in the workforce have advanced degrees? (data based on real figures from 2016) comment %} About @@@10\%@@@ of professors are women, @@@1\%@@@ of work force is professors and @@@50\%@@@ of workforce consists of women. What proportion of women in the workforce are professors? endcomment %}

Exercise 12.
  • The probability team A wins a best-out-of-7 series is 50%.
  • The probability team A wins the first two games in the series is 25%.
  • The probability team A wins the series assuming it won the first two games is @@@13/16@@@.

Assuming team A won the series, what is the probability it won the first two games?

The general form

In the last section, we presented the first and most simple version of Bayes’ formula, but in many cases the information presented involves conditioning on more than one event.

Suppose now that @@@B_1,\dots,B_n@@@ are disjoint and that their union is @@@\Omega@@@. Clearly,

$$P(B_1|A) = P(A|B_1)\frac{P(B_1)}{P(A)},$$

By the total probability formula \eqref{eq:total_prob}

$$P(A) = \sum_{j=1}^n P(A \cap B_j) = \sum_{j=1}^n P(A |B_j) P(B_j).$$

Therefore we arrive to the following:

\begin{equation} \label{eq:gen_bayes} \boxed {
P(B_1|A) = \frac{P(A|B_1) P(B_1)}{\sum_{j=1}^n P(A|B_j) P(B_j)}. } \end{equation}

Before the example, I’d like to stress that Bayes’ formula \eqref{eq:bayes1} is merely the specical case of \eqref{eq:gen_bayes} with @@@n=2@@@, with @@@B_1=B@@@ and @@@B_2 = B^c@@@. Also, in order to apply it, it is easier to work in two stages: first calculate the denominator which is simply @@@P(A)@@@, and only then plug this into the formula.

Example 11.

IRS audits (I made up all the figures)

  • @@@20\%@@@ of returns above @@@\$500K@@@;
  • @@@15\%@@@ of returns between @@@\$100K@@@ and @@@\$499K@@@; and
  • @@@10\%@@@ of returns below @@@\$100K@@@.

About @@@5\%@@@ of returns are at the first level and @@@60\%@@@ of returns are at the second level. Assuming I was audited, what is the probability I reported an income below @@@\$100K@@@?

Let’s identify the events. @@@A@@@ is the event “being audited”. @@@B_1@@@ is the event “reported less than @@@\$100K@@@”, @@@B_2@@@ is the event “reported between @@@\$100K@@@ and @@@\$499K@@@” and @@@B_3@@@ is the event “reported at least @@@\$500K@@@. We are given the following

  • @@@P(A|B_3) = 0.2,~P(A|B_2) = 0.15,P(A|B_1)=0.1@@@
  • @@@P(B_3) = 0.05,P(B_2)=0.6,~P(B_1)=?.@@@ We are asked to find @@@P(B_1|A)@@@, the proportion of returns at the lowest level among all returns audited.

Clearly, @@@B_1,B_2@@@ and @@@B_3@@@ are disjoint and their union is the entire sample space (all levels of income). Therefore

$$ 1= P(B_3)+P(B_2)+P(B_1),$$

giving us

$$P(B_1)= 1-0.05-0.6 =0.35$$

This settles the question mark. We have all we need to apply Bayes’ formula \eqref{eq:gen_bayes}, but let’s do it in two steps, first identifying the denominator. \begin{align} P(A) &= P(A|B_3)P(B_3) +P(A|B_2)P(B_2)+ P(A|B_1)P(B_1) & = 0.20.05+ 0.150.6+0.10.35 =0.135. \end{align*} Therefore,

$$ P(B_1|A) = \frac{0.1*0.35}{0.135} =0.26.$$

In other words, about @@@26\%@@@ of audits are for the lowest income level. This is much more than the proportion of audits among the low income level, @@@P(A|B_1)@@@, which was @@@10\%@@@.

If we’re like to find, say, what proportion of audits are on the highest income level, then we already have all the figures:

$$ P(B_3|A) = P(A|B_3)\frac{P(B_3)}{P(A)} = \frac{0.2 * 0.05}{0.135}=0.07,$$

less than the actual proportion of audits among the high level. Finally if we want to find @@@P(B_2|A)@@@, we… DO not repeat the computation. Because, conditional probability is a probability on its own right, remember? As the union of the disjoint events @@@B_1,B_2,B_3@@@ is the entire sample space, we have

$$ 1= P(B_1|A) + P(B_2|A) +P(B_3|A),$$

therefore @@@P(B_2|A) = 1- P(B_1|A) - P(B_3|A)=0.77@@@.

Let’s close with yet another example. https://youtu.be/gCleWpr__uc

Problems

Problem 1.

Repeat the calculation in Example 1 now assuming that the family has three children. No need to do the third part.

Problem 2.

Let @@@P@@@ be the uniform probability on the integers from @@@1@@@ to @@@99@@@. Let @@@B@@@ be the subset of numbers which have the digit @@@3@@@. Let @@@A@@@ be the subset of even numbers.

  1. What is @@@P(A)@@@, @@@P(B)@@@?
  2. What is @@@P(A|B)@@@? @@@P(B|A)@@@?
Problem 3.

(Total probability for conditional probabilities)
Suppose that @@@A@@@ and @@@S@@@ are events with @@@P(S)>0@@@, and that the events @@@B_1,\dots,B_n@@@ are a partition of the sample space @@@\Omega@@@, that is @@@\cup_{i=1}^n B_i =\Omega@@@ and @@@B_i\cap B_j@@@ for @@@i\ne j@@@. Show the following

$$ P(A|S) = \sum_i P(A|B_i \cap S) P(B_i |S),$$

where the sum is over all @@@i\in \{1,\dots,n\}@@@ such that @@@P(B_i \cap S)>0@@@.

Problem 4.

Determine whether each of the statements is true (=always true) or false (=not always true). In the former case, show why. In the latter case give an example where the statement fails to be true.

  1. If @@@P(B)>0@@@ and @@@P(A|B) \ge P(A)@@@ for all @@@A@@@, @@@P(B)=1@@@.
  2. If @@@P_1,P_2@@@ are two probability measures on the same sample space are such that @@@P_1(B)>0@@@ and @@@P_2(B) >0@@@ and @@@P_1(A|B)= P_2(A|B)@@@ for all @@@A@@@, then @@@P_1(B)=P_2(B)@@@.
  3. If @@@P(A)>0@@@, @@@P(B)>0@@@, then @@@P(A|B) P(B|A) \le 1@@@.
Problem 5.

Suppose that @@@P@@@ is a probability measure on a state space @@@\Omega@@@ and that @@@B_1@@@ and @@@B_2@@@ are two events satisfying @@@P(B_1 \cap B_2)>0@@@. For any event @@@A@@@, let @@@Q_1(A) = P(A| B_1)@@@, @@@Q_2(A ) = P(A|B_2)@@@, as well as @@@Q_{1,2}(A) = Q_1(A| B_2)@@@ and @@@Q_{2,1}(A) = Q_2(A| B_1)@@@. Show the following:

  1. @@@Q_{1,2}=Q_{2,1}@@@.
  2. @@@Q_{1,2}(A)= P(A | B_1 \cap B_2)@@@
Problem 6.

The probability that a math student took a course with me is @@@0.06@@@. The probability that a student who took a course with me is a math student is @@@0.75@@@. About @@@5\%@@@ of the students are math students. What proportion of the student body did I teach ?

Problem 7.

I’m tossing a dart at a round target whose radius is @@@8@@@ inches, with center at the point @@@(0,0)@@@. The area below the line @@@y=x@@@ is shaded and is not visible. The dart lands at a random position on the target. Assuming the dart landed in the shaded area (I heard it hitting the target, but cannot see it), what is the probability it landed in the right half of the target?

Problem 8.

(Monty Hall Problem) 4Lb-6rxZxx0 We have @@@n\ge 2@@@ identical boxes. Exactly @@@m<n-1@@@ of them have prizes in them (let’s say Amazon Gift cards), and the remaining @@@n-m@@@ (which is at least @@@2@@@) have nothing in them. I know which boxes have the prizes, and I’m asking you to select a box.

  1. What is the probability you’ll select a box with a prize?
  2. I’m then opening a box which is not the one you chose and which has no prize. What is the probability you selected a box with a prize now? \item After I showed you the empty box, I allow you to select another box. What is the probability of getting a prize if you select another box?
Problem 9: (Inspiration).

(https://math.stackexchange.com/questions/2780138/there-are-4-cups-of-liquid-three-are-water-and-one-is-poison-if-you-were-to-dr) There are @@@6@@@ glasses of water. @@@3@@@ are Poisoned.

  1. What is the probability that if you randomly select two cups at least one will be poisoned?
  2. I drink one cup. If I’m not poisoned (dah!), I drink another cup. Assuming I drank two cups, what is the probability I was not poisoned?
  3. Follow the same rule as part 2. Assuming I was not poisoned, what is the probability I drank two cups?
Problem 10.

This problem is from an academic publication on teaching probability Chad R Bhatti and Jennifer L Wightman, Conditional Probability and HIV Testing, the American Statistician 62:3, 238-241 . Consider the following table, describing test results from a controlled group 2000 individuals, 1000 of which are known to be HIV positive and 1000 of which are known to be HIV negative. From this table one can compute the conditional probabilities @@@P(T^{\pm}|HIV^{\pm})@@@, where @@@T^+@@@ and @@@T^-@@@ are the events “Test positive” and “Test negative” respectively, and @@@HIV^+@@@ and @@@HIV^-@@@ are the events “HIV positive” and “HIV negative”, respectively. Note that the table tells us about the test, but nothing about the general population.

Result Disease No Disease Total
Positive 990 15 1005
Negative 10 985 995
Total 1000 1000 2000
  1. Use the table to find the probability of a “true positive” @@@P(T^+| HIV^+)@@@ and a “false positive” @@@P(T^+|HIV^-)@@@.
  2. Use the table to find the probability of a “true negative” and a “false negative”.
  3. Data cited in the paper estimates the proportion of HIV-positive in the adult population of East Asia at around 0.1%. Using this, find the (conditional) probability that an individual in East Asia tested positive is actually HIV-positive, that is @@@P(HIV^+|T^+)@@@. How does this compare with your answer to the first part?
  4. The data for Sub-Saharan Africa lists a figure of 5.9% of HIV-positive among the adult population. Repeat the last problem with this figure.
Problem 11.

(source) A detective story. @@@85\%@@@ of accidents involve cars and @@@15\%@@@ of accidents involve trucks. I witnessed an accident at night. It was blurry… Tests done in similar lighting showed that I identify vehicles (truck/cars) correctly @@@80\%@@@ of the cases.

Assuming I reported seeing a truck, what is the probability it was a truck involved in the accident? Before answering it using the methods we studied, try to ballpark. Are you surprised?

Problem 12.

The probability I win a card game is @@@75\%@@@. The probability I played at home is 3 times larger when I win than when I lose. What is the probability of winning when I play at home?

Problem 13.

(From Nate Silver’s book The Signal and the Noise (p. 247-248), Source)

Consider a somber example: the September 11 attacks. Most of us would have assigned almost no probability to terrorists crashing planes into buildings in Manhattan when we woke up that morning. But we recognized that a terror attack was an obvious possibility once the first plane hit the World Trade Center. And we had no doubt we were being attacked once the second tower was hit. Bayes’s theorem can replicate this result.

The probability of that terrorists would crash a plane into a Manhattan skyscraper is @@@1/20000@@@. The probability a plane will crash into a Manhattan skyscraper if terrorists are not attacking (accident) is @@@1/12500@@@, and the probability a plane will crash into a Manhattan skyscraper if terrorists are attacking is @@@1@@@.

  1. If you see a plane crashing into a Manhattan skyscraper, what is the probability that it is a terrorist attack?
  2. Now let’s make it a little harder. Suppose we see two planes crashing. We will assume that the probability of two planes crashing into Manhattan skyscrapers if terrorists are not attacking is @@@(1/12500)^2@@@ (in terms of our next topic, independence, the two crashes are conditionally independent: if terrorists are not attacking, crashes are unrelated). The probability of two planes crashing if terrorists are attacking is still @@@1@@@. If two planes crashed into Manhattan skyscrapers, what is the probability that it is a terrorist attack?
Problem 14.

Each of the questions in a certain multiple-choice exam has four answers, exactly one of which is correct. The probability that a test taker knew the answer to a question they answered correctly is @@@\frac 12@@@. What proportion of the questions test takers answer correctly? Assume that a test taker always answer correctly to a question they know, and randomly pick an answer to a question they do not know.