conditional_probability.qmd

# [Conditional probability and learning]{.green} {#sec-learning}
{{< include macros.qmd >}}
{{< include macros_marg_cond.qmd >}}

## The meaning of the term "conditional probability" {#sec-conditional-probs}

When we introduced the notion of degree of belief -- a.k.a. probability -- in [chapter @sec-probability], we emphasized that *every probability is conditional on some state of knowledge or information*. So the term "conditional probability" sounds like a [pleonasm](https://dictionary.cambridge.org/dictionary/english/pleonasm), just like saying "round circle".

This term must be understood in a way analogous to "marginal probability": it applies in situations where we have two or more sentences of interest. We speak of a "conditional probability" when we want to emphasize that additional sentences appear in the conditional (right side of "$\|$") of that probability. For instance, in a scenario with these two probabilities:

$$
\P(\se{A} \| \se{\yellow B} \and \yI)
\qquad
\P(\se{A} \| \yI)
$$

we call the first [**conditional probability**]{.blue} of $\se{A}$ ([**given**]{.blue} $\se{\yellow B}$) to emphasize or point out that its conditional includes the additional sentence $\se{\yellow B}$, whereas the conditional of the second probability doesn't include this sentence.


## The relation between *learning* and conditional probability {#sec-conditional-prob_learning}

Why do we need to emphasize that a particular degree of belief is conditional on an additional sentence? Because the additional sentence usually represents *new information that the agent has learned*.

Remember that the conditional of a probability usually contains all factual information known to the agent^[Exceptions are, for instance, when the agent does *counterfactual* or *hypothetical* reasoning, as we discussed in [§@sec-inference-scenarios].]. Therefore if an agent acquires new data or a new piece of information expressed by a sentence $\yellow\se{D}$, it should draw inferences and make decisions using probabilities that include $\yellow\se{D}$ in their conditional. In other words, the agent before was drawing inferences and making decisions using some probabilities

$$
\P(\dotso \| \yK)
$$

where $\se{K}$ is the agent's knowledge until then. Now that the agent has acquired information or data $\yellow\se{D}$, it will draw inferences and make decisions using probabilities

$$
\P(\dotso \| \se{\yellow D} \and \yK)
$$

Vice versa, if we see that an agent is calculating new probabilities conditional on an additional sentence $\yellow\se{D}$, then it means^[But keep again in mind exceptions like counterfactual reasoning; see the previous side note.] that the agent has acquired that information or data $\yellow\se{D}$.

Therefore [**conditional probabilities represent an agent's learning**]{.blue} and [**should be used when an agent has learned something**]{.blue}.

This learning can be of many different kinds. Let's examine two particular kinds by means of some examples.

\

## Learning about a quantity from a *different* quantity {#sec-conditional-joint-dis}

Consider once more the next-patient arrival scenario of [§@sec-repr-joint-prob], with joint quantity $(U,T)$ and an agent's joint probability distribution as in [table @tbl-urgent-arrival], reproduced here:


+:------------------------------:+:---------------:+:---------------:+:---------------:+:---------------:+
|$\P(U\mo u \and T\mo t\|\yi[H])$|                 |**transportation at arrival** $T$                    |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
|                                |                 |ambulance        |helicopter       |other            |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
|**urgency** $U$                 |urgent           |0.11             |0.04             |0.03             |
+                                +-----------------+-----------------+-----------------+-----------------+
|                                |non-urgent       |0.17             |0.01             |0.64             |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
: Joint probability distribution for transportation and urgency {.sm}

Suppose that the agent must forecast whether the next patient will require $\urge$ or $\nonu$ care, so it needs to calculate the probability distribution for $U$ (that is, the probabilities for $U\mo\urge$ and $U\mo\nonu$). 

In the first exercise of [§@sec-marginal-probs] you found that the marginal probability that the next patient will need urgent care is

$$\P(U\mo\urge \| \yi[H]) = 18\%$$

this is the agent's degree of belief if it has nothing more and nothing less than the knowledge encoded in the sentence $\yi[H]$.

But now let's imagine that the agent *receives a new piece of information*: it is told that the next patient is being transported by helicopter. In other words, **the agent has learned that the sentence\ \ $T\mo\heli$\ \ is true**. The agent's complete knowledge is therefore now encoded in the `and`ed sentence

$$T\mo\heli\ \land\ \yi[H]$$

and this composite sentence should appear in the conditional. The agent's belief that the next patient requires urgent care, given the new information, is therefore

$$\P(U\mo\urge \| T\mo\heli \and \yi[H])$$

Calculation of this probability can be done by just one application of the `and`-rule, leading to a formula connected with Bayes's theorem ([§@sec-bayes-theorem]):

:::{.column-page-inset-right}
$$
\begin{aligned}
&\P(U\mo\urge \and T\mo\heli \| \yi[H]) =
\P(U\mo\urge \| T\mo\heli \and \yi[H]) \cdot
\P(T\mo\heli \| \yi[H])
\\[3ex]
&\quad\implies\quad
\P(U\mo\urge \| T\mo\heli \and \yi[H])
=
\frac{
\P(U\mo\urge \and T\mo\heli \| \yi[H])
}{
\P(T\mo\heli \| \yi[H])
}
\end{aligned}
$$
:::

Let's see how to calculate this. The agent already has the joint probability for $U\mo\urge \land T\mo\heli$ that appears in the numerator of the fraction above. The probability in the denominator is just a marginal probability for $T$, and we know how to calculate that too from [§@sec-marginal-probs]. So we find

$$
\P(U\mo\urge \| T\mo\heli \and \yi[H])
=\frac{
\P(U\mo\urge \and T\mo\heli \| \yi[H])
}{
\sum_u\P(U\mo u \and T\mo\heli \| \yi[H])
}
$$

where it's understood that the sum index $u$ runs over the values $\set{\urge, \nonu}$.

This is called a [**conditional probability**]{.blue}; in this case, the conditional probability of\ \ $U\mo\urge$\ \ [**given**]{.blue}\ \ $T\mo\heli$.

The collection of probabilities for all possible values of the quantity $U$, given a *specific* value of the quantity $T$, say $\heli$:

$$
\P(U\mo\urge \| T\mo\heli \and \yi[H]) \ ,
\qquad
\P(U\mo\nonu \| T\mo\heli \and \yi[H])
$$

is called the [**conditional probability distribution**]{.blue} for $U$\ \ [**given**]{.blue}\ \ $T\mo\heli$. It is indeed a probability distribution because the two probabilities sum up to 1.

:::{.callout-important}
## {{< fa exclamation-triangle >}}
Note that the collection of probabilities for, say, $U\mo\urge$, but for *different* values of the conditional quantity $T$, that is:

$$
\begin{aligned}
&\P(U\mo\urge \| T\mo\ambu \and \yi[H]) \ ,
\\[1ex]
&\P(U\mo\urge \| T\mo\heli \and \yi[H]) \ ,
\\[1ex]
&\P(U\mo\urge \| T\mo\othe \and \yi[H])
\end{aligned}
$$

is **not** a probability distribution. Calculate the three probabilities above and check that in fact they do *not* sum up to 1.
:::

:::{.callout-caution}
## {{< fa user-edit >}} Exercise
- Using the values from [table @tbl-urgent-arrival] and the formula for marginal probabilities, calculate:

    + The conditional probability that the next patient needs urgent care, given that the patient is being transported by helicopter.
    
    + The conditional probability that the next patient is being transported by helicopter, given that the patient needs urgent care.

- Now discuss and find an intuitive explanation for these comparisons:

    + The two probabilities you obtained above. Are they equal? why or why not?
	
	+ The *marginal* probability that the next patient will be transported by helicopter, with the *conditional* probability that the patient will be transported by helicopter *given* that it's urgent. Are they equal? if not, which is higher, and why?
:::

\

## Learning about a quantity from instances of *similar* quantities {#sec-conditional-joint-sim}

In the previous section we examined how learning about one quantity can change an agent's degree of belief about a *different* quantity, for example knowledge about "transportation" affects beliefs about "urgency", or vice versa. The agent's learning and ensuing belief change are reflected in the value of the corresponding conditional probability.

This kind of change can also occur with "similar" quantities, that is, quantities that represent the same kind of phenomenon and have the same domain. The maths and calculations are identical to the ones we explored above, but the interpretation and application can be somewhat different.

As an example, imagine a scenario similar to the next-patient arrival above, but now consider the *next three patients* to arrive and their urgency. Define the following three quantities:

$U_1$ : urgency of the next patient\
$U_2$ : urgency of the second future patient from now\
$U_3$ : urgency of the third future patient from now\

Each of these quantities has the same domain: $\set{\urge,\nonu}$.

The joint quantity $(U_1, U_2, U_3)$ has a domain with $2^3 = 8$ possible values:

- $U_1\mo\urge \and U_2\mo\urge \and U_3\mo\urge$
- $U_1\mo\urge \and U_2\mo\urge \and U_3\mo\nonu$
- . . .
- $U_1\mo\nonu \and U_2\mo\nonu \and U_3\mo\urge$
- $U_1\mo\nonu \and U_2\mo\nonu \and U_3\mo\nonu$

\
Suppose that an agent, with background information $\yI$, has a particular joint belief distribution for the joint quantity $(U_1, U_2, U_3)$. For example consider the joint distribution implicitly given as follows:
<!-- 0 53.6  -->
<!-- 1 11.4  -->
<!-- 2 3.6  -->
<!-- 3 1.4  -->

- If $\urge$ appears in the probability 0 times out of 3:\ \ probability = $53.6\%$
- If $\urge$ appears 1 times out of 3:\ \ probability = $11.4\%$
- If $\urge$ appears 2 times out of 3:\ \ probability = $3.6\%$
- If $\urge$ appears 3 times out of 3:\ \ probability = $1.4\%$

Here are some examples of how the probability values are determined by the description above:

:::{.column-page-inset-right}
$$
\begin{aligned}
&\P(U_1\mo\urge \and U_2\mo\nonu \and U_3\mo\urge \| \yI)
= 0.036 \quad&&\text{\small($\urge$ appears twice)}
\\[1ex]
&\P(U_1\mo\nonu \and U_2\mo\urge \and U_3\mo\nonu \| \yI)
= 0.114 &&\text{\small($\urge$ appears once)}
\\[1ex]
&\P(U_1\mo\urge \and U_2\mo\urge \and U_3\mo\nonu \| \yI)
= 0.036 &&\text{\small($\urge$ appears twice)}
\\[1ex]
&\P(U_1\mo\nonu \and U_2\mo\nonu \and U_3\mo\nonu \| \yI)
= 0.536 &&\text{\small($\urge$ doesn't appear)}
\end{aligned}
$$
:::

:::{.callout-caution}
## {{< fa user-edit >}} Exercise
- Check that the joint probability distribution as defined above indeed sums up to $1$.

- Calculate the marginal probability for $U_1\mo\urge$, that is,\ \ $\P(U_1\mo\urge \|\yI)$.
<!-- 0.2 -->

- Calculate the marginal probability that the second and third patients are non-urgent cases, that is

$$\P(U_2\mo\nonu \and U_3\mo\nonu \|\yI) \ .$$
<!-- 0.65 -->
:::

From this joint probability distribution the agent can calculate, among other things, its degree of belief that the *third* patient will require urgent care, regardless of the urgency of the preceding two patients. It's the marginal probability

$$
\begin{aligned}
\P(U_3\mo\urge \| \yI)  &= 
\sum_{u_1}\sum_{u_2}
\P(U_1\mo u_1 \and U_2\mo u_2 \and U_3\mo \urge \| \yI)
\\[1ex]
&= 0.114 + 0.036 + 0.036 + 0.014
\\[1ex]
&= \boldsymbol{20.0\%}
\end{aligned}
$$

where each index $u_1$ and $u_2$ runs over the values $\set{\urge, \nonu}$. This double sum therefore involves four terms. The first term in the sum corresponds to "$U_1\mo\urge\and U_2\mo\urge\and U_3\mo\urge$" and therefore has probability $0.014$ . The second term corresponds to "$U_1\mo\urge\and U_2\mo\nonu\and U_3\mo\urge$" and therefore has probability $0.036$. And so on.

Therefore the agent, with its current knowledge, has a $20\%$ degree of belief that the third patient will require urgent care.

\

Now fast-forward in time, after *two* patients have arrived and have been taken good care of; or maybe they haven't arrived yet, but their urgency conditions have been ascertained and communicated to the agent. Suppose that *both patients were or are non-urgent cases*. The agent now knows this fact. The agent needs to forecast whether the third patient will require urgent care.

The relevant degree of belief is obviously not\ \ $\P(U_3\mo\urge \|\yI)$,\ \ calculated above, because this belief represents an agent knowing only $\yI$. Now, instead, the agent has additional information about the first two patients, encoded in this `and`ed sentence:

$$
U_1\mo\nonu \and U_2\mo\nonu
$$

The relevant degree of belief is therefore the *conditional* probability


$$
\P(U_3\mo\urge \| U_1\mo\nonu \and U_2\mo\nonu \and \yI)
$$

Which we can calculate with the same procedure as in the previous section:

$$
\begin{aligned}
&\P(U_3\mo\urge \| U_1\mo\nonu \and U_2\mo\nonu \and \yI)
\\[2ex]
&\qquad{}=
\frac{
\P(U_1\mo\nonu \and U_2\mo\nonu \and U_3\mo\urge \| \yI)
}{
\P(U_1\mo\nonu \and U_2\mo\nonu \| \yI)
}
\\[1ex]
&\qquad{}=\frac{0.114}{0.65}
\\[2ex]
&\qquad{}\approx
\boldsymbol{17.5\%}
\end{aligned}
$$

This conditional probability $17.5\%$ for $U_3\mo\nonu$ is *lower* than $20.0\%$ calculated previously, which was based only on knowledge $\yI$. **Learning about the two first patients has thus affected the agent's degree of belief about the third**.

\
Let's also check how the agent's belief changes in the case where the first two patients are both *urgent* instead. The calculation is completely analogous:

$$
\begin{aligned}
&\P(U_3\mo\urge \| U_1\mo\urge \and U_2\mo\urge \and \yI)
\\[2ex]
&\qquad{}=
\frac{
\P(U_1\mo\urge \and U_2\mo\urge \and U_3\mo\urge \| \yI)
}{
\P(U_1\mo\urge \and U_2\mo\urge \| \yI)
}
\\[1ex]
&\qquad{}=\frac{0.030}{0.107}
\\[2ex]
&\qquad{}\approx
\boldsymbol{28.0\%}
\end{aligned}
$$

In this case the conditional probability $28.0\%$ for $U_3\mo\urge$ is *higher* than the $20.0\%$, which was based only on knowledge $\yI$.

One possible intuitive explanation of these probability changes, *in the present scenario*, is that observation of two non-urgent cases makes the agent slightly more confident that "this is a day with few urgent cases". Whereas observation of two urgent cases makes the agent more confident that "this is a day with many urgent cases".


:::{.callout-important}
## {{< fa exclamation-triangle >}} The diversity of inference scenarios
In general we cannot say that the probability of a particular value (such as $\urge$ in the scenario above) will decrease or increase as similar or dissimilar values are observed. Nor can we say how much the increase or decrease will be.

In a different situation the probability of $\urge$ could actually **increase** as more and more $\nonu$ cases are observed. Imagine, for instance, a scenario where the agent initially knows that there are 10 urgent and 90 non-urgent cases ahead (maybe these 100 patients have already been gathered in a room). Having observed 90 non-urgent cases, the agent will give a much higher, in fact 100%, probability that the next case will be an urgent one. Can you see intuitively why this conditional degree of belief must be 100%?

The differences among scenarios are reflected in differences in joint probabilities, from which the conditional probabilities are calculated. One particular joint probability can correspond to a scenario where observation of a value *increases* the degree of belief in subsequent instances of that value. Another particular joint probability can instead correspond to a scenario where observation of a value *decreases* the degree of belief in subsequent instances of that value.


**All** these situations are, in any case, correctly handled with the four fundamental rules of inference and the formula for conditional probability derived from them!
:::


:::{.callout-caution}
## {{< fa user-edit >}} Exercises

a. Using the same joint distribution above, calculate

    $$\P(U_1\mo\urge \| U_2\mo\nonu \and U_3\mo\nonu \and \yI)$$
	
	that is, the probability that the *first* patient will require urgent care *given that the agent knows the second and third patients will not require urgent care*.
	
	- Why is the value you obtained different from\ \ $\P(U_1\mo\urge \|\yI)$ ?
	
	- Describe a scenario in which the conditional probability above makes sense, and patients 2 and 3 still arrive after patient 1. That is, a scenario where the agent learns that patients 2 and 3 are non-urgent, but still doesn't know the condition of patient 1.

\

b. Do an analysis completely analogous to the one above, but with different background information $\yJ$ corresponding to the following joint probability distribution for $(U_1, U_2, U_3)$:

    • If $\urge$ appears 0 times out of 3:\ \ probability = $0\%$\
    • If $\urge$ appears 1 times out of 3:\ \ probability = $24.5\%$\
    • If $\urge$ appears 2 times out of 3:\ \ probability = $7.8\%$\
    • If $\urge$ appears 3 times out of 3:\ \ probability = $3.1\%$

    1. Calculate

        $$\P(U_3\mo\urge \|\yJ)$$
		<!-- 43.2% -->

        and

        $$\P(U_3\mo\urge \| U_1\mo\nonu \and U_2\mo\nonu \and \yJ)$$
        <!-- 100% -->
        and compare them.

    2. Find a scenario for which this particular change in degree of belief makes sense.

<!-- ##      i1 i2 i3        p   num   den -->
<!-- ## [1,]  0  0  1 1.000000 0.245 0.245 -->
<!-- ## [2,]  0  1  1 0.241486 0.078 0.323 -->
<!-- ## [3,]  1  0  1 0.241486 0.078 0.323 -->
<!-- ## [4,]  1  1  1 0.284404 0.031 0.109 -->
:::


## Learning in the general case {#sec-conditional-joint-general}

Take the time to review the two sections above, focusing on the application and meaning of the two scenarios and calculations, and noting the similarities and differences:

- [{{< fa equals >}} The calculations were completely analogous. In particular, the conditional probability was obtained as the quotient of a joint probability and a marginal one.]{.green}

- [{{< fa not-equal >}} In the first (urgency & transportation) scenario, information about one aspect of the situation changed the agent's belief about another aspect. The two aspects were different (transportation and urgency). Whereas in the second (three-patient) scenario, information about analogous occurrences of an aspect of the situation changed the agent's belief about a further occurrence.]{.yellow}

\
A third scenario is also possible, which combines the two above. Consider the case with three patients, where each patient can require $\urge$ care or not, and can be transported by $\ambu$, $\heli$, or $\othe$ means. To describe this situation, introduce three pairs of quantities, which together form the joint quantity

$$
(U_1, T_1, \ U_2, T_2, \ U_3, T_3)
$$

whose symbols should be obvious. This joint quantity has $(2\cdot 3)^3 = 216$ possible values, corresponding to all urgency & transportation combinations for the three patients.

Given the joint probability distribution for this joint quantity, it is possible to calculate all kinds of conditional probabilities, and therefore consider all the possible ways the agent may learn new information. For instance, suppose the agent learns this:

- the first two patients have not required urgent care
- the first patient was transported by ambulance
- the second patient was transported by other means
- the third patient is arriving by ambulance

and with this learned knowledge, the agent needs to infer whether the third patient will require urgent care. The required conditional probability is

:::{.column-page-right}
$$
\begin{aligned}
&\P(U_3\mo\urge \| T_3\mo\ambu \and
U_1\mo\nonu \and T_1\mo\ambu \and
U_2\mo\nonu \and T_2\mo\othe \and
\yI)
\\[2ex]
&\qquad{}=
\frac{
\P(U_3\mo\urge \and T_3\mo\ambu \and
U_1\mo\nonu \and T_1\mo\ambu \and
U_2\mo\nonu \and T_2\mo\othe \|
\yI)
}{
\P(T_3\mo\ambu \and
U_1\mo\nonu \and T_1\mo\ambu \and
U_2\mo\nonu \and T_2\mo\othe \and
\yI)
}
\end{aligned}
$$
:::

and is calculated in a way completely analogous to the ones already seen.

\

All three kinds of inference scenarios that we have discussed occur in data science and engineering. In machine learning, the second scenario is connected to "unsupervised learning"; the third, mixed scenario to "supervised learning". As you just saw, the probability calculus "sees" all of these scenarios as analogous: information about something changes the agent's belief about something else. And the handling of all three cases is perfectly covered by the four fundamental rules of inference.

So let's write down the general formula for all these cases of learning.

Let's consider a more generic case of a joint quantity with component quantities $\green X$ and $\red Y$. Their joint probability distribution is given. Each of these two quantities could be a complicated joint quantity by itself.

The conditional probability for $\red Y\mo y$, given that the agent has learned that $\green X$ has some specific value $\green x^*$, is then

$$
\P({\red Y\mo y}\| {\green X\mo x^*}\and\yI) =
\frac{
\P({\red Y\mo y} \and {\green X\mo x^*}\|\yI)
}{
\sum_{\red \upsilon}\P({\red Y\mo \upsilon}\and {\green X\mo x^*}\|\yI)
}
$$ {#eq-conditional-joint}

where the index $\red \upsilon$ runs over all possible values in the domain of $\red Y$.

\

## Conditional probabilities as initial information {#sec-conditional-conditional}

<!-- As emphasized in [§@sec-inference-origin], probabilities are either obtained from other probabilities, or taken as given probabilities, maybe determined by symmetry requirements. This is also true when we want to calculate conditional probabilities. -->

Up to now we have calculated conditional probabilities, using the derived formula ([@eq-conditional-joint]), starting from the joint probability distribution, which we considered to be given. In some situations, however, an agent may initially possess not a joint probability distribution but  **conditional probabilities** together with **marginal probabilities**.

As an example let's consider a variation of our next-patient scenario one more time. The agent has background information $\yi[S]$ that provides the following set of probabilities:

- Two conditional probability distributions\ \ $\P(T\mo\dotso \| U\mo\dotso \and \yi[S])$ for transportation $T$ given urgency $U$, as reported in the following table:

+:------------------------------:+:---------------:+:---------------:+:---------------:+:---------------:+
|$\P(T\mo t \| U\mo u\and\yi[S])$|                 |**transportation at arrival**\ \ $T\|{}$             |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
|                                |                 |ambulance        |helicopter       |other            |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
|*given* **urgency**\ \ ${}\|U$  |urgent           |0.61             |0.22             |0.17             |
+                                +-----------------+-----------------+-----------------+-----------------+
|                                |non-urgent       |0.21             |0.01             |0.78             |
+--------------------------------+-----------------+-----------------+-----------------+-----------------+
: Probability distributions for transportation given urgency {#tbl-T-given-U .sm}

::::{.column-margin}
:::{.callout-important}
## {{< fa exclamation-triangle >}}
[This table has **two** probability distributions: on the first row, one conditional on $U\mo\urge$; on the second row, one conditional on $U\mo\nonu$.  Check that the probabilities on each row indeed sum up to 1.]{.small}
:::
::::

\

- Marginal probability distribution\ \ $\P(U\mo\dotso \| \yi[S])$ for urgency $U$:

$$
\P(U\mo\urge \| \yi[S]) = 0.18 \ ,
\quad
\P(U\mo\nonu \| \yi[S]) = 0.82
$$ {#eq-U-marg}

\

With this background information, the agent can also compute all joint probabilities simply using the `and`-rule. For instance, the joint probability for $U\mo\urge \and T\mo\heli$ is

$$
\begin{aligned}
&P(U\mo\urge \and T\mo\heli \| \yi[S])
\\[1ex]
&\quad{}= 
P(T\mo\heli \| U\mo\urge \and \yi[S]) \cdot
P(U\mo\urge \| \yi[S])
\\[1ex]
&\quad{}= 0.22 \cdot 0.18 = \boldsymbol{3.96\%}
\end{aligned}
$$

<!-- :::{.column-margin} -->
<!-- Note that the joint probabilities are slightly different compared with those from the previous background information $\yi[H]$. -->
<!-- ::: -->

And from the joint probabilities, the marginal ones for transportation $T$ can also be calculated. For instance

$$
\begin{aligned}
&P(T\mo\heli \| \yi[S])
\\[1ex]
&\quad{}= 
\sum_u P(T\mo\heli \and U\mo u \| \yi[S])
\\[1ex]
&\quad{}= 
\sum_u P(T\mo\heli \| U\mo u \and \yi[S]) \cdot
P(U\mo u \| \yi[S])
\\[1ex]
&\quad{}= 
0.22 \cdot 0.18 +
0.01 \cdot 0.82
\\[1ex]
&\quad{}= \boldsymbol{4.78\%}
\end{aligned}
$$


\
Now suppose that the agent learns that the next patient is being transported by $\heli$, and needs to forecast whether $\urge$ care will be needed. This inference is the conditional probability\ \ $\P(U\mo\urge \| T\mo\heli \and \yi[S])$, which can also be rewritten in terms of the conditional probabilities given initially:

$$
\begin{aligned}
&\P(U\mo\urge \| T\mo\heli \and \yi[H])
\\[2ex]
&\quad{}=\frac{
\P(U\mo\urge \and T\mo\heli \| \yi[H])
}{
\P(T\mo\heli \| \yi[H])
}
\\[1ex]
&\quad{}=\frac{
P(T\mo\heli \| U\mo\urge \and \yi[S]) \cdot
P(U\mo\urge \| \yi[S])
}{
\sum_u P(T\mo\heli \| U\mo u \and \yi[S]) \cdot
P(U\mo u \| \yi[S])
}
\\[1ex]
&\quad{}=\frac{0.0396}{0.0478}
\\[2ex]
&\quad{}=\boldsymbol{82.8\%}
\end{aligned}
$$

This calculation was slightly more involved than the one in [§@sec-conditional-joint-dis], because in the present case the joint probabilities were not directly available. Our calculation involved the steps\ \ $T\|U \enspace\longrightarrow\enspace T\land U \enspace\longrightarrow\enspace U\|T$ .

\
In this same scenario, note that if the agent were instead interested, say, in forecasting the transportation means knowing that the next patient requires urgent care, then the relevant degree of belief\ \ $\P(T\mo\dotso \| U\mo\urge \and \yi[S])$ would be immediately available and no calculations would be needed.


\

Let's find the general formula for this case, where the agent's background information is represented by conditional probabilities instead of joint probabilities.

Consider a joint quantity with component quantities $\green X$ and $\red Y$. The conditional probabilities\ \ $\P({\green X\mo\dotso} \| {\red Y\mo\dotso} \and \yI)$\ \ and\ \ $\P({\red Y\mo\dotso} \| \yI)$\ \ are encoded in the agent from the start.

The conditional probability for $\red Y \mo y$, given that the agent has learned that $\green X \mo x^*$, is then

$$
\P({\red Y\mo y}\| {\green X\mo x^*}\and\yI) =
\frac{
\P( {\green X\mo x^*}\| {\red Y\mo y} \and\yI) \cdot
\P( {\red Y\mo y} \|\yI)
}{
\sum_{\red \upsilon}
\P( {\green X\mo x^*}\| {\red Y\mo \upsilon} \and\yI) \cdot
\P( {\red Y\mo \upsilon} \|\yI)
}
$$ {#eq-conditional-bayes}

In the above formula we recognize [**Bayes's theorem**]{.blue} from [§@sec-bayes-theorem].

This formula is often exaggeratedly emphasized in the literature; some texts even present it as an "axiom" to be used in situations such as the present one. But we see that this formula is simply a by-product of the four fundamental rules of inference in a specific situation. An AI agent who knows the four fundamental inference rules, and doesn't know what "Bayes's theorem" is, will nevertheless arrive at this very formula.


## Conditional densities {#sec-conditional-dens}

The discussion so far about conditional probabilities extends to conditional probability *densities*, in the usual way explained in §§[-@sec-joint-prob-densities] and [-@sec-marginal-dens].

If $\green X$ and $\red Y$ are continuous quantities, the notation

$$
\p({\red Y\mo y} \| {\green X\mo x} \and \yI) = {\blue q}
$$

means that, given background information $\yI$ and given the sentence "$\green X$ has value between $\green x-\delta/2$ and $\green x+\delta/2$", the sentence "$\red Y$ has value between $\red y-\epsilon/2$ and $\red y+\epsilon/2$" has probability ${\blue q}\cdot{\red\epsilon}$, as long as $\green\delta$ and $\red\epsilon$ are small enough. Note that the small interval $\green\delta$ for $\green X$ is *not* multiplied by the density $\blue q$.

The relation between a conditional density and a joint density or a different conditional density is given by

$$
\begin{aligned}
&\p({\red Y\mo y} \| {\green X\mo x} \and \yI)
\\[1ex]
&\quad{}=
\frac{\displaystyle
\p({\red Y\mo y} \and {\green X\mo x} \| \yI)
}{\displaystyle
\int_{\red\varUpsilon}\p({\red Y\mo \upsilon} \and {\green X\mo x} \| \yI) \, \di{\red\upsilon}
}
\\[1ex]
&\quad{}=
\frac{\displaystyle
\p({\green X\mo x} \| {\red Y\mo y} \and \yI) \cdot
\p({\red Y\mo y} \| \yI)
}{\displaystyle
\int_{\red\varUpsilon} \p({\green X\mo x} \| {\red Y\mo \upsilon} \and \yI) \cdot
\p({\red Y\mo \upsilon} \| \yI)\, \di{\red\upsilon}
}
\end{aligned}
$$

where $\red\varUpsilon$ is the domain of $\red Y$.

## Graphical representation of conditional probability distributions and densities {#sec-repr-conditional}

Conditional probability distributions and densities can be plotted in all the ways discussed in chapters [-@sec-prob-joint] and [-@sec-prob-marginal]. If we have two quantities $A$ and $B$, often we want to compare the different conditional probability distributions for $A$, conditional on different values of $B$:

- $\P(A\mo\dotso \| B\mo\cat{one-value} \and \yI)$,
- $\P(A\mo\dotso \| B\mo\cat{another-value} \and \yI)$,
- $\dotsc$

and so on. This can be achieved by representing them by overlapping line plots, or side-by-side scatter plots, or similar ways.

\
In [§@sec-marginal-scatter] we saw that if we have the scatter plot for a joint probability *density*, then from its points we can often obtain a scatter plot for its marginal densities. Unfortunately no similar advantage exists for the conditional densities that can be obtained from a joint density. In theory, a conditional density for $Y$, given that a quantity $X$ has value in some small interval $\delta$ around $x$, could be obtained by only considering scatter-plot points having $X$ coordinate in a small interval between $x-\delta/2$ and $x+\delta/2$. But the number of such points is usually too small and the resulting scatter plot could be very misleading.

\

::: {.callout-warning}
## {{< fa book >}} Study reading
- §5.4 of [*Risk Assessment and Decision Analysis with Bayesian Networks*](https://hvl.instructure.com/courses/28605/modules)

- §§12.2.1, 12.3, and 12.5 of [*Artificial Intelligence*](https://hvl.instructure.com/courses/28605/modules)

- §§4.1--4.3 in [*Medical Decision Making*](https://hvl.instructure.com/courses/28605/modules)

- §§5.1--5.5 of [*Probability*](https://hvl.instructure.com/courses/28605/modules) -- yes, once more!
:::