Joint Continuous Distribution & Expectations (2)

Published

Sunday Jun 8, 2025

Motivating Example 2

A committee of two is randomly selected from three teachers, two students, and one parent.

Let $X$ be the number of teachers on the committe, and $Y$ the number of students.

Find:

1. the marginal pmf of $Y$
1. the conditional pmf of $Y$ given that $X = 0$
1. the correlation between $X$ and $Y$

$X$ and $Y$ have joint probability mass function $p(x, y)$ given by

\[ p(x, y) = \frac{\binom{3}{x} \binom{2}{x} \binom{1}{2 - x - y}}{\binom{6}{2}} \]

For $x \in [0, 2]$ and $y \in [0, 2]$ with constraint $1 \leq x + y \leq 2$

Hence, we can fill in the following table remembering that it needs two people for the committee.

X \ Y	0	1	2
0	\	2/15 = 0.13	1 / 15 = 0.07
1	3 / 15 = 0.2	6/15 = 0.4	\
2	3 / 15 = 0.2	\	\

Therefore, we get that the marginal probability of $Y$ is

\[ p_Y(y) = \begin{cases} 6/15, & y = 0 \\ 8/15, & y = 1 \\ 1/15, & y = 2 \\ \end{cases} \]

The conditional pmf of $Y$ given that $X = 0$ is trivial from the above as,

\[ p_{Y\mid X}(y|0) = \begin{cases} 0, & y = 0 \\ 2/3, & y = 1 \\ 1/3, & y = 2 \\ \end{cases} \]

The Correlation between $X$ and $Y$

First, find the expectations of $X$ and $Y$ first as,

\[ \begin{align} E(X) &= 1 \times 0.6 + 2 \times 0.2 = 1 \\ E(Y) &= 1 \times \frac{8}{15} + 2 \times \frac{1}{15} = 0.67 \\ \end{align} \]

Hence,

\[ \begin{align} Cov(X, Y) &= E(XY) - \frac{2}{3} \\ &= \frac{6}{15} - \frac{2}{3} \\ &= -\frac{4}{15} = -0.27 \end{align} \]

Now, in order to find the correlation, we also need to find the standard deviation of the two random variables.

\[ \begin{align} Var(X) & = E(X^2) - \mu_X^2 = \frac{9}{15} + 2^2 \frac{3}{15} - 1 = \frac{9 + 12 - 15}{15} = \frac{6}{15} \\ Var(Y) & = E(Y^2) - \mu_Y^2 = \frac{8}{15} + 2^2 \frac{1}{15} - \frac{4}{9} = \frac{12 * 15 - 100}{225} = \frac{80}{225} = \frac{16}{45} = 0.36 \\ \end{align} \]

Therefore, the correlation is,

\[ Cor(X, Y) = \frac{Cov(X, Y)}{\sqrt{6/15}\sqrt{16/45}} = -\frac{1}{\sqrt{2}} = -0.7071 \]

1 Laws of Multivariate Expectation

$E(c) = c$
$E(cg(X, Y)) = cE(g(X, Y))$
$E(g_1(X, Y) + g_2(X, Y) + \cdots + g_k(X, Y)) = E(g_1(X, Y)) + \cdots + E(g_k(X, Y))$
If $X \perp Y$ then $E(g(X)h(Y)) = E(g(X))E(h(Y))$

Proof. The proof for all these can simply be extended from the single variate situation by simply exchanging the pmf to joint pmf.

2 Independence

We say that $Y_1, Y_2, \ldots, Y_n$ are pairwise independent iff

\[ p(y_i, y_j) = p(y_i)p(y_j) \qquad \text{for all $i < j$} \]

We say that $Y_1, Y_2, \ldots, Y_n$ are totally independent if

\[ \begin{align} & p(y_i, y_j) = p(y_i)p(y_j) & \text{for all $i < j$} \\ & p(y_i, y_j, y_k) = p(y_i)p(y_j)p(y_k) & \text{for all $i < j < k$} \\ & \dots \\ & p(y_1, y_2, \ldots, y_n) = p(y_1)p(y_2) \ldots p(y_n) & \\ \end{align} \]

3 Multinomial Distribution

This is simply a generalisation of the binomial distribution.

Consider $n$ independent and identical trials, on each of which there are $k$ possible outcomes. On each trial let $p_i$ be the probability of outcome $i$ and let $Y_i$ be the total number of trials with outcome $i$ ($i = 1, \ldots k$).

Then $Y_1, Y_2, \ldots , Y_k$ have a multinominal distribution with joint pmf

\[ p(y_1, y_2, \ldots, y_k) = \frac{n!}{y_1 !y_2 ! \ldots y_k !} p_1^{y_1}p_2^{y_2}\cdots p_k^{y_k}, \qquad y_i \in \{ 0, 1, \ldots, n\}, \sum_{j = 1}^k y_j = n, \]

($p_i \in [0, 1] and p_1 + p_2 + \cdots + p_k = 1$).

We write $Y_1, Y_2, \ldots, Y_k \sim \text{Mult}(n; p_1, p_2, \ldots , p_k)$.

3.1 Multinomial Cefficients

The number of ways of partitioning $n$ distinct objects into $k$ distinct groups is

\[ \binom{n}{y_1 \ldots y_k} = \frac{n!}{y_1 ! y_2 ! \ldots y_k !}, \qquad y_i = \text{count in $i$-th group} \]

Table 1: An Example Sequence

Trial	1	2	3	4	5		n
Group	1	3	5	1	1		2

Given the sequence in Table 1, we can state the probability as \[ \begin{align} P(sequence) &= p_1p_3p_5p_1p_1\cdots p_2 \\ &= p_1^{y_1} \ldots p_k^{y_k} \end{align} \]

The probability of obtaining a distinct sequene where we observe $y_1, y_2, \ldots y_k$ is given by \[ P(y_1, y_2, \ldots , y_k) = \binom{n}{y_1 \ldots y_k} p_1^{y_1}p_2^{y_2} \ldots p_k^{y_k} \]

3.2 Example

On 10 rolls of a die, what’s the pr there will result 3 even numbers and 2 ones?

Let $Y_1$ be the number of even numbers, $Y_2$ be number of ones, and $Y_3$ be number of threes and fives (non-evens and non-ones).

Then $Y_1, Y_2, Y_3 \sim \text{Multi}(10; 1/2, 1/6, 1/3)$ with pmf \[ p(y_1, y_2, y_3) = \frac{10!}{y_1 ! y_2 ! y_3 !} \left( \frac{1}{2} \right)^{y_1} \left( \frac{1}{6} \right)^{y_2} \left( \frac{1}{3} \right)^{y_3} \] So, \[ p(3, 2, 5) =\frac{10!}{3! 2! 5!} \left( \frac{1}{2} \right)^3 \left( \frac{1}{6} \right)^2 \left( \frac{1}{3} \right)^5 = 0.03601 \]

4 Three Important Theorems Regarding Linear Combinations of Random Variables

Theorem 1

\[ E \left\{ \sum_{i = 1}^n a_i Y_i \right\} = \sum_{i = 1}^n a_i\mu_i \]

Theorem 2 \[ Var\left( \sum_{i = 1}^n a_i Y_i \right) = \sum_{i=1}^n a_i^2 \sigma_i^2 + 2 \sum_{i = 1}^{n - 1}\sum_{j = i + 1}^n a_i a_j \sigma_{ij} = \sum_{i = 1}^n \sum_{j = 1}^n a_i a_j \sigma_{ij} \]

Theorem 3 \[ Cov\left( \sum_{i = 1}^n a_i Y_i, \sum_{i = 1}^n b_i Y_i \right) = \sum_{i=1}^n \sum_{j = 1}^n a_i b_j \sigma_{ij} \]

Proof. The proof of Theorem 1 is trivial given the linearity of the expectation. The proof of Theorem 2 is also trivial if Theorem 3 is proven. Hence, we would only need to prove Theorem 3.

\[ \begin{align} \text{LHS} & = E \left[ \left( \sum_{i = 1}^n a_i Y_i - E\left( \sum_{i = 1}^n a_i Y_i \right) \right) \left( \sum_{j = 1}^n b_j Y_j - E\left( \sum_{j=1}^n b_j Y_j \right) \right) \right] \\ & = E \left[ \left( \sum_{i = 1}^n a_i Y_i - \sum_{i = 1}^n a_i \mu_i \right) \left( \sum_{j = 1}^n b_j Y_j - \sum_{j = 1}^n b_j \mu_j \right) \right] \\ & = E \left[ \left( \sum_{i = 1}^n a_i (Y_i - \mu_i) \right) \left( \sum_{j = 1}^n b_j (Y_j - \mu_j) \right) \right] \\ & = \sum_{i = 1}^n \sum_{j = 1}^n a_i b_j E\left\{ (Y_i - \mu_i) (Y_j - \mu_j) \right\} = RHS \\ \end{align} \]

Example 4

Suppose that $Y_1$, $Y_2$, and $Y_3$ are three rv’s with means 2, -7, and 5, variances 10, 6, and 9, and covariances $\sigma_{12} = -1$, $\sigma_{13} = 3$, and $\sigma_{23} = 0$.

Find:

$E(3Y_1 - 2Y_2 + Y_3)$
$Var(3Y_1 - 2Y_2 + Y_3)$
$Cov(3Y_1 - 2Y_2, Y_2 + 8 Y_3)$

$E(3Y_1 - 2Y_2 + Y_3)$

\[ \begin{align} E(3Y_1 - 2Y_2 + Y_3) & = 3 \mu_1 - 2\mu_2 + \mu_3 \\ & = 3 * 2 - 2 * (-7) + 5 = 25 \\ \end{align} \]

$Var(3Y_1 - 2Y_2 + Y_3)$

\[ \begin{align} Var(3Y_1 - 2Y_2 + Y_3) & = \sum_{i = 1}^n \sum_{j = 1}^n a_i a_j \sigma_{ij} \\ & = 3 \times 3 \times 10 + 3 \times (-2) \times (-1) + 3 \times 1 \times 3 \\ & \qquad + (-2) \times 3 \times (-1) + (-2) \times (-2) \times 6 + (-2) \times 1 \times 0 \\ & \qquad + 1 \times 3 \times 3 + 0 \times 1 \times 1 \times 9 \\ & = 105 + 33 + 18 = 156 \\ \end{align} \]

$Cov(3Y_1 - 2Y_2, Y_2 + 8 Y_3)$

Now, in order to use Theorem 3, we need to have the same list of variables. Therefore, we can use the following transformation,

\[ Cov(3 Y_1 - 2 Y_2, Y_2 + 8 Y_3) = Cov(3 Y_1 - 2 Y_2 + 0 Y_3, 0 Y_1 + Y_2 + 8 Y_3) \]

Then, we can apply Theorem 3 as the following,

\[ \begin{align} Cov(3 Y_1 - 2 Y_2, Y_2 + 8 Y_3) &= Cov(3 Y_1 - 2 Y_2 + 0 Y_3, 0 Y_1 + Y_2 + 8 Y_3) \\ & = 3(1)\sigma_{12} + 3(8)\sigma_{13} + (-2)1\sigma_{22} + (-2)8\sigma_{23} \\ & = 3(-1) + 24(3) - 2(6) - 16(0) \\ & = 57 \end{align} \]

Use the above three theorems to find the mean and variance of the hypergeometric distribution

In order to determine the mean with Theorem 1, we first need to decompose the random variable into linear combinations of smaller chuncks. In fact, we need to know the marginal distribution of all these randome variables while importantly, note that each of these trials are clearly not independent. In other words, suppose $X_i$ be the random variable that denotes whether the outcome at trial $i$ is successful (1) or not (0). Then, the marginal distribution of $X_i$ is the following assuming $Y \sim \text{HyperGeom}(N, r, n)$ and $X_1 + \cdots + X_n = Y$ \[ \begin{align} P(X_i = 1) &= \sum_{\text{all configurations with $X_i = 1$}} p(x_1, x_2, \ldots , x_{i-1}, 1, x_{i+1}, \ldots , x_n) \\ & = \sum_{\text{all configurations with $X_i = 1$}} P(X_i = 1 \mid X_1 = x_1, \ldots X_{i - 1} = x_{i - 1}, X_{i + 1} = x_{i+1}, \ldots, X_{n} = x_n) P(\text{all other configurations}) \\ & = \sum_{y' = 0}^{\min(r, n - 1)}{\left(\frac{r - y'}{N - n + 1}\right) \frac{\binom{r}{y'}\binom{N - r}{n - 1 - y'}}{\binom{N}{n - 1}}} \\ & = \frac{1}{(N - n + 1)\binom{N}{n - 1}} \sum_{y' = 0}^{\min(r, n - 1)}{(r - y') \binom{r}{y'}\binom{N - r}{n - 1 - y'}} \\ & = \frac{1}{(N - n + 1)\binom{N}{n - 1}} \sum_{y' = 0}^{\min(r, n - 1)}{r \binom{r - 1}{y'}\binom{N - r}{n - 1 - y'}} \\ & = \frac{r}{(N - n + 1)\binom{N}{n - 1}} \sum_{y' = 0}^{\min(r, n - 1)}{ \binom{r - 1}{y'}\binom{N - r}{n - 1 - y'}} \\ & = \frac{r}{(N - n + 1)\binom{N}{n - 1}} \binom{N - 1}{n - 1} \\ & = \frac{r \binom{N-1}{n-1}}{(N-n+1)\frac{N}{N-n+1} \binom{N-1}{n-1}} \\ & = \frac{r}{N} \\ \end{align} \tag{1}\]

Note that:

Going from line 2 to line 3, we have assumed that there are $y'$ number of possible cases from $X_1, \ldots , X_{i - 1}, X_{i + 1}, \ldots, X_n$.
Going from line 3 to line 4 is nothing more than just algebraic manipulations.
Going from line 4 to line 5 requires a relatively simple combinatoric identiy as $(r - k)\binom{r}{k} = r \binom{r-1}{k}$.
Going from line 5 to line 6 requires only moving $r$ out from the summation.
Going from line 6 to line 7 requires the Vandermonde’s identity but assuming that $r \geq n - 1$ ¹. This identity simply states that $\sum_{k = 0}^{m} \binom{a}{k} \binom{b}{m - k} = \binom{a + b}{m}$. The proof of this identity can be done by considering the binomial expansion of the binomial $(1+x)^a(1+x)^b$, which is essential the same as $(1+x)^{a+b}$.
Going from line 7 to line 8 requires this identity: $\binom{N}{n - 1} = \frac{N}{N - n + 1}\binom{N - 1}{n - 1}$. The proof of this identity is relatively obvious from the expansion of the binomial coefficient into the full form.
Going from line 8 to line 9 requires only algebraic manipulations.

Then, from this probability, it is clear that for each trial, it still follows a bernoulli distribution $\text{Bern}(r/N)$. Therefore, the mean and variance of each $X_i$ is trivial from existing results. However, the covariances can be derived as, \[ \begin{align} Cov(X_i, X_j) & = E(X_i X_j) - E(X_i)E(X_j) \\ & = \sum_{x_i = 0}^1 \sum_{x_j = 0}^1 x_i x_j P(X_i = x_i, X_j = x_j) - \left( \frac{r}{N} \right)^2 \\ & = P(X_i = 1, X_j = 1) - \left( \frac{r}{N} \right)^2 \\ & = \frac{r(r-1)}{N(N-1)} - \left( \frac{r}{N} \right)^2 = \frac{r(r-N)}{N^2(N-1)} \\ \end{align} \tag{2}\] Note that:

Going from second last to last line can be proven with a similar proof above. However, an intuitive way of thinking about it is using the law of total probability and first find $P(X_i)$ and then the condintional probability $P(X_j \mid X_i)$. This argument is possible because we can assume W.L.O.G by the unconditional is taken on the earlier event and the conditional even taken on the later event out of the two events. However, I suppose a more mathematical approach would be use a similar argument as Equation 1 but replace the first term in the summation in line 3 with \[\frac{\binom{r - y'}{2}\binom{N - (n - 2) - (r - y')}{0}}{\binom{N - n + 2}{2}}\] which should now be clear that it is simply a special of hypergeometric distribution where we only want positive cases.

Hence, now, we can apply Theorem 1 and Theorem 2 to obtain the required mean and variance of the hypergeometric distribution.

\[ \begin{align} E(Y) & = E \left( \sum_{i = 1}^n X_i \right) \\ & = \sum_{i=1}^n E(X_i) \\ & = \frac{nr}{N} \\ Var(Y) & = Var\left( \sum_{i=1}^n X_i \right) \\ & = \sum_{i = 1}^n \sum_{j = 1}^n \sigma_{ij} \\ & = \left(\sum_{i = 1}^n \sigma_i^2\right) + \left( \sum_{i=1}^{n-1} \sum_{j = 1 + 1}^n \sigma_{ij} \right) \\ & = n \frac{r}{N} (1 - \frac{r}{N}) + n(n-1) \frac{r(r-N)}{N^2(N-1)} \\ & = \frac{nr}{N} \left( 1 - \frac{r}{N} + \frac{(n - 1)(r - N)}{N(N - 1)} \right) \\ & = \frac{nr}{N} \left( \frac{N^2 - N}{N(N - 1)} - \frac{rN - r}{N(N - 1)} + \frac{(n - 1)(r - N)}{N(N - 1)} \right) \\ & = \frac{nr}{N} \left( \frac{N^2 - rN + rn - nN}{N(N-1)}\right) \\ & = \frac{nr}{N} \left( \frac{N(N - n) - r(N - n)}{N(N-1)} \right) \\ & = \frac{nr(N - r)(N - n)}{N^2(N - 1)} \\ \end{align} \]

This confirms with the formula sheet given as well.

Covariance of Multinomial Distribution

Use the above three theorems to find $Cov(Y_i, Y_j)$ when $i \neq j$ and $Y_1, Y_2, \ldots , Y_k \sim \text{Mult}(n; p_1, p_2, \ldots, p_k)$.

I don’t think there is an easy way to use the above three theorems. Instead a common way to approach this question is to actually consider two set of indicator variables that indicate whether the class is $i$ or $j$ at certain trial.

5 Continuous Multivariate Probability Distributions

$Y_1, Y_2, \ldots , Y_n$ have a continuous multivariate probability distribution if their joint cdf \[ F(y_1, y_2, \ldots , y_n) = P(Y_1 \leq y_1, Y_2 \leq y_2, \ldots , Y_n \leq y_n) \] is continuous everywhere.

The joint pdf of $Y_1, Y_2, \ldots, Y_n$ is then \[ f(y_1, y_2, \ldots , y_n) = \frac{\partial^n F(y_1, y_2, \ldots y_n)}{\partial y_1 \partial y_2 \ldots \partial y_n} \].

All the definitions and results made for discrete joint distributions also hold for continuous ones, except that summations must be replaced by integrals, and $p$’s need to be replaced by $f$’s.

Thus: \[ \idotsint_{\mathbb{R}^n} f(y_1, y_2, \ldots, y_n) \, dy_1 dy_2 \ldots dy_n = 1 \]

Also, we can calculate the probability by evaluating the integrals at the corret region. Similar to discrete we define the following (assume that we have two random variables for easiness of demonstrations):

$f_X(x) = \int f(x, y) \, dy$ (marginal pdf of $X$)
$f_{X|Y}(x|y) = \frac{f(x, y)}{f(y)}$ (conditional pdf of $X$ given $Y = y$)
$E(g(X, Y)) = \iint g(x, y) f(x, y) \, dx dy$
$E(c) = c$, etc.

6 Conditional Expectation

Now, wtih the definition of the conditional probability, we can define the conditional expectations as the following. \[ E(X \mid Y = y) = \begin{cases} \sum_x xp(x \mid y), & \text{if $X$ is discrete} \\ \int x f(x\mid y) dx, & \text{if $X$ is continuous} \\ \end{cases} \]

Example 6 — Multivariate Continuous Random Variables

Suppose $X$ and $Y$ are two continuous random variables with joint pdf \[ f(x, y) = cxy, \qquad 0 < x < 2y < 4 \]. Find:

$P(X > 1, Y < 1)$
$E(Y)$
$\rho$
$E(X|Y = 1)$

$P(X > 1, Y < 1)$

This question is nothing hard when you recognise the fact that it is simply a multi-variate integral question. However, we first need to determine the constant.

\[ \begin{align} 1 & = \int_{x=0}^4 \int_{y=\frac{1}{2}x}^2 cxy \, dydx \\ & = c\int_{x=0}^4 x \left[ \frac{1}{2}y^2 \right]_{\frac{1}{2}x}^2 dx \\ & = c \int_{x=0}^4 x \left( 2 - \frac{1}{8} x^2 \right) dx \\ & = c \left[ x^2 - \frac{1}{32}x^4 \right]_{0}^4 \\ & = c \left(16 - \frac{256}{32}\right) = 8c \\ \end{align} \]

Therefore, we know that $c = \frac{1}{8}$. Now, we can proceed to evaluate the integral.

\[ \begin{align} P(X > 1, Y < 1) & = \int_{x=1}^2 \int_{y=\frac{1}{2}x}^1 \frac{xy}{8} \, dy dx \\ & = \frac{1}{8} \int_{x=1}^2 x \left[ \frac{y^2}{2} \right]_{\frac{1}{2}x}^1 \, dx \\ & = \frac{1}{8} \int_{x=1}^2 x \left( \frac{1}{2} - \frac{x^2}{8} \right) \, dx \\ & = \frac{1}{8} \left[ \frac{x^2}{4} - \frac{x^4}{32} \right]_1^2 \\ & = \frac{9}{256} \approx 0.035 \end{align} \]

$E(Y)$

In order to find the expectations, we need to have the marginal distribution. However, it can be combined into one step.

\[ \begin{align} E(Y) & = \int_{y = 0}^2 y \int_{x=0}^{2y} \frac{xy}{8} \, dxdy \\ & = \int_{y = 0}^2 \frac{y^2}{8} \left[ \frac{x^2}{2} \right]_0^{2y} \, dy \\ & = \int_{y=0}^2 \frac{y^2}{8} \left( 2y^2 \right) = \int_{y=0}^2 \frac{y^4}{4} \, dy \\ & = \left[ \frac{y^5}{20} \right]_0^2 = \frac{32}{20} = \frac{8}{5} \end{align} \]

$\rho$

\[ \begin{align} E(Y^2) &= \int_{y = 0}^2 y^2 \frac{y^3}{4} \, dy \\ & = \frac{2^6}{24} = \frac{8}{3} \\ Var(Y) & = \frac{8}{3} - \left( \frac{8}{5} \right)^2 \\ & = \frac{8}{75} \\ \end{align} \]

Also, we need to determine the same things for $X$ as well.

\[ \begin{align} E(X) & = \int_{y = 0}^2 \int_{x=0}^{2y} x f(x, y) \, dxdy \\ &= \int_{y=0}^2 \frac{y}{8} \int_{x=0}^{2y} x^2 \, dxdy \\ & = \int_{y = 0}^2 \frac{y}{8} \frac{8y^3}{3} \, dy \\ & = \left[ \frac{y^5}{15} \right]_0^2 = \frac{32}{15} \\ E(X^2) & = \int_{y = 0}^2 \int_{x=0}^{2y} x^2 f(x, y) \, dxdy \\ &= \int_{y=0}^2 \frac{y}{8} \int_{x=0}^{2y} x^3 \, dxdy \\ & = \int_{y = 0}^2 \frac{y}{8} \frac{16y^4}{4} \, dy = \int_{y = 0}^2 \frac{y^5}{2} \, dy \\ & = \left[ \frac{y^6}{12} \right]_0^2 = \frac{16}{3} \\ Var(X) & = E(X^2) - (E(X))^2 \\ & = \frac{16}{3} - \left( \frac{32}{15} \right)^2 \\ & = \frac{176}{225} \\ E(XY) & = \int_{y = 0}^2 \int_{x=0}^{2y} xy f(x, y) \, dxdy \\ &= \int_{y=0}^2 \frac{y^2}{8} \int_{x=0}^{2y} x^2 \, dxdy \\ & = \int_{y = 0}^2 \frac{y^2}{8} \frac{8y^3}{3} \, dy = \int_{y = 0}^2 \frac{y^5}{3} \, dy \\ & = \left[ \frac{y^6}{18} \right]_0^2 \\ & = \frac{32}{9} \\ Cov(X, Y) & = E(XY) - E(X)E(Y) \\ & = \frac{32}{9} - \frac{32}{15}\frac{8}{5} = \frac{32}{225} \\ \rho & = \frac{Cov(X, Y)}{\sqrt{Var(X)}\sqrt{Var(Y)}} \\ & = \frac{\frac{32}{225}}{\sqrt{\frac{176}{225}}\sqrt{\frac{8}{75}}} = 0.4924 \end{align} \]

$E(X|Y = 1)$

Let’s first calculate the conditional distribution.

\[ \begin{align} f(x \mid y) &= \frac{f(x, y)}{f(y)} \\ & = \frac{\frac{xy}{8}}{\frac{y^3}{4}} \\ & = \frac{x}{2y^2} \, (0 < x < 2y) \\ \end{align} \]

Then, we can find the expectation as,

\[ \begin{align} E(X \mid Y = y) & = \int_0^{2y} \frac{x^2}{2y^2} dx \\ & = \frac{1}{2y^2} \frac{8y^3}{3} \\ & = \frac{4y}{3} \\ \end{align} \]

Therefore, $E(X | Y = 1) = \frac{4}{3}$.

6.1 Random Expectations

Definition 1 (Random Expectations) By $E(X\mid Y)$, we denote the function $E(X\mid Y = y)$ with $y$ replaced by $Y$. This would then makes $E(X \mid Y)$ also be a random variable for which we can also apply expectations.

For example, the above would have the random expectations fo $X|Y$ as \[ E(X \mid Y) = \frac{4}{3} Y \]. Note that $E(X \mid Y)$ would be a random variable about $Y$ rather than $X$ as we have already taken the expectation over $X$.

6.2 The Law of Iterated Expectation

Applications

This theorem is commonly used in Bayesian Inference as we will discuss in CH16.

Theorem 4 (Law of Iterated Expectation) \[ E(X) = E(E(X \mid Y)) \]

Proof. \[ \begin{align} E(E(X\mid Y)) & = \int E(X | Y = y) f(y) \, dy = \int \left( \int x f(x \mid y) \, dx \right) f(y) \, dy \\ & = \iint x f(x, y) \, dxdy = E(X) \end{align} \]

6.3 Related Definitions and Results

Definition 2 (Conditional Expectation on Function of Random Variable) \[ E(g(X) \mid Y = y) = \begin{cases} \sum_x g(x) p(x\mid y), & \text{if $X$ is discrete} \\ \int g(x) f(x \mid y), & \text{if $X$ is continuous} \\ \end{cases} \]

\[ E(g(X) \mid Y) = E(g(X) \mid Y) \]

Definition 3 (Variance on Conditional Variable) As before, since we have obtained the expectation, we can also in a similar fashion define the variance with the expectation.

\[ Var(X \mid Y = y) = E \left[ \left( X - E(X \mid Y = y) \right)^2 \middle\vert Y = y \right] \]

Definition 4 (Conditional Covariance) \[ Cov(X,Z) = E \left[ (X - E(X \mid Y = y))(Z - E(Z \mid Y = y)) \middle \vert Y =y \right] \]

Equipped with the three definitions above, we can derivce the following alternative expression of the unconditional variables.

Theorem 5 (Law of Iterated Expectation (Function of R.V.)) \[ E(g(X)) = E(E(g(X) \mid Y)) \]

Theorem 6 (Iterated Variance) Similar to Theorem 5, we can use that to derive a “iterated” variance.

\[ Var(X) = E(Var(X \mid Y)) + Var(E(X \mid Y)) \]

Theorem 7 (Iterated Covariance) \[ Cov(X, Z) = E(Cov(X, Z \mid Y)) + Cov(E(X \mid Y), E(Z \mid Y)) \]

6.3.1 Proofs…?

The proof of Theorem 5 is almost equivalent as the proof that is required for Theorem 4. The following will present the proof of Theorem 6 and Theorem 7.

6.3.1.1 Proof of Theorem 6

Proof (Proof of Theorem 6). \[ \begin{align} \text{RHS} &= E\left[ E\left[ \left( X - E\left[X \middle \vert Y\right] \right)^2 \middle \vert Y \right] \right] + E\left[ \left(E\left[X \mid Y\right] - E\left[E\left[X \middle \vert Y\right]\right]\right)^2 \right] \\ & = E\left[ (X - E(X \mid Y))^2 \right] + E\left[ (E(X \mid Y) - E(X))^2 \right] \\ & = E \left[ (X-E[X \mid Y])^2 + (E(X \mid Y) - E(X))^2 \right] \\ & = E \left[ X^2 - 2XE[X\mid Y] + (E[X \mid Y])^2 + (E[X \mid Y])^2 - 2E(X)E(X|Y) + (E[X])^2 \right] \\ & = E[X^2] - 2 E[XE[X \mid Y]] + 2E[(E[X \mid Y])^2] -2E[X]E[E[X \mid Y]] + (E[X])^2 \\ & = E[X^2] - (E[X])^2 - 2E[XE[X \mid Y]] + 2E[(E[X\mid Y])^2] \\ & = Var(X) - 2E[XE[X \mid Y]] + 2E[(E[X \mid Y])^2] \\ \end{align} \]

Now, if we can prove the law two terms sum to zero, we have proven the identity. In fact, we only need to show that \[E[XE[X \mid Y]] = E[(E[X \mid Y])^2]\]. Before, that, we will first show the following lemman is true.

Lemma 1 \[ E[X \cdot g(Y)] = E[E[X \mid Y] \cdot g(Y)] \]

The following gives the proof for continuous case in Lemma 1 and discrete case should follow closely. First, we again apply Theorem 5 (The Law of Iterated Expectations). \[ E[X \cdot g(Y)] = E[E[X \cdot g(Y) \mid Y]] \]

Now, the following will fill in the proof for Lemma 1. \[ \begin{align} E[X \cdot g(Y) \mid Y = y] & = E[X \cdot g(y) \mid Y = y] \\ & = \int_x xg(y) P[X = x \mid Y = y] \, dx \\ & = g(y) \int_x x P[X = x \mid Y = y] \, dx \\ & = g(y) E[X \mid Y = y] \end{align} \]

Therefore, from the definition of random expectations (Definition 1), we know the following, \[ E[X \cdot g(Y) \mid Y] = g(Y) E[X \mid Y = Y] = g(Y) E[X \mid Y] \]

Then, we can simply apply Lemma 1 to the following by treating $g(Y) = E[X \mid Y]$, \[ \begin{align} E[XE[X \mid Y]] &= E[E[X\cdot E[X \mid Y] \mid Y ]] \\ &= E[E[X \cdot g(Y) \mid Y]] \qquad (\text{Let $g(Y) = E[X \mid Y]$}) \\ &= E[g(Y) E[X \mid Y]] = E[(E[X \mid Y])^2] \end{align} \] in which the lemman is applied in the second step.

6.3.1.2 Proof of Theorem 7

Proof (Proof of Theorem 7). Similar to above, we will first expand the terms and see if there is any things we can reduce.

\[ \begin{align} \text{RHS} &= E\left[ E\left[ (X - E[X \mid Y])(Z - E[Z \mid Y]) \mid Y \right] \right] + E\left[ (E[X \mid Y] - E[E[X \mid Y]])(E[Z \mid Y] - E[E[Z \mid Y]]) \right] \\ & = E[(X - E[X \mid Y])(Z - E[Z \mid Y])] + E\left[ (E[X \mid Y] - E[X])(E[Z \mid Y] - E[Z]) \right] \\ \end{align} \]

The last step is arrived by using the law of iterated expectation. Now, let’s simply the first term first,

\[ \begin{align} E\left[ (X - E[X \mid Y])(Z - E[Z \mid Y]) \right] & = E[XZ - X E[Z \mid Y] - ZE[X \mid Y] + E[X\mid Y]E[Z \mid Y]] \\ & = E[XZ] - E[XE[Z \mid Y]] - E[Z E[X \mid Y]] + E[E[X \mid Y]E[Z \mid Y]] \\ & = E[XZ - E[X\mid Y]E[Z \mid Y]] \end{align} \]

Going from line (2) to line (3) is by using Lemma 1 as same way as it is applied in the last step in the proof of Theorem 6 above. Then, the original expression is simplified as,

\[ \begin{align} \text{Exp} & = E[XZ - E[X\mid Y]E[Z\mid Y] + (E[X \mid Y] - E[X])(E[Z \mid Y] - E[Z])] \\ & = E[XZ - E[X \mid Y][Z \mid Y] + E[X\mid Y][Z \mid Y] - E[X]E[Z \mid Y] - E[Z]E[X \mid Y] + E[X]E[Z]] \\ & = E[XZ] - E[X]E[Z] - E[X]E[E[Z \mid Y] + E[Z]E[E[X\mid Y]]] \\ & = E[XZ] - E[X]E[Z] - E[X]E[Z] + E[Z]E[X] \\ & = E[XZ] - E[X]E[Z] = Cov(X, Z) \\ \end{align} \]

Comment

Hence, we can see that the hard part is actually to see Lemma 1 which is done in the last proof. Hence, this proof is relatively simple.

6.3.2 Examples

Example 7

Twenty bolts have just been randomly sampled from the production line in a factory. You are now going to count the number of defectives amongst them.

From experience, you know that the proportion of defective bolts produced in the factory is constant throughout any given day, but varies from day to day in a uniform manner between $0.1$ and $0.3$.

Find:

How many defective bolts do you expect to find?
What is the variance of the number of defective bolts?

Solution:

How many defective bolts do you expect to find?

Let $X$ be the number of defectives amongst the 20, and let $Y$ be the proportion of defectives amongst all bolts produced in the factory today.

Then $(X | Y = y) \sim \text{Bin}(20, y)$, and $Y \sim U(0.1, 0.3)$. So $E[X \mid Y = y] = 20y \implies E[X \mid Y] = 20Y$ and $E[Y] = 0.2$.

Therefore, $E[X] = E[E[X \mid Y]] = E[20Y] = 20 \times 0.2 = 4$.

What is the variance of the number of defective bolts?

Similarly, we will use Theorem 6 to determine the variance.

First, \[Var[Y] = \frac{0.04}{12} = \frac{1}{300} = 0.0033 \hspace{30pt} E[Y^2] = Var(Y) + (E[Y])^2 = \frac{13}{300} = 0.43\]. Also, we now that \[Var(X \mid Y) = nY(1-Y)\] since $(X \mid Y = y) \sim \text{Bin}(20, y)$. Therefore, \[ \begin{align} Var(X) &= E[Var(X \mid Y)] + Var(E[X \mid Y]) \\ & = E[20Y(1 - Y)] + Var(20Y) \\ & = E[20Y] - E[20Y^2] + 400Var(Y) \\ & = 20 * 0.2 - 20 \left(\frac{13}{300}\right) + 400 \left( \frac{1}{300} \right) = \frac{67}{15} = 4.4667 \\ \end{align} \]

I think this example question can easily become a bayesian estimate question by throwing another question about the distribution of $Y | X$.

Footnotes

I believe this is generally a plausible assumptions to make otherwise it is possible to run out of items when you are performing the trials.↩︎