Probability/Conditional Distributions

Motivation

Suppose there is an earthquake. Let $X$ be the number of casualties and $Y$ be the Richter scale of the earthquake.

(a) Without given anything, what is the distribution of $X$ ?

(b) Given that $Y=1$ , what is the distribution of $X$ ?

(c) Given that $Y=9$ , what is the distribution of $X$ ?

Remark.

$Y=1$ means the earthquake is micro, and $Y=9$ means the earthquake is great.

Are your answers to (a),(b),(c) different?

In (b) and (c), we have the conditional distribution of $X$ given $Y=1$ , and the conditional distribution of $X$ given $Y=9$ respectively.

In general, we have conditional distribution of $X$ given $Y$ (before observing the value of $Y$ ), or $X$ given $Y=y$ (after observing the value of $Y$ ).

Conditional distributions

Recall the definition of conditional probability: $\mathbb {P} (A|B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}},$ in which $A,B$ are events, with $\mathbb {P} (B)>0$ . Applying this definition to discrete random variables $X,Y$ , we have $\mathbb {P} (X=x|Y=y)={\frac {\mathbb {P} (X=x\cap Y=y)}{\mathbb {P} (Y=y)}}={\frac {f(x,y)}{f_{Y}(y)}},$ where $f(x,y)$ is the joint pmf of $X$ and $Y$ , and $f_{Y}(y)$ is the marginal pmf of $Y$ . It is natural to call such conditional probability as conditional pmf, right? We will denote such conditional probability as $f_{X|Y}(x|y)$ . Then, this is basically the definition of conditional pmf: conditional pmf of $X$ given $Y=y$ is the conditional probability $\mathbb {P} (X=x|Y=y)$ . Naturally, we will expect that conditional pdf is defined similarly. This is indeed the case:

Definition. (Conditional probability function) Let $X,Y$ be random variables that are both discrete or both continuous. The conditional probability (mass or density) function of $X$ given $Y=y$ , in which $y$ is a real number, is $\underbrace {f_{X|Y}({\color {darkgreen}x}|y)} _{{\text{function of }}{\color {darkgreen}x}}={\frac {\overbrace {f({\color {darkgreen}x},y)} ^{\text{joint probability function}}}{\underbrace {f_{Y}(y)} _{\text{marginal pdf}}}}\propto \underbrace {f({\color {darkgreen}x},y)} _{{\text{function of }}{\color {darkgreen}x}}$

Remark.

The marginal pdf can be interpreted as normalizing constant, which makes the integral $\int _{-\infty }^{\infty }f_{X|Y}({\color {darkgreen}x},y)\,d{\color {darkgreen}x}=1$ , since $\int _{-\infty }^{\infty }f({\color {darkgreen}x},y)\,d{\color {darkgreen}x}=\underbrace {f_{Y}(y)} _{\text{marginal pdf}}$ (integrating over the region in which $Y$ is fixed to be $y$ (the region in which the condition is satisfied), so we only integrate over the corresponding interval of $x$ ( $x$ is still a variable)).

This is similar to the denominator in the definition of conditional probability, which makes the conditional probability of the whole sample space equals one, to satisfy the probability axiom.

To understand the definition more intuitively for the continuous case, consider the following diagram.

Top view:
     
        |
        |
        *---------------* 
        |               |
        |               |
fixed y *===============* <--- corresponding interval
        |               |
        |               |
        *---------------*
        |
        *---------------- x

Side view:

          *  
         / \ 
        *\  *  /                                           
       /|#\   \
   |  / |##\ / *---------*
   | *  |###\            /\
   | |\ |##/#\----------/--\     
   | | \|#/###*--------*   /                             
   | |  \/############/#\ /                              
   | |y *\===========/===*                               
   | | /  *---------*   /                                
   | |/              \ /                                 
   | *----------------*                                  
   |/                                                    
   *------------------------- x                          


Front view:
             
    |
    |
    |               
    *\     
    |#\    
    |##\   
    |###\             
    |####\   <------ Area: f_Y(y)
    |#####*--------*  
    |###############\ 
    *================*-------------- x

*---*
|###| : corresponding cross section from joint pdf
*---*

We can see that when we are conditioning $Y=y$ , we take a "slice" out from the region under joint pdf, and the area of the "whole slice" is the area between the univariate joint pdf $f(x,y)$ with fixed $y$ and variable $x$ , and the $x$ -axis. Since the area is given by $\int _{-\infty }^{\infty }f({\color {darkgreen}x},y)\,d{\color {darkgreen}x}=f_{Y}(y)$ , while according to the probability axioms, the area should equal 1. Hence, we scale down the area of "slice" by a factor of $f_{Y}(y)$ , by dividing the univariate joint pdf $f(x,y)$ by $f_{Y}(y)$ . After that, the curve at the top of scaled "slice" is the graph of the conditional pdf ${\frac {f(x,y)}{f_{Y}(y)}}$ .

Now, we have discussed the case where both random variables are discrete or continuous. How about the case where one of them is discrete and another one is continuous? In this case, there is no "joint probability function" of these two random variables, since one is discrete and another is continuous! But, we can still define the conditional probability function in some other ways. To motivate the following definition, let $F_{X|Y}(x|y)$ be the conditional probability $\mathbb {P} (X\leq x|Y=y)$ . Then, differentiating $F_{X|Y}(x|y)$ with respect to $x$ should yield the conditional pdf $f_{X|Y}(x|y)$ . So, we have ${\begin{aligned}f_{X|Y}(x|y)={\frac {d}{dx}}F_{X|Y}(x|y)&=\lim _{h\to 0}{\frac {\mathbb {P} (X\leq x+h|Y=y)-\mathbb {P} (X\leq x|Y=y)}{h}}\\&=\lim _{h\to 0}{\frac {\mathbb {P} (x<X\leq x+h|Y=y)}{h}}\\&=\lim _{h\to 0}{\frac {\mathbb {P} (Y=y|x<X\leq x+h)\mathbb {P} (x<X\leq x+h)}{h\mathbb {P} (Y=y)}}\\&=\lim _{h\to 0}{\frac {\mathbb {P} (Y=y|x<X\leq x+h)\mathbb {P} (x<X\leq x+h)}{h\mathbb {P} (Y=y)}}\\&=\lim _{h\to 0}{\frac {\mathbb {P} (Y=y|x\leq X\leq x+h)}{\mathbb {P} (Y=y)}}\lim _{h\to 0}{\frac {\mathbb {P} (x<X\leq x+h)}{h}}\\&={\frac {\mathbb {P} (Y=y|X=x){\frac {d}{dx}}F_{X}(x)}{\mathbb {P} (Y=y)}}\\&={\frac {\mathbb {P} (Y=y|X=x)f_{X}(x)}{\mathbb {P} (Y=y)}}.\\\end{aligned}}$ Thus, it is natural to have the following definition.

Definition. (Conditional probability density function when $X$ is continuous and $Y$ is discrete) Let $X$ be a continuous random variable and $Y$ be a discrete random variable. The conditional probability density function of $X$ given $Y=y$ , where $y$ is real number, is $f_{X|Y}(x|y)={\frac {\mathbb {P} (Y=y|X=x)f_{X}(x)}{\mathbb {P} (Y=y)}}.$

Now, how about the case where $X$ is discrete and $Y$ is continuous? In this case, let us use the above definition for the motivation of definition. However, we should interchange $X$ and $Y$ so that the assumptions are still satisfied. Then, we get $f_{Y|X}(y|x)={\frac {\mathbb {P} (X=x|Y=y)f_{Y}(y)}{\mathbb {P} (X=x)}}.$ In this case, $X$ is discrete, so it is natural to define the conditional pmf of $X$ given $Y=y$ as $\mathbb {P} (X=x|Y=y)$ in the expression. Now, after rearranging the terms, we get $\mathbb {P} (X=x|Y=y)={\frac {f_{Y|X}(y|x)\mathbb {P} (X=x)}{f_{Y}(y)}}.$ Thus, we have the following definition.

Definition. (Conditional probability mass function when $X$ is discrete and $Y$ is continuous) Let $X$ be a discrete random variable and $Y$ be a continuous random variable. The conditional probability density function of $X$ given $Y=y$ , where $y$ is real number, is $f_{X|Y}(x|y)={\frac {f_{Y|X}(y|x)\mathbb {P} (X=x)}{f_{Y}(y)}}.$

Based on the definitions of conditional probability functions, it is natural to define the conditional cdf as follows.

Definition. (Conditional cumulative distribution function) Let $X,Y$ be discrete or continuous random variables. The conditional cumulative distribution function (cdf) of $X$ given $Y=y$ , in which $y$ is a real number, is $F_{X|Y}({\color {darkgreen}x}|y){\overset {\text{ def }}{=}}\mathbb {P} (X\leq {\color {darkgreen}x}|Y=y)={\begin{cases}\displaystyle \sum _{{\color {red}u}:{\color {red}u}\leq {\color {darkgreen}x}}^{}f_{X|Y}({\color {red}u}|y),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\color {darkgreen}x}f_{X|Y}({\color {red}u}|y)\,d{\color {red}u},&X{\text{ is continuous}}.\end{cases}}$

Remark.

We should be aware that when $Y$ is continuous, the event $\{Y=y\}$ has probability zero. So, according to the definition of conditional probability, the conditional cdf in this case should be undefined. However, in this context, we still define the conditional probability as an expression that makes sense and is defined.

Graphical illustration of the definition (continuous random variables):

Top view:
     
        |
        |
        *---------------* 
        |               |
        |               |
fixed y *=========@=====* <--- corresponding interval
        |         x     |
        |               |
        *---------------*
        |
        *---------------- 

Side view:

          *  
         / \ 
        *\  *  /                                           
       /|#\   \
   |  / |##\ / *---------*
   | *  |###\            /\
   | |\ |##/#\----------/--\     
   | | \|#/###*--------*   /                             
   | |  \/#########   / \ /                              
   | |y *\========@==/===*                               
   | | /  *-------x-*   /                                
   | |/              \ /                                 
   | *----------------*                                  
   |/                                                    
   *------------------------- x                          


Front view:

    |
    |
    |
    *\      
    |#\    
    |##\              
    |###\             
    |####\   <------------- Area: f_Y(y)         
    |#####*--------*  
    |###########    \ 
    *==========@=====*--------------  
               x
*---*
|###| : the desired region from the cross section from joint pdf, whose area is the probability from the cdf
*---*

If $Y=\mathbf {1} \{A\}$ for some event $A$ , we have some special notations for simplicity:

the conditional probability function of $X$ given $Y=y$ becomes

$f_{X|Y}({\color {darkgreen}x}|y)={\begin{cases}f({\color {darkgreen}x}|A),&y=1;\\f({\color {darkgreen}x}|A^{c}),&y=0.\end{cases}}$

the conditional cdf of $X$ given $Y=y$ becomes

$F_{X|Y}({\color {darkgreen}x}|y)=\mathbb {P} (X\leq {\color {darkgreen}x}|Y=y)={\begin{cases}F({\color {darkgreen}x}|A),&y=1;\\F({\color {darkgreen}x}|A^{c}),&y=0.\end{cases}}$

Proposition. (Determining independence of two random variables) Random varibles $X,Y$ are independent if and only if $f_{X|Y}(x|y)=f_{X}(x){\text{ or }}f_{Y|X}(y|x)=f_{Y}(y)$ for each $x,y$ .

Proof. Recall the definition of independence between two random variables:

X,Y

are independent if

$f(x,y)=f_{X}(x)f_{Y}(y)$

for each

x,y

.

Since $f_{X|Y}({\color {darkgreen}x}|y)={\frac {\overbrace {f({\color {darkgreen}x},y)} ^{f_{X}({\color {darkgreen}x})f_{Y}(y)}}{f_{Y}(y)}}=f_{X}(x){\text{ and }}f_{Y|X}({\color {darkgreen}y}|x)={\frac {\overbrace {f({\color {darkgreen}y},x)} ^{f_{Y}({\color {darkgreen}y})f_{X}(x)}}{f_{X}(x)}}=f_{Y}(y)$ for each $x,y$ , we have the desired result.

$\Box$

Remark.

This is expected, since the conditioning on independent event should not affect the occurrence of another independent event.

We can extend the definition of conditional probability function and cdf to groups of random variables, for joint cdf's and joint probability functions, as follows:

Definition. (Conditional joint probability function) Let $\mathbf {X} =(X_{1},\dotsc ,X_{r})^{T}$ and $\mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}$ be two random vectors. The conditional joint probability function of $\mathbf {X} =(x_{1},\dotsc ,x_{r})$ given $\mathbf {Y} =(y_{1},\dotsc ,y_{s})$ is $f_{\mathbf {X} |\mathbf {Y} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}|y_{1},\dotsc ,y_{s}){\overset {\text{ def }}{=}}\mathbb {P} (X_{1}={\color {darkgreen}x_{1}}\cap \dotsb \cap X_{r}={\color {darkgreen}x_{r}}|Y_{1}=y_{1}\cap \dotsb \cap Y_{s}=y_{s})={\frac {f({\color {darkgreen}x_{1},\dotsc ,x_{r}},y_{1},\dotsc ,y_{s})}{f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}$

Then, we also have a similar proposition for determining independence of two random vectors.

Proposition. (Determining independence of two random vectors) Random vectors $\mathbf {X} =(X_{1},\dotsc ,X_{r})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}$ are independent if and only if $f_{\mathbf {X} |\mathbf {Y} }(x_{1},\dotsc ,x_{r}|y_{1},\dotsc ,y_{s})=f_{\mathbf {X} }(x_{1},\dotsc ,x_{r}){\text{ or }}f_{\mathbf {Y} |\mathbf {X} }(y_{1},\dotsc ,y_{s}|x_{1},\dotsc ,x_{r})=f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})$ for each $x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}$ .

Proof. The definition of independence between two random vectors is

$\mathbf {X} =(X_{1},\dotsc ,X_{r})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}$ are independent if

$f(x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s})=f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})$

for each

x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}

.

Since $f_{\mathbf {X} |\mathbf {Y} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}|y_{1},\dotsc ,y_{s})={\frac {\overbrace {f({\color {darkgreen}x_{1},\dotsc ,x_{r}},y_{1},\dotsc ,y_{s})} ^{f_{\mathbf {X} }({\color {darkgreen}x_{1},\dotsc ,x_{r}})f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}{f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}=f_{\mathbf {X} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}){\text{ and }}f_{\mathbf {Y} |\mathbf {X} }({\color {darkgreen}y_{1},\dotsc ,y_{s}}|x_{1},\dotsc ,x_{r})={\frac {\overbrace {f({\color {darkgreen}y_{1},\dotsc ,y_{s}},x_{1},\dotsc ,x_{r})} ^{f_{\mathbf {Y} }({\color {darkgreen}y_{1},\dotsc ,y_{s}})f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})}}{f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})}}=f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})$ for each $x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}$ , we have the desired result.

$\Box$

Conditional distributions of bivariate normal distribution

Recall from the Probability/Important Distributions chapter that the joint pdf of ${\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})$ is $f(x,y)={\frac {1}{2\pi \sigma _{X}\sigma _{Y}{\sqrt {1-\rho ^{2}}}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right),\quad (x,y)\in \mathbb {R} ^{2}$ , and $X\sim {\mathcal {N}}(\mu _{X},\sigma _{X}^{2})$ and $Y\sim {\mathcal {N}}(\mu _{Y},\sigma _{Y}^{2})$ in this case. in which $\rho =\rho (X,Y)$ and $\sigma _{X},\sigma _{Y}$ are positive.

Proposition. (Conditional distributions of bivariate normal distribution) Let $(X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})$ . Then, $X|(Y=y)\sim {\mathcal {N}}\left(\mu _{X}+\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y}),\sigma _{X}^{2}(1-\rho ^{2})\right),{\text{ and }}Y|(X=x)\sim {\mathcal {N}}\left(\mu _{Y}+\rho \cdot {\frac {\sigma _{Y}}{\sigma _{X}}}(x-\mu _{X}),\sigma _{Y}^{2}(1-\rho ^{2})\right)$ (abuse of notations: when we say the distribution of " $X|(Y=y)$ ", we mean the conditional distribution of $X$ given $Y=y$ ).

Proof.

First, the conditional pdf

${\begin{aligned}f_{X|Y}(x|y)&{\overset {\text{ def }}{=}}{\frac {f(x,y)}{f_{Y}(y)}}\\&=\left.{\frac {1}{{\color {darkgreen}2\pi }\sigma _{X}{\cancel {\sigma _{Y}}}{\sqrt {1-\rho ^{2}}}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\right/{\frac {1}{\sqrt {{\color {darkgreen}2\pi }{\cancel {\sigma _{Y}^{2}}}}}}\exp {\big (}{\color {blue}-(y-\mu _{Y})^{2}/2\sigma _{Y}^{2}}{\big )}\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right){\color {blue}+(y-\mu _{Y})^{2}/2\sigma _{Y}^{2}}\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+{\cancel {\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}}}{\color {purple}-({\cancel {1}}-\rho ^{2})}\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2{\color {blue}\sigma _{X}^{2}}(1-\rho ^{2})}}\left(\left(x-\mu _{X}\right)^{2}-2\rho \cdot {\frac {\color {blue}\sigma _{X}}{\sigma _{Y}}}(x-\mu _{X})(y-\mu _{Y})+\left({\color {purple}\rho }\cdot {\frac {\color {blue}\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)^{2}\right)\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2\sigma _{X}^{2}(1-\rho ^{2})}}\left((x-\mu _{X})-\left(\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)\right)^{2}\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2\sigma _{X}^{2}(1-\rho ^{2})}}\left(x-\mu _{X}-\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)^{2}\right)\end{aligned}}$

Then, we can see that $X|(Y=y)\sim {\mathcal {N}}\left(\mu _{X}+\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y}),\sigma _{X}^{2}(1-\rho ^{2})\right)$ ,
and by symmetry (interchanging $X$ and $Y$ , and also interchanging $x$ and $y$ ), $Y|(X=x)\sim {\mathcal {N}}\left(\mu _{Y}+\rho \cdot {\frac {\sigma _{Y}}{\sigma _{X}}}(x-\mu _{X}),\sigma _{Y}^{2}(1-\rho ^{2})\right)$ .

$\Box$

Conditional version of concepts

We can obtain conditional version of concepts previously established for 'unconditional' distributions analogously for conditional distributions by substituting 'unconditional' cdf, pdf or pmf, i.e. $F(\cdot )$ or $f(\cdot )$ , by their conditional counterparts, i.e. $F(\cdot {\color {darkgreen}|\cdot })$ or $f(\cdot {\color {darkgreen}|\cdot })$ .

Conditional independence

Definition. Random variables $X_{1},X_{2},\dotsc ,X_{n}$ are conditionally independent given $Y=y$ if and only if $F_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}(x_{1},\dotsc ,x_{n}{\color {darkgreen}|y})=F_{X_{1}{\color {darkgreen}|Y}}(x_{1}{\color {darkgreen}|y})\dotsb F_{X_{n}{\color {darkgreen}|Y}}(x_{n}{\color {darkgreen}|y})$ or $f_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}(x_{1},\dotsc ,x_{n}{\color {darkgreen}|y})=f_{X_{1}{\color {darkgreen}|Y}}(x_{1}{\color {darkgreen}|y})\dotsb f_{X_{n}{\color {darkgreen}|Y}}(x_{n}{\color {darkgreen}|y})$ . for each real number $x_{1},\dotsc ,x_{n},{\color {darkgreen}y}$ and for each positive integer $n$ , in which $F_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}$ and $f_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}$ denote the joint cdf and probability function of $(X_{1},\dotsc ,X_{n})$ conditional on $Y=y$ respectively.

Remark.

For random variables, conditional independence and independence are not related, i.e. one of them does not imply the another.

Example. (Conditional independence does not imply independence) TODO

Example. (Independence does not imply conditional independence) TODO

Conditional expectation

Definition. (Conditional expectation) Let $f_{X|Y}(x|y)$ be the conditional probability function of $X$ given $Y=y$ . Then, $\mathbb {E} [X{\color {darkgreen}|Y=y}]={\begin{cases}\displaystyle \sum _{x}^{}xf_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y}),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }xf_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})\,dx,&X{\text{ is continuous}}.\end{cases}}$

Remark.

$\mathbb {E} [X{\color {darkgreen}|Y=y}]$ is a function of $y$
the random variable $\mathbb {E} [*{\color {darkgreen}|Y=Y}]$ , which is a function of $Y$ after computing the expectation, is written as $\mathbb {E} [*{\color {darkgreen}|Y}]$ for brevity, in which $*$ 's are the same term.
$\mathbb {E} [*{\color {darkgreen}|Y=y}]$ is a realization of $\mathbb {E} [*|Y]$ when $Y$ is observed to be $y$ in which $*$ 's are the same term.

Similarly, we have conditional version of law of the unconscious statistician.

Proposition. (Law of the unconscious statistician (conditional version)) Let $f_{X|Y}(x|y)$ be the conditional probability function of $X$ given $Y=y$ . Then, for each function $g(x)$ , $\mathbb {E} [g(X){\color {darkgreen}|Y=y}]={\begin{cases}\displaystyle \sum _{x}^{}g(x)f_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y}),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }g(x)f_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})\,dx,&X{\text{ is continuous}}.\end{cases}}$

Proposition. (Conditional expectation under independence) If random variables $X,Y$ are independent, $\mathbb {E} [g(X)|Y]=\mathbb {E} [g(X)]$ for each function $g$ .

Proof. $\mathbb {E} [g(X)|Y]={\begin{cases}\displaystyle \sum _{x}^{}g(x)f_{X|Y}(x|Y)=\sum _{x}^{}g(x)f_{X}(x)=\mathbb {E} [g(X)],&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }g(x)f_{X|Y}(x|Y)\,dx=\int _{-\infty }^{\infty }g(x)f_{X}(x)\,dx=\mathbb {E} [g(X)],&X{\text{ is continuous}}.\end{cases}}$

$\Box$

Remark.

This equality may not hold if $X,Y$ are not independent.

Example. Suppose random vector $\mathbf {X} =(Y,Z)^{T}$ in which $Y,Z$ are independent random variables, and $g(\mathbf {x} )=y+z$ . Then, $\mathbb {E} [g(\mathbf {X} )|Y]=\mathbb {E} [\underbrace {Y} _{{\text{constant given }}Y}+Z|Y]=Y+\mathbb {E} [Z],$ ( $Y$ is treated as constant, because of the conditioning: it is constant after realization of $\mathbb {E} [Y+Z|Y]$ ) but $\mathbb {E} [g(\mathbf {X} )]=\mathbb {E} [Y+Z]=\mathbb {E} [Y]+\mathbb {E} [Z]\neq \mathbb {E} [g(\mathbf {X} )|Y].$

The properties of $\mathbb {E} [\cdot ]$ still hold for conditional expectations $\mathbb {E} [\cdot {\color {darkgreen}|Y}]$ , with every 'unconditional' expectation replaced by conditional expectation and some suitable modifications, as follows:

Proposition. (Properties of conditional expectation) For each random variable $Y$ ,

(linearity) $\mathbb {E} [\underbrace {\alpha {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}X_{1}+\underbrace {\beta {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}X_{2}+\underbrace {\gamma {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}{\color {darkgreen}|Y}]=\alpha {\color {darkgreen}(Y)}\mathbb {E} [X_{1}{\color {darkgreen}|Y}]+\beta {\color {darkgreen}(Y)}\mathbb {E} [X_{2}{\color {darkgreen}|Y}]+\gamma {\color {darkgreen}(Y)}$

for each functions

\alpha (Y),\beta (Y),\gamma (Y)

of

Y

and for each random variable

X_{1},X_{2}

(nonnegativity) if $X{\color {darkgreen}|Y}\geq 0$ , $\mathbb {E} [X{\color {darkgreen}|Y}]\geq 0$
(monotonicity) if $X_{1}\geq X_{2}$ , $\mathbb {E} [X_{1}{\color {darkgreen}|Y}]\geq \mathbb {E} [X_{2}{\color {darkgreen}|Y}]$ for each random variable $X_{1},X_{2}$
(triangle inequality)

$|\mathbb {E} [X{\color {darkgreen}|Y}]|\leq \mathbb {E} [|X|{\color {darkgreen}|Y}]$

(multiplicativity under independence) if $X_{1},X_{2}$ are conditionally independent given $Y$ ,

$\mathbb {E} [X_{1}X_{2}{\color {darkgreen}|Y}]=\mathbb {E} [X_{1}{\color {darkgreen}|Y}]\mathbb {E} [X_{2}{\color {darkgreen}|Y}]$

Proof. The proof is similar to the one for 'unconditional' expectations.

$\Box$

Remark.

$\alpha (Y),\beta (Y),\gamma (Y)$ are treated as constants given $Y$ , since after observing the value of $Y$ , they cannot be changed.
Each result also holds with $Y$ replaced by random vectors $(Y_{1},\dotsc ,Y_{s})^{T}$ .

The following theorem about conditional expectation is quite important.

Theorem. (Law of total expectation) For each function $g(x)$ and for each random variable $X,Y$ , $\mathbb {E} {\big [}\underbrace {\mathbb {E} [g(X)|Y]} _{{\text{function of }}y}{\big ]}=\mathbb {E} [g(X)].$

Proof. $\mathbb {E} [\mathbb {E} [g(X)|Y]]={\begin{cases}\displaystyle \sum _{y}^{}\mathbb {E} [g(X)|Y=y]f_{Y}(y)=\sum _{x}^{}{\bigg (}\sum _{y}^{}g(x)\overbrace {f_{X|Y}(x|y)} ^{f(x,y){\cancel {/f_{Y}(y)}}}{\cancel {f_{Y}(y)}}{\bigg )}=\sum _{x}^{}g(x){\bigg (}\overbrace {\sum _{y}^{}f(x,y)} ^{f_{X}(x)}{\bigg )}=\mathbb {E} [g(X)],&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }\mathbb {E} [g(X)|Y=y]f_{Y}(y)\,dy=\int _{-\infty }^{\infty }{\bigg (}\int _{-\infty }^{\infty }g(x)\underbrace {f_{X|Y}(x|y)} _{f(x,y){\cancel {/f_{Y}(y)}}}\,dx{\bigg )}{\cancel {f_{Y}(y)}}\,dy=\int _{-\infty }^{\infty }g(x){\bigg (}\underbrace {\int _{-\infty }^{\infty }f(x,y)\,dy} _{f_{X}(x)}{\bigg )}\,dx=\mathbb {E} [g(X)],&X{\text{ is continuous}}.\end{cases}}$

$\Box$

Remark.

We can replace $g(X)$ by $g(X,Y,Z,\dotsc )$ and get

$\mathbb {E} [g(X,Y,Z,\dotsc )]=\mathbb {E} [\mathbb {E} [g(X,{\color {darkgreen}Y},Z,\dotsc ){\color {darkgreen}|Y}]]=\mathbb {E} [\mathbb {E} [g(X,{\color {darkgreen}Y,Z,\dotsc |Y,Z,\dotsc }]]=\dotsb$

Corollary. (Generalized law of total probability) For each event $A$ , $\mathbb {E} _{Y}[\mathbb {P} (A|{\color {darkgreen}Y})]=\mathbb {P} (A).$

Proof.

First,

$\mathbb {E} [\mathbf {1} \{A\}|Y]=1(\mathbb {P} (\mathbf {1} \{A\}=1|Y)+0(\mathbb {P} (\mathbf {1} \{A\}=0|Y)=\mathbb {P} (A|Y).$

Then, using law of total expectation,

$\mathbb {E} _{Y}[\mathbb {P} (A|{\color {darkgreen}Y})]{\overset {\text{ above }}{=}}\mathbb {E} _{Y}[\mathbb {E} [\mathbf {1} \{A\}|{\color {darkgreen}Y}]]=\mathbb {E} [\mathbf {1} \{A\}]=\mathbb {P} (A).$

$\Box$

Remark.

The expectation is taken with respect to $Y$ , so we use the $\mathbb {E} _{Y}[\cdot ]$ notation. We will use similar notations to denote the random variables to which the expectation is taken with respect if needed.
We can replace $Y$ by $(Y_{1},\dotsc ,Y_{s})$ , which is a random vector.
If $Y$ is discrete, then the expanded form of the result is $\sum _{i}^{}\mathbb {P} (A|{\color {darkgreen}Y=i})\mathbb {P} ({\color {darkgreen}Y=i})=\mathbb {P} (A)$ (discrete case for law of total probability).
If $Y$ is continuous, then the expanded form of the result is $\int _{\operatorname {supp} (Y)}\mathbb {P} (A|{\color {darkgreen}Y=y})f_{Y}(y)\,dy=\mathbb {P} (A)$ (continuous case for law of total probability).

Corollary. (Expectation version of law of total probability) Suppose the sample space $\Omega =A_{1}\cup A_{2}\cup \dotsb$ in which $A_{i}$ 's are mutually exclusive. Then, $\mathbb {E} [X]=\mathbb {E} [X|A_{1}]\mathbb {P} (A_{1})+\mathbb {E} [X|A_{2}]\mathbb {P} (A_{2})+\dotsb .$

Proof. Define $Y=i$ if $A_{i}$ occurs, in which $i$ is a positive integer. Then, $\mathbb {E} [X]=\mathbb {E} _{Y}[\mathbb {E} _{X}[X|Y]]=\sum _{i=1}^{\infty }\mathbb {E} _{X}[X|Y=i]\mathbb {P} (Y=i)=\sum _{i=1}^{\infty }\mathbb {E} [X|A_{i}]\mathbb {P} (A_{i})$

$\Box$

Remark.

the number of events can be finite, as long as they are mutually exclusive and their union is the whole sample space
if $X=\mathbf {1} \{B\}$ , it reduces to law of total probability

Example. Let $X$ be the human height in m. A person is randomly selected from a population consisting of same number of men and women. Given that the mean height of a man is 1.8 m, and that of a woman is 1.7m, the mean height of the entire population is $\mathbb {E} [X]=\mathbb {E} [X|\{{\text{man selected}}\}]\mathbb {P} ({\text{man selected}})+\mathbb {E} [X|\{{\text{woman selected}}\}]\mathbb {P} ({\text{woman selected}})=1.8(1/2)+1.7(1/2)=1.75$

Corollary. (formula of expectation conditional on event) For each random variable $X$ and event $A$ with $\mathbb {P} (A)>0$ , $\mathbb {E} [X|A]={\frac {\mathbb {E} [X\mathbf {1} \{A\}]}{\mathbb {P} (A)}}.$

Proof. By the formula of expectation computed by weighted average of conditional expectations, $\mathbb {E} [X\mathbf {1} \{A\}]=\mathbb {E} [X\underbrace {\mathbf {1} \{A\}} _{1}|A]\mathbb {P} (A)+\mathbb {E} [X\underbrace {\mathbf {1} \{A\}} _{0}|A^{c}]\mathbb {P} (A^{c})=\mathbb {E} [X|A]\mathbb {P} (A),$ and the result follows if $\mathbb {P} (A)>0$ .

$\Box$

Remark.

if $X=\mathbf {1} \{B\}$ , it reduces to the definition of the conditional probability $\mathbb {P} (B|A)$ by the fundamental bridge between probability and expectation

After defining conditional expectation, we can also have conditional variance, covariance and correlation coefficient, since variance, covariance, and correlation coefficient are built upon expectation.

Conditional expectations of bivariate normal distribution

Proposition. (Conditional expectations of bivariate normal distribution) Let $(X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})$ . Then, $\mathbb {E} [X|Y=y]=\mathbb {E} [X]+\rho (X,Y)\cdot {\frac {\sqrt {\operatorname {Var} (X)}}{\sqrt {\operatorname {Var} (Y)}}}(y-\mathbb {E} [Y]),{\text{ and }}\mathbb {E} [Y|X=x]=\mathbb {E} [Y]+\rho (X,Y)\cdot {\frac {\sqrt {\operatorname {Var} (Y)}}{\sqrt {\operatorname {Var} (X)}}}(x-\mathbb {E} [X]).$

Proof.

The result follows from the proposition about conditional distributions of bivariate normal distribution readily.

$\Box$

Conditional variance

Definition. (Conditional variance) The conditional variance of random variable $X$ given $Y=y$ is $\operatorname {Var} (X{\color {darkgreen}|Y=y})=\mathbb {E} [(X-\mathbb {E} [X{\color {darkgreen}|Y=y}])^{2}{\color {darkgreen}|Y=y}].$

Similarly, we have properties of conditional variance which are similar to that of variance.

Proposition. (Properties of conditional variance) For each random variable $X,Y$ ,

(alternative formula of conditional variance) $\operatorname {Var} (X{\color {darkgreen}|Y})=\mathbb {E} [X^{2}{\color {darkgreen}|Y}]-(\mathbb {E} [X{\color {darkgreen}|Y}])^{2}$
(invariance under change in location parameter) $\operatorname {Var} (X+a{\color {darkgreen}(Y)}{\color {darkgreen}|Y})=\operatorname {Var} (X{\color {darkgreen}|Y})$
(homogeneity of degree two) $\operatorname {Var} (b{\color {darkgreen}(Y)}X{\color {darkgreen}|Y})=\left(b{\color {darkgreen}(Y)}\right)^{2}\operatorname {Var} (X{\color {darkgreen}|Y})$
(nonnegativity) $\operatorname {Var} (X{\color {darkgreen}|Y})\geq 0$
(zero variance implies non-randomness) $\operatorname {Var} (X{\color {darkgreen}|Y})=0\Leftrightarrow \mathbb {P} (X=c{\color {darkgreen}(Y)|Y})=1$ for some function $c(Y)$ of $Y$
(additivity under independence) if $X_{1},\dotsc ,X_{n}$ are conditionally independent given $Y$ , $\operatorname {Var} (X_{1}+\dotsb +X_{n}{\color {darkgreen}|Y})=\operatorname {Var} (X_{1}{\color {darkgreen}|Y})+\dotsb +\operatorname {Var} (X_{n}{\color {darkgreen}|Y})$

Proof. The proof is similar to the one for properties of variance.

$\Box$

Beside law of total expectation, we also have law of total variance, as follows:

Proposition. (Law of total variance) For each rnadom variable $X,Y$ , $\operatorname {Var} (X)=\mathbb {E} [\operatorname {Var} (X|Y)]+\operatorname {Var} (\mathbb {E} [X|Y]).$

Proof. ${\begin{aligned}\mathbb {E} [\operatorname {Var} (X|Y)]+\operatorname {Var} (\mathbb {E} [X|Y])&=\mathbb {E} \left[\mathbb {E} [X^{2}|Y]-(\mathbb {E} [X|Y])^{2}\right]+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]-(\mathbb {E} [\mathbb {E} [X|Y]])^{2}\\&=\mathbb {E} [\mathbb {E} [X^{2}|Y]]{\cancel {+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]}}+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]{\cancel {-(\mathbb {E} [\mathbb {E} [X|Y]])^{2}}}\\&=\mathbb {E} [X^{2}]-(\mathbb {E} [X])^{2}\qquad {\text{by law of total expectation}}\\&=\operatorname {Var} (X)\end{aligned}}$

$\Box$

Remark.

We can replace $Y$ by $(Y_{1},\dotsc ,Y_{s})^{T}$ , a random vector.

Conditional variances of bivariate normal distribution

Proposition. (Conditional variances of bivariate normal distribution) Let $(X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})$ . Then, $\operatorname {Var} (X|Y=y)={\big (}1-(\rho (X,Y))^{2}{\big )}\operatorname {Var} (X),{\text{ and }}\operatorname {Var} (Y|X=x)={\big (}1-(\rho (X,Y)^{2}{\big )}\operatorname {Var} (Y)$

Proof.

The result follows from he proposition about conditional distributions of bivariate normal distribution readily.

$\Box$

Remark.

It can be observed that the exact values of $x$ and $y$ in the conditions do not matter. The result is the same for different values of them.

Conditional covariance

Definition. (Conditional covariance) The conditional covariance of $X$ and $Y$ given $Z=z$ is $\operatorname {Cov} (X,Y{\color {darkgreen}|Z=z})=\mathbb {E} [(X-\mathbb {E} [X{\color {darkgreen}|Z=z}])(Y-\mathbb {E} [Y{\color {darkgreen}|Z=z}]){\color {darkgreen}|Z=z}]$

Proposition. (Properties of conditional covariance)

(i) (symmetry) for each random variable $X,Y$ , $\operatorname {Cov} (X,Y{\color {darkgreen}|Z})=\operatorname {Cov} (Y,X{\color {darkgreen}|Z})$ (ii) for each random variable $X$ , $\operatorname {Cov} (X,X{\color {darkgreen}|Z})=\operatorname {Var} (X{\color {darkgreen}|Z})$ (iii) (alternative formula of covariance) $\operatorname {Cov} (X,Y{\color {darkgreen}|Z})=\mathbb {E} [XY{\color {darkgreen}|Z}]-\mathbb {E} [X{\color {darkgreen}|Z}]\mathbb {E} [Y{\color {darkgreen}|Z}]$ (iv) for each constant $a_{1},\dotsc ,a_{n},b_{1},\dotsc ,b_{m},c,d$ , and for each random variables $X_{1},\dotsc ,X_{n},Y_{1},\dotsc ,Y_{m}$ , $\operatorname {Cov} \left(\sum _{i=1}^{n}(a_{i}X_{i}+c),\sum _{j=1}^{m}(b_{j}Y_{j}+d){\color {darkgreen}|Z}\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}a_{i}b_{j}\operatorname {Cov} (X_{1},Y_{j}{\color {darkgreen}|Z})$ (v) for each random variable $X_{1},\dotsc ,X_{n}$ , $\operatorname {Var} (X_{1}+\dotsb +X_{n}{\color {darkgreen}|Z})=\sum _{i=1}^{n}\operatorname {Var} (X_{i}{\color {darkgreen}|Z})+2\sum _{1\leq i<j\leq n}^{}\operatorname {Cov} (X_{i},Y_{j}{\color {darkgreen}|Z})$

Conditional correlation coefficient

Definition. (Conditional correlation coefficient) The conditional correlation coefficient of random variables $X$ and $Y$ given $Z=z$ is $\rho (X,Y{\color {darkgreen}|Z=z})={\frac {\operatorname {Cov} (X,Y{\color {darkgreen}|Z=z})}{\sqrt {\operatorname {Var} (X{\color {darkgreen}|Z=z})\operatorname {Var} (Y{\color {darkgreen}|Z=z})}}}.$

Remark.

Similar to 'unconditional' correlation coefficient, conditional correlation coefficient also lies between $-1$ and $1$ inclusively. The proof is similar, by replacing every unconditional terms with conditional terms.

Conditional quantile

Definition. (Conditional quantile) The conditional $\alpha$ th quantile of $X$ given $Y=y$ is $\inf\{x\in \mathbb {R} :F_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})>\alpha \}.$

Remark.

Then, we can have conditional median, interquartile range, etc., which are defined using conditional quantile in the same way as the unconditional ones