Independence and conditioning

You can read the LaTeX document online (for the latest updated chapters) from the link: probability.pdf

Chapter 4: Independence and conditioning 1. Independence In elementary probability, it is common to introduce first conditional probability, and then independence. But now the concept of independence is introduced first without mentioning any conditioning. And, the concept of independent random variables is introduced first, rather than independent events in common ways. In the section later, a brief introduction of conditioning will be offered.

Contents
Contents
 1.  Independence
 2.  Conditional expectations
   2.1.  Elementary conditional probabilities
   2.2.  Conditional expectations
   2.3.  Regular Conditional Distributions
 3.  Joint distribution with density

Definition 1. The random variables X _ i, 1\leqslant i\leqslant n, are said to be independent, iff for any Borel sets B _ i we have
\[
\mathbb P\Big(\bigcap _ {i=1}^n(X _ i\in B _ i)\Big)=\prod _ {i=1}^{n}\mathbb P(X _ i\in B _ i).
\]
The random variables of an infinite family are said to be independent iff those in every finite subfamily are. They are said to be pairwise independent iff every two of them are independent.

Later we will see that with the notion of independent events, they are called independent if the family \{\sigma(X _ i)\} _ {i=1}^n of \sigma-algebras is independent.

Note that if X _ 1,\dots,X _ n are independent, then the random variables in every subset of it are also independent, since we may take some of the B _ i's as \mathbb R. On the other hand, the independence condition can be derived by a weaker condition:
\[
\mathbb P\Big(\bigcap _ {i=1}^n(X _ i\leqslant x _ i) \Big)=\prod _ {i=1}^{n}\mathbb P(X _ i\leqslant x _ i).
\]

The equivalence is not proved here. Written in terms of distribution functions, it is
\[
F(x _ 1,\dots,x _ n)=\prod _ {i=1}^{n}F _ i(x _ i).
\]

Definition 2. The events \{E _ i\} are said to be independent, iff their indicators are independent; this is equivalent to: for any finite subset \{i _ 1,\dots,i _ l\} of index set, we have
\[
\mathbb{P}\Big(\bigcap _ {j=1}^lE _ {i _ j}\Big)=\prod _ {j=1}^{l}\mathbb{P}(E _ {i _ j}).
\]

The equivalence in this definition can be verified directly and is not shown here. The latter is the most common definition since it does not rely on the notion of independent variables, and is the natural generalization of two independent events.

Theorem 3. If X _ 1,\dots,X _ n are independent variables and f _ 1,\dots,f _ n are Borel measurable functions, then f _ 1(X _ 1),\dots,f _ n(X _ n) are independent random variables.

Theorem 4. Let 1\leqslant n _ 1<n _ 2<\dots<n _ k=n; f _ 1 a Borel measurable function of n _ 1 variables, f _ 2 one of n _ 2-n _ 1 variables, ..., f _ k one of n _ k-n _ {k-1} variables. If \{X _ i\} _ {i=1}^n are independent random variables then f _ 1(X _ 1,\dots,X _ {n _ 1}),\dots,f _ k(X _ {n _ {k-1}+1},\dots,X _ {n _ k}) are independent.

Theorem 5. If X and Y are independent and both have finite expectations, then \mathbb E(XY)=\mathbb E(X)\mathbb E(Y).

We prove this theorem in two methods. The first is standard and therefore longer. First consider discrete X and Y, and then arbitrary positive ones. Finally the general case follows as usual.

The second proof can be written as follows:
\[
\mathbb E(XY)=\int _ {\Omega}XY\, \mathrm d\mathbb P=\iint _ {\mathbb R^2}xy\, \mu^2(\mathrm dx,\mathrm dy).
\]

Note that \mu^2(B _ 1\times B _ 2)=\mu _ X(B _ 1)\mu _ Y(B _ 2), where B _ 1,B _ 2\in\mathcal B. We have
\[
\mathbb E(XY)=\int _ {\mathbb R}\int _ {\mathbb R}xy\, \mu _ X(\mathrm dx)\mu _ Y(\mathrm dy)=\int _ {\mathbb R}x\, \mu _ X(\mathrm dx)\int _ {\mathbb R}y\, \mu _ Y(\mathrm dy)=\mathbb E(X)\mathbb E(Y),
\]

finishing the proof! Observe that we are using here a very simple form of Fubini's theorem (see below). Indeed, the second proof appears to be so much shorter only because we are relying on the theory of ''product measure". You can check that the measure \mu^2 of (X,Y) is the product measure \mu_X\times\mu_Y iff X and Y are independent. Details on product measure would not be discussed more here.

Corollary 6. If \{X _ i\} _ {i=1}^n are independent random variables with finite expectations, then \mathbb E(\prod X _ i)=\prod \mathbb E(X _ i).

The above definition of independent events is a kind of abstraction of our common sense. However, sometimes it is not easy to see the independence or dependence from our intuition. Consider the example of rolling two dices. Denote A _ i as the event ''the sum of faces values is a multiple of i". It can be easily verified that A _ 2,A _ 3 are independent, while A _ 2,A _ 5 are not. We can see that they are not trivial without calculation.

It is well worth pointing out that pairwise independence does not imply mutual independence. Consider the following example. We set \Omega=\{\{1\},\{2\},\{3\},\{1,2,3\}\} and each sample point is assigned a probability of 1/4. Next, let A _ i=\{\{i\},\{1,2,3\}\}, then \mathbb P(A _ i)=1/2, \mathbb P(A _ 1\cap A _ 2)=\mathbb P(A _ 1\cap A _ 3)=\mathbb P(A _ 2\cap A _ 3)=1/4, so A _ 1,A _ 2,A _ 3 are pairwise independent. However, they are not mutual independent since 1/4=\mathbb{P}(A _ 1\cap A _ 2\cap A _ 3)\neq\mathbb{P}(A _ 1)\mathbb{P}(A _ 2)\mathbb{P}(A _ 3)=1/8.

From the above examples, we may feel that the intuition usually misleads us. But as mentioned before, the definition is a kind of mathematical abstraction from our ''independence" in daily life, so it should not be surprising that the mathematical independence differs somewhat from our intuitive background.

The notion of independence is extremely important in probability theory. It was central in the early development of probability theory (let's say, until the 1930s), and today, a number of non-independent theories have been developed, but they are still not sufficiently developed. Moreover, the theories and methods of independence are the basis and tools for the study of non-independent models. In practice, there are indeed many events whose dependence is so small that they can be considered independent within the tolerance of errors, thus facilitating the solution of problems.

Now let us state the fundamental existence theorem of product measures. Since the proof is sort of complex, it is not provided here.

Theorem 7. Let a sequence of probability measures \{\mu _ i\} on (\mathbb R,\mathcal B), or equivalently their distributions be given. There exists a probability space (\Omega,\mathcal A,\mathbb P) and a sequence of independent random variables \{X _ i\} defined on it such that for each i, \mu _ i is the probability measure of X _ i.

Now a famous theorem, Fubini's theorem, will be stated. In the previous review of real analysis this theorem is given in a special case (Lebesgue measure only), so now a more general one will be provided.

Let (X,\mathcal A _ 1,\mu) and (Y,\mathcal A _ 2,\lambda) be \sigma-finite measure spaces, and let f be an (\mathcal A _ 1\times \mathcal A _ 2)-measurable function on X\times Y. Then: for each x\in X, f(x,\cdot) is a \mathcal A _ 2-measurable function, and for each y\in Y, f(\cdot,y) is a \mathcal A _ 1-measurable function.

Theorem 8. With the above assumption, we have:
  • If 0\leqslant f\leqslant\infty, and if
    \[
    \varphi(x)=\int _ Y f(x,\cdot)\, \mathrm d\lambda,\quad \psi(y)=\int _ X f(\cdot, y)\, \mathrm d\mu,\quad (x\in X,\, y\in Y),
    \]

    then \varphi is \mathcal A _ 1-measurable, \psi is \mathcal A _ 2-measurable, and
    \[
    \int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda.
    \]

  • If f is complex and if
    \[
    \varphi^\ast(x)=\int _ Y|f(x,\cdot)|\, \mathrm d\lambda,\quad \int _ X\varphi^\ast\, \mathrm d\mu<\infty,
    \]

    then f\in L^1(\mu\times\lambda).

  • If f\in L^1(\mu\times\lambda), then f(x,\cdot)\in L^1(\lambda) for almost all x\in X, and f(\cdot,y)\in L^1(\mu) for almost all y\in Y; the functions \varphi and \psi, defined by the formula above almost everywhere, are in L^1(\mu) and L^1(\lambda), respectively, and it still holds that
    \[
    \int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda.
    \]

If we consider the completion of measures, we have an alternative statement of Fubini's theorem.

Theorem 9. Let (X,\mathcal A _ 1,\mu) and (Y,\mathcal A _ 2,\lambda) be \sigma-finite measure spaces. Let \overline{\mathcal A _ 1\times \mathcal A _ 2} be the completion of \mathcal A _ 1\times \mathcal A _ 2. Let f be an \overline{\mathcal A _ 1\times \mathcal A _ 2}-measurable function on X\times Y. Then all conclusions of Fubini's theorem hold, the only difference being as follows:

The \mathcal A _ 2-measurability of f(x,\cdot) can be asserted only for almost all x\in X, so that \varphi(x) is only defined a.e. with respect to \mu; a similar statement holds for f(\cdot,y) and \psi.


2. Conditional expectations 2.1. Elementary conditional probabilities
Definition 10. Let (\Omega,\mathcal A,\mathbb P) be a probability space and B\in\mathcal A. We define the conditional probability given B for any A\in\mathcal A by: \mathbb P(A| B)=0, if \mathbb P(B)=0; otherwise,
\[
\mathbb P(A| B)=\frac{\mathbb P(A\cap B)}{\mathbb P(B)}.
\]

If \mathbb P(B)>0, it is obvious that \mathbb P(\cdot| B) is a probability measure on (\Omega,\mathcal A).

Let A,B\in\mathcal A with \mathbb P(A),\mathbb P(B)>0. Then ''A,B are independent" is equivalent to \mathbb P(A| B)=\mathbb P(A), and this is also equivalent to \mathbb P(B| A)=\mathbb P(B). This is the most common definition of independent events in elementary probability, since it can be interpreted well from our intuitive background. The conditional probability \mathbb P(A| B) often differs from the unconditional probability \mathbb P(A), reflecting that the two events A,B have some kind of dependence. If they are the same, then the occurrence of event B does not influence the posibility of occurrence of A at all.

Theorem 11. Let I be a countable set and let \{B _ i\} _ {i\in I} be pairwise disjoint sets with \mathbb P(\bigcup B _ i)=1. Then for any A\in\mathcal A,
\begin{equation}\label{31}
\mathbb P(A)=\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i).
\end{equation}

For any A\in\mathcal A with \mathbb P(A)>0 and any k\in I,
\begin{equation}\label{32}
\mathbb P(B _ k| A)=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\mathbb P(A)}=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i)}.
\end{equation}

From the derivation you may think (32) is somewhat trivial. However, (32) is a well-known formula in probability theory and named Bayes' formula (or alternatively Bayes' theorem, Bayes' law, Bayes' rule, etc.), due to the interpretation of its practical and even philosophical significance. For \mathbb P(B _ i), they are the estimate posibility of occurrence without further information (of the occurrence of event A). Now with new information (of the occurrence of event A), the posibility of B _ i has new estimation. This is common in our daily life: An event that was previously considered nearly impossible can be made very possible by the occurrence of another event, and vice versa. The Bayes' formula characterize it quantitatively.

If the event A is treated as a ''result'', and those B _ is are possible causes for this result, we can interpret (31) in some sense as ''inference of result from the causes''; but Bayes' formula does the opposite, for it ''infers the causes from the result''. Now we have known that the result A occurs, we may ask among those many possible causes, which one lead to this result? This is a very common question we may ask in daily life or in research. Bayes' formula basically asserts that the possibility is proportional to \mathbb P(B _ i|A).

From the discussion above, it would not be surprising that Bayes' formula is able to play a remarkable role in the field of statistics.

For the purpose of next subsection, we consider a naive example of conditional expectation.

If X is a random variable with finite expectation and A\in\mathcal A with \mathbb P(A)>0, then the expectation of X with respect to the probability measure \mathbb P(\cdot|A) can be
\[
\mathbb E(X|A)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|A):=\frac1{\mathbb P(A)}\int _ A X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ A X)}{\mathbb P(A)}.
\]
Clearly, \mathbb P(B|A)=\mathbb E(\chi _ B|A) for all B\in\mathcal A.
2.2. Conditional expectations Let X\sim U[0,1], that is, X has density 1 over (0,1). Assume that if we have known the information X=x, the random variables Y _ 1,\dots, Y _ n are independent and each has Bernoulli distribution with parameter x, i.e., \mathbb P(Y _ i=1)=x and \mathbb P(Y _ i=0)=1-x. (Although those very common distributions have not been reviewed so far, they are used here for the convenience of better explanation of motivation.) So far, something like \mathbb P(\cdot\mid X=x) has not been defined since \mathbb P(X=x)=0. However, we need it since in the example here the distributions of Y _ is would be determined under the condition X=x.

We first consider a more general situation in this section.

Definition 12. Given an integrable X and a sub-\sigma-algebra \mathcal F of \mathcal A, the conditional expectation of X given \mathcal F, denoted by \mathbb E(X|\mathcal F), is a random variable satisfying the two properties:
  • it is \mathcal F-measurable;
  • it has the same integral as X over any set in \mathcal F, i.e.,
    \[
    \int _ A\mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P,\quad \forall A\in\mathcal F,
    \]

    or \mathbb E(X\chi _ A)=\mathbb E(\mathbb E(X|\mathcal F)\chi _ A).

For B\in\mathcal A, \mathbb P(B|\mathcal F):=\mathbb E(\chi _ B|\mathcal F) is called a conditional probability of B given \mathcal F.


Theorem 13. \mathbb E(X|\mathcal F) exists and is unique (up to equality almost surely).

For the uniqueness. Let Y and Y' satisfy the two properties. Let A=\{\omega\mid Y-Y'>0\}. Clearly A\in\mathcal F. Then \int _ A (Y-Y')\, \mathrm d\mathbb P=0. Hence \mathbb P(A)=0; interchanging Y and Y' we can conclude that Y=Y' a.e. For the existence, consider the set function \nu on \mathcal F: for any A\in\mathcal F, \nu(A)=\int _ A X\, \mathrm d\mathbb P. It is finite-valued and countably additive, hence a ''signed measure'' on \mathcal F. If \mathbb P(A)=0, then \nu(A)=0; hence it is absolutely continuous with respect to \mathbb P: \nu\ll\mathbb P. The existence then follows from the Radon-Nikodym theorem, the resulting ''derivative'' \mathrm d\nu/\mathrm d\mathbb P being what we desired.

If Y is a random variable and X is integrable, then we write \mathbb E(X|Y):=\mathbb E(X|\sigma(Y)).

The next theorem gathers some properties of conditional expectation.

Theorem 14. Consider probability space (\Omega, \mathcal A,\mathbb P) and X,Y be integrable. Let \mathcal G\subseteq\mathcal F\subseteq \mathcal A be \sigma-algebras. Then:
  1. (Linearity) \mathbb E(\lambda X+Y\mid\mathcal F)=\lambda\, \mathbb E(X|\mathcal F)+\mathbb E(Y|\mathcal F).
  2. (Monotonicity) If X\geqslant Y a.s., then \mathbb E(X|\mathcal F)\geqslant\mathbb E(Y|\mathcal F).
  3. If XY is integrable and Y is \mathcal F-measurable, then
    \[
    \mathbb E(XY|\mathcal F)=Y\, \mathbb E(X|\mathcal F),\quad \mathbb E(Y|\mathcal F)=\mathbb E(Y|Y)=Y.
    \]

  4. (Tower property)
    \[
    \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)=\mathbb E(\mathbb E(X|\mathcal G)\mid\mathcal F)=\mathbb E(X|\mathcal G).
    \]

    Hence \mathbb E(X|\mathcal G)=\mathbb E(X|\mathcal F) if and only if \mathbb E(X|\mathcal G) is \mathcal F-measurable.

  5. (Triangle inequality) \mathbb E(|X|\mid \mathcal F)\geqslant |\mathbb E(X|\mathcal F)|.
  6. (Independence) If \sigma(X) and \mathcal F are independent, then \mathbb E(X|\mathcal F)=\mathbb E(X).
  7. If for any A\in\mathcal F, \mathbb P(A)=0 or \mathbb P(A)=1, then \mathbb E(X|\mathcal F)=\mathbb E(X).
  8. (Dominated convergence) Assume |X _ n|\leqslant Y and X _ n\to X a.s. Then
    \[
    \lim _ {n\to\infty}\mathbb E(X _ n|\mathcal F)=\mathbb E(X|\mathcal F)\quad \text{a.s. and in }L^1(\mathbb P).
    \]

For (3), as usual we may suppose X,Y\geqslant0. The proof consists in observing that Y\, \mathbb E(X|\mathcal F) is \mathcal F-measurable and satisfies the defining relation \int _ A Y\, \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A XY\, \mathrm d\mathbb P for all A\in\mathcal F. This is true if Y=\chi _ B where b\in\mathcal F hence it is true if Y is a simple \mathcal F-measurable random variable and consequently also for each Y in \mathcal F by monotone convergence, whether the limits are finite or positive infinite. Note that the integrability of Y\mathbb E(X|\mathcal F) is part of the assertion of the property. The second equality follows (it can also be easily seen from the definition easily since the defining relation holds naturally).

(4) may be the most important property of conditional expectation relating to changing \sigma-algebras. The second equality follows from (3). Now let A\in\mathcal G, then A\in\mathcal F. We apply the defining relation twice:
\[
\int _ A \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)\, \mathrm d\mathbb P=\int _ A \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P.
\]
Hence \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G) satisfies the defining relation for \mathbb E(X|\mathcal G). Since it is \mathcal G-measurable, it is equal to the latter.

For (6), let A\in\mathcal F. Then X and \chi _ A are independent. Recall the property of expectations of independent random variables, we have \int _ A \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\mathbb E(X\chi _ A)=\mathbb E\int _ A\, \mathrm d\mathbb P.

The proof of (8) is omitted here.

Corollary 15. Let \mathcal F\subseteq \mathcal A be a \sigma-algebra and X be a random variable with \mathbb E(X^2)<\infty. Then \mathbb E(X|\mathcal F) is the orthogonal projection of X on L^2(\Omega,\mathcal F,\mathbb P). That is, for any \mathcal F-measurable Y with \mathbb E(Y^2)<\infty,
\[
\mathbb E(X-Y)^2\geqslant\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2
\]
with equality if and only if Y = \mathbb E(X|\mathcal F).

It can be proved that \mathbb E[\mathbb E(X|\mathcal F)^2]\leqslant\mathbb E(X^2) with Jensen's inequality (which will be provided later). From Cauchy-Schwarz inequality \mathbb E|XY|<\infty. Since Y is \mathcal F-measurable, by property (3),
\[
\mathbb E[\mathbb E(X|\mathcal F)Y]=\mathbb E(XY),\quad \mathbb E[X\, \mathbb E(X|\mathcal F)]=\mathbb E\{\mathbb E[X\, \mathbb E(X|\mathcal F)]\mid\mathcal F\}=\mathbb E\left[\mathbb E(X|\mathcal F)^2\right].
\]
Therefore, we have
\begin{align*}
\mathbb E(X-Y)^2-\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2 & =\mathbb{E}\left[X^2-2XY+Y^2-X^2+2X\mathbb E(X|\mathcal F)-\mathbb E(X|\mathcal F)^2\right] \\
& =\mathbb{E}\left[Y^2-2Y\mathbb E(X|\mathcal F)+\mathbb E(X|\mathcal F)^2\right]\\
& =\mathbb E\left[(Y-\mathbb E(X|\mathcal F))^2\right]\geqslant 0.
\end{align*}

Theorem 16 (Cauchy-Schwarz inequality). For square integrable X,Y we have
\[
\mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F).
\]

This inequality is given here to appreciate the caution that is necessary in handling conditional expectations. If we consider the positiveness of \mathbb E((X+\lambda Y)^2\mid\mathcal F), the problem arises that for each \lambda this is true only for almost all \omega, i.e., only up to a \mathbb P-null set N _ \lambda. The union \bigcup _ {\lambda\in\mathbb R}N _ \lambda cannot be ignored without comment. We get rid out of this difficulty by restricting \lambda to all rational numbers. Let N=\bigcup _ {\lambda\in\mathbb Q}N _ \lambda. Clearly N is a \mathbb P-null set and then the positiveness of the quadratic form in \lambda holds for every \omega\in \Omega\setminus N and every \lambda \in \mathbb Q. For \omega\in\Omega\setminus N,
\begin{align*}
& \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E((X+\lambda Y)^2\mid\mathcal F)(\omega)\Big]\geqslant0, \\
\Longrightarrow{} & \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\
\Longrightarrow{} & \inf _ {\lambda\in\mathbb R}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\
\Longrightarrow{} & \mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)\quad \text{a.s.}
\end{align*}

It can also be proved in this way for an alternative to avoid such difficulty:
\[
\mathbb E\Big(\frac{|XY|}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\Big|\mathcal F\Big)\leqslant\frac12\mathbb E\Big(\frac{X^2}{\mathbb E(X^2|\mathcal F)}\Big|\mathcal F\Big)+\frac12\mathbb E\Big(\frac{Y^2}{\mathbb E(Y^2|\mathcal F)}\Big|\mathcal F\Big).
\]

Hence
\[
\frac{\mathbb E(|XY|\mid\mathcal F)}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\leqslant\frac12+\frac12=1.
\]

Theorem 17 (Jensen's inequality). If \varphi is convex on \mathbb R and X,\varphi(X) are integrable random variables, then for each \mathcal F\subseteq A:
\[
\varphi(\mathbb E(X|\mathcal F))\leqslant\mathbb E(\varphi(X)|\mathcal F).
\]

Proof. You may concern that the conditional expectation of \varphi(X) does not exist. However, note that the negative part \varphi(X)^- of \varphi(x) is integrable, \mathbb E(\varphi(X)|\mathcal F) exists in a general sense. Though the details of this is not given here, you are advised to check it in some materials deeper.

The right derivative of a convex function must exist. We denote the right derivative of \varphi(x) as \varphi _ +'(x), then we have: for all x,y\in \mathbb R, \varphi(y)\geqslant\varphi(x)+\varphi _ +'(x)(y-x). Replace y by X and take conditional expectation, by the linearity and monotonicity of (general) conditional expectations, we have: for x\in\mathbb R, there exists a \mathbb P-null set N _ x such that for every \omega\in\Omega\setminus N _ x,
\[
\mathbb E [ \varphi(X)|\mathcal F ] (\omega)\geqslant\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x].
\]
Let N=\bigcup _ {x\in\mathbb Q}N _ x, then it is also a \mathbb P-null set, and the inequality holds for every \omega\in\Omega\setminus N and every x\in \mathbb Q. The right derivative \varphi _ +'(x) of \varphi(x) is right continuous, which can be checked: \varphi _ +'(x) can be written by \lim _ {y\to x^+}(\varphi(y)-\varphi(x))/(y-x) and it is increasing with respect to x, so \lim _ {y\to x^+} \varphi _ +'(y)\geqslant \varphi _ +'(x); we just need \lim _ {y\to x^+} \varphi _ +'(y)\leqslant \varphi _ +'(x), and this follows from \lim _ {y\to x^+} \varphi _ +'(y)\leqslant (\varphi(z)-\varphi(x))/(z-x). Thus, we have proved \lim _ {y\to x^+} \varphi _ +'(y)=\varphi _ +'(x), i.e., \varphi _ +'(x) is right continuous.

Now, we can see \varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x] is also right continuous. We take supremum of it over x\in \mathbb Q, then it is the same with the one over x\in\mathbb R. Note that if we take x=\mathbb E(X|\mathcal F)(\omega), it becomes \varphi(\mathbb E(X|\mathcal F)(\omega)). Hence for every \omega\in\Omega\setminus N, we have
\[
\varphi(\mathbb E(X|\mathcal F)(\omega))\leqslant\sup _ {x\in\mathbb R}\Big[\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x]\Big]\leqslant\mathbb E [ \varphi(X)|\mathcal F ] (\omega).
\]

2.3. Regular Conditional Distributions Let X be a random variable with values in a measurable space (E,\mathcal E). That is, X is \mathcal A-\mathcal E-measurable. So far we can define the conditional probability \mathbb P(A|X) for fixed A\in\mathcal A only. However, we would like to define for every x\in E a probability measure \mathbb P(\cdot\mid X=x) such that for any A\in\mathcal A we have \mathbb P(A|X)(\omega)\equiv\mathbb{P}(A\mid X=x) where \omega\in\{\omega\mid X(\omega)=x\}. In this subsection, we show how to do this.

Let Z be a \sigma(X)-measurable real random variable. It can be proved that there is a map \varphi:E\to\mathbb R such that: 1. \varphi is \mathcal E-measurable; 2. \varphi(X)=Z.

Lemma 18. Let (\Omega',\mathcal A') be a measurable space and let \Omega be a nonempty set. Let f : \Omega\to\Omega' be a map. A map g : \Omega\to\overline{\mathbb R} is \sigma(f)-\mathcal B(\overline{\mathbb R})-measurable, then there is a measurable map \varphi:(\Omega',\mathcal A')\to(\overline{\mathbb R},\mathcal B(\overline{\mathbb R})) such that g = \varphi\circ f.

We now prove this lemma. First consider the case g\geqslant0. Then g can be written as \sum _ {n=1}^{\infty} \alpha _ n\chi _ {A _ n}. (We know that a measurable function can be the limit of an increasing sequence of nonnegative measurable simple functions, then it is clear that g can be written as summation of nonnegative measurable simple functions, which is also summation of scaled indicators \alpha _ n \chi _ {A _ n}). By the definition of g, A _ 1,A _ 2\dots\in\sigma(f), which means that for all n there is B _ n such that f^{-1}(B _ n)=A _ n. Hence \chi _ {A _ n}=\chi _ {B _ n}\circ f.

Now define \varphi:\Omega'\to\overline{\mathbb R} by \varphi=\sum _ {n=1}^{\infty}\alpha _ n\chi _ {B _ n}. Clearly, \varphi is \mathcal A'-\mathcal B(\overline{\mathbb R})-measurable and g=\varphi\circ f.

Now drop the assumption that g is nonnegative. Then there exist measurable maps \varphi^- and \varphi^+ such that g^-=\varphi^-\circ f and g^+=\varphi^+\circ f. Note that \min(g^+(\omega),g^-(\omega))=0 for all \omega. Hence we just need \varphi(\omega):=\varphi^+(\omega)-\varphi^-(\omega) if \varphi^+(\omega)<\infty or \varphi^-(\omega)<\infty, and \varphi(\omega):=0 elsewhere.

Let f=X and g=Z. We obtain the \varphi mentioned above. Now we set Z=\mathbb E(Y|X), then we can see random variable \mathbb E(Y|X) is given by \varphi(X). From this observation, we can define \mathbb E(Y\mid X=x):

Definition 19. Let Y be integrable random variable and X:(\Omega,\mathcal A)\to(E,\mathcal E). We define the conditional expectation of Y given X by \mathbb E(Y\mid X=x):=\varphi(x), where \varphi is the function defined in the way above that satisfies: \varphi is \mathcal E-measurable and \varphi(X)=\mathbb E(Y|X).

Analogously, define \mathbb P(A\mid X=x)=\mathbb E(\chi _ A\mid X=x) for A\in\mathcal A.

For B\in\mathcal A with \mathbb P(B)>0, we have known that \mathbb P(\cdot|B) is a probability measure. Is it true for \mathbb P(\cdot\mid X=x)? The question is a bit tricky since for every given A\in\mathcal A, the expression \mathbb P(A\mid X=x) is defined for almost all x only, i.e., up to x in a null set depending on A. It seems that we would have some difficulties dealing with it. But let us first take a look at some useful definitions.

Definition 20. Let (\Omega _ 1,\mathcal A _ 1), (\Omega _ 2,\mathcal A _ 2) be measurable spaces. A map \kappa: \Omega _ 1\times \mathcal A _ 2\to[0,\infty] is called a (\sigma-)finite transition kernel (from \Omega _ 1 to \Omega _ 2) if:
  1. \omega _ 1\mapsto\kappa(\omega _ 1,A _ 2) is \mathcal A _ 1-measurable for any A _ 2\in\mathcal A _ 2;
  2. A _ 2\mapsto\kappa(\omega _ 1,A _ 2) is a (\sigma-)finite measure on (\Omega _ 2,\mathcal A _ 2) for any \omega _ 1\in\Omega _ 1.

If in (2) the measure is a probability measure for all \omega _ 1\in\Omega _ 1, then \kappa is called a stochastic kernel or a Markov kernel.

For a transition kernel if we also have \kappa(\omega _ 1,\Omega _ 2)\leqslant1 for any \omega _ 1\in\Omega _ 1, then \kappa is called sub-Markov or substochastic.


Definition 21. Let Y be a random variable with values in a measurable space (E,\mathcal E) and let \mathcal F\subseteq\mathcal A be a sub-\sigma-algebra. A Markov kernel \kappa _ {Y,\mathcal F} from (\Omega,\mathcal F) to (E,\mathcal E) is called a regular conditional distribution of Y given \mathcal F if
\[
\kappa _ {Y,\mathcal F}(\omega,B)=\mathbb P(\{Y\in B\}\mid \mathcal F)(\omega)\quad \text{a.e. for all }B\in\mathcal E,
\]

that is, if
\[
\int _ A \chi _ B(Y)\, \mathrm d\mathbb P=\int _ A \kappa _ {Y,\mathcal F}(\cdot,B)\, \mathrm d\mathbb P\quad \text{for all }A\in\mathcal F,\, B\in\mathcal E.
\]

In short, the function \kappa is called a regular conditional distribution of Y given \mathcal F if: 1. \omega\mapsto\kappa(\omega,B) is a version of \mathbb P(X\in B\mid \mathcal F) for each B; 2. B\mapsto\kappa(\omega,B) is a probability measure on (E,\mathcal E).

Consider the special case where \mathcal F=\sigma(X) for a random variable X (with values in an arbitrary measurable space (E',\mathcal R')). Define regular conditional distribution of Y given X by the Markov kernel
\[
(x,A)\mapsto\kappa _ {Y,X}(x,A):=\mathbb P(\{Y\in A\}\mid X=x)=\kappa _ {Y,\sigma(X)}(X^{-1}(x),A),
\]

and if X^{-1}(x) does not exist we assign an arbitrary value.

For regular conditional distributions in \mathbb R, we have the following theorem:

Theorem 22. Let Y:(\Omega,\mathcal A)\to(\mathbb R,\mathcal B) be real-valued. Then there exists a regular conditional distribution \kappa _ {Y,\mathcal F} of Y given \mathcal F.

The proof can be referred to in other materials.

We are also interested in the situation where Y takes values in \mathbb R^n or in even more general spaces. We now extend the result to a larger class of ranges for Y. More definitions are needed but they are only briefly stated here. A measurable space (E,\mathcal E) is called a Borel space if there exists a Borel set B\in\mathcal B(\mathbb R) and a one-to-one map \varphi:E\to B such that \varphi is \mathcal E-\mathcal B(B)-measurable and the inverse map \varphi^{-1} is \mathcal B(B)-\mathcal E-measurable. In general topology, a Polish space is a separable completely metrizable topological space (i.e., a separable topological space whose topology is induced by a complete metric). If E is a Polish space with Borel \sigma-algebra \mathcal E, then (E,\mathcal E) is a Borel space.

Theorem 23. Let \mathcal F\subseteq A be a sub-\sigma-algebra. Let Y be a random variable with values in a Borel space (E,\mathcal E) (hence, for example, E Polish, E=\mathbb R^d, E=\mathbb R^\infty, E=C[0,1], etc.). Then there exists a regular conditional distribution \kappa _ {Y,\mathcal F} of Y given \mathcal F.

The proof can be referred to in other materials.

To conclude, we pick up again the example with which we started. Define Y=(Y _ 1,\dots,Y _ n). By the theorem above (with E=\{0,1\}\subseteq\mathbb R^n), a regular conditional distribution exists:
\[
\kappa _ {Y,X}(x,\cdot)=\mathbb P(Y\in\cdot\mid X=x)\quad x\in[0,1].
\]

Indeed, for almost all x\in[0,1], \mathbb P(Y\in\cdot\mid X=x) can be calculated by the product of n Bernoulli distributions with parameter x.

Theorem 24. Let X be a random variable on (\Omega,\mathcal A,\mathbb P) with values in a Borel space (E,\mathcal E). Let \mathcal F\subseteq A be a \sigma-algebra and let \kappa _ {X,\mathcal F} be a regular conditional distribution of X given \mathcal F. Further, let f:E\to\mathbb R be measurable and f(X) be integrable. Then
\[
\mathbb E(f(X)|\mathcal F)(\omega)=\int f(x)\, \kappa _ {X,\mathcal F}(\omega,\mathrm dx)\quad\text{for }\mathbb P\text{-almost all }\omega.
\]

The proof can be referred to in other materials.
3. Joint distribution with density In this section we consider more about jointly distributed random variables with density. Consider a family of random variables \{X _ i\} _ {i\in I} and let J\subseteq I be a finite subset of I. We have defined the (joint) distribution of \{X _ j\} _ {j\in J}. The joint distribution function of \{X _ j\} _ {j\in J} can be generalized easily from the one of one-dimensional random variable. It is also possible to define the (joint) density of \{X _ j\} _ {j\in J}, that is, a function f _ J:\mathbb R^J\to[0,\infty) such that
\[
F _ J(\boldsymbol x)=\int _ {-\infty}^{x _ {j _ 1}}\dots\int _ {-\infty}^{x _ {j _ n}}f _ J(t _ 1,\dots,t _ n)\, \mathrm dt _ 1\dots\mathrm dt _ n,\quad \forall \boldsymbol x\in\mathbb R^J.
\]

If we further assume that f _ J is continuous, then the independence can be written in terms of joint density:

Theorem 25. the family \{X _ i\} _ {i\in I} is independent iff for any finite J\subseteq I we have
\[f _ J(\boldsymbol x)=\prod _ {j\in J}f _ j(x _ j)\quad\forall\boldsymbol x\in\mathbb R^J,\]

where f _ j is the marginal density, deduced from the joint density by integrating out the other variables x _ k\, (k\in J,k\neq j).

For discrete random variable, it can be seen easily that the probability mass function can also characterize the independence:

Theorem 26. the family \{X _ i\} _ {i\in I} is independent iff for any finite J\subseteq I we have
\[
p _ J(\boldsymbol x)=\prod _ {j\in J}p _ j(x _ j)\quad \forall \boldsymbol x\in\boldsymbol x(S).
\]

Now we consider conditional expectations and conditional distributions for discrete random variables and continuous random variables. If A is any set in \mathcal A with \mathbb P(A)>0, we have known \mathbb P _ A(\cdot) defined by \mathbb P _ A(E)=\mathbb P(E\cap A)\mathbin{/}\mathbb P(A) is a probability measure.

In the discrete case, let Y be discrete and Y=\sum _ n y _ n\chi _ {\Omega _ n} where \{\Omega _ n\} is a partition of \Omega, then we consider \mathbb E(X\mid Y=y _ n). We will see the definition in the section 3.4.3 coincides with the one in section 3.4.1. In section 3.4.1 we have defined such a conditional expectation
\[
\mathbb E(X|\Omega _ n)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|\Omega _ n)=\frac1{\mathbb P(\Omega _ n)}\int _ {\Omega _ n} X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ {\Omega _ n} X)}{\mathbb P(\Omega _ n)},
\]

which means that \mathbb E(X\mid Y=y _ n)={\mathbb E(\chi _ {\Omega _ n} X)}\mathbin{/}{\mathbb P(\Omega _ n)}. In the context of section 3.4.2, we prove \mathbb E(X|Y)=\sum _ n \mathbb E(X|\Omega _ n)\chi _ {\Omega _ n}, i.e., for any \omega\in\Omega _ n, \mathbb E(X|Y)(\omega)=\mathbb E(X|\Omega _ n). Only the defining relation needs to be checked, and it suffices to show \int _ {\Omega _ n}\mathbb E(X|\Omega _ n)\, \mathrm d\mathbb P=\int _ {\Omega _ n}X\, \mathrm d\mathbb P. The left term is equal to \mathbb E(X|\Omega _ n)\mathbb P(\Omega _ n) and the right is \mathbb E(\chi _ {\Omega _ n}X). From the equality above we know the defining relation holds. From \mathbb E(X|Y)=\sum _ n \mathbb E(X|\Omega _ n)\chi _ {\Omega _ n}, we can obtain again \mathbb E(X\mid Y=y _ n)={\mathbb E(\chi _ {\Omega _ n} X)}\mathbin{/}{\mathbb P(\Omega _ n)}.

Now we focus more on the continuous case, where the random variables have continuous density. For simplicity we consider real-valued X,Y with density f(x,y). Denote the margin by f _ X(x)=\int f(x,y)\, \mathrm dy, f _ Y(y)=\int f(x,y)\, \mathrm dx. First we consider the case f _ Y(y)>0 for all y. We are interested in \mathbb E(X\mid Y=y). If you have acquaintance with some conditional density in probability course of undergraduate level, you may guess this is \int xf _ {X|Y}(x)\, \mathrm dx=\int xf(x,y)\, \mathrm dx\mathbin{/}f _ Y(y)=:g(y). Recall \mathbb E(X\mid Y=y) means the value of function \varphi(Y) on Y=y, hence we shall show that g can be a version of \varphi, i.e., g(Y) is just the conditional expectation \mathbb E(X|Y). On the one hand, g(Y) is \sigma(Y)-measurable (g is continuous function and h is \sigma(Y)-measurable, then it can be shown g\circ h is \sigma(Y)-measurable); on the other hand, we need for any A\in\sigma(Y), \int _ A g(Y)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P.

Theorem 27. Almost surely we have
\[
\mathbb E(X\mid Y=y)=\int _ {\mathbb R} xf _ {X|Y}(x)\, \mathrm dx=\frac{\int xf(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.
\]

If the set \Lambda=\mathbb R^2\setminus\{(x,y)\mid (X,Y)(\omega),\, \omega\in\Omega\} has positive measure, then from the equality
\[
\iint _ {\Lambda}f(x,y)\, \mathrm dx\mathrm dy=\int _ {\varnothing}\mathbb{P}(\mathrm d\omega)=0
\]

we can know f(x,y)=0 for almost all points in \Lambda.

For any \sigma(Y)-measurable set A, there exists a Borel set B\in\mathcal B such that Y^{-1}(B)=A, so
\[
\int _ A g(Y)\, \mathrm d\mathbb P=\int _ B g(y)\, \mu _ Y(\mathrm d y)=\int _ B \frac{\int xf(x,y)\, \mathrm dx}{f _ Y(y)}\, \mu _ Y(\mathrm d y)=\iint _ {\mathbb R\times B} xf(x,y)\, \mathrm dx\mathrm dy.
\]
On the other hand,
\[
\int _ A X\, \mathrm d\mathbb P=\iint _ {(X,Y)(A)}xf(x,y)\, \mathrm dx\mathrm dy.
\]

The set (X,Y)(A) is contained in \mathbb R\times B. Are the value of the above two integrals the same? Actually for points (x,y) in \mathbb R\times B but not (X,Y)(A), since they all belong to \Lambda (note that A=Y^{-1}(B)), we have derived previously f(x,y)=0 for almost all of them. Thus the two integrals are equal.

For the case where the positiveness f _ Y(y)=\int f(x,y)\, \mathrm dx>0 does not always hold, we can define g by g(y)\int f(x,y)\, \mathrm dx=\int xf(x,y)\, \mathrm dx., i.e., h can be anything where \int f(x,y)\, \mathrm dy=0. Note that this is enough for the proof.

More generally, it can be proved indifferently that for integrable h(X), the condition expectation \mathbb E(h(X)\mid Y=y) is
\[
\mathbb E(h(X)\mid Y=y)=\int _ {\mathbb R} h(x)f _ {X|Y}(x)\, \mathrm dx=\frac{\int h(x)f(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.
\]
In particular, the conditional probability is
\[
\mathbb P(X\in A\mid Y=y)=\mathbb E(\chi _ {\{X\in A\} }\mid Y=y)=\int _ {A} f _ {X|Y}(x)\, \mathrm dx.
\]
Obviously we can obtain a regular conditional distribution here with density f _ {X|Y}.

Theorem 28. The distribution of X given Y=y is given by the density function
\[
f _ {X|Y}(x)=\frac{f(x,y)}{f _ Y(y)}=\frac{f(x,y)}{\int f(x,y)\, \mathrm dx}.
\]


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注