Loading [MathJax]/jax/element/mml/optable/SuppMathOperators.js

Independence and conditioning

You can read the LaTeX document online (for the latest updated chapters) from the link: probability.pdf

Chapter 4: Independence and conditioning 1. Independence In elementary probability, it is common to introduce first conditional probability, and then independence. But now the concept of independence is introduced first without mentioning any conditioning. And, the concept of independent random variables is introduced first, rather than independent events in common ways. In the section later, a brief introduction of conditioning will be offered.

Contents
Contents
 1.  Independence
 2.  Conditional expectations
   2.1.  Elementary conditional probabilities
   2.2.  Conditional expectations
   2.3.  Regular Conditional Distributions
 3.  Joint distribution with density

Definition 1. The random variables XiX _ i, 1in1\leqslant i\leqslant n, are said to be independent, iff for any Borel sets BiB _ i we have
P(ni=1(XiBi))=ni=1P(XiBi).
The random variables of an infinite family are said to be independent iff those in every finite subfamily are. They are said to be pairwise independent iff every two of them are independent.

Later we will see that with the notion of independent events, they are called independent if the family {σ(Xi)}i=1n\{\sigma(X _ i)\} _ {i=1}^n of σ\sigma-algebras is independent.

Note that if X1,,XnX _ 1,\dots,X _ n are independent, then the random variables in every subset of it are also independent, since we may take some of the BiB _ i's as R\mathbb R. On the other hand, the independence condition can be derived by a weaker condition:
P(ni=1(Xi

The equivalence is not proved here. Written in terms of distribution functions, it is
F(x _ 1,\dots,x _ n)=\prod _ {i=1}^{n}F _ i(x _ i).

Definition 2. The events {Ei}\{E _ i\} are said to be independent, iff their indicators are independent; this is equivalent to: for any finite subset {i1,,il}\{i _ 1,\dots,i _ l\} of index set, we have
\mathbb{P}\Big(\bigcap _ {j=1}^lE _ {i _ j}\Big)=\prod _ {j=1}^{l}\mathbb{P}(E _ {i _ j}).

The equivalence in this definition can be verified directly and is not shown here. The latter is the most common definition since it does not rely on the notion of independent variables, and is the natural generalization of two independent events.

Theorem 3. If X1,,XnX _ 1,\dots,X _ n are independent variables and f1,,fnf _ 1,\dots,f _ n are Borel measurable functions, then f1(X1),,fn(Xn)f _ 1(X _ 1),\dots,f _ n(X _ n) are independent random variables.

Theorem 4. Let 1n1<n2<<nk=n1\leqslant n _ 1<n _ 2<\dots<n _ k=n; f1f _ 1 a Borel measurable function of n1n _ 1 variables, f2f _ 2 one of n2n1n _ 2-n _ 1 variables, ..., fkf _ k one of nknk1n _ k-n _ {k-1} variables. If {Xi}i=1n\{X _ i\} _ {i=1}^n are independent random variables then f1(X1,,Xn1),,fk(Xnk1+1,,Xnk)f _ 1(X _ 1,\dots,X _ {n _ 1}),\dots,f _ k(X _ {n _ {k-1}+1},\dots,X _ {n _ k}) are independent.

Theorem 5. If XX and YY are independent and both have finite expectations, then E(XY)=E(X)E(Y)\mathbb E(XY)=\mathbb E(X)\mathbb E(Y).

We prove this theorem in two methods. The first is standard and therefore longer. First consider discrete XX and YY, and then arbitrary positive ones. Finally the general case follows as usual.

The second proof can be written as follows:
\mathbb E(XY)=\int _ {\Omega}XY\, \mathrm d\mathbb P=\iint _ {\mathbb R^2}xy\, \mu^2(\mathrm dx,\mathrm dy).

Note that μ2(B1×B2)=μX(B1)μY(B2)\mu^2(B _ 1\times B _ 2)=\mu _ X(B _ 1)\mu _ Y(B _ 2), where B1,B2BB _ 1,B _ 2\in\mathcal B. We have
\mathbb E(XY)=\int _ {\mathbb R}\int _ {\mathbb R}xy\, \mu _ X(\mathrm dx)\mu _ Y(\mathrm dy)=\int _ {\mathbb R}x\, \mu _ X(\mathrm dx)\int _ {\mathbb R}y\, \mu _ Y(\mathrm dy)=\mathbb E(X)\mathbb E(Y),

finishing the proof! Observe that we are using here a very simple form of Fubini's theorem (see below). Indeed, the second proof appears to be so much shorter only because we are relying on the theory of ''product measure". You can check that the measure μ2\mu^2 of (X,Y)(X,Y) is the product measure μX×μY\mu_X\times\mu_Y iff XX and YY are independent. Details on product measure would not be discussed more here.

Corollary 6. If {Xi}i=1n\{X _ i\} _ {i=1}^n are independent random variables with finite expectations, then E(Xi)=E(Xi)\mathbb E(\prod X _ i)=\prod \mathbb E(X _ i).

The above definition of independent events is a kind of abstraction of our common sense. However, sometimes it is not easy to see the independence or dependence from our intuition. Consider the example of rolling two dices. Denote AiA _ i as the event ''the sum of faces values is a multiple of ii". It can be easily verified that A2,A3A _ 2,A _ 3 are independent, while A2,A5A _ 2,A _ 5 are not. We can see that they are not trivial without calculation.

It is well worth pointing out that pairwise independence does not imply mutual independence. Consider the following example. We set Ω={{1},{2},{3},{1,2,3}}\Omega=\{\{1\},\{2\},\{3\},\{1,2,3\}\} and each sample point is assigned a probability of 1/41/4. Next, let Ai={{i},{1,2,3}}A _ i=\{\{i\},\{1,2,3\}\}, then P(Ai)=1/2\mathbb P(A _ i)=1/2, P(A1A2)=P(A1A3)=P(A2A3)=1/4\mathbb P(A _ 1\cap A _ 2)=\mathbb P(A _ 1\cap A _ 3)=\mathbb P(A _ 2\cap A _ 3)=1/4, so A1,A2,A3A _ 1,A _ 2,A _ 3 are pairwise independent. However, they are not mutual independent since 1/4=P(A1A2A3)P(A1)P(A2)P(A3)=1/81/4=\mathbb{P}(A _ 1\cap A _ 2\cap A _ 3)\neq\mathbb{P}(A _ 1)\mathbb{P}(A _ 2)\mathbb{P}(A _ 3)=1/8.

From the above examples, we may feel that the intuition usually misleads us. But as mentioned before, the definition is a kind of mathematical abstraction from our ''independence" in daily life, so it should not be surprising that the mathematical independence differs somewhat from our intuitive background.

The notion of independence is extremely important in probability theory. It was central in the early development of probability theory (let's say, until the 1930s), and today, a number of non-independent theories have been developed, but they are still not sufficiently developed. Moreover, the theories and methods of independence are the basis and tools for the study of non-independent models. In practice, there are indeed many events whose dependence is so small that they can be considered independent within the tolerance of errors, thus facilitating the solution of problems.

Now let us state the fundamental existence theorem of product measures. Since the proof is sort of complex, it is not provided here.

Theorem 7. Let a sequence of probability measures {μi}\{\mu _ i\} on (R,B)(\mathbb R,\mathcal B), or equivalently their distributions be given. There exists a probability space (Ω,A,P)(\Omega,\mathcal A,\mathbb P) and a sequence of independent random variables {Xi}\{X _ i\} defined on it such that for each ii, μi\mu _ i is the probability measure of XiX _ i.

Now a famous theorem, Fubini's theorem, will be stated. In the previous review of real analysis this theorem is given in a special case (Lebesgue measure only), so now a more general one will be provided.

Let (X,A1,μ)(X,\mathcal A _ 1,\mu) and (Y,A2,λ)(Y,\mathcal A _ 2,\lambda) be σ\sigma-finite measure spaces, and let ff be an (A1×A2)(\mathcal A _ 1\times \mathcal A _ 2)-measurable function on X×YX\times Y. Then: for each xXx\in X, f(x,)f(x,\cdot) is a A2\mathcal A _ 2-measurable function, and for each yYy\in Y, f(,y)f(\cdot,y) is a A1\mathcal A _ 1-measurable function.

Theorem 8. With the above assumption, we have:
  • If 0f0\leqslant f\leqslant\infty, and if
    \varphi(x)=\int _ Y f(x,\cdot)\, \mathrm d\lambda,\quad \psi(y)=\int _ X f(\cdot, y)\, \mathrm d\mu,\quad (x\in X,\, y\in Y),

    then φ\varphi is A1\mathcal A _ 1-measurable, ψ\psi is A2\mathcal A _ 2-measurable, and
    \int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda.

  • If ff is complex and if
    \varphi^\ast(x)=\int _ Y|f(x,\cdot)|\, \mathrm d\lambda,\quad \int _ X\varphi^\ast\, \mathrm d\mu<\infty,

    then fL1(μ×λ)f\in L^1(\mu\times\lambda).

  • If fL1(μ×λ)f\in L^1(\mu\times\lambda), then f(x,)L1(λ)f(x,\cdot)\in L^1(\lambda) for almost all xXx\in X, and f(,y)L1(μ)f(\cdot,y)\in L^1(\mu) for almost all yYy\in Y; the functions φ\varphi and ψ\psi, defined by the formula above almost everywhere, are in L1(μ)L^1(\mu) and L1(λ)L^1(\lambda), respectively, and it still holds that
    \int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda.

If we consider the completion of measures, we have an alternative statement of Fubini's theorem.

Theorem 9. Let (X,A1,μ)(X,\mathcal A _ 1,\mu) and (Y,A2,λ)(Y,\mathcal A _ 2,\lambda) be σ\sigma-finite measure spaces. Let A1×A2\overline{\mathcal A _ 1\times \mathcal A _ 2} be the completion of A1×A2\mathcal A _ 1\times \mathcal A _ 2. Let ff be an A1×A2\overline{\mathcal A _ 1\times \mathcal A _ 2}-measurable function on X×YX\times Y. Then all conclusions of Fubini's theorem hold, the only difference being as follows:

The A2\mathcal A _ 2-measurability of f(x,)f(x,\cdot) can be asserted only for almost all xXx\in X, so that φ(x)\varphi(x) is only defined a.e. with respect to μ\mu; a similar statement holds for f(,y)f(\cdot,y) and ψ\psi.


2. Conditional expectations 2.1. Elementary conditional probabilities
Definition 10. Let (Ω,A,P)(\Omega,\mathcal A,\mathbb P) be a probability space and BAB\in\mathcal A. We define the conditional probability given BB for any AAA\in\mathcal A by: P(AB)=0\mathbb P(A| B)=0, if P(B)=0\mathbb P(B)=0; otherwise,
\mathbb P(A| B)=\frac{\mathbb P(A\cap B)}{\mathbb P(B)}.

If P(B)>0\mathbb P(B)>0, it is obvious that P(B)\mathbb P(\cdot| B) is a probability measure on (Ω,A)(\Omega,\mathcal A).

Let A,BAA,B\in\mathcal A with P(A),P(B)>0\mathbb P(A),\mathbb P(B)>0. Then ''A,BA,B are independent" is equivalent to P(AB)=P(A)\mathbb P(A| B)=\mathbb P(A), and this is also equivalent to P(BA)=P(B)\mathbb P(B| A)=\mathbb P(B). This is the most common definition of independent events in elementary probability, since it can be interpreted well from our intuitive background. The conditional probability P(AB)\mathbb P(A| B) often differs from the unconditional probability P(A)\mathbb P(A), reflecting that the two events A,BA,B have some kind of dependence. If they are the same, then the occurrence of event BB does not influence the posibility of occurrence of AA at all.

Theorem 11. Let II be a countable set and let {Bi}iI\{B _ i\} _ {i\in I} be pairwise disjoint sets with P(Bi)=1\mathbb P(\bigcup B _ i)=1. Then for any AAA\in\mathcal A,
\begin{equation}\label{31} \mathbb P(A)=\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i). \end{equation}

For any AAA\in\mathcal A with P(A)>0\mathbb P(A)>0 and any kIk\in I,
\begin{equation}\label{32} \mathbb P(B _ k| A)=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\mathbb P(A)}=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i)}. \end{equation}

From the derivation you may think (32) is somewhat trivial. However, (32) is a well-known formula in probability theory and named Bayes' formula (or alternatively Bayes' theorem, Bayes' law, Bayes' rule, etc.), due to the interpretation of its practical and even philosophical significance. For P(Bi)\mathbb P(B _ i), they are the estimate posibility of occurrence without further information (of the occurrence of event AA). Now with new information (of the occurrence of event AA), the posibility of BiB _ i has new estimation. This is common in our daily life: An event that was previously considered nearly impossible can be made very possible by the occurrence of another event, and vice versa. The Bayes' formula characterize it quantitatively.

If the event AA is treated as a ''result'', and those BiB _ is are possible causes for this result, we can interpret (31) in some sense as ''inference of result from the causes''; but Bayes' formula does the opposite, for it ''infers the causes from the result''. Now we have known that the result AA occurs, we may ask among those many possible causes, which one lead to this result? This is a very common question we may ask in daily life or in research. Bayes' formula basically asserts that the possibility is proportional to P(BiA)\mathbb P(B _ i|A).

From the discussion above, it would not be surprising that Bayes' formula is able to play a remarkable role in the field of statistics.

For the purpose of next subsection, we consider a naive example of conditional expectation.

If XX is a random variable with finite expectation and AAA\in\mathcal A with P(A)>0\mathbb P(A)>0, then the expectation of XX with respect to the probability measure P(A)\mathbb P(\cdot|A) can be
\mathbb E(X|A)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|A):=\frac1{\mathbb P(A)}\int _ A X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ A X)}{\mathbb P(A)}.
Clearly, P(BA)=E(χBA)\mathbb P(B|A)=\mathbb E(\chi _ B|A) for all BAB\in\mathcal A.
2.2. Conditional expectations Let XU[0,1]X\sim U[0,1], that is, XX has density 11 over (0,1)(0,1). Assume that if we have known the information X=xX=x, the random variables Y1,,YnY _ 1,\dots, Y _ n are independent and each has Bernoulli distribution with parameter xx, i.e., P(Yi=1)=x\mathbb P(Y _ i=1)=x and P(Yi=0)=1x\mathbb P(Y _ i=0)=1-x. (Although those very common distributions have not been reviewed so far, they are used here for the convenience of better explanation of motivation.) So far, something like P(X=x)\mathbb P(\cdot\mid X=x) has not been defined since P(X=x)=0\mathbb P(X=x)=0. However, we need it since in the example here the distributions of YiY _ is would be determined under the condition X=xX=x.

We first consider a more general situation in this section.

Definition 12. Given an integrable XX and a sub-σ\sigma-algebra F\mathcal F of A\mathcal A, the conditional expectation of XX given F\mathcal F, denoted by E(XF)\mathbb E(X|\mathcal F), is a random variable satisfying the two properties:
  • it is F\mathcal F-measurable;
  • it has the same integral as XX over any set in F\mathcal F, i.e.,
    \int _ A\mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P,\quad \forall A\in\mathcal F,

    or E(XχA)=E(E(XF)χA)\mathbb E(X\chi _ A)=\mathbb E(\mathbb E(X|\mathcal F)\chi _ A).

For BAB\in\mathcal A, P(BF):=E(χBF)\mathbb P(B|\mathcal F):=\mathbb E(\chi _ B|\mathcal F) is called a conditional probability of BB given F\mathcal F.


Theorem 13. E(XF)\mathbb E(X|\mathcal F) exists and is unique (up to equality almost surely).

For the uniqueness. Let YY and YY' satisfy the two properties. Let A={ωYY>0}A=\{\omega\mid Y-Y'>0\}. Clearly AFA\in\mathcal F. Then A(YY)dP=0\int _ A (Y-Y')\, \mathrm d\mathbb P=0. Hence P(A)=0\mathbb P(A)=0; interchanging YY and YY' we can conclude that Y=YY=Y' a.e. For the existence, consider the set function ν\nu on F\mathcal F: for any AFA\in\mathcal F, ν(A)=AXdP\nu(A)=\int _ A X\, \mathrm d\mathbb P. It is finite-valued and countably additive, hence a ''signed measure'' on F\mathcal F. If P(A)=0\mathbb P(A)=0, then ν(A)=0\nu(A)=0; hence it is absolutely continuous with respect to P\mathbb P: νP\nu\ll\mathbb P. The existence then follows from the Radon-Nikodym theorem, the resulting ''derivative'' dν/dP\mathrm d\nu/\mathrm d\mathbb P being what we desired.

If YY is a random variable and XX is integrable, then we write E(XY):=E(Xσ(Y))\mathbb E(X|Y):=\mathbb E(X|\sigma(Y)).

The next theorem gathers some properties of conditional expectation.

Theorem 14. Consider probability space (Ω,A,P)(\Omega, \mathcal A,\mathbb P) and X,YX,Y be integrable. Let GFA\mathcal G\subseteq\mathcal F\subseteq \mathcal A be σ\sigma-algebras. Then:
  1. (Linearity) E(λX+YF)=λE(XF)+E(YF)\mathbb E(\lambda X+Y\mid\mathcal F)=\lambda\, \mathbb E(X|\mathcal F)+\mathbb E(Y|\mathcal F).
  2. (Monotonicity) If XYX\geqslant Y a.s., then E(XF)E(YF)\mathbb E(X|\mathcal F)\geqslant\mathbb E(Y|\mathcal F).
  3. If XYXY is integrable and YY is F\mathcal F-measurable, then
    \mathbb E(XY|\mathcal F)=Y\, \mathbb E(X|\mathcal F),\quad \mathbb E(Y|\mathcal F)=\mathbb E(Y|Y)=Y.

  4. (Tower property)
    \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)=\mathbb E(\mathbb E(X|\mathcal G)\mid\mathcal F)=\mathbb E(X|\mathcal G).

    Hence E(XG)=E(XF)\mathbb E(X|\mathcal G)=\mathbb E(X|\mathcal F) if and only if E(XG)\mathbb E(X|\mathcal G) is F\mathcal F-measurable.

  5. (Triangle inequality) E(XF)E(XF)\mathbb E(|X|\mid \mathcal F)\geqslant |\mathbb E(X|\mathcal F)|.
  6. (Independence) If σ(X)\sigma(X) and F\mathcal F are independent, then E(XF)=E(X)\mathbb E(X|\mathcal F)=\mathbb E(X).
  7. If for any AFA\in\mathcal F, P(A)=0\mathbb P(A)=0 or P(A)=1\mathbb P(A)=1, then E(XF)=E(X)\mathbb E(X|\mathcal F)=\mathbb E(X).
  8. (Dominated convergence) Assume XnY|X _ n|\leqslant Y and XnXX _ n\to X a.s. Then
    \lim _ {n\to\infty}\mathbb E(X _ n|\mathcal F)=\mathbb E(X|\mathcal F)\quad \text{a.s. and in }L^1(\mathbb P).

For (3), as usual we may suppose X,Y0X,Y\geqslant0. The proof consists in observing that YE(XF)Y\, \mathbb E(X|\mathcal F) is F\mathcal F-measurable and satisfies the defining relation AYE(XF)dP=AXYdP\int _ A Y\, \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A XY\, \mathrm d\mathbb P for all AFA\in\mathcal F. This is true if Y=χBY=\chi _ B where bFb\in\mathcal F hence it is true if YY is a simple F\mathcal F-measurable random variable and consequently also for each YY in F\mathcal F by monotone convergence, whether the limits are finite or positive infinite. Note that the integrability of YE(XF)Y\mathbb E(X|\mathcal F) is part of the assertion of the property. The second equality follows (it can also be easily seen from the definition easily since the defining relation holds naturally).

(4) may be the most important property of conditional expectation relating to changing σ\sigma-algebras. The second equality follows from (3). Now let AGA\in\mathcal G, then AFA\in\mathcal F. We apply the defining relation twice:
\int _ A \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)\, \mathrm d\mathbb P=\int _ A \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P.
Hence E(E(XF)G)\mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G) satisfies the defining relation for E(XG)\mathbb E(X|\mathcal G). Since it is G\mathcal G-measurable, it is equal to the latter.

For (6), let AFA\in\mathcal F. Then XX and χA\chi _ A are independent. Recall the property of expectations of independent random variables, we have AE(XF)dP=E(XχA)=EAdP\int _ A \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\mathbb E(X\chi _ A)=\mathbb E\int _ A\, \mathrm d\mathbb P.

The proof of (8) is omitted here.

Corollary 15. Let FA\mathcal F\subseteq \mathcal A be a σ\sigma-algebra and XX be a random variable with E(X2)<\mathbb E(X^2)<\infty. Then E(XF)\mathbb E(X|\mathcal F) is the orthogonal projection of XX on L2(Ω,F,P)L^2(\Omega,\mathcal F,\mathbb P). That is, for any F\mathcal F-measurable YY with E(Y2)<\mathbb E(Y^2)<\infty,
\mathbb E(X-Y)^2\geqslant\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2
with equality if and only if Y=E(XF)Y = \mathbb E(X|\mathcal F).

It can be proved that E[E(XF)2]E(X2)\mathbb E[\mathbb E(X|\mathcal F)^2]\leqslant\mathbb E(X^2) with Jensen's inequality (which will be provided later). From Cauchy-Schwarz inequality EXY<\mathbb E|XY|<\infty. Since YY is F\mathcal F-measurable, by property (3),
\mathbb E[\mathbb E(X|\mathcal F)Y]=\mathbb E(XY),\quad \mathbb E[X\, \mathbb E(X|\mathcal F)]=\mathbb E\{\mathbb E[X\, \mathbb E(X|\mathcal F)]\mid\mathcal F\}=\mathbb E\left[\mathbb E(X|\mathcal F)^2\right].
Therefore, we have
\begin{align*} \mathbb E(X-Y)^2-\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2 & =\mathbb{E}\left[X^2-2XY+Y^2-X^2+2X\mathbb E(X|\mathcal F)-\mathbb E(X|\mathcal F)^2\right] \\ & =\mathbb{E}\left[Y^2-2Y\mathbb E(X|\mathcal F)+\mathbb E(X|\mathcal F)^2\right]\\ & =\mathbb E\left[(Y-\mathbb E(X|\mathcal F))^2\right]\geqslant 0. \end{align*}

Theorem 16 (Cauchy-Schwarz inequality). For square integrable X,YX,Y we have
\mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F).

This inequality is given here to appreciate the caution that is necessary in handling conditional expectations. If we consider the positiveness of E((X+λY)2F)\mathbb E((X+\lambda Y)^2\mid\mathcal F), the problem arises that for each λ\lambda this is true only for almost all ω\omega, i.e., only up to a P\mathbb P-null set NλN _ \lambda. The union λRNλ\bigcup _ {\lambda\in\mathbb R}N _ \lambda cannot be ignored without comment. We get rid out of this difficulty by restricting λ\lambda to all rational numbers. Let N=λQNλN=\bigcup _ {\lambda\in\mathbb Q}N _ \lambda. Clearly NN is a P\mathbb P-null set and then the positiveness of the quadratic form in λ\lambda holds for every ωΩN\omega\in \Omega\setminus N and every λQ\lambda \in \mathbb Q. For ωΩN\omega\in\Omega\setminus N,
\begin{align*} & \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E((X+\lambda Y)^2\mid\mathcal F)(\omega)\Big]\geqslant0, \\ \Longrightarrow{} & \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\ \Longrightarrow{} & \inf _ {\lambda\in\mathbb R}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\ \Longrightarrow{} & \mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)\quad \text{a.s.} \end{align*}

It can also be proved in this way for an alternative to avoid such difficulty:
\mathbb E\Big(\frac{|XY|}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\Big|\mathcal F\Big)\leqslant\frac12\mathbb E\Big(\frac{X^2}{\mathbb E(X^2|\mathcal F)}\Big|\mathcal F\Big)+\frac12\mathbb E\Big(\frac{Y^2}{\mathbb E(Y^2|\mathcal F)}\Big|\mathcal F\Big).

Hence
\frac{\mathbb E(|XY|\mid\mathcal F)}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\leqslant\frac12+\frac12=1.

Theorem 17 (Jensen's inequality). If φ\varphi is convex on R\mathbb R and X,φ(X)X,\varphi(X) are integrable random variables, then for each FA\mathcal F\subseteq A:
\varphi(\mathbb E(X|\mathcal F))\leqslant\mathbb E(\varphi(X)|\mathcal F).

Proof. You may concern that the conditional expectation of φ(X)\varphi(X) does not exist. However, note that the negative part φ(X)\varphi(X)^- of φ(x)\varphi(x) is integrable, E(φ(X)F)\mathbb E(\varphi(X)|\mathcal F) exists in a general sense. Though the details of this is not given here, you are advised to check it in some materials deeper.

The right derivative of a convex function must exist. We denote the right derivative of φ(x)\varphi(x) as φ+(x)\varphi _ +'(x), then we have: for all x,yRx,y\in \mathbb R, φ(y)φ(x)+φ+(x)(yx)\varphi(y)\geqslant\varphi(x)+\varphi _ +'(x)(y-x). Replace yy by XX and take conditional expectation, by the linearity and monotonicity of (general) conditional expectations, we have: for xRx\in\mathbb R, there exists a P\mathbb P-null set NxN _ x such that for every ωΩNx\omega\in\Omega\setminus N _ x,
\mathbb E [ \varphi(X)|\mathcal F ] (\omega)\geqslant\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x].
Let N=xQNxN=\bigcup _ {x\in\mathbb Q}N _ x, then it is also a P\mathbb P-null set, and the inequality holds for every ωΩN\omega\in\Omega\setminus N and every xQx\in \mathbb Q. The right derivative φ+(x)\varphi _ +'(x) of φ(x)\varphi(x) is right continuous, which can be checked: φ+(x)\varphi _ +'(x) can be written by limyx+(φ(y)φ(x))/(yx)\lim _ {y\to x^+}(\varphi(y)-\varphi(x))/(y-x) and it is increasing with respect to xx, so limyx+φ+(y)φ+(x)\lim _ {y\to x^+} \varphi _ +'(y)\geqslant \varphi _ +'(x); we just need limyx+φ+(y)φ+(x)\lim _ {y\to x^+} \varphi _ +'(y)\leqslant \varphi _ +'(x), and this follows from limyx+φ+(y)(φ(z)φ(x))/(zx)\lim _ {y\to x^+} \varphi _ +'(y)\leqslant (\varphi(z)-\varphi(x))/(z-x). Thus, we have proved limyx+φ+(y)=φ+(x)\lim _ {y\to x^+} \varphi _ +'(y)=\varphi _ +'(x), i.e., φ+(x)\varphi _ +'(x) is right continuous.

Now, we can see φ(x)+φ+(x)[E(XF)(ω)x]\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x] is also right continuous. We take supremum of it over xQx\in \mathbb Q, then it is the same with the one over xRx\in\mathbb R. Note that if we take x=E(XF)(ω)x=\mathbb E(X|\mathcal F)(\omega), it becomes φ(E(XF)(ω))\varphi(\mathbb E(X|\mathcal F)(\omega)). Hence for every ωΩN\omega\in\Omega\setminus N, we have
\varphi(\mathbb E(X|\mathcal F)(\omega))\leqslant\sup _ {x\in\mathbb R}\Big[\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x]\Big]\leqslant\mathbb E [ \varphi(X)|\mathcal F ] (\omega).

2.3. Regular Conditional Distributions Let XX be a random variable with values in a measurable space (E,E)(E,\mathcal E). That is, XX is A\mathcal A-E\mathcal E-measurable. So far we can define the conditional probability P(AX)\mathbb P(A|X) for fixed AAA\in\mathcal A only. However, we would like to define for every xEx\in E a probability measure P(X=x)\mathbb P(\cdot\mid X=x) such that for any AAA\in\mathcal A we have P(AX)(ω)P(AX=x)\mathbb P(A|X)(\omega)\equiv\mathbb{P}(A\mid X=x) where ω{ωX(ω)=x}\omega\in\{\omega\mid X(\omega)=x\}. In this subsection, we show how to do this.

Let ZZ be a σ(X)\sigma(X)-measurable real random variable. It can be proved that there is a map φ:ER\varphi:E\to\mathbb R such that: 1. φ\varphi is E\mathcal E-measurable; 2. φ(X)=Z\varphi(X)=Z.

Lemma 18. Let (Ω,A)(\Omega',\mathcal A') be a measurable space and let Ω\Omega be a nonempty set. Let f:ΩΩf : \Omega\to\Omega' be a map. A map g:ΩRg : \Omega\to\overline{\mathbb R} is σ(f)\sigma(f)-B(R)\mathcal B(\overline{\mathbb R})-measurable, then there is a measurable map φ:(Ω,A)(R,B(R))\varphi:(\Omega',\mathcal A')\to(\overline{\mathbb R},\mathcal B(\overline{\mathbb R})) such that g=φfg = \varphi\circ f.

We now prove this lemma. First consider the case g0g\geqslant0. Then gg can be written as n=1αnχAn\sum _ {n=1}^{\infty} \alpha _ n\chi _ {A _ n}. (We know that a measurable function can be the limit of an increasing sequence of nonnegative measurable simple functions, then it is clear that gg can be written as summation of nonnegative measurable simple functions, which is also summation of scaled indicators αnχAn\alpha _ n \chi _ {A _ n}). By the definition of gg, A1,A2σ(f)A _ 1,A _ 2\dots\in\sigma(f), which means that for all nn there is BnB _ n such that f1(Bn)=Anf^{-1}(B _ n)=A _ n. Hence χAn=χBnf\chi _ {A _ n}=\chi _ {B _ n}\circ f.

Now define φ:ΩR\varphi:\Omega'\to\overline{\mathbb R} by φ=n=1αnχBn\varphi=\sum _ {n=1}^{\infty}\alpha _ n\chi _ {B _ n}. Clearly, φ\varphi is A\mathcal A'-B(R)\mathcal B(\overline{\mathbb R})-measurable and g=φfg=\varphi\circ f.

Now drop the assumption that gg is nonnegative. Then there exist measurable maps φ\varphi^- and φ+\varphi^+ such that g=φfg^-=\varphi^-\circ f and g+=φ+fg^+=\varphi^+\circ f. Note that min(g+(ω),g(ω))=0\min(g^+(\omega),g^-(\omega))=0 for all ω\omega. Hence we just need φ(ω):=φ+(ω)φ(ω)\varphi(\omega):=\varphi^+(\omega)-\varphi^-(\omega) if φ+(ω)<\varphi^+(\omega)<\infty or φ(ω)<\varphi^-(\omega)<\infty, and φ(ω):=0\varphi(\omega):=0 elsewhere.

Let f=Xf=X and g=Zg=Z. We obtain the φ\varphi mentioned above. Now we set Z=E(YX)Z=\mathbb E(Y|X), then we can see random variable E(YX)\mathbb E(Y|X) is given by φ(X)\varphi(X). From this observation, we can define E(YX=x)\mathbb E(Y\mid X=x):

Definition 19. Let YY be integrable random variable and X:(Ω,A)(E,E)X:(\Omega,\mathcal A)\to(E,\mathcal E). We define the conditional expectation of YY given XX by E(YX=x):=φ(x)\mathbb E(Y\mid X=x):=\varphi(x), where φ\varphi is the function defined in the way above that satisfies: φ\varphi is E\mathcal E-measurable and φ(X)=E(YX)\varphi(X)=\mathbb E(Y|X).

Analogously, define P(AX=x)=E(χAX=x)\mathbb P(A\mid X=x)=\mathbb E(\chi _ A\mid X=x) for AAA\in\mathcal A.

For BAB\in\mathcal A with P(B)>0\mathbb P(B)>0, we have known that P(B)\mathbb P(\cdot|B) is a probability measure. Is it true for P(X=x)\mathbb P(\cdot\mid X=x)? The question is a bit tricky since for every given AAA\in\mathcal A, the expression P(AX=x)\mathbb P(A\mid X=x) is defined for almost all xx only, i.e., up to xx in a null set depending on AA. It seems that we would have some difficulties dealing with it. But let us first take a look at some useful definitions.

Definition 20. Let (Ω1,A1),(Ω2,A2)(\Omega _ 1,\mathcal A _ 1), (\Omega _ 2,\mathcal A _ 2) be measurable spaces. A map κ:Ω1×A2[0,]\kappa: \Omega _ 1\times \mathcal A _ 2\to[0,\infty] is called a (σ\sigma-)finite transition kernel (from Ω1\Omega _ 1 to Ω2\Omega _ 2) if:
  1. ω1κ(ω1,A2)\omega _ 1\mapsto\kappa(\omega _ 1,A _ 2) is A1\mathcal A _ 1-measurable for any A2A2A _ 2\in\mathcal A _ 2;
  2. A2κ(ω1,A2)A _ 2\mapsto\kappa(\omega _ 1,A _ 2) is a (σ\sigma-)finite measure on (Ω2,A2)(\Omega _ 2,\mathcal A _ 2) for any ω1Ω1\omega _ 1\in\Omega _ 1.

If in (2) the measure is a probability measure for all ω1Ω1\omega _ 1\in\Omega _ 1, then κ\kappa is called a stochastic kernel or a Markov kernel.

For a transition kernel if we also have κ(ω1,Ω2)1\kappa(\omega _ 1,\Omega _ 2)\leqslant1 for any ω1Ω1\omega _ 1\in\Omega _ 1, then κ\kappa is called sub-Markov or substochastic.


Definition 21. Let YY be a random variable with values in a measurable space (E,E)(E,\mathcal E) and let FA\mathcal F\subseteq\mathcal A be a sub-σ\sigma-algebra. A Markov kernel κY,F\kappa _ {Y,\mathcal F} from (Ω,F)(\Omega,\mathcal F) to (E,E)(E,\mathcal E) is called a regular conditional distribution of YY given F\mathcal F if
\kappa _ {Y,\mathcal F}(\omega,B)=\mathbb P(\{Y\in B\}\mid \mathcal F)(\omega)\quad \text{a.e. for all }B\in\mathcal E,

that is, if
\int _ A \chi _ B(Y)\, \mathrm d\mathbb P=\int _ A \kappa _ {Y,\mathcal F}(\cdot,B)\, \mathrm d\mathbb P\quad \text{for all }A\in\mathcal F,\, B\in\mathcal E.

In short, the function κ\kappa is called a regular conditional distribution of YY given F\mathcal F if: 1. ωκ(ω,B)\omega\mapsto\kappa(\omega,B) is a version of P(XBF)\mathbb P(X\in B\mid \mathcal F) for each BB; 2. Bκ(ω,B)B\mapsto\kappa(\omega,B) is a probability measure on (E,E)(E,\mathcal E).

Consider the special case where F=σ(X)\mathcal F=\sigma(X) for a random variable XX (with values in an arbitrary measurable space (E,R)(E',\mathcal R')). Define regular conditional distribution of YY given XX by the Markov kernel
(x,A)\mapsto\kappa _ {Y,X}(x,A):=\mathbb P(\{Y\in A\}\mid X=x)=\kappa _ {Y,\sigma(X)}(X^{-1}(x),A),

and if X1(x)X^{-1}(x) does not exist we assign an arbitrary value.

For regular conditional distributions in R\mathbb R, we have the following theorem:

Theorem 22. Let Y:(Ω,A)(R,B)Y:(\Omega,\mathcal A)\to(\mathbb R,\mathcal B) be real-valued. Then there exists a regular conditional distribution κY,F\kappa _ {Y,\mathcal F} of YY given F\mathcal F.

The proof can be referred to in other materials.

We are also interested in the situation where YY takes values in Rn\mathbb R^n or in even more general spaces. We now extend the result to a larger class of ranges for YY. More definitions are needed but they are only briefly stated here. A measurable space (E,E)(E,\mathcal E) is called a Borel space if there exists a Borel set BB(R)B\in\mathcal B(\mathbb R) and a one-to-one map φ:EB\varphi:E\to B such that φ\varphi is E\mathcal E-B(B)\mathcal B(B)-measurable and the inverse map φ1\varphi^{-1} is B(B)\mathcal B(B)-E\mathcal E-measurable. In general topology, a Polish space is a separable completely metrizable topological space (i.e., a separable topological space whose topology is induced by a complete metric). If EE is a Polish space with Borel σ\sigma-algebra E\mathcal E, then (E,E)(E,\mathcal E) is a Borel space.

Theorem 23. Let FA\mathcal F\subseteq A be a sub-σ\sigma-algebra. Let YY be a random variable with values in a Borel space (E,E)(E,\mathcal E) (hence, for example, EE Polish, E=RdE=\mathbb R^d, E=RE=\mathbb R^\infty, E=C[0,1]E=C[0,1], etc.). Then there exists a regular conditional distribution κY,F\kappa _ {Y,\mathcal F} of YY given F\mathcal F.

The proof can be referred to in other materials.

To conclude, we pick up again the example with which we started. Define Y=(Y1,,Yn)Y=(Y _ 1,\dots,Y _ n). By the theorem above (with E={0,1}Rn)E=\{0,1\}\subseteq\mathbb R^n), a regular conditional distribution exists:
\kappa _ {Y,X}(x,\cdot)=\mathbb P(Y\in\cdot\mid X=x)\quad x\in[0,1].

Indeed, for almost all x[0,1]x\in[0,1], P(YX=x)\mathbb P(Y\in\cdot\mid X=x) can be calculated by the product of nn Bernoulli distributions with parameter xx.

Theorem 24. Let XX be a random variable on (Ω,A,P)(\Omega,\mathcal A,\mathbb P) with values in a Borel space (E,E)(E,\mathcal E). Let FA\mathcal F\subseteq A be a σ\sigma-algebra and let κX,F\kappa _ {X,\mathcal F} be a regular conditional distribution of XX given F\mathcal F. Further, let f:ERf:E\to\mathbb R be measurable and f(X)f(X) be integrable. Then
\mathbb E(f(X)|\mathcal F)(\omega)=\int f(x)\, \kappa _ {X,\mathcal F}(\omega,\mathrm dx)\quad\text{for }\mathbb P\text{-almost all }\omega.

The proof can be referred to in other materials.
3. Joint distribution with density In this section we consider more about jointly distributed random variables with density. Consider a family of random variables {Xi}iI\{X _ i\} _ {i\in I} and let JIJ\subseteq I be a finite subset of II. We have defined the (joint) distribution of {Xj}jJ\{X _ j\} _ {j\in J}. The joint distribution function of {Xj}jJ\{X _ j\} _ {j\in J} can be generalized easily from the one of one-dimensional random variable. It is also possible to define the (joint) density of {Xj}jJ\{X _ j\} _ {j\in J}, that is, a function fJ:RJ[0,)f _ J:\mathbb R^J\to[0,\infty) such that
F _ J(\boldsymbol x)=\int _ {-\infty}^{x _ {j _ 1}}\dots\int _ {-\infty}^{x _ {j _ n}}f _ J(t _ 1,\dots,t _ n)\, \mathrm dt _ 1\dots\mathrm dt _ n,\quad \forall \boldsymbol x\in\mathbb R^J.

If we further assume that fJf _ J is continuous, then the independence can be written in terms of joint density:

Theorem 25. the family {Xi}iI\{X _ i\} _ {i\in I} is independent iff for any finite JIJ\subseteq I we have
f _ J(\boldsymbol x)=\prod _ {j\in J}f _ j(x _ j)\quad\forall\boldsymbol x\in\mathbb R^J,

where fjf _ j is the marginal density, deduced from the joint density by integrating out the other variables xk(kJ,kj)x _ k\, (k\in J,k\neq j).

For discrete random variable, it can be seen easily that the probability mass function can also characterize the independence:

Theorem 26. the family {Xi}iI\{X _ i\} _ {i\in I} is independent iff for any finite JIJ\subseteq I we have
p _ J(\boldsymbol x)=\prod _ {j\in J}p _ j(x _ j)\quad \forall \boldsymbol x\in\boldsymbol x(S).

Now we consider conditional expectations and conditional distributions for discrete random variables and continuous random variables. If AA is any set in A\mathcal A with P(A)>0\mathbb P(A)>0, we have known PA()\mathbb P _ A(\cdot) defined by PA(E)=P(EA)/P(A)\mathbb P _ A(E)=\mathbb P(E\cap A)\mathbin{/}\mathbb P(A) is a probability measure.

In the discrete case, let YY be discrete and Y=nynχΩnY=\sum _ n y _ n\chi _ {\Omega _ n} where {Ωn}\{\Omega _ n\} is a partition of Ω\Omega, then we consider E(XY=yn)\mathbb E(X\mid Y=y _ n). We will see the definition in the section 3.4.3 coincides with the one in section 3.4.1. In section 3.4.1 we have defined such a conditional expectation
\mathbb E(X|\Omega _ n)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|\Omega _ n)=\frac1{\mathbb P(\Omega _ n)}\int _ {\Omega _ n} X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ {\Omega _ n} X)}{\mathbb P(\Omega _ n)},

which means that E(XY=yn)=E(χΩnX)/P(Ωn)\mathbb E(X\mid Y=y _ n)={\mathbb E(\chi _ {\Omega _ n} X)}\mathbin{/}{\mathbb P(\Omega _ n)}. In the context of section 3.4.2, we prove E(XY)=nE(XΩn)χΩn\mathbb E(X|Y)=\sum _ n \mathbb E(X|\Omega _ n)\chi _ {\Omega _ n}, i.e., for any ωΩn\omega\in\Omega _ n, E(XY)(ω)=E(XΩn)\mathbb E(X|Y)(\omega)=\mathbb E(X|\Omega _ n). Only the defining relation needs to be checked, and it suffices to show ΩnE(XΩn)dP=ΩnXdP\int _ {\Omega _ n}\mathbb E(X|\Omega _ n)\, \mathrm d\mathbb P=\int _ {\Omega _ n}X\, \mathrm d\mathbb P. The left term is equal to E(XΩn)P(Ωn)\mathbb E(X|\Omega _ n)\mathbb P(\Omega _ n) and the right is E(χΩnX)\mathbb E(\chi _ {\Omega _ n}X). From the equality above we know the defining relation holds. From E(XY)=nE(XΩn)χΩn\mathbb E(X|Y)=\sum _ n \mathbb E(X|\Omega _ n)\chi _ {\Omega _ n}, we can obtain again E(XY=yn)=E(χΩnX)/P(Ωn)\mathbb E(X\mid Y=y _ n)={\mathbb E(\chi _ {\Omega _ n} X)}\mathbin{/}{\mathbb P(\Omega _ n)}.

Now we focus more on the continuous case, where the random variables have continuous density. For simplicity we consider real-valued X,YX,Y with density f(x,y)f(x,y). Denote the margin by fX(x)=f(x,y)dyf _ X(x)=\int f(x,y)\, \mathrm dy, fY(y)=f(x,y)dxf _ Y(y)=\int f(x,y)\, \mathrm dx. First we consider the case fY(y)>0f _ Y(y)>0 for all yy. We are interested in E(XY=y)\mathbb E(X\mid Y=y). If you have acquaintance with some conditional density in probability course of undergraduate level, you may guess this is xfXY(x)dx=xf(x,y)dx/fY(y)=:g(y)\int xf _ {X|Y}(x)\, \mathrm dx=\int xf(x,y)\, \mathrm dx\mathbin{/}f _ Y(y)=:g(y). Recall E(XY=y)\mathbb E(X\mid Y=y) means the value of function φ(Y)\varphi(Y) on Y=yY=y, hence we shall show that gg can be a version of φ\varphi, i.e., g(Y)g(Y) is just the conditional expectation E(XY)\mathbb E(X|Y). On the one hand, g(Y)g(Y) is σ(Y)\sigma(Y)-measurable (gg is continuous function and hh is σ(Y)\sigma(Y)-measurable, then it can be shown ghg\circ h is σ(Y)\sigma(Y)-measurable); on the other hand, we need for any Aσ(Y)A\in\sigma(Y), Ag(Y)dP=AXdP\int _ A g(Y)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P.

Theorem 27. Almost surely we have
\mathbb E(X\mid Y=y)=\int _ {\mathbb R} xf _ {X|Y}(x)\, \mathrm dx=\frac{\int xf(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.

If the set Λ=R2{(x,y)(X,Y)(ω),ωΩ}\Lambda=\mathbb R^2\setminus\{(x,y)\mid (X,Y)(\omega),\, \omega\in\Omega\} has positive measure, then from the equality
\iint _ {\Lambda}f(x,y)\, \mathrm dx\mathrm dy=\int _ {\varnothing}\mathbb{P}(\mathrm d\omega)=0

we can know f(x,y)=0f(x,y)=0 for almost all points in Λ\Lambda.

For any σ(Y)\sigma(Y)-measurable set AA, there exists a Borel set BBB\in\mathcal B such that Y1(B)=AY^{-1}(B)=A, so
\int _ A g(Y)\, \mathrm d\mathbb P=\int _ B g(y)\, \mu _ Y(\mathrm d y)=\int _ B \frac{\int xf(x,y)\, \mathrm dx}{f _ Y(y)}\, \mu _ Y(\mathrm d y)=\iint _ {\mathbb R\times B} xf(x,y)\, \mathrm dx\mathrm dy.
On the other hand,
\int _ A X\, \mathrm d\mathbb P=\iint _ {(X,Y)(A)}xf(x,y)\, \mathrm dx\mathrm dy.

The set (X,Y)(A)(X,Y)(A) is contained in R×B\mathbb R\times B. Are the value of the above two integrals the same? Actually for points (x,y)(x,y) in R×B\mathbb R\times B but not (X,Y)(A)(X,Y)(A), since they all belong to Λ\Lambda (note that A=Y1(B)A=Y^{-1}(B)), we have derived previously f(x,y)=0f(x,y)=0 for almost all of them. Thus the two integrals are equal.

For the case where the positiveness fY(y)=f(x,y)dx>0f _ Y(y)=\int f(x,y)\, \mathrm dx>0 does not always hold, we can define gg by g(y)f(x,y)dx=xf(x,y)dxg(y)\int f(x,y)\, \mathrm dx=\int xf(x,y)\, \mathrm dx., i.e., hh can be anything where f(x,y)dy=0\int f(x,y)\, \mathrm dy=0. Note that this is enough for the proof.

More generally, it can be proved indifferently that for integrable h(X)h(X), the condition expectation E(h(X)Y=y)\mathbb E(h(X)\mid Y=y) is
\mathbb E(h(X)\mid Y=y)=\int _ {\mathbb R} h(x)f _ {X|Y}(x)\, \mathrm dx=\frac{\int h(x)f(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.
In particular, the conditional probability is
\mathbb P(X\in A\mid Y=y)=\mathbb E(\chi _ {\{X\in A\} }\mid Y=y)=\int _ {A} f _ {X|Y}(x)\, \mathrm dx.
Obviously we can obtain a regular conditional distribution here with density fXYf _ {X|Y}.

Theorem 28. The distribution of XX given Y=yY=y is given by the density function
f _ {X|Y}(x)=\frac{f(x,y)}{f _ Y(y)}=\frac{f(x,y)}{\int f(x,y)\, \mathrm dx}.


评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注