You can read the LaTeX document online (for the latest updated chapters) from the link: probability.pdf
Chapter 4: Independence and conditioning 1. Independence In elementary probability, it is common to introduce first conditional probability, and then independence. But now the concept of independence is introduced first without mentioning any conditioning. And, the concept of independent random variables is introduced first, rather than independent events in common ways. In the section later, a brief introduction of conditioning will be offered.
Contents
Contents
1. Independence
2. Conditional expectations
2.1. Elementary conditional probabilities
2.2. Conditional expectations
2.3. Regular Conditional Distributions
3. Joint distribution with density
P(n⋂i=1(Xi∈Bi))=n∏i=1P(Xi∈Bi).
The random variables of an infinite family are said to be independent iff those in every finite subfamily are. They are said to be pairwise independent iff every two of them are independent.
Later we will see that with the notion of independent events, they are called independent if the family of -algebras is independent.
Note that if are independent, then the random variables in every subset of it are also independent, since we may take some of the 's as . On the other hand, the independence condition can be derived by a weaker condition:
P(n⋂i=1(Xi⩽
The equivalence is not proved here. Written in terms of distribution functions, it is
F(x _ 1,\dots,x _ n)=\prod _ {i=1}^{n}F _ i(x _ i).
\mathbb{P}\Big(\bigcap _ {j=1}^lE _ {i _ j}\Big)=\prod _ {j=1}^{l}\mathbb{P}(E _ {i _ j}).
The equivalence in this definition can be verified directly and is not shown here. The latter is the most common definition since it does not rely on the notion of independent variables, and is the natural generalization of two independent events.
We prove this theorem in two methods. The first is standard and therefore longer. First consider discrete and , and then arbitrary positive ones. Finally the general case follows as usual.
The second proof can be written as follows:
\mathbb E(XY)=\int _ {\Omega}XY\, \mathrm d\mathbb P=\iint _ {\mathbb R^2}xy\, \mu^2(\mathrm dx,\mathrm dy).
Note that , where . We have
\mathbb E(XY)=\int _ {\mathbb R}\int _ {\mathbb R}xy\, \mu _ X(\mathrm dx)\mu _ Y(\mathrm dy)=\int _ {\mathbb R}x\, \mu _ X(\mathrm dx)\int _ {\mathbb R}y\, \mu _ Y(\mathrm dy)=\mathbb E(X)\mathbb E(Y),
finishing the proof! Observe that we are using here a very simple form of Fubini's theorem (see below). Indeed, the second proof appears to be so much shorter only because we are relying on the theory of ''product measure". You can check that the measure of is the product measure iff and are independent. Details on product measure would not be discussed more here.
The above definition of independent events is a kind of abstraction of our common sense. However, sometimes it is not easy to see the independence or dependence from our intuition. Consider the example of rolling two dices. Denote as the event ''the sum of faces values is a multiple of ". It can be easily verified that are independent, while are not. We can see that they are not trivial without calculation.
It is well worth pointing out that pairwise independence does not imply mutual independence. Consider the following example. We set and each sample point is assigned a probability of . Next, let , then , , so are pairwise independent. However, they are not mutual independent since .
From the above examples, we may feel that the intuition usually misleads us. But as mentioned before, the definition is a kind of mathematical abstraction from our ''independence" in daily life, so it should not be surprising that the mathematical independence differs somewhat from our intuitive background.
The notion of independence is extremely important in probability theory. It was central in the early development of probability theory (let's say, until the 1930s), and today, a number of non-independent theories have been developed, but they are still not sufficiently developed. Moreover, the theories and methods of independence are the basis and tools for the study of non-independent models. In practice, there are indeed many events whose dependence is so small that they can be considered independent within the tolerance of errors, thus facilitating the solution of problems.
Now let us state the fundamental existence theorem of product measures. Since the proof is sort of complex, it is not provided here.
Now a famous theorem, Fubini's theorem, will be stated. In the previous review of real analysis this theorem is given in a special case (Lebesgue measure only), so now a more general one will be provided.
Let and be -finite measure spaces, and let be an -measurable function on . Then: for each , is a -measurable function, and for each , is a -measurable function.
- If , and if
\varphi(x)=\int _ Y f(x,\cdot)\, \mathrm d\lambda,\quad \psi(y)=\int _ X f(\cdot, y)\, \mathrm d\mu,\quad (x\in X,\, y\in Y),then is -measurable, is -measurable, and
\int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda. - If is complex and if
\varphi^\ast(x)=\int _ Y|f(x,\cdot)|\, \mathrm d\lambda,\quad \int _ X\varphi^\ast\, \mathrm d\mu<\infty,then .
- If , then for almost all , and for almost all ; the functions and , defined by the formula above almost everywhere, are in and , respectively, and it still holds that
\int _ X \varphi\, \mathrm d\mu=\int _ {X\times Y}f\, \mathrm d(\mu\times \lambda)=\int _ Y\psi\, \mathrm d\lambda.
If we consider the completion of measures, we have an alternative statement of Fubini's theorem.
The -measurability of can be asserted only for almost all , so that is only defined a.e. with respect to ; a similar statement holds for and .
2. Conditional expectations 2.1. Elementary conditional probabilities
\mathbb P(A| B)=\frac{\mathbb P(A\cap B)}{\mathbb P(B)}.
If , it is obvious that is a probability measure on .
Let with . Then '' are independent" is equivalent to , and this is also equivalent to . This is the most common definition of independent events in elementary probability, since it can be interpreted well from our intuitive background. The conditional probability often differs from the unconditional probability , reflecting that the two events have some kind of dependence. If they are the same, then the occurrence of event does not influence the posibility of occurrence of at all.
\begin{equation}\label{31} \mathbb P(A)=\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i). \end{equation}
For any with and any ,
\begin{equation}\label{32}
\mathbb P(B _ k| A)=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\mathbb P(A)}=\frac{\mathbb P(A| B _ k)\mathbb P(B _ k)}{\sum _ {i\in I}\mathbb P(A| B _ i)\mathbb P(B _ i)}.
\end{equation}
From the derivation you may think (32) is somewhat trivial. However, (32) is a well-known formula in probability theory and named Bayes' formula (or alternatively Bayes' theorem, Bayes' law, Bayes' rule, etc.), due to the interpretation of its practical and even philosophical significance. For , they are the estimate posibility of occurrence without further information (of the occurrence of event ). Now with new information (of the occurrence of event ), the posibility of has new estimation. This is common in our daily life: An event that was previously considered nearly impossible can be made very possible by the occurrence of another event, and vice versa. The Bayes' formula characterize it quantitatively.
If the event is treated as a ''result'', and those s are possible causes for this result, we can interpret (31) in some sense as ''inference of result from the causes''; but Bayes' formula does the opposite, for it ''infers the causes from the result''. Now we have known that the result occurs, we may ask among those many possible causes, which one lead to this result? This is a very common question we may ask in daily life or in research. Bayes' formula basically asserts that the possibility is proportional to .
From the discussion above, it would not be surprising that Bayes' formula is able to play a remarkable role in the field of statistics.
For the purpose of next subsection, we consider a naive example of conditional expectation.
If is a random variable with finite expectation and with , then the expectation of with respect to the probability measure can be
\mathbb E(X|A)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|A):=\frac1{\mathbb P(A)}\int _ A X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ A X)}{\mathbb P(A)}.
Clearly, for all .
2.2. Conditional expectations
Let , that is, has density over . Assume that if we have known the information , the random variables are independent and each has Bernoulli distribution with parameter , i.e., and . (Although those very common distributions have not been reviewed so far, they are used here for the convenience of better explanation of motivation.) So far, something like has not been defined since . However, we need it since in the example here the distributions of s would be determined under the condition .
We first consider a more general situation in this section.
- it is -measurable;
- it has the same integral as over any set in , i.e.,
\int _ A\mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P,\quad \forall A\in\mathcal F,or .
For , is called a conditional probability of given .
For the uniqueness. Let and satisfy the two properties. Let . Clearly . Then . Hence ; interchanging and we can conclude that a.e. For the existence, consider the set function on : for any , . It is finite-valued and countably additive, hence a ''signed measure'' on . If , then ; hence it is absolutely continuous with respect to : . The existence then follows from the Radon-Nikodym theorem, the resulting ''derivative'' being what we desired.
If is a random variable and is integrable, then we write .
The next theorem gathers some properties of conditional expectation.
- (Linearity) .
- (Monotonicity) If a.s., then .
- If is integrable and is -measurable, then
\mathbb E(XY|\mathcal F)=Y\, \mathbb E(X|\mathcal F),\quad \mathbb E(Y|\mathcal F)=\mathbb E(Y|Y)=Y. - (Tower property)
\mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)=\mathbb E(\mathbb E(X|\mathcal G)\mid\mathcal F)=\mathbb E(X|\mathcal G).Hence if and only if is -measurable.
- (Triangle inequality) .
- (Independence) If and are independent, then .
- If for any , or , then .
- (Dominated convergence) Assume and a.s. Then
\lim _ {n\to\infty}\mathbb E(X _ n|\mathcal F)=\mathbb E(X|\mathcal F)\quad \text{a.s. and in }L^1(\mathbb P).
For (3), as usual we may suppose . The proof consists in observing that is -measurable and satisfies the defining relation for all . This is true if where hence it is true if is a simple -measurable random variable and consequently also for each in by monotone convergence, whether the limits are finite or positive infinite. Note that the integrability of is part of the assertion of the property. The second equality follows (it can also be easily seen from the definition easily since the defining relation holds naturally).
(4) may be the most important property of conditional expectation relating to changing -algebras. The second equality follows from (3). Now let , then . We apply the defining relation twice:
\int _ A \mathbb E(\mathbb E(X|\mathcal F)\mid\mathcal G)\, \mathrm d\mathbb P=\int _ A \mathbb E(X|\mathcal F)\, \mathrm d\mathbb P=\int _ A X\, \mathrm d\mathbb P.
Hence satisfies the defining relation for . Since it is -measurable, it is equal to the latter.
For (6), let . Then and are independent. Recall the property of expectations of independent random variables, we have .
The proof of (8) is omitted here.
\mathbb E(X-Y)^2\geqslant\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2
with equality if and only if .
It can be proved that with Jensen's inequality (which will be provided later). From Cauchy-Schwarz inequality . Since is -measurable, by property (3),
\mathbb E[\mathbb E(X|\mathcal F)Y]=\mathbb E(XY),\quad \mathbb E[X\, \mathbb E(X|\mathcal F)]=\mathbb E\{\mathbb E[X\, \mathbb E(X|\mathcal F)]\mid\mathcal F\}=\mathbb E\left[\mathbb E(X|\mathcal F)^2\right].
Therefore, we have
\begin{align*}
\mathbb E(X-Y)^2-\mathbb{E}\left[(X-\mathbb E(X|\mathcal F)\right]^2 & =\mathbb{E}\left[X^2-2XY+Y^2-X^2+2X\mathbb E(X|\mathcal F)-\mathbb E(X|\mathcal F)^2\right] \\
& =\mathbb{E}\left[Y^2-2Y\mathbb E(X|\mathcal F)+\mathbb E(X|\mathcal F)^2\right]\\
& =\mathbb E\left[(Y-\mathbb E(X|\mathcal F))^2\right]\geqslant 0.
\end{align*}
\mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F).
This inequality is given here to appreciate the caution that is necessary in handling conditional expectations. If we consider the positiveness of , the problem arises that for each this is true only for almost all , i.e., only up to a -null set . The union cannot be ignored without comment. We get rid out of this difficulty by restricting to all rational numbers. Let . Clearly is a -null set and then the positiveness of the quadratic form in holds for every and every . For ,
\begin{align*}
& \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E((X+\lambda Y)^2\mid\mathcal F)(\omega)\Big]\geqslant0, \\
\Longrightarrow{} & \inf _ {\lambda\in\mathbb Q}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\
\Longrightarrow{} & \inf _ {\lambda\in\mathbb R}\Big[\mathbb E(Y|\mathcal F)^2(\omega)\lambda^2+2\, \mathbb E(XY|\mathcal F)(\omega)\lambda+\mathbb E(X^2|\mathcal F)(\omega)\Big]\geqslant0,\\
\Longrightarrow{} & \mathbb E(|XY|\mid\mathcal F)^2\leqslant\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)\quad \text{a.s.}
\end{align*}
It can also be proved in this way for an alternative to avoid such difficulty:
\mathbb E\Big(\frac{|XY|}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\Big|\mathcal F\Big)\leqslant\frac12\mathbb E\Big(\frac{X^2}{\mathbb E(X^2|\mathcal F)}\Big|\mathcal F\Big)+\frac12\mathbb E\Big(\frac{Y^2}{\mathbb E(Y^2|\mathcal F)}\Big|\mathcal F\Big).
Hence
\frac{\mathbb E(|XY|\mid\mathcal F)}{\sqrt{\mathbb E(X^2|\mathcal F)\mathbb E(Y^2|\mathcal F)}}\leqslant\frac12+\frac12=1.
\varphi(\mathbb E(X|\mathcal F))\leqslant\mathbb E(\varphi(X)|\mathcal F).
Proof. You may concern that the conditional expectation of does not exist. However, note that the negative part of is integrable, exists in a general sense. Though the details of this is not given here, you are advised to check it in some materials deeper.
The right derivative of a convex function must exist. We denote the right derivative of as , then we have: for all , . Replace by and take conditional expectation, by the linearity and monotonicity of (general) conditional expectations, we have: for , there exists a -null set such that for every ,
\mathbb E [ \varphi(X)|\mathcal F ] (\omega)\geqslant\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x].
Let , then it is also a -null set, and the inequality holds for every and every . The right derivative of is right continuous, which can be checked: can be written by and it is increasing with respect to , so ; we just need , and this follows from . Thus, we have proved , i.e., is right continuous.
Now, we can see is also right continuous. We take supremum of it over , then it is the same with the one over . Note that if we take , it becomes . Hence for every , we have
\varphi(\mathbb E(X|\mathcal F)(\omega))\leqslant\sup _ {x\in\mathbb R}\Big[\varphi(x)+\varphi _ +'(x)[\mathbb E(X|\mathcal F)(\omega)-x]\Big]\leqslant\mathbb E [ \varphi(X)|\mathcal F ] (\omega).
2.3. Regular Conditional Distributions Let be a random variable with values in a measurable space . That is, is --measurable. So far we can define the conditional probability for fixed only. However, we would like to define for every a probability measure such that for any we have where . In this subsection, we show how to do this.
Let be a -measurable real random variable. It can be proved that there is a map such that: 1. is -measurable; 2. .
We now prove this lemma. First consider the case . Then can be written as . (We know that a measurable function can be the limit of an increasing sequence of nonnegative measurable simple functions, then it is clear that can be written as summation of nonnegative measurable simple functions, which is also summation of scaled indicators ). By the definition of , , which means that for all there is such that . Hence .
Now define by . Clearly, is --measurable and .
Now drop the assumption that is nonnegative. Then there exist measurable maps and such that and . Note that for all . Hence we just need if or , and elsewhere.
Let and . We obtain the mentioned above. Now we set , then we can see random variable is given by . From this observation, we can define :
Analogously, define for .
For with , we have known that is a probability measure. Is it true for ? The question is a bit tricky since for every given , the expression is defined for almost all only, i.e., up to in a null set depending on . It seems that we would have some difficulties dealing with it. But let us first take a look at some useful definitions.
- is -measurable for any ;
- is a (-)finite measure on for any .
If in (2) the measure is a probability measure for all , then is called a stochastic kernel or a Markov kernel.
For a transition kernel if we also have for any , then is called sub-Markov or substochastic.
\kappa _ {Y,\mathcal F}(\omega,B)=\mathbb P(\{Y\in B\}\mid \mathcal F)(\omega)\quad \text{a.e. for all }B\in\mathcal E,
that is, if
\int _ A \chi _ B(Y)\, \mathrm d\mathbb P=\int _ A \kappa _ {Y,\mathcal F}(\cdot,B)\, \mathrm d\mathbb P\quad \text{for all }A\in\mathcal F,\, B\in\mathcal E.
In short, the function is called a regular conditional distribution of given if: 1. is a version of for each ; 2. is a probability measure on .
Consider the special case where for a random variable (with values in an arbitrary measurable space ). Define regular conditional distribution of given by the Markov kernel
(x,A)\mapsto\kappa _ {Y,X}(x,A):=\mathbb P(\{Y\in A\}\mid X=x)=\kappa _ {Y,\sigma(X)}(X^{-1}(x),A),
and if does not exist we assign an arbitrary value.
For regular conditional distributions in , we have the following theorem:
The proof can be referred to in other materials.
We are also interested in the situation where takes values in or in even more general spaces. We now extend the result to a larger class of ranges for . More definitions are needed but they are only briefly stated here. A measurable space is called a Borel space if there exists a Borel set and a one-to-one map such that is --measurable and the inverse map is --measurable. In general topology, a Polish space is a separable completely metrizable topological space (i.e., a separable topological space whose topology is induced by a complete metric). If is a Polish space with Borel -algebra , then is a Borel space.
The proof can be referred to in other materials.
To conclude, we pick up again the example with which we started. Define . By the theorem above (with , a regular conditional distribution exists:
\kappa _ {Y,X}(x,\cdot)=\mathbb P(Y\in\cdot\mid X=x)\quad x\in[0,1].
Indeed, for almost all , can be calculated by the product of Bernoulli distributions with parameter .
\mathbb E(f(X)|\mathcal F)(\omega)=\int f(x)\, \kappa _ {X,\mathcal F}(\omega,\mathrm dx)\quad\text{for }\mathbb P\text{-almost all }\omega.
The proof can be referred to in other materials.
3. Joint distribution with density
In this section we consider more about jointly distributed random variables with density. Consider a family of random variables and let be a finite subset of . We have defined the (joint) distribution of . The joint distribution function of can be generalized easily from the one of one-dimensional random variable. It is also possible to define the (joint) density of , that is, a function such that
F _ J(\boldsymbol x)=\int _ {-\infty}^{x _ {j _ 1}}\dots\int _ {-\infty}^{x _ {j _ n}}f _ J(t _ 1,\dots,t _ n)\, \mathrm dt _ 1\dots\mathrm dt _ n,\quad \forall \boldsymbol x\in\mathbb R^J.
If we further assume that is continuous, then the independence can be written in terms of joint density:
f _ J(\boldsymbol x)=\prod _ {j\in J}f _ j(x _ j)\quad\forall\boldsymbol x\in\mathbb R^J,
where is the marginal density, deduced from the joint density by integrating out the other variables .
For discrete random variable, it can be seen easily that the probability mass function can also characterize the independence:
p _ J(\boldsymbol x)=\prod _ {j\in J}p _ j(x _ j)\quad \forall \boldsymbol x\in\boldsymbol x(S).
Now we consider conditional expectations and conditional distributions for discrete random variables and continuous random variables. If is any set in with , we have known defined by is a probability measure.
In the discrete case, let be discrete and where is a partition of , then we consider . We will see the definition in the section 3.4.3 coincides with the one in section 3.4.1. In section 3.4.1 we have defined such a conditional expectation
\mathbb E(X|\Omega _ n)=\int _ \Omega X(\omega)\, \mathbb P(\mathrm d\omega|\Omega _ n)=\frac1{\mathbb P(\Omega _ n)}\int _ {\Omega _ n} X(\omega)\, \mathbb P(\mathrm d\omega)=\frac{\mathbb E(\chi _ {\Omega _ n} X)}{\mathbb P(\Omega _ n)},
which means that . In the context of section 3.4.2, we prove , i.e., for any , . Only the defining relation needs to be checked, and it suffices to show . The left term is equal to and the right is . From the equality above we know the defining relation holds. From , we can obtain again .
Now we focus more on the continuous case, where the random variables have continuous density. For simplicity we consider real-valued with density . Denote the margin by , . First we consider the case for all . We are interested in . If you have acquaintance with some conditional density in probability course of undergraduate level, you may guess this is . Recall means the value of function on , hence we shall show that can be a version of , i.e., is just the conditional expectation . On the one hand, is -measurable ( is continuous function and is -measurable, then it can be shown is -measurable); on the other hand, we need for any , .
\mathbb E(X\mid Y=y)=\int _ {\mathbb R} xf _ {X|Y}(x)\, \mathrm dx=\frac{\int xf(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.
If the set has positive measure, then from the equality
\iint _ {\Lambda}f(x,y)\, \mathrm dx\mathrm dy=\int _ {\varnothing}\mathbb{P}(\mathrm d\omega)=0
we can know for almost all points in .
For any -measurable set , there exists a Borel set such that , so
\int _ A g(Y)\, \mathrm d\mathbb P=\int _ B g(y)\, \mu _ Y(\mathrm d y)=\int _ B \frac{\int xf(x,y)\, \mathrm dx}{f _ Y(y)}\, \mu _ Y(\mathrm d y)=\iint _ {\mathbb R\times B} xf(x,y)\, \mathrm dx\mathrm dy.
On the other hand,
\int _ A X\, \mathrm d\mathbb P=\iint _ {(X,Y)(A)}xf(x,y)\, \mathrm dx\mathrm dy.
The set is contained in . Are the value of the above two integrals the same? Actually for points in but not , since they all belong to (note that ), we have derived previously for almost all of them. Thus the two integrals are equal.
For the case where the positiveness does not always hold, we can define by ., i.e., can be anything where . Note that this is enough for the proof.
More generally, it can be proved indifferently that for integrable , the condition expectation is
\mathbb E(h(X)\mid Y=y)=\int _ {\mathbb R} h(x)f _ {X|Y}(x)\, \mathrm dx=\frac{\int h(x)f(x,y)\, \mathrm dx}{\int f(x,y)\, \mathrm dx}.
In particular, the conditional probability is
\mathbb P(X\in A\mid Y=y)=\mathbb E(\chi _ {\{X\in A\} }\mid Y=y)=\int _ {A} f _ {X|Y}(x)\, \mathrm dx.
Obviously we can obtain a regular conditional distribution here with density .
f _ {X|Y}(x)=\frac{f(x,y)}{f _ Y(y)}=\frac{f(x,y)}{\int f(x,y)\, \mathrm dx}.
发表回复