You can read the LaTeX document online (for the latest updated chapters) from the link: probability.pdf
Chapter 3: Random variables 1. Definitions Basically speaking, a random variable is just a measurable function.
It is necessary to give a general definition, as the way in real analysis, for logical reasons in many applications.
\[
\{\omega \mid X(\omega) \in B\}\in\Delta\cap \mathcal A,
\]
where \Delta\cap \mathcal A is the trace of \mathcal A on \Delta. A complex-valued random variable is a function on a set \Delta in \mathcal A to the complex plane whose real and imaginary parts are both real, finite-valued random variables.
For a discussion of basic properties we may suppose \Delta=\Omega and that X is real and finite-valued with probability one. The general case may be reduced to this one by considering the trace of (\Omega,\mathcal A,\mathbb P) on \Delta, or on the "domain of finiteness" \Omega _ 0=\{\omega\mid|X(\omega)|<\infty\}, and taking real and imaginary parts.
Recall the theory in measurable functions, we can characterise random variable with the following theorem.
The probability of the set in Definition def31 is clearly defined and will be written as
\[
\mathbb{P}(X(\omega)\in B),\, \text{or }\mathbb{P}(X\in B).
\]
The next theorem relates the probability measure \mathbb P to a probability measure on (\mathbb R,\mathcal B) as discussed in Chapter 2.
\[
\mu(B)=\mathbb{P}(X^{-1}(B))=\mathbb{P}(X\in B),\quad \forall B\in\mathcal B.
\]
The collections of sets \{X^{-1}(S),\, S\subseteq\mathbb R\} is a \sigma-algebra for any function X. If X is a random variable then the collection \{X^{-1}(B),\, B\subseteq\mathcal B\} is called the \sigma-algebra generated by X. It is The smallest \sigma-algebra contained in \mathcal A which contains all sets of the form \{\omega\mid X(\omega)\leqslant x\}, where x\in\mathbb R. Thus Theorem 3 gives us a convenient way of representing the measure \mathbb P when it is restricted to this subfield; symbolically we may write it as follows:
\[
\mu=\mathbb P\circ X^{-1}.
\]
This \mu is called the "probability distribution measure" or probability measure of X, and its associated distribution function F according to Section 2.2 will be called the distribution function of X. Specifically, F is given by
\[
F(x)=\mu((-\infty,x])=\mathbb P(X\leqslant x).
\]
While the random variable X determines \mu and therefore F, the converse is obviously false. A family of random variables having the same distribution is said to be "identically distributed".
For the notion of random vector, it is just a vector each of whose components is a random variable. It is sufficient to consider the case of two dimensions, since there is no essential difference in higher dimensions apart from in notation.
The class of Borel sets in \mathbb R^2 is denoted as \mathcal B^2, and the class of sets, each of which is a finite union of disjoint product sets, forms an algebra denoted \mathcal B _ 0^2.
Now let X and Y be two random variables on (\Omega, \mathcal A,\mathbb P). The random vector (X,Y) induces a probability \nu on \mathcal B^2 as follows:
\[
\nu(A)=\mathbb P((X,Y)\in A), \quad\forall A\in\mathcal B^2,
\]
the right side being an abbreviation of \mathbb P(\omega\mid (X(\omega),Y(\omega))\in A). This \nu is called the (2-dimensional, probability) distribution or simply the probability measure of (X, Y).
Proof [f\circ(X,Y)]^{-1}(\mathcal B)=(X,Y)^{-1}(f^{-1}(\mathcal B))\subseteq (X,Y)^{-1}(\mathcal B^2), we just need to show (X,Y)^{-1}(\mathcal B^2)\subseteq \mathcal A. If A has the simple form of two Borel sets, say A=B _ 1\times B _ 2 where B _ 1,B _ 2\in\mathcal B, then it is clear (X,Y)^{-1}(A)=X^{-1}(B _ 1)\cap Y^{-1}(B _ 2)\in\mathcal A. Hence for the class of sets A for which (X,Y)^{-1}(A)\in\mathcal A, the algebra \mathcal B _ 0^2 is contained in it. It can be proved that this class forms a \sigma-algebra, thus it must contain \mathcal B^2 (which is the smallest \sigma-algebra containing \mathcal B _ 0^2). We now have the desired conclusion.
If \{X _ i\} _ {i=1}^\infty is a sequence of random variables, then \inf X _ i, \sup X _ i, \liminf X _ i, \limsup X _ i are random variables, not necessarily finite-valued with probability one though everywhere defined, and \lim X _ i is an random variable on the set \Delta on which there is either convergence or divergence to \pm\infty.
The analogue in real analysis should be well known to the reader, so we omit the proof.
It is easy to see that X is discrete iff its distribution function is.
Let \{\Lambda _ j\} be a countable partition of \Omega, \{b _ j\} be arbitrary real numbers, then the function \varphi defined by \varphi(\omega)=\sum _ j b _ j\chi _ {\Lambda _ j}(\omega)\, (\forall \omega\in\Omega), is a discrete random variable. We shall call \varphi the random variable belonging to the weighted partition \{\Lambda _ j;b _ j\}. Each discrete random variable X belongs to a certain partition. For let \{b _ j\} be the countable set in the definition of X and let \Lambda _ j = \{\omega\mid X (\omega)) = b _ j \}, then X belongs to the weighted partition \{\Lambda _ j;b _ j\}. If j ranges over a finite index set, the partition is called finite and the r.v. belonging to it simple.
Transformation in \mathbb R^n
Here It is stated without proof the transformation formula for measures with continuous densities under differentiable maps. The proof can be found in textbooks on calculus. With this formula we can obtain the density of some new transformed random variables.
\[
f _ {\varphi}(\boldsymbol x)={f(\varphi^{-1}(\boldsymbol x))}\cdot|\det(\varphi'[\varphi^{-1}(\boldsymbol x)])|^{-1}, \quad \boldsymbol x\in B,\, \det(\varphi'[\varphi^{-1}(\boldsymbol x)])\neq0.
\]
For \boldsymbol x elsewhere, f _ \varphi(\boldsymbol x) is assigned 0.
Now we let \mu be the probability measure of \boldsymbol X, i.e., \mu=\mathbb P\circ \boldsymbol X^{-1}. We have \mu(A)=(\mathbb P\circ \boldsymbol X^{-1})(A)=1, and (\mu\circ\varphi^{-1})(B)=\mu(A)=1. Let \boldsymbol Y^{-1}=\boldsymbol X^{-1}\circ\varphi^{-1}, then \boldsymbol Y=\varphi(\boldsymbol X). The range of \boldsymbol Y is B. When \boldsymbol X has density f(\boldsymbol x), from the theorem the density of \boldsymbol Y is
\[
f _ {\boldsymbol Y}(\boldsymbol y)={f(\boldsymbol x)}\cdot|\det(\varphi'(\boldsymbol x))|^{-1}={f(\varphi^{-1}(\boldsymbol y))}\cdot|\det(\varphi'[\varphi^{-1}(\boldsymbol y)])|^{-1}.
\]
2. Expectation The concept of "(mathematical) expectation" is the same as that of integration in the probability space with respect to the measure \mathbb P. The reader is supposed to have some acquaintance with this. The general theory is not much different. The random variables below will be tacitly assumed to be finite everywhere to avoid trivial complications.
We first define the expectation of an arbitrary positive random variable X. For any two positive integers m,n, the set \Lambda _ {mn}=\{\omega\mid n/2^m\leqslant X(\omega)<(n+1)/2^m\} belongs to \mathcal A. For each m, let X _ m denote the random variable belonging to the weighted partition \{\Lambda _ {mn};n/2^m\}. It is easy to see we have got a increasing sequence of random variables, and there is monotone convergence: X _ m(\omega)=X(\omega). The expectation \mathbb E(X) of X is defined as the limit as m\to\infty
\[
\sum _ {n=0}^{\infty}\frac{n}{2^m}\mathbb{P}\left(\frac{n}{2^m}\leqslant X<\frac{n+1}{2^m}\right),
\]
the limit existing, finite or infinite. For an arbitrary X, put as usual X=X^+-X^-, both X^+ and X^- are positive random variables, so their expectations are defined. Unless both \mathbb E(X^+) and \mathbb E(X^-) are +\infty, we define \mathbb E(X)=\mathbb (E^+)-\mathbb (E^-) with the usual convention regrading \infty. The expectation, when it exists, is also denoted by
\[
\int _ \Omega X(\omega)\, \mathbb{P}(\mathrm{d}\omega).
\]
For each \Lambda\in\mathcal A, we define
\[\int _ \Lambda X(\omega)\, \mathbb{P}(\mathrm{d}\omega)=\mathbb{E}(X\cdot\chi _ {\Lambda})\]
and call it "the integral of X (with respect to \mathbb P) over the set \Lambda". As a general notation, the left member above will be abbreviated to
\[
\int _ \Lambda X\, \mathrm{d}\mathbb P.
\]
We shall say that X is integrable with respect to \mathbb P over \Lambda iff the integral above exists and is finite.
The general integral has the familiar properties of the Lebesgue integral.
- Absolute integrability. \int _ \Lambda X\, \mathrm{d}\mathbb P is finite iff \int _ \Lambda |X|\, \mathrm{d}\mathbb P<\infty.
- Linearity. \int _ \Lambda(aX+bY)\, \mathrm{d}\mathbb P=a\int _ \Lambda X\, \mathrm{d}\mathbb P+b\int _ \Lambda Y\, \mathrm{d}\mathbb P, provided that the right side is meaningful, namely not +\infty-\infty or -\infty+\infty.
- Additivity over sets. If the \Lambda _ n's are disjoint, then \int _ {\bigcup \Lambda _ n}X\, \mathrm{d}\mathbb P=\sum _ n\int _ {\Lambda _ n}X\, \mathrm{d}\mathbb P.
- Positivity. If X\geqslant 0 a.e. on \Lambda, then \int _ \Lambda X\, \mathrm{d}\mathbb P\geqslant0.
- Monotonicity. If X _ 1\leqslant X\leqslant X _ 2 on \Lambda a.e. on \Lambda, then \int _ \Lambda X _ 1\, \mathrm{d}\mathbb P\leqslant\int _ \Lambda X\, \mathrm{d}\mathbb P\leqslant\int _ \Lambda X _ 2\, \mathrm{d}\mathbb P.
- Mean value theorem. If a\leqslant X\leqslant b a.e. on \Lambda, then a\, \mathbb P(\Lambda)\leqslant\int _ \Lambda X\, \mathrm{d}\mathbb P\leqslant b\, \mathbb P(\Lambda).
- Modulus inequality. \left|\int _ \Lambda X\, \mathrm{d}\mathbb P\right|\leqslant\int _ \Lambda |X|\, \mathrm{d}\mathbb P.
- Dominated convergence theorem. If X _ n\to X a.e. (or merely in measure) on \Lambda and |X _ n|\leqslant Y a.e. on \Lambda, with \int _ \Lambda Y\, \mathrm{d}\mathbb P<\infty, then
\[
\lim _ {n\to\infty}\int _ \Lambda X _ n\, \mathrm{d}\mathbb P=\int _ \Lambda\lim _ {n\to\infty} X _ n\, \mathrm{d}\mathbb P=\int _ \Lambda X\, \mathrm{d}\mathbb P.
\] - Bounded convergence theorem. If X _ n\to X a.e. (or merely in measure) on \Lambda and |X _ n|\leqslant Y a.e. on \Lambda and there is a constant M such that |X _ n|\leqslant M a.e. on \Lambda, then the formula in (8) is true.
- Monotone convergence theorem. If X _ n\geqslant0 and X _ n\uparrow X a.e. on \Lambda, then the formula in (8) is again true provided that +\infty is allowed as a value for either member. The condition "X _ n\geqslant 0" may be weakened to : " \mathbb E(X _ n) > -\infty for some n".
- Integration term by term. If \sum _ n\int _ \Lambda |X _ n|\, \mathrm d\mathbb P<\infty, then \sum _ n|X _ n|<\infty a.e. on \Lambda so that \sum _ n X _ n converges a.e. on \Lambda and
\[
\int _ \Lambda\sum _ n X _ n\, \mathrm{d}\mathbb P=\sum _ n\int _ \Lambda X _ n\, \mathrm{d}\mathbb P.
\] - Fatou's lemma. If X _ n\geqslant0 a.e. on \Lambda, then
\[
\int _ \Lambda\liminf _ {n\to\infty} X _ n\, \mathrm{d}\mathbb P\leqslant\liminf _ {n\to\infty}\int _ \Lambda X _ n\, \mathrm{d}\mathbb P.
\]
\[
\sum _ {n=1}^{\infty}\mathbb P(|X|\geqslant n)\leqslant\mathbb E(|X|)\leqslant1+\sum _ {n=1}^{\infty}\mathbb P(|X|\geqslant n)
\]
so that \mathbb E(|X|)<\infty iff the series above converges.
\[
\mathbb E(X)=\sum _ {n=1}^{\infty}\mathbb P(|X|\geqslant n).
\]
To verify the theorem, set \Lambda _ n=\{n\leqslant X<n+1\}, then \mathbb E(|X|)=\sum _ {n=0}^\infty\int _ {\Lambda _ n}|X|\, \mathrm d\mathbb P, and
\[
\sum _ {n=0}^{\infty}n\, \mathbb P(\Lambda _ n)=\sum _ {n=0}^{\infty}\mathbb P(|X|\geqslant n).
\]
When X only takes positive integer values, this equation is just the above corollary.
.
There is a basic relation between the abstract integral with respect to \mathbb P over sets in \mathcal A on the one hand, and the Lebesgue-Stieltjes integral with respect to \mu over sets in \mathcal B on the other, induced by each random variable. We give the version in one dimension first.
\[
\int _ \Omega f(X(\omega))\, \mathbb P(\mathrm d\omega)=\int _ \mathbb R f(x)\, \mu(\mathrm dx)
\]
provided that either side exists.
The key point of the proof is approximation using simple functions. If f is a characteristic function of a Borel set, then the left side is \mathbb P(X\in B) and the right side is \mu(B). They are equal by the definition. Hence the proposition holds when f is a simple function. Then we construct a positive increasing sequence of simple functions such that f _ n\uparrow f, and take limits on both sides to the proposition with f _ n. The general case follows in the usual way.
As a consequence of this theorem, we have: if \mu _ X and F _ X denote, respectively, the probability measure and distribution function induced by X, then we have
\[
\mathbb E(X)=\int _ \mathbb R x\, \mu _ X(\mathrm dx)=\int _ {-\infty}^{+\infty}x\, \mathrm dF _ X(x),
\]
and more generally,
\[
\mathbb E(f(X))=\int _ \mathbb R f(x)\, \mu _ X(\mathrm dx)=\int _ {-\infty}^{+\infty}f(x)\, \mathrm dF _ X(x),
\]
with the usual proviso regarding existence and finiteness.
We shall need the generalization of the preceding theorem in several dimensions. No change is necessary except for notation, which we will give in two dimensions. Let us write the "mass element" as \mu^2(\mathrm dx,\mathrm dy) so that
\[
\nu(A)=\iint _ A\mu^2(\mathrm dx,\mathrm dy).
\]
\[
\int _ \Omega f(X(\omega),Y(\omega))\, \mathbb P(\mathrm d\omega)=\iint _ {\mathbb R^2}f(x,y)\, \mu^2(\mathrm dx,\mathrm dy).
\]
Note that f(X,Y) is a random variable.
If we take f(x,y)=x+y, we obatain
\[
\mathbb E(X+Y)=\mathbb E(X)+\mathbb E(Y).
\]
This is a useful relation.
More generally, we have the following theorem (you can prove it on your own):
\[
\int_\Omega (f\circ X)\, \mathrm d\mu=\int_{\Omega'}f\, \mathrm d(\mu\circ X^{-1}).
\]
Moments Let a\in\mathbb R, r\in\mathbb R^+, then \mathbb E(|X-a|^r) is called the absolute moment of X of order r, about a. It may be +\infty; otherwise, and if r is an integer, \mathbb E(X-a)^r is the corresponding moment. If \mu and F denote, respectively, the probability measure and distribution function induced by X, then
\begin{align*}
\mathbb E|X-a|^r & =\int _ \mathbb R |x-a|^r\, \mu(\mathrm dx)=\int _ {-\infty}^{+\infty}|x-a|^r\, \mathrm dF(x), \\
\mathbb E(X-a)^r & =\int _ \mathbb R (x-a)^r\, \mu(\mathrm dx)=\int _ {-\infty}^{+\infty}(x-a)^r\, \mathrm dF(x).
\end{align*}
For r=1, a=0, this reduces to \mathbb E(X), which is also called the mean of X. The moments about mean are called central moments. That of order 2 is particularly important and is called the variance, \operatorname{Var} (X); its positive square root the standard deviation \sigma(X):
\[
\operatorname{Var}(X)=\sigma^2(X)=\mathbb E(X-\mathbb EX)^2=\mathbb E(X^2)-(\mathbb EX)^2.
\]
For any positive number p, X is said to belong to L^p=L^p(\Omega,\mathcal A,\mathbb P) iff \mathbb E|X|^p<\infty.
Here are some well-known inequalities.
- Hölder's inequality:
\[
|\mathbb E(XY)|\leqslant \mathbb E|XY|\leqslant \mathbb (\mathbb E|X|^p)^{1/p}(\mathbb E|Y|^q)^{1/q}.
\] - Minkowski inequality:
\[
(\mathbb E|X+Y|^p)^{1/p}\leqslant(\mathbb E|X|^p)^{1/p}+(\mathbb E|Y|^p)^{1/p}.
\] - Cauchy-Schwarz inequality:
\[
(\mathbb E|XY|)^2\leqslant \mathbb (E|X|^2)(\mathbb E|Y|^2).
\] - If Y\equiv1 in 1.,
\[
\mathbb E|X|\leqslant (\mathbb E|X|^p)^{1/p}.
\] - Liapounov inequality:
\[
(\mathbb E|X|^r)^{1/r}\leqslant(\mathbb E|X|^{r^\prime})^{1/{r}^\prime},\quad0<r<r^\prime<\infty.
\] - Jensen's inequality: If \varphi is a convex function on \mathbb R, and X and \varphi(X) are integrable random variables:
\[
\varphi(\mathbb E(X))\leqslant \mathbb E(\varphi(X)).
\] - Chebyshev inequality: If \varphi is a strictly positive and increasing function on (0, +\infty), \varphi(u) = \varphi(-u), and X is a random variable such that \mathbb E(\varphi(X))<\infty, then for each u>0,
\[
\mathbb P(|X|\geqslant u)\leqslant\frac{\mathbb E(\varphi(X))}{\varphi(u)}.
\]
发表回复