Probability measures

You can read the LaTeX document online (for the latest updated chapters) from the link: probability.pdf

Chapter 1: Probability measures 1. What is probability? What is probability? The concept can be easy to understand if we mention it in daily life, but a number of problems would arise if we think it with some rigorous analysis and on a higher level of philosophy. We don't discuss those problems here, but still hope the concept of probability can be explained as clearly as possible.

Mathematically, probability is defined from some axioms. This was successfully achieved by the great modern mathematician Kolmogorov in the year 1933. It is worth appreciating that the probability axioms were not only proposed successfully, but simple and elegant.

Clearly, probability is a kind of set function. So some basic knowledge of set is reviewed here first.

Contents
Contents
 1.  What is probability?
 2.  Some notes on definition and interpretation of probability
   2.1.  Classical probability
   2.2.  The frequency definition of probability
   2.3.  Subjective probability

Definition 1. A family \mathcal {A} of subsets of \Omega, i.e. a subset of the power set \mathcal P(\Omega), is called an algebra over \Omega, that has the following properties:
  1. Contains the empty set as an element: \varnothing\in \mathcal A.
  2. Closed under complementation in \Omega: E^c\in\mathcal A for all E\in\mathcal A.
  3. Closed under finite unions: \bigcup _ {i=1}^n E _ i\in\mathcal A for all n\in\mathbb N^* and E _ i\in\mathcal A.


Definition 2. The algebra \mathcal A is called a \sigma-algebra if the following additional condition (4) is satisfied:
  • Closed under countable unions: \bigcup _ {i=1}^\infty E _ i\in\mathcal A for all E _ i\in\mathcal A.

The following theorems are about the relation between \sigma-algebra and monotone class:

Theorem 3. An algebra is \sigma-algebra iff it is also a monotone class, while a nonempty class \mathcal A of subsets of \Omega is called a monotone class if it is closed under unions (and intersections) of increasing sequence (and decreasing sequence).

Theorem 4. Let \mathcal A _ 0 be an algebra, \mathcal G be the minimal monotone class containing \mathcal A _ 0, \mathcal F the minimal \sigma-algebra containing \mathcal A _ 0, then \mathcal F=\mathcal G.

Theorem 3 is one of a type called monotone class theorems. They are among the most useful tools of measure theory.

Naturally, we hope we can define probability over all the subsets of \Omega, but actually we can't because there would be some problems. If we do so, the "probability" can't satisfy some more basic principle (countable additivity). Therefore, we restrict ourselves to \sigma-algebra defined above. Let \Omega be a set, \mathcal A a \sigma-algebra over \Omega. The tuple (\Omega,\mathcal A) is called a measurable space.

Definition 5. A probability measure \mathbb P on \mathcal A is a set function with domain \mathcal A satisfying the following axioms:
  1. nonnegativity: \forall E\in\mathcal A, \mathbb P(E)\geqslant0.
  2. \sigma-additivity (or countable additivity): If \{E _ i\} _ {i=1}^\infty is a countable collection of (pairwise) disjoint sets in \mathcal A, then \mathbb{P}(\bigcup E _ i)=\sum\mathbb{P}(E _ i).
  3. unit measure: \mathbb P(\Omega)=1.

The triple (\Omega,\mathcal A,\mathbb P) is called a probability space; \Omega is called the sample space, and \omega is then a sample point.

These axioms imply the following consequences.

Proposition 6.
  • \mathbb{P}(\varnothing)=0.
  • Finite additivity: If \{E _ i\} _ {i=1}^n is a finite collection of (pairwise) disjoint sets in \mathcal A, then \mathbb{P}(\bigcup _ i E _ i)=\sum _ i\mathbb{P}(E _ i).
  • \forall E,F\in\mathcal{A}, E\subseteq F\Rightarrow \mathbb{P}(E) \leqslant \mathbb{P}(F).
  • \forall E\in\mathcal{A}, \mathbb{P}(E)\leqslant1.
  • Total probability: F\in\mathcal{A}, \{E _ n\} is a countable partition of \Omega, then \mathbb{P}(F)=\sum _ {i=1}^{\infty} \mathbb{P}\left(F \cap E _ {i}\right).
  • Completion: \forall E\in\mathcal A, \mathbb{P}\left(E^{c}\right)=1-\mathbb{P}(E).
  • Union: \forall E,F\in\mathcal{A}, \mathbb{P}(E \cup F)=\mathbb{P}(E)+\mathbb{P}(F)-\mathbb{P}(E \cap F).
  • Poincaré formula: if E _ {1}, \ldots, E _ {n} \in \mathcal{A}, then
    \[\mathbb{P}\Big(\bigcup _ {i=1}^{n} E _ {i}\Big)=\sum _ {k=1}^{n}(-1)^{k-1} \sum _ {1 \leqslant i _ {1}<\cdots<i _ {k} \leqslant n} \mathbb{P}\left(E _ {i _ {1}} \cap \cdots \cap E _ {i _ {k}}\right).\]

  • Boole's inequality: if \{E _ n\} _ {n=1}^\infty\subseteq\mathcal A, then
    \[\mathbb{P}\Big(\bigcup _ {i=1}^{\infty} E _ {i}\Big) \leqslant \sum _ {i=1}^{\infty} \mathbb{P}\left(E _ {i}\right).\]

  • Monotone property or continuity property: if \{E _ n\} _ {n=1}^\infty is a monotone sequence in \mathcal A, then we have \mathbb P(E _ n)\to\mathbb P(E), i.e.,
    \[\lim _ {n \rightarrow \infty} \mathbb{P}\left(E _ {n}\right)=\mathbb{P}\left(\lim _ {n \rightarrow \infty} E _ {n}\right).\]

There is an axiom called "axiom of continuity", which refers to the proposition
\begin{equation*}
E _ n\downarrow \varnothing\implies\mathbb P(E _ n)\to0.
\end{equation*}

It is a particular case of the continuity property above, which may be deduced from it.

Theorem 7. The axioms of finite additivity and of continuity together are equivalent to the axiom of countable additivity.

The next definition considers the subset of the sample space as (new) sample space.

Definition 8. Let \Delta\subseteq \Omega, then the trace of the \sigma-algebra \mathcal A on \Delta is the collection of all sets of the form \Delta\cap A where A\in\mathcal A. It is easy to see that this is a \sigma-algebra over \Delta, and we shall denote it by \Delta\cap\mathcal A. Suppose \Delta\in\mathcal A and \mathbb P(\Delta)>0; then we may define the set function \mathbb P _ \Delta on \Delta\cap\mathcal A as follows:
\[\mathbb P _ \Delta(E)=\frac{\mathbb P(E)}{\mathbb P(\Delta)}, \quad\forall E\in\Delta\cap\mathcal A.\]

It is easy to see that \mathbb P _ \Delta is a probability measure on \Delta\cap\mathcal A. The triple (\Delta, \Delta\cap\mathcal A,\mathbb P _ \Delta) will be called the trace of (\Omega,\mathcal A,\mathbb P) on \Delta.

2. Some notes on definition and interpretation of probability The definition of probability in this chapter is clearly among an abstract approach. We can't calculate any probability from those axioms, or use the axioms to interpret more practical stuffs. There is always a gap between the abstract theory and practical applications. The interpretation of probability is discussed more in the field of statistics because of its importance. Generally speaking, people mainly interpret it in two perspectives: frequency view or Bayesian view.
2.1. Classical probability Probability was defined in a "classical" way at the beginning. Suppose there are only a finite number of possible "elementary events". If we cannot find any reason to think that one of the outcomes is more advantageous than any of the others, then we have to assume that all outcomes have an equal chance of occurring. Assuming that they are all equally likely to occur, the probability of an event is just the ratio of the number of the relevant elementary events to the total number of elementary events. In terms of formula, that is \mathbb{P}(E)=|E|/|\Omega|.

We can see it is reasonable. By the meaning of equal likelihood, the probability of each elementary event is \frac1m where m=|\Omega|. If E contains n elementary events, the probability of E should be n times of \frac1m, i.e., \frac nm.

However, there seems to be a circular definition. (If defining a concept depends on the defined concept itself, it is the same as using undefined concept to describe this undefined concept, which of course does not work as a definition, e.g. "a writer is a person who participates in a writers' association"). How can we determine those that are equally likely to happen? They should have the same probability, but the calculation of probability in turn relies on the initial assumption of equal likelihood. If we recall the axioms, we will find there is nothing about equal likelihood. The formula \mathbb{P}(E)=|E|/|\Omega| has actually defined a probability measure \mathbb P. Therefore the value of probability depends on the determination of elementary events, which is not done by the axioms. Once the elementary events are determined, we can calculate all probabilities.

Example Tossing two coins, mathematician D'Alembert thought there were three possible cases, i.e., two heads, two tails, one head and one tail, so he thought the probability of one head and one tail was 1/3. If he could realize the relationship between probability and frequency and try it himself a few times, he might change his opinion. The three cases are not equally likely to happen under the usual assumptions, and "one head and one tail" should contain two "elementary events" (head,tail) and (tail,head). Under normal assumptions the four elementary events are (head,head), (head,tail),(tail,head) and (tail,tail).

This is a famous example but as mentioned above, the determination of elementary events is outside the range of probability axioms. The treatment of D'Alembert is not always wrong. In some physics models, there exists events (say, AA, AB and BB) that have the same probability 1/3.
2.2. The frequency definition of probability It is often said that frequency is used to estimate the probability. For example, if the dice are not uniform and the probability of a getting 1 cannot be calculated, however we can do an experiment: repeatedly throw the dice a large number of times, say n times, and if 1 occurs m times in these n throws, then m/n is the "frequency" of the event "1 facing up" in these n trials (each throw is counted as one trial). The point of statistical definition of probability is to say that this frequency is taken as an estimate of the probability of the event. The intuitive background of this concept is simple: the likelihood of an event's occurrence should be characterised by the frequency over multiple repeated trials. This idea can be generalized. The point is that the trail must be able to be repeated a large number of times under the same conditions, so that we can observe the frequency of the event.

The problem is, frequency is the estimate of probability after all, but not the probability itself. In order to obtain the true probability, theoretically we can say, the probability of the event E is a number p that has the following properties: when the trial is repeated, the frequency of E is around the given number p, and as the repetition continues, the frequency will be closer and closer to p. Or we simply put: the probability is the limit of the frequency when the number of trials increases to infinity. If we do so, there is a question to be answered: how to prove the existence of such a p having the desired properties mentioned above? Could it be that the existence of p is only an assumption? Clearly by the very nature of a limit we cannot put an end to the trials. If it is the case that p does exist, then it is the limit of a frequency of an infinite trial sequence, but there is no guarantee whatever that another sequence of trials, even if it is carried out under the same circumstances, will yield the same p.

In general, this definition is not adopted. We can never obtain the probability of any event with this definition and it is harder to obtain more useful results. However, if considered in the axiomatic theory of probability, under certain conditions the limiting frequency will indeed exist, for almost all sequences of trails. This is a famous theorem called Law of Large Numbers. In a sense it justifies the intuitive foundation of probability as frequency discussed above. This fact is able to quiet our feelings or misgivings about frequencies. The theorem will be discussed in some later chapter.
2.3. Subjective probability This one is really special. From the name we may think it reflects the subjective determination of how likely an event is to occur. For example, to say "the probability of rain tomorrow is 0.2" is to say subjectively that it is possible that tomorrow is a rainy day, but it is more unlikely to rain. We can see this is very common, and subjective. An expert may estimate a very different probability compared to another person. It seems that subjective probability should stand against science since science is to explore the objective truth, but many people would not think so. This kind of probability gains its popularity and applications in recent years. Perhaps it is reasonable in some way but we won't discuss more here -- the justification is also outside the range of those axioms.


评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注