Contents
Contents
1. Introduction
2. Definition and basic properties
2.1. Definition
2.2. Regularity conditions and alternative representation
2.3. Sum of information
2.4. Data processing
2.5. Sufficient statistic
2.6. Relation to relative entropy
3. Continued on Page Two
Fisher information plays a very important role in probability and statistics. It is believed that Fisher information in some way measures the amount of information that the random variable carries about the unknown parameter of the distribution.
The first section will start with the definition and some basic properties. The properties justifies the definition of Fisher "information". There is also some connection between Fisher information and other notions in information theory.
The second section will talk about the role of Fisher information in the Frequentists. Cramér-Rao inequality basically states that under some conditions, any unbiased estimation of \theta cannot have a variance lower than the inverse of Fisher information. In large sample theory, Fisher information plays an essential role. For example, under some conditions MLE \hat\theta is a consistently asymptotically normal estimator, and the asymptotic variance achieve the C-R bound. This indicates the asymptotic optimality of MLE.
The third section will talk about the role of Fisher information in the Bayesians. If we want an uninformative prior distribution, a natural choice might be the uniform distribution. However, it turns out that this is not always a good candidate, since the uniform prior is not invariant under reparameterization and therefore it is not truly uninformative. To address this issue, Jeffreys prior takes the prior distribution proportional to the square root of the Fisher information, which will give us the desired "invariance".
2. Definition and basic properties
Assume \mathbf x\sim f(\boldsymbol x;\theta) in which \theta is the parameter, f is the population. \mathbf x contains n samples \mathrm x _ 1,\mathrm x _ 2,\dots,\mathrm x _ n. For a more modern and general setting, we assume the distribution family of \mathbf x is \{\mathbb P _ \theta\}. \mu is a \sigma-finite measure with \mathbb P _ \theta\ll\mu and f(\boldsymbol x;\theta)=\mathrm d\mathbb P _ \theta(\boldsymbol x)/\mathrm d\mu.
2.1. Definition
We first consider the case of one-dimensional parameter. After we obtain n observations \mathbf x=(\mathrm x _ 1,\mathrm x _ 2,\dots,\mathrm x _ n), the Fisher information can be defined as:
\[
I(\theta)=\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)^2\Big].
\]For vector-valued \boldsymbol\theta=(\theta _ 1,\dots,\theta _ k), the Fisher information matrix can be defined as:
\[
I(\boldsymbol\theta)=(I _ {ij}(\boldsymbol\theta)) _ {k\times k}
\]where
\[
I _ {ij}(\boldsymbol\theta)=\mathbb E\Big[\frac{\partial\ln f(\mathbf x;\boldsymbol\theta)}{\partial\theta _ i}\frac{\partial\ln f(\mathbf x;\boldsymbol\theta)}{\partial\theta _ j}\Big].
\]
This definition can be somehow justified by the following example. Take normal distribution as the example: N(\theta,\sigma^2), where the variance \sigma^2 is known. If \sigma^2=10^{-10} then just one observation x is adequate for our estimation for \theta at a high precision, indicating that x contains much information. If \sigma^2=10^{10}, with one observation we can only estimate \theta at a very low precision, indicating that x contains very few information. If we calculate the Fisher information, it turns out that the result is 1/\sigma^2, consistent with the previous qualitative analysis.
The properties of Fisher information presented later can be taken as further justifications of the definition. Only scalar-valued \theta is considered unless unless otherwise specified.
2.2. Regularity conditions and alternative representation
The properties of Fisher information usually depends on the so-called "regularity conditions". Here are some of them.
Regularity conditions
- \theta\in\Theta, where \Theta is an open interval on \mathbb R.
- f(\boldsymbol x;\theta)>0 and \partial f(\boldsymbol x;\theta)/\partial\theta exists, for all \boldsymbol x and all \theta.
- The order of differentiation and integration with respect to f can be interchanged for all \theta. Specifically,
\begin{align*}
\mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big] & =\int \frac{\partial f(\boldsymbol x;\theta)}{\partial\theta}\, \mu(\mathrm d\boldsymbol x) \\
& =\frac{\partial}{\partial\theta}\int f(\boldsymbol x;\theta)\, \mu(\mathrm d\boldsymbol x)=0,\quad \forall\, \theta\in\Theta.
\end{align*} - For all \theta,
\[
0<I(\theta)=\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)^2\Big]=\int\Big(\frac{\partial}{\partial \theta}\ln f(\boldsymbol x;\theta)\Big)^2 f(\boldsymbol x;\theta)\, \mu(\mathrm d\boldsymbol x).
\]
Alternative representation
With the assumptions of regularity conditions, I(\theta) can also be written as:
\[
I(\theta)=\operatorname{Var}\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big).
\]
If we further assume that the order of differentiation and integration with respect to \partial f/\partial{\theta} can be interchanged for all \theta, then
\[
I(\theta)
=-\mathbb E\Big[\frac{\partial^2}{\partial\theta^2}\ln f(\mathbf x;\theta)\Big].
\]This can be derived as follows.
\begin{align*}
0 & =\frac{\partial}{\partial\theta}\mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big] =\frac{\partial}{\partial\theta}\int\Big(\frac{\partial}{\partial\theta}\ln f(\boldsymbol x;\theta)\Big)f(\boldsymbol x;\theta)\, \mu(\mathrm d\boldsymbol x) \\
& =\int\Big[\Big(\frac{\partial^2}{\partial\theta^2}\ln f(\boldsymbol x;\theta)\Big)f(\boldsymbol x;\theta)+\Big(\frac{\partial}{\partial\theta}\ln f(\boldsymbol x;\theta)\Big)\Big(\frac{\partial}{\partial\theta}f(\boldsymbol x;\theta)\Big)\Big]\, \mu(\mathrm d\boldsymbol x) \\
& =\int\Big[\Big(\frac{\partial^2}{\partial\theta^2}\ln f(\boldsymbol x;\theta)\Big)+\Big(\frac{\partial}{\partial\theta}\ln f(\boldsymbol x;\theta)\Big)^2\Big]f(\boldsymbol x;\theta)\, \mu(\mathrm d\boldsymbol x)\\
& =\mathbb E\Big[\frac{\partial^2}{\partial\theta^2}\ln f(\mathbf x;\theta)\Big]+\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)^2\Big].
\end{align*}
2.3. Sum of information
It turns out that the Fisher information of several independent samples is just the sum of them. Specifically, suppose \mathrm x _ 1,\dots,\mathrm x _ n are independently sampled from f _ 1(x _ 1;\theta),\dots,f _ n(x _ n;\theta) respectively. Then we have f(\boldsymbol x;\theta)=f _ 1(x _ 1;\theta)\cdot\cdots\cdot f _ n(x _ n;\theta), which satisfies the four regularity conditions, and
\[
I(\theta)=I _ 1(\theta)+\dots+I _ n(\theta).
\]Here I(\theta) and I _ i(\theta) are the Fisher information of \mathbf x and \mathrm x _ i respectively.
Proof. It is clear that regularity conditions 1 and 2 holds. Since
\[
\frac{\partial f(\boldsymbol x;\theta)}{\partial\theta}=\sum _ {i=1}^{n}f _ 1\cdots f _ {i-1}\frac{\partial f _ i}{\partial \theta}f _ {i+1}\cdots f _ n,
\]taking integral, by Fubini's theorem we have \int|\partial f/\partial \theta|\, \mu(\mathrm d\boldsymbol x)<\infty. Again by Fubini's theorem,
\[
\int\frac{\partial f(\boldsymbol x;\theta)}{\partial\theta}\, \mu(\mathrm d\boldsymbol x)=0.
\]Thus regularity condition 3 holds.
\begin{align*}
&\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)^2\Big] =\mathbb E\Big[\Big(\sum _ {i=1}^{n}\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big)^2\Big], \\
={}& \sum _ {i=1}^{n}\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big)^2\Big]+\sum _ {i,j}\mathbb E \Big[\Big(\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big)\Big(\frac{\partial}{\partial \theta}\ln f _ j(\mathrm x _ j;\theta)\Big)\Big] \\
={}& \sum _ {i=1}^{n}\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big)^2\Big]+\sum _ {i,j}\mathbb E \Big[\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big]\mathbb E\Big[\frac{\partial}{\partial \theta}\ln f _ j(\mathrm x _ j;\theta)\Big] \\
={}& \sum _ {i=1}^{n}\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f _ i(\mathrm x _ i;\theta)\Big)^2\Big].
\end{align*}
Thus regularity condition 4 and I(\theta)=I _ 1(\theta)+\dots+I _ n(\theta) holds.
2.4. Data processing
The Fisher information cannot increase after data processing. Specifically, suppose \mathrm t=t(\mathbf x) is a statistic with density g(t;\theta), f and g satisfy the four regularity conditions and we can interchange the order of differentiation and integration under any domain of integration, then
\begin{equation}\label{1}
I _ {\mathbf x}(\theta)\geqslant I _ {\mathrm t}(\theta).
\end{equation}
Proof. For any measurable A,
\begin{align*}
&\int _ A\Big(\frac{\partial}{\partial\theta}\ln g(t;\theta)\Big)g(t;\theta)\, \mathrm d\mu _ t(t) =\frac{\partial}{\partial\theta}\int _ A g(t;\theta)\, \mathrm d\mu _ t(t) \\
={}&\frac{\partial}{\partial\theta}\int _ {t^{-1}(A)}f(\boldsymbol x;\theta)\, \mathrm d\mu(\boldsymbol x) =\int _ {t^{-1}(A)}\Big(\frac{\partial}{\partial\theta}\ln f(\boldsymbol x;\theta)\Big)f(\boldsymbol x;\theta)\, \mathrm d\mu(\boldsymbol x).
\end{align*}By the definition of conditional expectation,
\[
\mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big|\mathrm t\Big]=\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta).
\]By the property of conditional expectation,
\begin{align*}
& \mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\Big] =\mathbb E\Big[\mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\Big|\mathrm t\Big]\Big] \\
={} & \mathbb E\Big[\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\, \mathbb E\Big[\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big|\mathrm t\Big]\Big]= \mathbb E\Big[\Big(\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\Big)^2\Big].
\end{align*}Hence
\begin{align}\label{2}
\begin{split}
I _ {\mathbf x}(\theta)-I _ {\mathrm t}(\theta) & =\mathbb E\Big[\Big(\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big)^2\Big]-\mathbb E\Big[\Big(\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\Big)^2\Big] \\
& =\mathbb E\Big[\Big(\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)-\frac{\partial}{\partial\theta}\ln g(\mathrm t;\theta)\Big)^2\Big]\geqslant0.
\end{split}
\end{align}
2.5. Sufficient statistic
With the above assumptions, if one more regularity conditions holds that for any \boldsymbol x, \partial f/\partial\theta and \partial g/\partial \theta are continuous functions with respect to \theta, then: the equality in (1) holds for all \theta if and only if \mathrm t is a sufficient statistic.
We have known that sufficient statistics constitute the only type of statistics which compress sample data without loss of information about the parameter \theta. This result endows this statement with a precise meaning making use of the notion of the Fisher information.
The idea of the proof is roughly sketched here. First assume I _ {\mathbf x}(\theta)= I _ {\mathrm t}(\theta) holds for all \theta. By (2) we know that \frac\partial{\partial\theta} \ln f(\boldsymbol x;\theta)=\frac\partial{\partial\theta}\ln g(t(\boldsymbol x);\theta) a.e.[\mathbb P _ \theta] holds for any \theta. Then it can be shown that this can be true for all \theta a.e.[\mu], which indicates that f/g is a constant with respect to \theta, denoted by h(\boldsymbol x). Then f(\boldsymbol x;\theta)=g(t;\theta)h(\boldsymbol x) a.e.[\mu], and by factorization theorem \mathrm t is a sufficient statistic. On the other hand, assume \mathrm t is a sufficient statistic, then for any \theta, f(\boldsymbol x;\theta)=g _ 0(t(\boldsymbol x);\theta)h(\boldsymbol x) a.e.[\mu]. And it can be shown that there exists a function \xi(t) such that g _ 0(t(\boldsymbol x);\theta)=g(t(\boldsymbol x);\theta)/\xi(t(\boldsymbol x)) a.e.[\mu], so f(\boldsymbol x;\theta)=g(t(\boldsymbol x);\theta)h(\boldsymbol x)/\xi(t(\boldsymbol x)) a.e.[\mu]. By some argument similar to the "only if" part, this holds for all \theta a.e.[\mu]. Taking logarithm and partial derivative to \theta, we know that \frac\partial{\partial\theta} \ln f(\boldsymbol x;\theta)=\frac\partial{\partial\theta}\ln g(t(\boldsymbol x);\theta) a.e.[\mathbb P _ \theta] for any \theta. By (2), I _ {\mathbf x}(\theta)= I _ {\mathrm t}(\theta) holds for all \theta.
2.6. Relation to relative entropy
We can also establish some connections between Fisher information and other notions in information theory. For example, KL divergence, or relative entropy, between the true distribution f(\boldsymbol x;\theta _ 0) and another distribution f(\boldsymbol x;\theta) is defined by
\[
D _ {\mathrm{KL}}(f(\boldsymbol x;\theta _ 0)\|f(\boldsymbol x;\theta))=-\int f(\boldsymbol x;\theta _ 0)\ln\frac{f(\boldsymbol x;\theta)}{f(\boldsymbol x;\theta _ 0)}\, \mu(\mathrm d\boldsymbol x),
\]which is minimized at \theta=\theta _ 0. Under some smoothness assumptions and regularity conditions, we take the Taylor expansion near \theta _ 0 up to second order as
\begin{align*}
& D _ {\mathrm{KL}}(f(\boldsymbol x;\theta _ 0)\|f(\boldsymbol x;\theta)) \\
={} & -\frac12\Big[\int f(\boldsymbol x;\theta _ 0)\frac{\partial^2}{\partial\theta^2}\ln f(\boldsymbol x;\theta)\, \mu(\mathrm d\boldsymbol x)\Big]\Big| _ {\theta=\theta _ 0}(\theta-\theta _ 0)^2+o((\theta-\theta _ 0)^2).
\end{align*}By the alternative representation of Fisher information, the second derivative of the relative entropy D _ {\mathrm{KL}}(f(\boldsymbol x;\theta _ 0)\|f(\boldsymbol x;\theta)) evaluated at \theta _ 0 is just the Fisher information I(\theta _ 0). We can see that the Fisher information represents the curvature of the relative entropy.

发表回复