Fisher information in probability and statistics

Contents
Contents
 1.  Fisher information in Frequentist Statistics
   1.1.  Cramér-Rao inequality
   1.2.  Asymptotic property of MLE
 2.  Fisher information in Bayesian Statistics
   2.1.  Jeffreys prior
   2.2.  Idea

(Continued.)

1. Fisher information in Frequentist Statistics 1.1. Cramér-Rao inequality In statistics, it is important to find an unbiased estimate for g(\theta) with the smallest possible variance. It is not hard to see that given that the sample size is fixed, the variance cannot be arbitrary small, i.e., there is a positive lower bound (of \theta). There are some well-known results for this lower bound of variance, among which the most famous one is an inequality thanks to Cramér and Rao. Cramér-Rao inequality has some improvements after then, but basically the form is not greatly changed. In the Cramér-Rao inequality, which is stated as a theorem below, we can see Fisher information appears in the lower bound.

Theorem 1. Assume the four regularity conditions holds, g:\Theta\to\mathbb R and g'(\theta) exists. Assume \hat g(\boldsymbol x) is an unbiased estimate of g(\theta), satisfying one more regularity condition that the order of differentiation and integration with respect to \hat g(\boldsymbol x)f(\boldsymbol x;\theta) can be interchanged. Then
\[
\operatorname{Var}(\hat g)\geqslant \frac{(g'(\theta))^2}{I(\theta)}.
\]

Proof. Suppose I(\theta)<\infty and \operatorname{Var}(\hat g)<\infty, otherwise the equality holds immediately. By the unbiasedness and the regularity condition,
\begin{align*}
g'(\theta) & =\operatorname{\frac{\mathrm d}{\mathrm d\theta}}g(\theta)=\frac{\partial}{\partial\theta}\int \hat g(\boldsymbol x)f(\boldsymbol x,\theta)\, \mu(\mathrm d\boldsymbol x)=\int \hat g(\boldsymbol x)\frac{\partial}{\partial\theta}f(\boldsymbol x,\theta)\, \mu(\mathrm d\boldsymbol x) \\
& =\mathbb E\Big[\hat g(\mathbf x)\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big]=\operatorname{Cov}\Big(\hat g(\mathbf x),\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big).
\end{align*}Hence
\[
\operatorname{Var}(\hat g(\mathbf x))\operatorname{Var}\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)\geqslant \operatorname{Cov}^2\Big(\hat g(\mathbf x),\frac{\partial}{\partial\theta}\ln f(\mathbf x;\theta)\Big)= (g'(\theta))^2.
\]By the alternative representation of I(\theta), we obtain the Cramér-Rao inequality.
.

For vector-valued \boldsymbol\theta with dimension k, we may want to find a best unbiased estimate for \boldsymbol g(\boldsymbol \theta)\in\mathbb R^l. For two such estimates \hat{\boldsymbol g}^{(1)}(\boldsymbol\theta),\hat{\boldsymbol g}^{(2)}(\boldsymbol\theta), it is natural to compare the variance of each element. If \operatorname{Var}(\hat g _ i^{(1)})\leqslant \operatorname{Var}(\hat g _ i^{(2)}) for all i (1\leqslant i\leqslant l), then we may say \hat{\boldsymbol g}^{(1)}(\boldsymbol\theta) is better than \hat{\boldsymbol g}^{(2)}(\boldsymbol\theta). However, it is more common to use the criteria based on the covariance matrix, i.e., \hat{\boldsymbol g}^{(1)}(\boldsymbol\theta) is better than \hat{\boldsymbol g}^{(2)}(\boldsymbol\theta) when \operatorname{Cov}(\hat{\boldsymbol g}^{(1)})\preccurlyeq \operatorname{Cov}(\hat{\boldsymbol g}^{(2)}). Since for any \boldsymbol a\in\mathbb R^l,
\begin{align*}
\operatorname{Cov}(\hat{\boldsymbol g}^{(1)})\preccurlyeq \operatorname{Cov}(\hat{\boldsymbol g}^{(2)}) & \iff \boldsymbol a^\mathsf T\operatorname{Cov}(\hat{\boldsymbol g}^{(1)})\boldsymbol a\leqslant\boldsymbol a^\mathsf T\operatorname{Cov}(\hat{\boldsymbol g}^{(2)})\boldsymbol a \\
& \iff \operatorname{Var}(\boldsymbol a^\mathsf T\hat{\boldsymbol g}^{(1)})\leqslant\operatorname{Var}(\boldsymbol a^\mathsf T\hat{\boldsymbol g}^{(2)}),
\end{align*}we can see this is a stronger criteria. Moreover, \boldsymbol a^\mathsf T\hat{\boldsymbol g}^{(i)} is unbiased estimator for \boldsymbol a^\mathsf T\hat{\boldsymbol g}. Hence when \operatorname{Cov}(\hat{\boldsymbol g}^{(1)})\preccurlyeq \operatorname{Cov}(\hat{\boldsymbol g}^{(2)}) holds, not only each element of the first estimator is better than the second one, but also the linear combinations of them.

The regularity conditions can be modified to a version for vectors, but their statements are omitted here.

Theorem 2. Assume the regularity conditions for vectors holds, \boldsymbol g(\boldsymbol\theta):\Theta\to\mathbb R^l (l\leqslant k) and \partial g _ i(\boldsymbol \theta)/\partial\theta _ j exists for all i,j. Let D(\boldsymbol\theta)=(\partial g _ i/\partial \theta _ j)\in\mathbb R^{l\times k}, and \hat{\boldsymbol g}(\boldsymbol x) is a unbiased estimate for \boldsymbol g(\boldsymbol \theta) with finite variance elements. Suppose one more regularity condition holds that the order of partial differentiation and integration with respect to \hat {\boldsymbol g} _ i(\boldsymbol x)f(\boldsymbol x;\theta) can be interchanged. Then
\[
\operatorname{Var}(\hat{\boldsymbol g})\succcurlyeq D(\boldsymbol\theta)I^{-1}(\boldsymbol\theta)D(\theta)^\mathsf T.
\]

Proof. Let \boldsymbol S(\boldsymbol x,\boldsymbol \theta)=(\frac{\partial f}{\partial\theta _ 1},\dots,\frac{\partial f}{\partial\theta _ k})^\mathsf T. Then
\begin{gather*}
\mathbb E[\boldsymbol S(\mathbf x,\boldsymbol\theta)]=\boldsymbol 0,\quad \operatorname{Cov}(\boldsymbol S(\mathbf x,\boldsymbol\theta))=I(\boldsymbol\theta)\succ0, \\
\operatorname{Cov}\Big(\hat g _ i(\mathbf x),\frac{\partial }{\partial x _ j}f(\mathbf x,\boldsymbol\theta)\Big)=\frac{\partial}{\partial\theta _ j}g _ i(\boldsymbol \theta).
\end{gather*}Thus
\[
0\preccurlyeq\operatorname{Cov}\begin{pmatrix}
\hat{\boldsymbol g} \\
\boldsymbol S
\end{pmatrix}
=\begin{pmatrix}
\operatorname{Cov}(\hat{\boldsymbol g}) & D(\boldsymbol\theta) \\
D(\boldsymbol\theta)^\mathsf T & I(\boldsymbol\theta)
\end{pmatrix}.
\]A semi-positive definite matrix can be written as A^\mathsf TA where A is a square matrix of order l+k. Denote the first l columns as A _ 1 and last k columns as A _ 2, then
\[
\begin{pmatrix}
\operatorname{Cov}(\hat{\boldsymbol g}) & D(\boldsymbol\theta) \\
D(\boldsymbol\theta)^\mathsf T & I(\boldsymbol\theta)
\end{pmatrix}=
\begin{pmatrix}
A _ 1^\mathsf TA _ 1 & A _ 1^\mathsf TA _ 2 \\
A _ 2^\mathsf TA _ 1 & A _ 2^\mathsf TA _ 2
\end{pmatrix}.
\]The blocks in the corresponding positions are equal. Hence
\[
\operatorname{Cov}(\hat{\boldsymbol g})-D(\boldsymbol\theta)I^{-1}(\boldsymbol\theta)D(\theta)^\mathsf T=
A _ 1^\mathsf T(I _ l-A _ 2(A _ 2^\mathsf TA _ 2)^{-1}A _ 2^\mathsf T)A _ 1\succcurlyeq0.
\]Here I _ l denotes the identity matrix of order l. The semi-positive definiteness holds since by linear algebra I _ l-A _ 2(A _ 2^\mathsf TA _ 2)^{-1}A _ 2^\mathsf T\succcurlyeq0.
1.2. Asymptotic property of MLE Modern asymptotic theory of Frequentist statistics relies heavily on a great number of highly complex regularity conditions. For the sake of clarity and simplification, it is often unnecessary to focus too much on the technical detail of these conditions. In contrast, the qualitative description carries greater significance.

Basically there are two main approaches to establishing the asymptotic properties of maximum likelihood estimation (MLE). The first begins from the definition, which is notably challenging as it requires a bunch of regularity conditions, but the advantage is that we study directly on the MLE. The second approach starts from the likelihood equation, which essentially studies the solution of the equation rather than the MLE itself (the solution is not necessarily an MLE). This is less formidable than the first approach, but it still demands numerous regularity conditions. For simplicity, we focus on the results from the second approach.

MLE is said to be asymptotically optimal. Suppose we have n i.i.d samples. If \sqrt{n}(\hat\theta-\theta)\stackrel d\to N(0,V^2), then the smaller the asymptotic variance is, the better the estimator is. Suppose the Fisher information is finite. When n is sufficiently large, \hat\theta is almost unbiased, and by Cramér-Rao inequality, the lower bound of the variance is I(\theta)^{-1}. Therefore, if \sqrt{n}(\hat\theta-\theta)\stackrel d\to N(0,1/I(\theta)), then \hat\theta may be the best asymptotic normal estimate. The property of MLE is that it satisfies this asymptotic distribution, so it is asymptotically optimal.

We consider the following setting: (i) The n samples are i.i.d from f(x;\theta). \theta\in\Theta, where \Theta is an open interval in \mathbb R. (ii) The Fisher information is positive and finite for all \theta. (iii) We can take partial differentiation under the integral signs of \int f\, \mathrm d\mu and \int \partial f/\partial\theta\, \mathrm d\mu. (iv) There exists M(x) such that for all \theta,
\[
\Big|\frac{\partial^3}{\partial\theta^3}\ln f(x;\theta)\Big|<M(x),\quad \int M(x)f(x;\theta)\, \mu(\mathrm dx)<K<\infty.
\]

(v) Different \theta corresponds to different distributions of the samples. Then

Theorem 3. Under this setting, suppose the true value of \theta is \theta _ 0, then there exists a solution \hat\theta _ n(\mathbf x) of the likelihood equation satisfying \sqrt n(\hat\theta _ n-\theta _ 0)\stackrel d\to N(0,1/I(\theta _ 0)).

Proof. It can be proved that the likelihood equation has a solution \hat\theta _ n(\mathbf x) such that \mathbb P(\hat\theta _ n\to\theta _ 0)=1. The proof of consistency is left out.

The log likelihood function is
\[
L=\sum _ {i=1}^{n}\ln f(x _ i,\theta).
\]By Taylor expansion,
\[
0=\frac{\partial L}{\partial\theta}\Big| _ {\theta=\hat\theta _ n}=\frac{\partial L}{\partial \theta _ 0}+(\hat\theta _ n-\theta _ 0)\frac{\partial^2L}{\partial\theta _ 0^2}+ \frac{(\hat\theta _ n-\theta _ 0)^2}{2}\frac{\partial^3L}{\partial\theta _ 1^3},
\]where \theta _ 1 is between \theta _ 0 and \hat\theta _ n. The partial derivatives are all taken with respect to \theta, and \partial L/\partial \theta _ 0 denotes \partial L/\partial \theta| _ {\theta=\theta _ 0}, etc. Transform the form into \sqrt n(\hat\theta _ n-\theta _ 0):
\[
\sqrt n(\hat\theta _ n-\theta _ 0)=\frac{-\frac1{\sqrt n}\frac{\partial L}{\partial\theta _ 0}}{\frac1n(\frac{\partial^2L}{\partial\theta _ 0^2}+ \frac{(\hat\theta _ n-\theta)}{2}\frac{\partial^3L}{\partial\theta _ 1^3})}.
\]Since by central limit theorem,
\[
\frac{1}{\sqrt{n}}\frac{\partial L}{\partial\theta _ 0}=\frac{1}{\sqrt n}\sum _ {i=1}^{n}\frac{\partial}{\partial\theta _ 0}\ln f(x _ i;\theta)\stackrel d\to N(0,I(\theta _ 0)),
\]we just need to show \sqrt n(\hat\theta _ n-\theta _ 0)I(\theta _ 0) and \frac{1}{\sqrt{n}}\frac{\partial L}{\partial\theta _ 0} have the same asymptotic distribution.
\[
\sqrt n(\hat\theta _ n-\theta _ 0)I(\theta _ 0)-\frac{1}{\sqrt{n}}\frac{\partial L}{\partial\theta _ 0}=-\bigg[\frac{I(\theta _ 0)}{\frac1n(\frac{\partial^2L}{\partial\theta _ 0^2}+ \frac{(\hat\theta _ n-\theta)}{2}\frac{\partial^3L}{\partial\theta _ 1^3})}+1\bigg]\frac{1}{\sqrt{n}}\frac{\partial L}{\partial\theta _ 0}.
\]Hence it suffices to show the term in the brackets converges to zero almost surely. We have
\begin{gather*}
\frac1n\frac{\partial^2 L}{\partial\theta _ 0^2}=\frac1n\sum _ {i=1}^{n}\frac{\partial^2}{\partial\theta _ 0^2}\ln f(x _ i;\theta)\xrightarrow{\text{a.s.}} -I(\theta _ 0), \\
\frac1n\Big|\frac{\partial^3L}{\partial\theta _ 1^3}\Big|\leqslant\frac1n\sum _ {i=1}^{n}M(x _ i)\xrightarrow{\text{a.s.}}\mathbb E[M(\mathrm x)]<K.
\end{gather*}Together with \hat\theta _ n-\theta _ 0\to0 almost surely, we obtain (\hat\theta _ n-\theta _ 0)\frac1n\frac{\partial^3L}{\partial\theta _ 1^3}\to0 almost surely. Thus the term in the brackets converges to zero almost surely. The conclusion \sqrt n(\hat\theta _ n-\theta _ 0)\stackrel d\to N(0,1/I(\theta _ 0)) holds.
.

It turns out that Fisher information plays an essential role in modern large sample theory of statistics, not only in the asymptotic property of MLE.
2. Fisher information in Bayesian Statistics The inference of Bayesian statistics starts from formalizing our knowledge of the parameter \theta to a distribution, known as prior distribution, p(\theta), and then "updates" the knowledge of \theta after observing n samples by Bayes' theorem:
\[
p(\theta|x)\propto p(\theta)p(x|\theta).
\]p(\theta|x) is known as posterior distribution. Based on posterior distribution, one can perform point estimation, interval estimation, hypothesis testing, etc.

Therefore, it is of great importance to construct a suitable prior distribution. In particular, if we knew nothing about \theta, then we should construct an "uninformative" prior. Naturally, we might use uniform distribution over \Theta as the default prior. The problem is that uniform distribution is not invariant under reparameterization and therefore it is not truly uninformative. For example, if we choose the prior distribution of p in \mathrm{Bernoulli}(p) by U(0,1) due to our poor knowledge of p, then it is not hard to verify that p^3 is not uniformly distributed. However, we also have poor knowledge on p^3, so p^3 should also be uniformly distributed. Thus, uniform distribution may not be a good choice because our inference will depend on how the model is parameterized.
2.1. Jeffreys prior The Jeffreys prior is based on an invariance principle so that it can address the issue. It is defined using Fisher information:
\[
p(\theta)\propto \sqrt{I(\theta)}.
\]For vector \boldsymbol\theta, it is
\[
p(\boldsymbol\theta)\propto\sqrt{\det(I(\boldsymbol\theta))}.
\]
The invariance property can be illustrated in the simple case. Suppose \varphi=\varphi(\theta) is a bijective reparameterization. Then
\begin{align*}
I _ {\theta}(\theta)=\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\theta)\Big)^2\Big]
&=\mathbb E\Big[\Big(\frac{\partial}{\partial \theta}\ln f(\mathbf x;\varphi^{-1}(\varphi))\Big)^2\Big(\frac{\mathrm d\varphi}{\mathrm d\theta}\Big)^2\Big] \\
& =I _ {\varphi}(\varphi)\Big(\frac{\mathrm d\varphi}{\mathrm d\theta}\Big)^2.
\end{align*}The invariance requires
\[
p _ {\varphi}(\varphi)=p _ \theta(\theta)\Big|\frac{\mathrm d\varphi}{\mathrm d\theta}\Big|^{-1}.
\]Thus the desired property holds if we define the prior by p _ \theta(\theta)\propto\sqrt{I _ \theta(\theta)} and p _ {\varphi}(\varphi)\propto \sqrt{I _ \varphi(\varphi)}.
2.2. Idea Very roughly, the idea of the original paper of Jeffrey is sketched in the following.

Suppose D(P',P) in some way measures the "distance" between two distributions P' and P. For example, suppose P',P have density p'(x),p(x) respectively, let
\[
D(P',P)=D _ {\mathrm{KL}}(P'\|P)+D _ {\mathrm{KL}}(P\|P')=\int (p'(x)-p(x))(\log p'(x)-\log p(x))\, \mathrm dx.
\]Now suppose the two distribution are belong to the same parametric distribution family P _ \theta. We set P'=P _ {\theta+\Delta\theta} and P=P _ \theta. In Section 2.6, the relation between KL divergence and Fisher information has been discussed, so now we know that
\[
D(P _ {\theta+\Delta\theta},P _ \theta)\approx I(\theta)(\Delta\theta)^2\implies
D(P _ {\theta+\Delta\theta},P _ \theta)^{1/2}\approx \sqrt{I(\theta)}\Delta\theta.
\]From this we can see \sqrt{I(\theta)} appears as something like a density.

We would want a "uniform" "model space" (rather than "parameter space"), and want \sqrt D independent from the parameterization. Suppose again \varphi=\varphi(\theta) is bijective reparameterization. Then \varphi(\theta+\Delta\theta)\approx\varphi(\theta)+\frac{\mathrm d\varphi}{\mathrm d\theta}\Delta\theta so that \Delta\varphi\approx\frac{\mathrm d\varphi}{\mathrm d\theta}\Delta\theta.
\[
\sqrt{I(\theta)}\Delta\theta\approx D(P _ {\theta+\Delta\theta},P _ \theta)^{1/2}=D(P _ {\varphi+\Delta\varphi},P _ \varphi)^{1/2}\approx \sqrt{I(\varphi)}\Delta\varphi.
\]This relation holds since we have shown Fisher information has the property I _ {\theta}(\theta)=I _ {\varphi}(\varphi)(\frac{\mathrm d\varphi}{\mathrm d\theta})^2.


评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注