next up previous contents
Next: Estimation Up: Reduction of Data Previous: Likelihood   Contents

Information in a Sample

In the next chapter we will be considering properties of estimators. One of these properties involves the variance of an estimator and our desire to choose an estimator with variance as small as possible. Some concepts and results that will be used there are introduced in this section. In particular, we will consider the notion of information in a sample, and how we measure this information when data from several experiments is combined.

Consider a distribution indexed by a real parameter $\theta$ and suppose $X_1, \ldots, X_{n_1}$ and $Y_1, \ldots, Y_{n_2}$ are independent sets of data, then the likelihood of the combined sample is the product of the likelihoods of the two individual samples. That is,

\begin{displaymath}L(\theta;{\bf x}, {\bf y})=L_1(\theta;{\bf x})L_2(\theta;{\bf y}) \end{displaymath}

and so

\begin{displaymath}\log L(\theta;{\bf x},{\bf y})=\log L_1(\theta;{\bf x}) + \log L_2(\theta;
{\bf y}). \end{displaymath}

The statistic that we shall be concerned with is the derivative with respect to $\theta$ of the log likelihood.



Definition 7..8
The score of a sample, denoted by $V$ is defined by

\begin{displaymath}V=\frac{\partial}{\partial\theta}\log L(\theta;{\bf X})=
\frac{L'(\theta)}{L(\theta)}.\end{displaymath}

where $L'(\theta)=\frac{\partial}{\partial\theta}L(\theta).$


Some properties of V are given below. Rigorous proofs of these results depend on fulfillment of conditions (sometimes referred to as regularity conditions) that permit interchange of integration and differentiation operations, and on the existence and integrability of the various partial derivatives. The proofs are not required in this course but an outline of the proof of (7.7) is given on page [*].


Properties of V

(i)
The expected value of $V$ is zero.
(ii)
Var($V$) is called the information (or Fisher's information) in a sample and is denoted by $I_X(\theta)$, so we have
\begin{displaymath}
I_{{\bf X}}(\theta)=\mbox{Var}(V)=E\left\{\left[\frac{\partial}{\partial\theta}\log
f({\bf X};\theta)\right]^2\right\}.
\end{displaymath} (7.7)

(iii)
Information is additive over independent experiments. For X, Y independent, we have
\begin{displaymath}
I_{{\bf X}}(\theta)+I_{{\bf Y}}(\theta)=I_{{\bf X}+{\bf Y}}(\theta).
\end{displaymath} (7.8)

(iv)
As a special case of (iii), the information in a random sample of size $n$ is $n$ times the information in a single observation. That is,
\begin{displaymath}
I_{{\bf X}}(\theta)=nI_X(\theta).
\end{displaymath} (7.9)

(v)
The information provided by a sufficient statistic $T=t({\bf X})$ is the same as that in the sample ${\bf X}$. That is,
\begin{displaymath}
I_T(\theta)=I_{{\bf X}}(\theta).
\end{displaymath} (7.10)

(vi)
The information in a sample can be computed by an alternate formula,
\begin{displaymath}
I_{{\bf X}}(\theta)=-E\left(\frac{\partial V}{\partial\theta}\right).
\end{displaymath} (7.11)

(vii)
For $T=t({\bf X})$ a statistic,
\begin{displaymath}
I_T(\theta) \leq I_{{\bf X}}(\theta)
\end{displaymath} (7.12)

with equality holding if and only if $T$ is a sufficient statistic for $\theta$.
[This property emphasizes the importance of sufficiency. The reduction of a sample to a statistic may lose information relative to $\theta$, but there is no loss of information if and only if sufficiency is maintained in the data reduction.]


Comment on (i) and (vi).

A typical example where the ``regularity conditions'' don't hold is the case where $X$ is distributed U(0, $\theta)$. When the range space of $X$ depends on $\theta$, the order of integration (over X) and differentiation (with respect to $\theta$) can't usually be interchanged, as is done in proving (i) and (vi). In particular, for a sample of size $1$ from $f(x)=1/\theta \ ,\ 0<x< \theta$, we have $L(\theta; x)=1/\theta$, and

\begin{eqnarray*}
\log L(\theta;x)&=& - \log \theta\\
V=\frac{\partial}{\partia...
...c{1}{\theta}.\frac{1}{\theta}dx\\
&=& -\frac{1}{\theta} \neq 0.
\end{eqnarray*}





Example 7..10
For $X_1, \ldots, X_n$ a random sample from a N($\mu$, $\sigma^2$) distribution, find
(a) the information for $\mu$; (b) the information for $\sigma^2$.

(a)
We have

\begin{eqnarray*}
f(x_i;\mu)&=&(2 \pi \sigma^2)^{-1/2}e^{-(x_i-\mu)^2/2\sigma^2}...
...E\left(\frac{\partial V}{\partial \mu}\right)=\frac{1}{\sigma^2}
\end{eqnarray*}



Alternatively, we note that $V^2=(X_i-\mu)^2/\sigma^4$ and that

\begin{displaymath}\mbox{Var}(V)=E(V^2)=\frac{E(X_i-\mu)^2}{\sigma^4}=\frac{1}{\sigma^2}.\end{displaymath}

[Both $I_{\bf X}(\mu)$ and Var($V$) are expressions for the information in a single observation. The information in a random sample of size $n$ is thus $n/\sigma^2$.]
(b)
We have

\begin{eqnarray*}
V=\frac{\partial}{\partial \sigma^2}\log f &=& -
\frac{1}{2\si...
...ma^4} + \frac{\sigma^2}{(\sigma^2)^3}\\
&=&\frac{1}{2 \sigma^4}
\end{eqnarray*}



For a sample of size $n$, $I_{{\bf X}}=n/2 \sigma^4.$



Example 7..11
Compute the information on $p$ from $n$ Bernoulli trials with probability of success equal to $p$. Now

\begin{eqnarray*}
f(x;p)&=& p^x (1-p)^{1-x}\\
\log f(x;p)&=& x \log p + (1-x) \...
...\
-E\left(\frac{\partial V}{\partial p} \right)&=& \frac{1}{pq}
\end{eqnarray*}



For a sample of size $n$, the information on $p$ is $I_{{\bf X}}(p)=n/pq$.

Outline of Proof of 7.7

\begin{eqnarray*}
V&=&\frac{\partial}{\partial \theta}\log f(X;\theta)=\frac{f'
...
...V}{\partial
\theta}\right)&=&E\left(\frac{f''}{f}\right)-E(V^2)
\end{eqnarray*}



Now

\begin{displaymath}f''=\frac{\partial^2}{\partial \theta^2}f(X;\theta)\end{displaymath}

So

\begin{eqnarray*}
E\left(\frac{f''}{f}\right)&=&\int_{-\infty}^{\infty}
\frac{\f...
...
\underbrace{ \int_{-\infty}^{\infty}f(x;\theta)dx}_{=1}\\
&=&0
\end{eqnarray*}



So

\begin{displaymath}E(V^2) \ = \ -E\left(\frac{\partial V}{\partial \theta}\right).\end{displaymath}


Comments

  1. The proof is somewhat `simplistic' in the sense that just $X$ is used rather that ${\bf X}$. The latter would require multiple integrals rather than just a single integral.
  2. The proof that $E(V)=0$ is similar.
  3. Note the line where the order of integration (wrt x) an d differentiation (wrt $\theta$) is interchanged. This can only be done when regularity conditions apply. For instance, the limits on the integrals must not involve $\theta$.

next up previous contents
Next: Estimation Up: Reduction of Data Previous: Likelihood   Contents
Bob Murison 2000-10-31