Koneksi antara Fisher metrik dan entropi relatif

20

Dapatkah seseorang membuktikan hubungan berikut antara metrik informasi Fisher dan entropi relatif (atau divergensi KL) dengan cara yang benar-benar ketat secara matematis?

D(p(,a+da)p(,a))=12gi,jdaidaj+(O(da3)
where a=(a1,,an),da=(da1,,dan),
gi,j=i(logp(x;a))j(logp(x;a)) p(x;a) dx
and gi,jdaidaj:=i,jgi,jdaidaj is the Einstein summation convention.

I found the above in the nice blog of John Baez where Vasileios Anagnostopoulos says about that in the comments.

Kumara
sumber
1
Dear Kumara: For clarify, it would help to better explain your notation, specifically the meaning of gi,j. Also, I think your expression is missing a constant factor of 1/2 in front of the first term of the right-hand side of the display equation. Note that what Kullback himself called divergence (using the notation J(,)) is the symmetrized version of what is know called the KL divergence, i.e., J(p,q)=D(pq)+D(qp). The KL divergence was denoted I(,) in Kullback's writings. This explains the factor of 1/2 as well. Cheers.
cardinal

Jawaban:

19

In 1946, geophysicist and Bayesian statistician Harold Jeffreys introduced what we today call the Kullback-Leibler divergence, and discovered that for two distributions that are "infinitely close" (let's hope that Math SE guys don't see this ;-) we can write their Kullback-Leibler divergence as a quadratic form whose coefficients are given by the elements of the Fisher information matrix. He interpreted this quadratic form as the element of length of a Riemannian manifold, with the Fisher information playing the role of the Riemannian metric. From this geometrization of the statistical model, he derived his Jeffreys's prior as the measure naturally induced by the Riemannian metric, and this measure can be interpreted as an intrinsically uniform distribution on the manifold, although, in general, it is not a finite measure.

To write a rigorous proof, you'll need to spot out all the regularity conditions and take care of the order of the error terms in the Taylor expansions. Here is a brief sketch of the argument.

The symmetrized Kullback-Leibler divergence between two densities f and g is defined as

D[f,g]=(f(x)g(x))log(f(x)g(x))dx.

If we have a family of densities parameterized by θ=(θ1,,θk), then

D[p(θ),p(θ+Δθ)]=(p(x,θ)p(xθ+Δθ))log(p(xθ)p(xθ+Δθ))dx,
in which Δθ=(Δθ1,,Δθk). Introducing the notation
Δp(xθ)=p(xθ)p(xθ+Δθ),
some simple algebra gives
D[p(θ),p(θ+Δθ)]=Δp(xθ)p(xθ)log(1+Δp(xθ)p(xθ))p(xθ)dx.
Using the Taylor expansion for the natural logarithm, we have
log(1+Δp(xθ)p(xθ))Δp(xθ)p(xθ),
and therefore
D[p(θ),p(θ+Δθ)](Δp(xθ)p(xθ))2p(xθ)dx.
But
Δp(xθ)p(xθ)1p(xθ)i=1kp(xθ)θiΔθi=i=1klogp(xθ)θiΔθi.
Hence
D[p(θ),p(θ+Δθ)]i,j=1kgijΔθiΔθj,
in which
gij=logp(xθ)θilogp(xθ)θjp(xθ)dx.

This is the original paper:

Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Royal Soc. of London, Series A, 186, 453–461.

Zen
sumber
1
Thank you very much for the nice writing. It would be nice if you can help this as well.
Kumara
Yes, you rightly said. I must come out of this "abstraction trap".
Kumara
@zen You are using the Taylor expansion of logarithm under the integral, why is that valid?
Sus20200
1
It seems crucial that you start with the symmetrized KL divergence, as opposed to the standard KL divergence. The Wikipedia article makes no mention of the symmetrized version, and so it might possibly be incorrect. en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Surgical Commander
11

Proof for usual (non-symmetric) KL divergence

Zen's answer uses the symmetrized KL divergence, but the result holds for the usual form as well, since it becomes symmetric for infinitesimally close distributions.

Here's a proof for discrete distributions parameterized by a scalar θ (because I'm lazy), but can be easily re-written for continuous distributions or a vector of parameters:

D(pθ,pθ+dθ)=pθlogpθpθlogpθ+dθ .
Taylor-expanding the last term:
=pθlogpθpθlogpθ= 0dθpθddθlogpθ= 0 12dθ2pθd2dθ2logpθ=pθ(ddθlogpθ)2 +O(dθ3)=12dθ2pθ(ddθlogpθ)2Fisher information+O(dθ3).
Assuming some regularities, I have used the two results:
:pθddθlogpθ=ddθpθ=ddθpθ=0,

:pθd2dθ2logpθ=pθddθ(1pθdpθdθ)=pθ[1pθd2pθdθ(1pθdpθdθ)2]=d2pθdθ2pθ(1pθdpθdθ)2=d2dθ2pθ= 0pθ(ddθlogpθ)2.
Abhranil Das
sumber
4

You can find a similar relationship (for a one-dimensional parameter) in equation (3) of the following paper

D. Guo (2009), Relative Entropy and Score Function: New Information–Estimation Relationships through Arbitrary Additive Perturbation, in Proc. IEEE International Symposium on Information Theory, 814–818. (stable link).

The authors refer to

S. Kullback, Information Theory and Statistics. New York: Dover, 1968.

for a proof of this result.

Primo Carnera
sumber
1
A multivariate version of equation (3) of that paper is proven in the cited Kullback text on pages 27-28. The constant 1/2 seems to have gone missing in the OP's question. :)
cardinal