Matriks informasi yang diamati adalah penduga yang konsisten dari matriks informasi yang diharapkan?

16

Saya mencoba untuk membuktikan bahwa matriks informasi yang diamati dievaluasi pada estimator kemungkinan maksimum yang konsisten (MLE) yang lemah, adalah estimator yang lemah konsisten dari matriks informasi yang diharapkan. Ini adalah hasil yang dikutip secara luas tetapi tidak ada yang memberikan referensi atau bukti (saya sudah kelelahan saya pikir 20 halaman pertama dari hasil google dan statistik saya buku teks)!

Dengan menggunakan urutan MLE yang konsisten dan lemah, saya dapat menggunakan hukum lemah angka besar (WLLN) dan teorema pemetaan terus menerus untuk mendapatkan hasil yang saya inginkan. Namun saya percaya teorema pemetaan terus menerus tidak dapat digunakan. Sebaliknya saya pikir hukum seragam dalam jumlah besar (ULLN) perlu digunakan. Apakah ada yang tahu referensi yang memiliki bukti ini? Saya memiliki upaya di ULLN tetapi mengabaikannya sekarang untuk singkatnya.

Saya minta maaf untuk panjang pertanyaan ini tetapi notasi harus diperkenalkan. Notasi adalah sebagai berikut (bukti saya ada di akhir).

Asumsikan kita memiliki sampel iid dari variabel acak $\{Y_1,\ldots,Y_N\}$ dengan kepadatan $f(\tilde{Y}|\theta)$ , di mana $\theta\in\Theta\subseteq\mathbb{R}^{k}$ (di sini $\tilde{Y}$ adalah hanya variabel acak umum dengan kepadatan yang sama sebagai salah satu anggota sampel). Vektor $Y=(Y_1,\ldots,Y_N)^{T}$ adalah vektor dari semua vektor sampel di mana $Y_{i}\in\mathbb{R}^{n}$ untuk semua $i=1,\ldots,N$ . Nilai parameter sebenarnya dari kepadatan adalah $\theta_{0}$ , dan adalah lemah konsisten maksimum kemungkinan estimator (MLE) dari . Tunduk pada kondisi keteraturan matriks Informasi Fisher dapat ditulis sebagai $\hat{\theta}_{N}(Y)$ $\theta_{0}$

I (θ) = - E_{θ} [H_{θ} (\log f (\tilde{Y} | θ)]

$I(\theta)=-E_\theta \left[H_{\theta}(\log f(\tilde{Y}|\theta)\right]$

di mana ${H}_{\theta}$ adalah matriks Hessian. Setara sampel adalah

I_{N} (θ) = \sum_{i = 1}^{N} I_{y_{i}} (θ),

$I_N(\theta)=\sum_{i=1}^N I_{y_i}(\theta),$

di mana $I_{y_i}=-E_\theta \left[H_{\theta}(\log f(Y_{i}|\theta)\right]$ . Matriks informasi yang diamati adalah;

$J(\theta) = -H_\theta(\log f(y|\theta)$ ,

(beberapa orang menuntut matriks dievaluasi pada tapi beberapa tidak). Matriks informasi yang diamati sampel adalah; $\hat{\theta}$

$J_N(\theta)=\sum_{i=1}^N J_{y_i}(\theta)$

di mana $J_{y_i}(\theta)=-H_\theta(\log f(y_{i}|\theta)$ .

Saya bisa membuktikan konvergensi dalam probabilitas dari estimator $N^{-1}J_N(\theta)$ ke $I(\theta)$ , tetapi bukan dari $N^{-1}J_{N}(\hat{\theta}_N(Y))$ ke $I(\theta_{0})$ . Inilah bukti saya sejauh ini;

Sekarang $(J_{N}(\theta))_{rs}=-\sum_{i=1}^N (H_\theta(\log f(Y_i|\theta))_{rs}$ adalah elemen $(r,s)$ dari $J_N(\theta)$ , untuk setiap $r,s=1,\ldots,k$ . Jika sampel iid, maka dengan hukum lemah jumlah besar (WLLN), rata-rata dari puncak ini konvergen dalam probabilitas ke $-E_{\theta}[(H_\theta(\log f(Y_{1}|\theta))_{rs}]=(I_{Y_1}(\theta))_{rs}=(I(\theta))_{rs}$ Jadi $N^{-1}(J_N(\theta))_{rs}\overset{P}{\rightarrow}(I(\theta))_{rs}$ untuk semua $r,s=1,\ldots,k$ , dan begitu $N^{-1}J_N(\theta)\overset{P}{\rightarrow}I(\theta)$ . Sayangnya kita tidak bisa hanya menyimpulkan dengan menggunakan teorema pemetaan kontinu karena tidak memiliki fungsi yang sama dengan . $N^{-1}J_{N}(\hat{\theta}_N(Y))\overset{P}{\rightarrow}I(\theta_0)$ $N^{-1}J_{N}(\cdot)$ $I(\cdot)$

Bantuan apa pun akan sangat dihargai.

maximum-likelihood expected-value asymptotics information fisher-information dandar
sumber

Terkait: Tingkat konvergensi matriks informasi Fisher empiris .

apakah jawaban saya di bawah alamat menjawab pertanyaan Anda?

Dapz

1

@Apap, Mohon terima permintaan maaf saya yang tulus karena tidak membalas Anda sampai sekarang - saya membuat kesalahan dengan berasumsi tidak ada yang akan menjawab. Terima kasih atas jawaban Anda di bawah ini - Saya telah memutarnya karena saya dapat melihatnya paling berguna, namun saya perlu meluangkan sedikit waktu untuk mempertimbangkannya. Terima kasih atas waktu Anda, dan saya akan segera membalas posting Anda di bawah ini.

dandar

7

$\newcommand{\convp}{\stackrel{P}{\longrightarrow}}$

Saya kira secara langsung membuat semacam hukum seragam dalam jumlah besar adalah salah satu pendekatan yang mungkin.

Ini satu lagi.

Kami ingin menunjukkan bahwa . $\frac{J^N(\theta_{MLE})}{N} \convp I(\theta^*)$

(Seperti yang Anda katakan, kami miliki oleh WLLN bahwa . Tapi ini tidak secara langsung membantu kita.) $\frac{J^N(\theta)}{N} \convp I(\theta)$

Salah satu strategi yang mungkin adalah menunjukkan bahwa

| I (θ^{*}) - \frac{J^{N} (θ^{*})}{N} | \overset{P}{⟶} 0.

$|I(\theta^*) - \frac{J^N(\theta^*)}{N}| \convp 0.$

dan

| \frac{J^{N} (θ_{M L E})}{N} - \frac{J^{N} (θ^{*})}{N} | \overset{P}{⟶} 0

$|\frac{J^N(\theta_{MLE})}{N} - \frac{J^N(\theta^*)}{N}| \convp 0$

Jika kedua hasil tersebut benar, maka kita dapat menggabungkannya untuk mendapatkan

| I (θ^{*}) - \frac{J^{N} (θ_{M L E})}{N} | \overset{P}{⟶} 0,

$|I(\theta^*) - \frac{J^N(\theta_{MLE})}{N}| \convp 0,$

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second almost follows from the continuous mapping theorem, but unfortunately our function $g()$ that we want to apply the CMT to changes with $N$ : our $g$ is really $g_N(\theta) := \frac{J^N(\theta)}{N}$ . So we cannot use the CMT.

(Comment: If you examine the proof of the CMT on Wikipedia, notice that the set $B_\delta$ they define in their proof for us now also depends on $n$ . We essentially need some sort of equicontinuity at $\theta^*$ over our functions $g_N(\theta)$ .)

Fortunately, if you assume that the family $\mathcal{G} = \{g_N | N=1,2,\ldots\}$ is stochastically equicontinuous at $\theta^*$ , then it immediately follows that for $\theta_{MLE} \convp \theta^*$ ,

\begin{aligned} | g_{n} (θ_{M L E}) - g_{n} (θ^{*}) | \overset{P}{⟶} 0. \end{aligned}

$\begin{align*} |g_n(\theta_{MLE}) - g_n(\theta^*)| \convp 0. \end{align*}$

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at $\theta^*$ , and a proof of the above fact.)

Therefore, assuming that $\mathcal{G}$ is SE at $\theta^*$ , your desired result holds true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need to impose on $\mathcal{G}$ to get SE? It looks like one way to do this is to establish a Lipshitz condition on the entire class of functions $\mathcal{G}$ (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).

Dapz
sumber

1

The answer above using stochastic equicontinuity works very well, but here I am answering my own question by using a uniform law of large numbers to show that the observed information matrix is a strongly consistent estimator of the information matrix , i.e. $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$ if we plug-in a strongly consistent sequence of estimators. I hope it is correct in all details.

We will use $I_{N}=\{1,2,...,N\}$ to be an index set, and let us temporarily adopt the notation $J(\tilde{Y},\theta):=J(\theta)$ in order to be explicit about the dependence of $J(\theta)$ on the random vector $\tilde{Y}$ . We shall also work elementwise with $(J(\tilde{Y},\theta))_{rs}$ and $(J_{N}(\theta))_{rs}=\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}$ , $r,s=1,...,k$ , for this discussion. The function $(J(\cdot,\theta))_{rs}$ is real-valued on the set $\mathbb{R}^{n}\times\Theta^{\circ}$ , and we will suppose that it is Lebesgue measurable for every $\theta\in\Theta^{\circ}$ . A uniform (strong) law of large numbers defines a set of conditions under which

$\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}(J_{N}(\theta))_{rs}-E_{\theta}\left[(J(Y_{1},\theta))_{rs}\right]\right|=\nonumber\\ \hspace{60pt}\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\overset{a.s}{\longrightarrow}0\hspace{100pt}(1)$

The conditions that must be satisfied in order that (1) holds are (a) $\Theta^{\circ}$ is a compact set; (b) $(J(\tilde{Y},\theta))_{rs}$ is a continuous function on $\Theta^{\circ}$ with probability 1; (c) for each $\theta\in \Theta^{\circ}$ $(J(\tilde{Y},\theta))_{rs}$ is dominated by a function $h(\tilde{Y})$ , i.e. $|(J(\tilde{Y},\theta))_{rs}|<h(\tilde{Y})$ ; and (d) for each $\theta\in \Theta^{\circ}$ $E_{\theta}[h(\tilde{Y})]<\infty$ ;. These conditions come from Jennrich (1969, Theorem 2).

Now for any $y_{i}\in\mathbb{R}^{n}$ , $i\in I_{N}$ and $\theta'\in S\subseteq\Theta^{\circ}$ , the following inequality obviously holds

$\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta'))_{rs}-(I(\theta'))_{rs}\right|\leq\underset{\theta\in S}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|.\hspace{50pt}(2)$

Suppose that $\{\hat{\theta}_{N}(Y)\}$ is a strongly consistent sequence of estimators for $\theta_{0}$ , and let $\Theta_{N_{1}}=B_{\delta_{N_{1}}}(\theta_{0})\subseteq K\subseteq \Theta^{\circ}$ be an open ball in $\mathbb{R}^{k}$ with radius $\delta_{N_{1}}\rightarrow 0$ as $N_{1}\rightarrow\infty$ , and suppose $K$ is compact. Then since $\hat{\theta}_{N}(Y)\in \Theta_{N_{1}}$ for $N$ sufficiently large enough we have $P[\underset{N}{\text{lim}}\{\hat{\theta}_{N}(Y)\in\Theta_{N_{1}}\}]=1$ for sufficiently large $N$ . Together with (2) this implies

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|\leq\right.\right.\nonumber\\ \hspace{40pt}\left.\left.\underset{\theta\in\Theta_{N_{1}}}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\right\}\right]=1.\hspace{100pt}(3)$

Now $\Theta_{N_{1}}\subseteq\Theta^{\circ}$ implies conditions (a)-(d) of Jennrich (1969, Theorem 2) apply to $\Theta_{N_{1}}$ . Thus (1) and (3) imply

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|=0\right\}\right]=1.\hspace{100pt}(4)$

Since $(I(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}I(\theta_{0})$ then (4) implies that $N^{-1}(J_{N}(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}(I(\theta_{0}))_{rs}$ . Note that (3) holds however small $\Theta_{N_{1}}$ is, and so the result in (4) is independent of the choice of $N_{1}$ other than $N_{1}$ must be chosen such that $\Theta_{N_{1}}\subseteq \Theta^{\circ}$ . This result holds for all $r,s=1,...,k$ , and so in terms of matrices we have $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$ .

dandar
sumber

Matriks informasi yang diamati adalah penduga yang konsisten dari matriks informasi yang diharapkan?

Jawaban: