Stan

16

Saya sedang membaca dokumentasi Stan yang dapat diunduh dari sini . Saya sangat tertarik dengan implementasi diagnostik Gelman-Rubin. Makalah asli Gelman & Rubin (1992) mendefinisikan faktor skala pengurangan potensial (PSRF) sebagai berikut:

Biarkan menjadi rantai Markov ke- i sampel, dan biarkan ada rantai M independen keseluruhan sampel. Biarkan ˉ X i menjadi rata-rata dari rantai ke- i , dan ˉ X menjadi rata-rata keseluruhan. Tentukan, W = 1Xi,1,,Xi,NiMX¯iiX¯ mana s 2 m =1

W=1Mm=1Msm2,
Dan tentukan B B = N
sm2=1N1t=1N(X¯mtX¯m)2.
B
B=NM1m=1M(X¯mX¯)2.

Tentukan V = ( N - 1 PSRF diperkirakan dengan

V^=(N1N)W+(M+1MN)B.
dimana R= VR^ Di mana d f = 2 V / V sebuah r ( V ) .
R^=V^Wdf+3df+1,
df=2V^/Var(V^)

Dokumentasi Stan pada halaman 349 abaikan istilah dengan dan juga menghilangkan ( M + 1 ) / M jangka perkalian. Ini formula mereka,df(M+1)/M

Penaksir varians adalah Akhirnya, statistik pengurangan skala potensial didefinisikan oleh R =

var^+(θ|y)=N1NW+1NB.
R^=var^+(θ|y)W.

Dari apa yang saya lihat, mereka tidak memberikan referensi untuk perubahan formula ini, dan mereka juga tidak membahasnya. Biasanya tidak terlalu besar, dan seringkali bisa serendah 2 , jadiM2 tidak boleh diabaikan, bahkan jika d f jangka dapat didekati dengan 1.(M+1)/Mdf

Jadi dari mana formula ini berasal?


EDIT: Saya telah menemukan jawaban parsial untuk pertanyaan "dari mana formula ini berasal? ", Dalam buku Analisis Data Bayesian oleh Gelman, Carlin, Stern, dan Rubin (edisi kedua) memiliki formula yang persis sama. Namun, buku itu tidak menjelaskan bagaimana / mengapa dibenarkan untuk mengabaikan istilah-istilah itu?

Greenparker
sumber
Belum ada makalah yang diterbitkan di sana, dan rumus mungkin akan berubah dalam beberapa bulan ke depan.
Ben Goodrich
@ BenGoodrich Terima kasih atas komentarnya. Bisakah Anda mengatakan lebih banyak tentang motivasi menggunakan formula ini? Dan mengapa formula akan berubah?
Greenparker
1
Formula R-hat split saat ini adalah cara yang paling banyak untuk membuatnya berlaku untuk kasus di mana hanya ada satu rantai. Perubahan yang akan datang sebagian besar berkaitan dengan fakta bahwa distribusi posterior marginal yang mendasari mungkin tidak normal atau memiliki rata-rata dan / atau varian.
Ben Goodrich
1
M=2(M+1)/M=3/2

Jawaban:

4

Saya mengikuti tautan khusus yang diberikan untuk Gelman & Rubin (1992) dan sudah

σ^=n-1nW+1nB
seperti pada versi selanjutnya, meskipun σ^ digantikan dengan σ^+ dalam Brooks & Gelman (1998) dan dengan vSebuahr^+ dalam BDA2 (Gelman et al, 2003) dan BDA3 (Gelman et al, 2013).

BDA2 and BDA3 (couldn't check now BDA1) have an exercise with hints to show that var^+ is unbiased estimate of the desired quantity.

Gelman & Brooks (1998) has equation 1.1

R^=m+1mσ^+Wn1mn,
which can be rearranged as
R^=σ^+W+σ^+Wmn1mn.
We can see that the effect of second and third term are negligible for decision making when n is large. See also the discussion in the paragraph before Section 3.1 in Brooks & Gelman (1998).

Gelman & Rubin (1992) also had the term with df as df/(df-2). Brooks & Gelman (1998) have a section describing why this df corretion is incorrect and define (df+3)/(df+1). The paragraph before Section 3.1 in Brooks & Gelman (1998) explains why (d+3)/(d+1) can be dropped.

It seems your source for the equations was something post Brooks & Gelman (1998) as you had (d+3)/(d+1) there and Gelman & Rubin (1992) had df/df(-2). Otherwise Gelman & Rubin (1992) and Brooks & Gelman (1998) have equivalent equations (with slightly different notations and some terms are arranged differently). BDA2 (Gelman, et al., 2003) doesn't have anymore terms σ^+Wmn1mn. BDA3 (Gelman et al., 2003) and Stan introduced split chains version.

My interpretation of the papers and experiences using different versions of R^ is that the terms which have been eventually dropped can be ignored when n is large, even when m is not. I also vaguely remember discussing this with Andrew Gelman years ago, but if you want to be certain of the history, you should ask him.

Usually M is not too large, and can often be as low so as 2

I really do hope that this is not often the case. In cases where you want to use split-R^ convergence diagnostic, you should use at least 4 chains split and thus have M=8. You may use less chains, if you already know that in your specific cases the convergence and mixing is fast.

Additional reference:

  • Brooks and Gelman (1998). Journal of Computational and Graphical Statistics, 7(4)434-455.
Aki Vehtari
sumber
Yes it has the same σ^2 as you mention, but their R^ statistic is (σ^2+B/mn)/Wdfterm (look at the equation on top of page 495 in the Stat Science official version), which introduces the (m+1)/m term I was talking about. In addition, look at the code and description in the R package coda, which has had the GR diagnostic since 1999.
Greenparker
I'm confused. The article via the link you provided and the article from Stat Science web pages has only pages 457-472.I didn't check now, but years ago and last year when I checked coda, it didn't have the current recommended version.
Aki Vehtari
Note that I edited my answer. Gelman & Brooks (1998) has that (m+1)/m term more clearly, and it seems you missed the last term which mostly cancels the effect of (m+1)/m term for decision making. See that paragraph before section 3.1.
Aki Vehtari
Sorry about that, that was a typo. It's page 465, and Gelman and Rubin have the same exact definition as Brooks and Gelman (which you state above). Equation 1.1 in Brooks and Gelman is exactly what I wrote down as well (when you rearrange some terms).
Greenparker
"We can see that the effect of second and third term are negligible for decision making when n is large", so what you are saying is that the expression in BDA and hence STAN comes from essentially ignoring these terms for large n?
Greenparker