Interval prediksi untuk variabel acak binomial

Ok, ayo coba ini. Saya akan memberikan dua jawaban - yang Bayesian, yang menurut saya sederhana dan alami, dan salah satu yang mungkin sering.

Solusi Bayesian

Kita asumsikan Beta sebelumnya pada , i, e., , karena model Beta-Binomial adalah konjugat, yang berarti bahwa distribusi posterior juga distribusi Beta dengan parameter , (saya menggunakan untuk menunjukkan jumlah keberhasilan dalam percobaan, bukan ). Dengan demikian, kesimpulan sangat disederhanakan. Sekarang, jika Anda memiliki pengetahuan sebelumnya tentang nilai kemungkinan $p$ $p \sim Beta(\alpha,\beta)$ $\hat{\alpha}=\alpha+k,\hat{\beta}=\beta+n-k$ $k$ $n$ $y$ , Anda dapat menggunakannya untuk mengatur nilai dan , yaitu, untuk menentukan Beta Anda sebelumnya, jika tidak, Anda dapat menganggap seragam (noninformatif) sebelumnya, dengan , atau prior noninformatif lainnya (lihat contohdi sini). Bagaimanapun, posterior Anda $p$ $\alpha$ $\beta$ $\alpha=\beta=1$

$Pr(p|n,k)=Beta(\alpha+k,\beta+n-k)$

Dalam inferensi Bayesian, semua yang penting adalah probabilitas posterior, yang berarti bahwa setelah Anda tahu itu, Anda dapat membuat kesimpulan untuk semua kuantitas lain dalam model Anda. Anda ingin membuat inferensi pada diamati : khususnya, pada vektor hasil baru , di mana belum tentu sama untuk . Khususnya, untuk setiap , kami ingin menghitung probabilitas untuk mendapatkan keberhasilan tepat dalam percobaan berikutnya , mengingat bahwa kami mendapat $y$ $\mathbf{y}=y_1,\dots,y_m$ $m$ $n$ $j=0,\dots,m$ $j$ $m$ $k$ keberhasilan dalam sebelumnya uji coba; fungsi massa prediktif posterior: $n$

Namun, model Binomial kami untuk berarti bahwa, kondisional pada memiliki nilai tertentu, kemungkinan memiliki keberhasilan dalam percobaan tidak tergantung pada hasil masa lalu: itu hanya $Y$ $p$ $j$ $m$

$f(j|m,p)=\binom{j}{m} p^j(1-p)^j$

Jadi ungkapan itu menjadi

$Pr(j|m,n,k)=\int_0^1 \binom{j}{m} p^j(1-p)^j Pr(p|n,k)dp=\int_0^1 \binom{j}{m} p^j(1-p)^j Beta(\alpha+k,\beta+n-k)dp$

Hasil integral ini adalah distribusi terkenal yang disebut distribusi Beta-Binomial: melewatkan bagian-bagian, kita mendapatkan ekspresi yang mengerikan

$Pr(j|m,n,k)=\frac{m!}{j!(m-j)!}\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+k)\Gamma(\beta+n-k)}\frac{\Gamma(\alpha+k+j)\Gamma(\beta+n+m-k-j)}{\Gamma(\alpha+\beta+n+m)}$

Estimasi titik kami untuk , mengingat kerugian kuadratik, tentu saja adalah rata-rata dari distribusi ini, yaitu, $j$

$\mu=\frac{m(\alpha+k)}{(\alpha+\beta+n)}$

Sekarang, mari kita cari interval prediksi. Karena ini adalah distribusi diskrit, kita tidak memiliki ekspresi bentuk tertutup untuk , sehingga . Alasannya adalah bahwa, tergantung pada bagaimana Anda mendefinisikan suatu kuantil, untuk distribusi diskrit, fungsi kuantil bukanlah fungsi atau fungsi diskontinyu. Tapi ini bukan masalah besar: untuk kecil , Anda bisa menuliskan probabilitas $[j_1,j_2]$ $Pr(j_1\leq j \leq j_2)= 0.95$ $m$ $m$ dan dari sini temukan sedemikian rupa sehingga $Pr(j=0|m,n,k),Pr(j\leq 1|m,n,k),\dots,Pr(j \leq m-1|m,n,k)$ $j_1,j_2$

$Pr(j_1\leq j \leq j_2)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\geq 0.95$

Tentu saja Anda akan menemukan lebih dari satu pasangan, sehingga Anda akan idealnya mencari terkecil seperti yang di atas adalah puas. Catat itu $[j_1,j_2]$

$Pr(j=0|m,n,k)=p_0,Pr(j\leq 1|m,n,k)=p_1,\dots,Pr(j \leq m-1|m,n,k)=p_{m-1}$

are just the values of the CMF (Cumulative Mass Function) of the Beta-Binomial distribution, and as such there is a closed form expression, but this is in terms of the generalized hypergeometric function and thus is quite complicated. I'd rather just install the R package extraDistr and call pbbinom to compute the CMF of the Beta-Binomial distribution. Specifically, if you want to compute all the probabilities $p_0,\dots,p_{m-1}$ in one go, just write:

library(extraDistr)  
jvec <- seq(0, m-1, by = 1) 
probs <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

di mana alphadan betaadalah nilai-nilai parameter Beta Anda sebelumnya, yaitu, dan (dengan demikian 1 jika Anda menggunakan seragam sebelum lebih dari ). Tentu saja semua akan jauh lebih sederhana jika R menyediakan fungsi kuantil untuk distribusi Beta-Binomial, tetapi sayangnya tidak. $\alpha$ $\beta$ $p$

Contoh praktis dengan solusi Bayesian

Misalkan , (dengan demikian kami awalnya mengamati 70 keberhasilan dalam 100 percobaan). Kami menginginkan estimasi titik dan interval prediksi-95% untuk jumlah keberhasilan dalam percobaan berikutnya. Kemudian $n=100$ $k=70$ $j$ $m=20$

n <- 100
k <- 70
m <- 20
alpha <- 1
beta  <- 1

$p$

bayesian_point_estimate <- m * (alpha + k)/(alpha + beta + n) #13.92157

$j$

jvec <- seq(0, m-1, by = 1)
library(extraDistr)
probabilities <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

Peluangnya adalah

> probabilities
 [1] 1.335244e-09 3.925617e-08 5.686014e-07 5.398876e-06
 [5] 3.772061e-05 2.063557e-04 9.183707e-04 3.410423e-03
 [9] 1.075618e-02 2.917888e-02 6.872028e-02 1.415124e-01
[13] 2.563000e-01 4.105894e-01 5.857286e-01 7.511380e-01
[17] 8.781487e-01 9.546188e-01 9.886056e-01 9.985556e-01

$j_2$ $Pr(j\leq j_2|m,n,k)\ge 0.975$ $j_1$ such that $Pr(j < j_1|m,n,k)=Pr(j \le j_1-1|m,n,k)\le 0.025$ . This way, we will have

$Pr(j_1\leq j \leq j_2|m,n,k)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\ge 0.975-0.025=0.95$

Thus, by looking at the above probabilities, we see that $j_2=18$ and $j_1=9$ . The probability of this Bayesian prediction interval is 0.9778494, which is larger than 0.95. We could find shorter intervals such that $Pr(j_1\leq j \leq j_2|m,n,k)\ge 0.95$ , but in that case at least one of the two inequalities for the tail probabilities wouldn't be satisfied.

Frequentist solution

I'll follow the treatment of Krishnamoorthy and Peng, 2011. Let $Y\sim Binom(m,p)$ and $X\sim Binom(n,p)$ be independently Binominally distributed. We want a $1-2\alpha-$ prediction interval for $Y$ , based on a observation of $X$ . In other words we look for $I=[L(X;n,m,\alpha),U(X;n,m,\alpha)]$ such that:

$Pr_{X,Y}(Y\in I)=Pr_{X,Y}(L(X;n,m,\alpha)\leq Y\leq U(X;n,m,\alpha)]\geq 1-2\alpha$

The " $\geq 1-2\alpha$ " is due to the fact that we are dealing with a discrete random variable, and thus we cannot expect to get exact coverage...but we can look for an interval which has always at least the nominal coverage, thus a conservative interval. Now, it can be proved that the conditional distribution of $X$ given $X+Y=k+j=s$ is hypergeometric with sample size $s$ , number of successes in the population $n$ and population size $n+m$ . Thus the conditional pmf is

$Pr(X=k|X+Y=s,n,n+m)=\frac{\binom{n}{k}\binom{m}{s-k}}{\binom{m+n}{s}}$

The conditional CDF of $X$ given $X+Y=s$ is thus

$Pr(X\leq k|s,n,n+m)=H(k;s,n,n+m)=\sum_{i=0}^k\frac{\binom{n}{i}\binom{m}{s-i}}{\binom{m+n}{s}}$

The first great thing about this CDF is that it doesn't depend on $p$ , which we don't know. The second great thing is that it allows to easily find our PI: as a matter of fact, if we observed a value $k$ of X, then the $1-\alpha$ lower prediction limit is the smallest integer $L$ such that

$Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$

correspondingly, the the $1-\alpha$ upper prediction limit is the largest integer such that

$Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$

Thus, $[L,U]$ is a prediction interval for $Y$ of coverage at least $1-2\alpha$ . Note that when $p$ is close to 0 or 1, this interval is conservative even for large $n$ , $m$ , i.e., its coverage is quite larger than $1-2\alpha$ .

Practical example with the Frequentist solution

Same setting as before, but we don't need to specify $\alpha$ and $\beta$ (there are no priors in the Frequentist framework):

n <- 100
k <- 70
m <- 20

The point estimate is now obtained using the MLE estimate for the probability of successes, $\hat{p}=\frac{k}{n}$ , which in turns leads to the following estimate for the number of successes in $m$ trials:

frequentist_point_estimate <- m * k/n #14

For the prediction interval, the procedure is a bit different. We look for the largest $U$ such that $Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$ , thus let's compute the above expression for all $U$ in $[0,m]$ :

jvec <- seq(0, m, by = 1)
probabilities <- phyper(k,n,m,k+jvec)

We can see that the largest $U$ such that the probability is still larger than 0.025 is

jvec[which.min(probabilities > 0.025) - 1] # 18

Same as for the Bayesian approach. The lower prediction bound $L$ is the smallest integer such that $Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$ , thus

probabilities <- 1-phyper(k-1,n,m,k+jvec)
jvec[which.max(probabilities > 0.025) - 1] # 8

Thus our frequentist "exact" prediction interval is $[L,U]=[8,18]$ .

DeltaIV
sumber

Interval prediksi untuk variabel acak binomial

Jawaban:

Solusi Bayesian

Contoh praktis dengan solusi Bayesian

Frequentist solution

Practical example with the Frequentist solution