Mengapa penyebut penaksir kovarians tidak menjadi n-2 daripada n-1?

36

Penyebut penduga varians (tidak bias) adalah $n-1$ karena ada $n$ pengamatan dan hanya satu parameter yang diperkirakan.

V (X) = \frac{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}}{n - 1}

$\mathbb{V}\left(X\right)=\frac{\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}}{n-1}$

Dengan cara yang sama saya bertanya-tanya mengapa penyebut kovarians tidak menjadi $n-2$ ketika dua parameter diperkirakan?

C o v (X, Y) = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{n - 1}

$\mathbb{Cov}\left(X, Y\right)=\frac{\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)\left(Y_{i}-\overline{Y}\right)}{n-1}$

self-study variance covariance descriptive-statistics unbiased-estimator MYaseen208
sumber

15

Jika Anda melakukan itu, Anda akan memiliki dua definisi bertentangan untuk varian: satu akan menjadi formula pertama dan yang lain akan menjadi rumus kedua diaplikasikan dengan

Y = X

$Y=X$ .

whuber

3

Mean bi / multivariat (ekspektasi) adalah satu, bukan 2 parameter.

ttnphns

14

@ttnphns Itu tidak benar: mean bivariat jelas adalah dua parameter karena memerlukan dua bilangan real untuk mengekspresikannya. (Memang itu adalah parameter vektor tunggal , tetapi mengatakan demikian hanya menyamarkan fakta bahwa ia memiliki dua komponen.) Ini muncul secara eksplisit dalam derajat kebebasan untuk uji-t gabungan-ragam, misalnya, di mana

2

$2$ dikurangi, bukan

1

$1$ . Yang menarik dari pertanyaan ini adalah bagaimana ia mengungkapkan betapa kabur, tidak menarik, dan berpotensi menyesatkan adalah "penjelasan" umum yang kita kurangi

1

$1$ dari

n

$n$ karena satu parameter telah diperkirakan.

whuber

@whuber, Anda benar pada saat itu. Jika hanya

(observasi independen) yang penting kita tidak akan menghabiskan lebih banyak df dalam tes multivariat dari dalam yang univariat.

n

$n$

ttnphns

3

@whuber: Saya mungkin akan mengatakan bahwa itu menunjukkan bahwa apa yang dianggap sebagai "parameter" tergantung pada situasinya. Dalam hal ini varians dihitung atas pengamatan $n$ dan sehingga setiap pengamatan - atau total rata-rata - dapat dilihat sebagai satu parameter, bahkan jika itu adalah rata-rata multivariat, seperti yang dikatakan oleh ttnphns. Namun, dalam kasus lain ketika misalnya uji mempertimbangkan kombinasi linear dimensi, setiap dimensi dari setiap pengamatan menjadi "parameter". Anda benar bahwa ini adalah masalah yang rumit.

Amoeba berkata Reinstate Monica

31

Varian adalah varian.

Karena dengan identitas polarisasi

Cov (X, Y) = Var (\frac{X + Y}{2}) - Var (\frac{X - Y}{2}),

$\newcommand{\c}{\text{Cov}}\newcommand{\v}{\text{Var}} \c(X,Y) = \v\left(\frac{X+Y}{2}\right) - \v\left(\frac{X-Y}{2}\right),$

penyebutnya harus sama.

whuber
sumber

20

Kasus khusus harus memberi Anda intuisi; pikirkan hal-hal berikut:

\hat{C o v} (X, X) = \hat{V} (X)

$\hat{\mathbb{Cov}}\left(X, X\right)= \hat{\mathbb{V}}\left(X\right)$

Anda senang bahwa yang terakhir adalah karena koreksi Bessel. $\frac{\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2}}{n-1}$

Tetapi mengganti dengan di untuk yang pertama memberi $Y$ $X$ $\hat{\mathbb{Cov}}\left(X, Y\right)$ , jadi apa yang menurut Anda paling baik mengisi kekosongan? $\frac{\sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)\left(X_{i}-\overline{X}\right)}{\text{mystery denominator}}$

Gegat
sumber

1

BAIK. Tetapi OP mungkin bertanya "mengapa mempertimbangkan cov (X, X) dan cov (X, Y) berada dalam satu baris logika? Mengapa Anda mengganti Y dengan X dalam cov () dengan sembrono? Mungkin cov (X, Y) Apakah situasi berbeda? " Anda tidak menghindari itu, sementara jawaban (sangat tervotasikan) seharusnya ada, dalam kesan saya :-)

ttnphns

7

Jawaban cepat dan kotor ... Mari kita pertimbangkan dulu ; jika Anda memiliki pengamatan dengan nilai ekspektasi yang diketahui Anda akan menggunakan $\text{var}(X)$ $n$ $E(X) = 0$ untuk memperkirakan varians. ${1\over n}\sum_{i=1}^n X_i^2$

Nilai yang diharapkan tidak diketahui, Anda dapat mengubah pengamatan Anda menjadi pengamatan dengan nilai harapan yang diketahui dengan mengambil untuk . Anda akan mendapatkan formula dengan di penyebut - namun tidak independen dan Anda harus mempertimbangkan ini; pada akhirnya Anda akan menemukan formula yang biasa. $n$ $n-1$ $A_i = X_i - X_1$ $i = 2, \dots,n$ $n-1$ $A_i$

Sekarang untuk kovarians Anda dapat menggunakan ide yang sama: jika nilai yang diharapkan dari adalah , Anda akan memiliki $(X,Y)$ $(0,0)$ dalam rumus. Dengan mengurangike semua nilai yang diamati lainnya, Anda mendapatkanpengamatan dengan nilai yang diharapkan diketahui ... dan ${1\over n}$ $(X_1,Y_1)$ $n-1$ ${1\over n-1}$ dalam formula - sekali lagi, ini memperkenalkan beberapa ketergantungan untuk diperhitungkan.

PS Cara bersih untuk melakukannya adalah untuk memilih basis ortonormal dari , yaitu vektor sehingga $\big\langle (1, \dots, 1)' \big\rangle^{\perp}$ $n-1$ $c_1, \dots, c_{n-1} \in \mathbb R^n$

untuk semua , $\sum_j c_{ij}^2 = 1$ $i$
untuk semua , $\sum_j c_{ij} = 0$ $i$
untuk semua . $\sum_j c_{i_1j} c_{i_2j} = 0$ $i_1 \ne i_2$

You can then define $n-1$ variables $A_i = \sum_j c_{ij} X_j$ and $B_i = \sum_j c_{ij} Y_j$ . The $(A_i,B_i)$ are independent, have expected value $(0,0)$ , and have same variance/covariance than the original variables.

All the point is that if you want to get rid of the unknown expectation, you drop one (and only one) observation. This works the same for both cases.

Elvis
sumber

6

Here is a proof that the p-variate sample covariance estimator with denominator $\frac{1}{n-1}$ is an unbiased estimator of the covariance matrix:

$x' = (x_1,...,x_p)$ .

$\Sigma= E((x-\mu)(x-\mu)')$

$S = \frac{1}{n} \sum (x_i - \bar{x})(x_i - \bar{x})'$

To show: $E(S) = \frac{n-1}{n}\Sigma$

Proof: $S= \frac{1}{n}\sum x_ix_i' - \bar{x}\bar{x}'$

(2) $E(\bar{x}\bar{x}') = \frac{1}{n} \Sigma+ \mu\mu'$

Therefore: $E(S) = \Sigma + \mu\mu' - (\frac{1}{n} \Sigma+ \mu\mu') = \frac{n-1}{n} \Sigma$

And so $S_u = \frac{n}{n-1}S$ , with the final denominator $\frac{1}{n-1}$ , is unbiased. The off-diagonal elements of $S_u$ are your individual sample covariances.

Additional remarks:

The n draws are independent. This is used in (2) to calculate the covariance of the sample mean.
Step (1) and (2) use the fact that $Cov(x)= E[xx']-\mu\mu'$
Step (2) uses the fact that $Cov(\bar{x})= \frac{1}{n}\Sigma$

statchrist
sumber

The difficulty being in step 2 ! :)

Elvis

@Elvis It's messy. One needs to apply the rule Cov(X+Y,Z)=Cov(X,Z) + Cov(Y,Z) and recognize that the different draws are independent. Then it's basically summing up the covariance n times and scaling it down by 1/n²

statchrist

4

I guess one way to build intuition behind using 'n-1' and not 'n-2' is - that for calculating co-variance we do not need to de-mean both X and Y, but either of the two, i.e.

$\sum (X-\mu_x)(Y - \mu_y) = \sum (X-\mu_x)Y \ \ \ or \ \ \ \sum (Y-\mu_y)X$

Uditg_ucla
sumber

Could you elaborate on how this bears on the question of what denominator to use? The algebraic relation in evidence derives from the fact that the residuals relative to the mean sum to zero, but otherwise is silent about which denominator is relevant.

whuber

5

I came here because I had the same question as the OP. I think this answer gets at the nub of the point @whuber pointed out above: that the rule of thumb is that df ~= n - (parameters estimated) can be "vague, unrigorous, and potentially misleading." This points out the fact that though it looks like you need to estimate two parameters (xbar and ybar), you really only estimate one (xbar or ybar). Since the df should be the same in both cases, it must be the lower of the two. I think that is the intent here.

mpettis

1

1) Mulai $df=2n$ .

2) Sampel kovarians sebanding dengan $\Sigma_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})$ . Lose two $df$ ; one from $\bar{X}$ , one from $\bar{Y}$ resulting in $df=2(n-1)$ .

3) However, $\Sigma_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})$ only contains $n$ separate terms, one from each product. When two numbers are multiplied together the independent information from each separate number disappears.

As a trite example, consider that

$24=1*24=2*12=3*8=4*6=6*4=8*3=12*2=24*1$ ,

and that does not include irrationals and fractions, e.g. $24=2\sqrt{6}*2\sqrt{6}$ , so that when we multiply two number series together and examine their product, all we see are the $df=n-1$ from one number series, as we have lost half of the original information, that is, what those two numbers were before the pair-wise grouping into one number (i.e., multiplication) was performed.

In other words, without loss of generality we can write

$(X_i-\bar{X})(Y_i-\bar{Y})=z_i-\bar{z}$ for some $z_i$ and $\bar{z}$ ,

i.e., $z_i=X_iY_i-\bar{X}Y_i-X_i\bar{Y}$ , and, $\bar{z}=\bar{X}\bar{Y}$ . From the $z$ 's, which then clearly have $df=n-1$ , the covariance formula becomes

$\Sigma_{i=1}^n\frac{z_i-\bar{z}}{n-1}=$

$\Sigma_{i=1}^n\frac{[(X_i-\bar{X})(Y_i-\bar{Y})]}{n-1}=$

$\frac{1}{n-1}\Sigma_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})$ .

Thus, the answer to the question is that the $df$ are halved by grouping.

Carl
sumber

@whuber How on earth did I get the same thing posted twice and deleted once? What gives? Can we get rid of one of them? For future reference, is there any way to permanently delete such duplicates? I have a few hanging around and it's annoying.

Carl

As far as I can tell, you reposted your answer from the duplicate to here. (Nobody else has the power to post answers in your name.) The system strongly discourages posting identical answers in multiple threads, so when I saw that, it convinced me these two threads are perfect duplicates and I "merged" them. This is a procedure that moves all comments and answers from the source thread to the target thread. I then deleted your duplicate post here in the target thread. It will remain permanently deleted, but will be visible to you as well as to people of sufficiently high reputation.

whuber

@whuber I didn't know what happens in a merge, that a merge was taking place or what many of the rules are, despite looking things up constantly. It takes time to learn, be patient, BTW, would you consider taking stats.stackexchange.com/questions/251700/… off of Hold?

Carl

Mengapa penyebut penaksir kovarians tidak menjadi n-2 daripada n-1?

Jawaban: