Saya seorang noob dalam statistik, jadi bisakah kalian tolong bantu saya di sini.
Pertanyaan saya adalah sebagai berikut: Apa sebenarnya varian yang dikumpulkan? ?
Ketika saya mencari rumus untuk varian gabungan di internet, saya menemukan banyak literatur menggunakan rumus berikut (misalnya, di sini: http://math.tntech.edu/ISR/Mathematical_Statistics/Introduction_to_Statistics_Tests/thispage/newnode19.html ):
Tapi apa yang sebenarnya terjadi dihitung? Karena ketika saya menggunakan rumus ini untuk menghitung varian gabungan saya, itu memberi saya jawaban yang salah.
Misalnya, pertimbangkan "sampel induk" ini:
Varian dari sampel induk ini adalah , dan rata-ratanya adalah ˉ x p = 5 .
Sekarang, misalkan saya membagi sampel induk ini menjadi dua sub-sampel:
- Sub-sampel pertama adalah 2,2,2,2,2 dengan rata-rata dan varians S 2 1 = 0 .
- Sub-sampel kedua adalah 8,8,8,8,8 dengan rata-rata dan varians S 2 2 = 0 .
Sekarang, jelas, menggunakan rumus di atas untuk menghitung varian pooled / parent dari dua sub-sampel ini akan menghasilkan nol, karena dan S 2 = 0 . Jadi, apa rumus ini sebenarnya menghitung?
Di sisi lain, setelah beberapa derivasi panjang, saya menemukan rumus yang menghasilkan varian pooled / parent yang benar adalah:
Dalam rumus di atas, dan d 2 = ¯ x 2 - ˉ x p .
Saya menemukan formula yang sama dengan milik saya, misalnya di sini: http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html dan juga di Wikipedia. Meskipun saya harus mengakui bahwa mereka tidak persis sama dengan milik saya.
Jadi sekali lagi, apa arti sebenarnya dari kumpulan gabungan? Bukankah itu berarti varians sampel induk dari dua sub-sampel? Atau saya benar-benar salah di sini?
Terima kasih sebelumnya.
EDIT 1: Seseorang mengatakan bahwa dua sub-sampel saya di atas adalah patologis karena mereka memiliki nol varians. Baiklah, saya bisa memberi Anda contoh berbeda. Pertimbangkan contoh induk ini:
Varian dari sampel induk ini adalah , dan rata-ratanya adalah ˉ x p = 25,5 .
Sekarang, misalkan saya membagi sampel induk ini menjadi dua sub-sampel:
- Sub-sampel pertama adalah 1,2,3,4,5 dengan rata-rata dan varian S 2 1 = 2,5 .
- Sub-sampel kedua adalah 46,47,48,49,50 dengan rerata dan varians S 2 2 = 2,5 .
Sekarang, jika Anda menggunakan "rumus literatur" untuk menghitung varians yang dikumpulkan, Anda akan mendapatkan 2,5, yang sepenuhnya salah, karena varians induk / gabungan seharusnya 564,7. Sebaliknya, jika Anda menggunakan "formula saya", Anda akan mendapatkan jawaban yang benar.
Tolong mengerti, saya menggunakan contoh ekstrim di sini untuk menunjukkan kepada orang-orang bahwa formula itu memang salah. Jika saya menggunakan "data normal" yang tidak memiliki banyak variasi (kasus ekstrim), maka hasil dari kedua rumus tersebut akan sangat mirip, dan orang-orang dapat mengabaikan perbedaan karena kesalahan pembulatan, bukan karena rumus itu sendiri salah.
Jawaban:
Sederhananya, varians gabungan adalah perkiraan (tidak bias) dari varians dalam setiap sampel, di bawah asumsi / kendala bahwa varians tersebut sama.
Ini dijelaskan, dimotivasi, dan dianalisis secara rinci dalam entri Wikipedia untuk varian gabungan .
It does not estimate the variance of a new "meta-sample" formed by concatenating the two individual samples, like you supposed. As you have already discovered, estimating that requires a completely different formula.
sumber
Varians gabungan digunakan untuk menggabungkan varian dari sampel yang berbeda dengan mengambil rata-rata tertimbang mereka, untuk mendapatkan varian "keseluruhan". Masalah dengan contoh Anda adalah bahwa itu adalah kasus patologis, karena masing-masing sub-sampel memiliki varians sama dengan nol. Kasus patologis seperti itu memiliki sangat sedikit kesamaan dengan data yang biasanya kita temui, karena selalu ada beberapa variabilitas dan jika tidak ada variabilitas, kami tidak peduli dengan variabel seperti itu karena mereka tidak membawa informasi. Anda perlu memperhatikan bahwa ini adalah metode yang sangat sederhana dan ada cara yang lebih rumit untuk memperkirakan varians dalam struktur data hierarkis yang tidak rentan terhadap masalah seperti itu.
As about your example in the edit, it shows that it is important to clearly state your assumptions before starting the analysis. Let's say that you haven data points in k groups, we would denote it as x1,1,x2,1,…,xn−1,k,xn,k , where the i -th index in xi,j stands for cases and j -th index stands for group indexes. There are several scenarios possible, you can assume that all the points come from the same distribution (for simplicity, let's assume normal distribution),
you can assume that each of the sub-samples has its own mean
or, its own variance
or, each of them have their own, distinct parameters
Depending on your assumptions, particular method may, or may not be adequate for analyzing the data.
In the first case, you wouldn't be interested in estimating the within-group variances, since you would assume that they all are the same. Nonetheless, if you aggregated the global variance from the group variances, you would get the same result as by using pooled variance since the definition of variance is
and in pooled estimator you first multiply it byn−1 , then add together, and finally divide by n1+n2−1 .
In the second case, means differ, but you have a common variance. This example is closest to your example in the edit. In this scenario, the pooled variance would correctly estimate the global variance, while if estimated variance on the whole dataset, you would obtain incorrect results, since you were not accounting for the fact that the groups have different means.
In the third case it doesn't make sense to estimate the "global" variance since you assume that each of the groups have its own variance. You may be still interested in obtaining the estimate for the whole population, but in such case both (a) calculating the individual variances per group, and (b) calculating the global variance from the whole dataset, can give you misleading results. If you are dealing with this kind of data, you should think of using more complicated model that accounts for the hierarchical nature of the data.
The fourth case is the most extreme and quite similar to the previous one. In this scenario, if you wanted to estimate the global mean and variance, you would need a different model and different set of assumptions. In such case, you would assume that your data is of hierarchical structure, and besides the within-group means and variances, there is a higher-level common variance, for example assuming the following model
where each sample has its own means and variancesμj,σ2j that are themselves draws from common distributions. In such case, you would use a hierarchical model that takes into consideration both the lower-level and upper-level variability. To read more about this kind of models, you can check the Bayesian Data Analysis book by Gelman et al. and their eight schools example. This is however much more complicated model then the simple pooled variance estimator.
sumber
The problem is if you just concatenate the samples and estimate its variance you're assuming they're from the same distribution therefore have the same mean. But we are in general interested in several samples with different mean. Does this make sense?
sumber
The use-case of pooled variance is when you have two samples from distributions that:
An example of this is a situation where you measure the length of Alice's nosen times for one sample, and measure the length of Bob's nose m times for the second. These are likely to produce a bunch of different measurements on the scale of millimeters, because of measurement error. But you expect the variance in measurement error to be the same no matter which nose you measure.
In this case, taking the pooled variance would give you a better estimate of the variance in measurement error than taking the variance of one sample alone.
sumber
Through pooled variance we are not trying to estimate the variance of a bigger sample, using smaller samples. Hence, the two examples you gave don't exactly refer to the question.
Pooled variance is required to get a better estimate of population variance, from two samples that have been randomly taken from that population and come up with different variance estimates.
Example, you are trying to gauge variance in the smoking habits of males in London. You sample two times, 300 males from London. You end up getting two variances (probably a bit different!). Now since, you did a fair random sampling (best to your capability! as true random sampling is almost impossible), you have all the rights to say that both the variances are true point estimates of population variance (London males in this case).
But how is that possible? i.e. two different point estimates!! Thus, we go ahead and find a common point estimate which is pooled variance. It is nothing but weighted average of two point estimates, where the weights are the degree of freedom associated with each sample.
Hope this clarifies.
sumber