Johansson (2011) dalam " Hail the impossible: nilai-p, bukti, dan kemungkinan " (di sini juga terkait dengan jurnal ) menyatakan bahwa nilai- yang lebih rendah sering dianggap sebagai bukti yang lebih kuat terhadap nol. Johansson menyiratkan bahwa orang akan menganggap bukti terhadap nol lebih kuat jika uji statistik mereka menghasilkan nilai , dibandingkan jika uji statistik mereka menghasilkan nilai . Johansson mencantumkan empat alasan mengapa nilai- tidak dapat digunakan sebagai bukti terhadap nol:
- terdistribusi secara merata di bawah hipotesis nol dan karena itu tidak pernah dapat menunjukkan bukti untuk nol.
- dikondisikan semata-mata pada hipotesis nol dan karenanya tidak cocok untuk mengkuantifikasi bukti, karena bukti selalu relatif dalam arti menjadi bukti untuk atau melawan hipotesis relatif terhadap hipotesis lain.
- designates probability of obtaining evidence (given the null), rather than strength of evidence.
- depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the evidential strength of observed data depends on things that did not happen and subjective intentions.
Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a -value of indicates there is less chance the null is true, than a -value of . Why are lower -values not stronger evidence against null?
Jawaban:
My personal appraisal of his arguments:
His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. Butp as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case p is a transformation of the likelihood ratio. As Cosma Shalizi puts it:
Hereq(x) is the density under state "signal" and p(x) the density under state "noise". The measure for "sufficiently likely" would here be P(q(X)/p(x)>tobs∣H0) which is p . Note that in correct Neyman-Pearson testing tobs is substituted by a fixed t(s) such that P(q(X)/p(x)>t(s)∣H0)=α .
sumber
Alasan bahwa argumen seperti Johansson didaur ulang begitu sering tampaknya terkait dengan fakta bahwa nilai-P adalah indeks bukti terhadap nol tetapi bukan merupakan ukuran bukti. Bukti memiliki dimensi lebih dari yang bisa diukur oleh angka tunggal, dan selalu ada aspek hubungan antara nilai-P dan bukti yang sulit ditemukan orang.
Saya telah meninjau banyak argumen yang digunakan oleh Johansson dalam sebuah makalah yang menunjukkan hubungan antara nilai-P dan fungsi kemungkinan, dan dengan demikian bukti: http://arxiv.org/abs/1311.0081 Sayangnya kertas itu sekarang telah tiga kali ditolak, meskipun argumen dan bukti untuk mereka belum dibantah. (Tampaknya itu tidak menyenangkan bagi wasit yang memiliki pendapat seperti pendapat Johansson daripada salah.)
sumber
Adding to @Momo's nice answer:
Do not forget multiplicity. Given many independent p-values, and sparse non-trivial effect sizes, the smallest p-values are from the null, with probability tending to1 as the number of hypotheses increases.
So if you tell me you have a small p-value, the first thing I want to know is how many hypotheses you have been testing.
sumber
Is Johansson talking about p-values from two different experiments? If so, comparing p-values may be like comparing apples to lamb chops. If experiment "A" involves a huge number of samples, even a small inconsequential difference may be statistically significant. If experiment "B" involves only a few samples, an important difference may be statistically insignificant. Even worse (that's why I said lamb chops and not oranges), the scales may be totally incomparable (psi in one and kwh in the other).
sumber