Dalam bidang penelitian saya, cara populer menampilkan data adalah dengan menggunakan kombinasi diagram batang dengan "bilah pegangan". Sebagai contoh,
The "handle-bar" bergantian antara kesalahan standar dan standar deviasi tergantung pada penulis. Biasanya, ukuran sampel untuk setiap "bilah" cukup kecil - sekitar enam.
Plot-plot ini tampaknya sangat populer dalam ilmu biologi - lihat beberapa makalah pertama dari BMC Biology, vol 3 untuk contohnya.
Jadi, bagaimana Anda menyajikan data ini?
Kenapa saya tidak suka plot ini
Secara pribadi saya tidak suka plot ini.
- Ketika ukuran sampel kecil, mengapa tidak hanya menampilkan titik data individual.
- Apakah sd atau se yang sedang ditampilkan? Tidak ada yang setuju untuk digunakan.
- Mengapa menggunakan bilah sama sekali. Data tidak (biasanya) pergi dari 0 tetapi lulus pertama pada grafik menunjukkan itu.
- Grafik tidak memberikan gambaran tentang rentang atau ukuran sampel data.
Script R.
Ini adalah kode R yang saya gunakan untuk menghasilkan plot. Dengan begitu Anda bisa (jika mau) menggunakan data yang sama.
#Generate the data
set.seed(1)
names = c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")
prevs = c(38, 37, 31, 31, 29, 26, 40, 32, 39)
n=6; se = numeric(length(prevs))
for(i in 1:length(prevs))
se[i] = sd(rnorm(n, prevs, 15))/n
#Basic plot
par(fin=c(6,6), pin=c(6,6), mai=c(0.8,1.0,0.0,0.125), cex.axis=0.8)
barplot(prevs,space=c(0,0,0,3,0,0, 3,0,0), names.arg=NULL, horiz=FALSE,
axes=FALSE, ylab="Percent", col=c(2,3,4), width=5, ylim=range(0,50))
#Add in the CIs
xx = c(2.5, 7.5, 12.5, 32.5, 37.5, 42.5, 62.5, 67.5, 72.5)
for (i in 1:length(prevs)) {
lines(rep(xx[i], 2), c(prevs[i], prevs[i]+se[i]))
lines(c(xx[i]+1/2, xx[i]-1/2), rep(prevs[i]+se[i], 2))
}
#Add the axis
axis(2, tick=TRUE, xaxp=c(0, 50, 5))
axis(1, at=xx+0.1, labels=names, font=1,
tck=0, tcl=0, las=1, padj=0, col=0, cex=0.1)
data-visualization
csgillespie
sumber
sumber
Jawaban:
Thanks for all you answers. For completeness I thought I should include what I usually do. I tend to do a combination of the suggestions given: dots, boxplots (when n is large), and se (or sd) ranges.
(Removed by moderator because the site hosting the image no longer appears to work correctly.)
From the dot plot, it is clear that data is far more spread out the "handle bar" plots suggest. In fact, there is a negative value in A3!
I've made this answer a CW so I don't gain rep
sumber
Frank Harrell's (most excellent) keynote entitled "Information Allergy" at useR! last month showed alternatives to these: rather than hiding the raw data via the aggregation the bars provide, the raw data is also shown as dots (or points). "Why hide the data?" was Frank's comment.
Given alpa blending, that strikes as a most sensible suggestion (and the whole talk most full of good, and important, nuggets).
sumber
jitter
is also in plain R.From a psychological perspective, I advocate plotting the data plus your uncertainty about the data. Thus, in a plot like you show, I would never bother with extending the bars all the way to zero, which only serves to minimize the eye's ability to distinguish differences in the range of the data.
Additionally, I'm frankly anti-bargraph; bar graphs map two variables to the same aesthetic attribute (x-axis location), which can cause confusion. A better approach is to avoid redundant aesthetic mapping by mapping one variable to the x-axis and another variable to another aesthetic attribute (eg. point shape or color or both).
Finally, in your plot above, you only include error bars above the value, which hinders one's ability to compare the intervals of uncertainty relative to bars above and below the value.
Here's how I would plot the data (via the ggplot2 package). Note that I add lines connecting points in the same series; some argue that this is only appropriate when the series across which the lines are connected are numeric (as seems to be in this case), however as long as there is any reasonable ordinal relationship among the levels of the x-axis variable, I think connecting lines are useful for helping the eye associate points across the x-axis. This can become particularly useful for detecting interactions, which really stand out with lines.
sumber
geom_ribbon()
indicating the error. If you don't like producing apparent estimates for regions between 1 and 2, at least reduce the width of the error bar.I'm curious at to why you don't like these plots. I use them all the time. Without wanting to state the blooming obvious, they allow you to compare the means of different groups and see if their 95% CIs overlap (i.e., true mean likely to be different).
It's important to get a balance of simplicity and information for different purposes, I guess. But when I use these plots I am saying- "these two groups are different from each other in some important way" [or not].
Seems pretty great to me, but I'd be interested to hear counter-examples. I suppose implicit in the use of the plot is that the data do not have a bizzare distribution which renders the mean invalid or misleading.
sumber
If the data are rates: that is number of successes divided by number of trials, then a very elegant method is a funnel plot. For example, see http://qshc.bmj.com/content/11/4/390.2.full (apologies if the link requires a subscription--let me know and I'll find another).
It may be possible to adapt it to other types of data, but I haven't seen any examples.
UPDATE:
Here's a link to an example which doesn't require a subscription (and has a good explanation for how they might be used): http://understandinguncertainty.org/fertility
They can be used for non-rate data, by simply plotting mean against standard error, however they may lose some of their simplicity.
The wikipedia article is not great, as it only discusses their use in meta-analyses. I'd argue they could be useful in many other contexts.
sumber
I would use boxplots here; clean, meaningful, nonparametric... Or vioplot if the distribution is more interesting.
sumber
Simplifying @csgillespie's terrific code from above:
sumber
I prefer geom_pointrange to errorbar and think the lines are distracting rather than helpful. Here is version that I find much cleaner than the @James or @csgillespie version:
sumber