Grafik alternatif untuk plot plot “handle bar”

15

Dalam bidang penelitian saya, cara populer menampilkan data adalah dengan menggunakan kombinasi diagram batang dengan "bilah pegangan". Sebagai contoh,

enter image description here

The "handle-bar" bergantian antara kesalahan standar dan standar deviasi tergantung pada penulis. Biasanya, ukuran sampel untuk setiap "bilah" cukup kecil - sekitar enam.

Plot-plot ini tampaknya sangat populer dalam ilmu biologi - lihat beberapa makalah pertama dari BMC Biology, vol 3 untuk contohnya.

Jadi, bagaimana Anda menyajikan data ini?

Kenapa saya tidak suka plot ini

Secara pribadi saya tidak suka plot ini.

  1. Ketika ukuran sampel kecil, mengapa tidak hanya menampilkan titik data individual.
  2. Apakah sd atau se yang sedang ditampilkan? Tidak ada yang setuju untuk digunakan.
  3. Mengapa menggunakan bilah sama sekali. Data tidak (biasanya) pergi dari 0 tetapi lulus pertama pada grafik menunjukkan itu.
  4. Grafik tidak memberikan gambaran tentang rentang atau ukuran sampel data.

Script R.

Ini adalah kode R yang saya gunakan untuk menghasilkan plot. Dengan begitu Anda bisa (jika mau) menggunakan data yang sama.

                                        #Generate the data
set.seed(1)
names = c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3")
prevs = c(38, 37, 31, 31, 29, 26, 40, 32, 39)

n=6; se = numeric(length(prevs))
for(i in 1:length(prevs))
  se[i] = sd(rnorm(n, prevs, 15))/n

                                        #Basic plot
par(fin=c(6,6), pin=c(6,6), mai=c(0.8,1.0,0.0,0.125), cex.axis=0.8)
barplot(prevs,space=c(0,0,0,3,0,0, 3,0,0), names.arg=NULL, horiz=FALSE,
        axes=FALSE, ylab="Percent", col=c(2,3,4), width=5, ylim=range(0,50))

                                        #Add in the CIs
xx = c(2.5, 7.5, 12.5, 32.5, 37.5, 42.5,  62.5, 67.5, 72.5)
for (i in 1:length(prevs)) {
  lines(rep(xx[i], 2), c(prevs[i], prevs[i]+se[i]))
  lines(c(xx[i]+1/2, xx[i]-1/2), rep(prevs[i]+se[i], 2))
}

                                        #Add the axis
axis(2, tick=TRUE, xaxp=c(0, 50, 5))
axis(1, at=xx+0.1, labels=names, font=1,
     tck=0, tcl=0, las=1, padj=0, col=0, cex=0.1)
csgillespie
sumber
6
Membantu bidang Anda mencapai konsensus hanya pada pertanyaan se v. Sd akan menjadi kemajuan besar. Mereka berarti hal yang sangat berbeda.
John
Saya setuju - se biasanya dipilih karena memberikan wilayah yang lebih kecil!
csgillespie
Mungkin judul yang lebih informatif?
3
Just for reference, I have seen these bar charts with error bars called "Dynamite Plots" before. Here are a few references giving the exact same recommendations as everyone else pretty much has (dot charts). Tatsuki Koyama, Beware of Dynamite Poster and Drummond & Vowler, 2011.
Andy W
1
Please add the image again if you can. Use the image uploader this time so it doesn't become a dead link.
endolith

Jawaban:

16

Thanks for all you answers. For completeness I thought I should include what I usually do. I tend to do a combination of the suggestions given: dots, boxplots (when n is large), and se (or sd) ranges.

(Removed by moderator because the site hosting the image no longer appears to work correctly.)

From the dot plot, it is clear that data is far more spread out the "handle bar" plots suggest. In fact, there is a negative value in A3!


I've made this answer a CW so I don't gain rep

csgillespie
sumber
3
That's a good answer. In addition, I'd suggest horizontally jittering the points, so they don't overlap, especially if you have more points per group than this. In ggplot2, the geom_jitter() will do that.
Harlan
@Harlan: I agree. Although if I had many more points I would probably use a boxplot.
csgillespie
1
I also like scatterplots for small data sets (nb, I use the term 'dotplot' to refer to a slightly different plot). However, for what it's worth, the barplot above is cleaner & easier to read than this one. I'm not sure that makes it better, but it's worth pointing out.
gung - Reinstate Monica
@Harlan: Alternatively, make the dots transparent so that multiple dots stack up and produce a darker dot?
endolith
do you have the original image to replace this dead link?
endolith
10

Frank Harrell's (most excellent) keynote entitled "Information Allergy" at useR! last month showed alternatives to these: rather than hiding the raw data via the aggregation the bars provide, the raw data is also shown as dots (or points). "Why hide the data?" was Frank's comment.

Given alpa blending, that strikes as a most sensible suggestion (and the whole talk most full of good, and important, nuggets).

Dirk Eddelbuettel
sumber
1
Is it available as a video? It sounds great.
Henrik
1
I think the word is "will be eventually" -- keynotes got recorded.
Dirk Eddelbuettel
1
this is easy in ggplot I think, i.e. had.co.nz/ggplot2/geom_jitter.html
Mike Dewar
1
jitter is also in plain R.
2
Just for the protocol, Frank's talk (in video) is now online: r-bloggers.com/RUG/2010/08/user-2010-conference-videos
Tal Galili
7

From a psychological perspective, I advocate plotting the data plus your uncertainty about the data. Thus, in a plot like you show, I would never bother with extending the bars all the way to zero, which only serves to minimize the eye's ability to distinguish differences in the range of the data.

Additionally, I'm frankly anti-bargraph; bar graphs map two variables to the same aesthetic attribute (x-axis location), which can cause confusion. A better approach is to avoid redundant aesthetic mapping by mapping one variable to the x-axis and another variable to another aesthetic attribute (eg. point shape or color or both).

Finally, in your plot above, you only include error bars above the value, which hinders one's ability to compare the intervals of uncertainty relative to bars above and below the value.

Here's how I would plot the data (via the ggplot2 package). Note that I add lines connecting points in the same series; some argue that this is only appropriate when the series across which the lines are connected are numeric (as seems to be in this case), however as long as there is any reasonable ordinal relationship among the levels of the x-axis variable, I think connecting lines are useful for helping the eye associate points across the x-axis. This can become particularly useful for detecting interactions, which really stand out with lines.

library(ggplot2)
a = data.frame(names,prevs,se)
a$let = substr(a$names,1,1)
a$num = substr(a$names,2,2)
ggplot(data = a)+
layer(
    geom = 'point'
    , mapping = aes(
        x = num
        , y = prevs
        , colour = let
        , shape = let
    )
)+
layer(
    geom = 'line'
    , mapping = aes(
        x = num
        , y = prevs
        , colour = let
        , linetype = let
        , group = let
    )    
)+
layer(
    geom = 'errorbar'
    , mapping = aes(
        x = num
        , ymin = prevs-se
        , ymax = prevs+se
        , colour = let
    )
    , alpha = .5
    , width = .5
)

enter image description here

Mike Lawrence
sumber
1
I should add that my "plot only the data and uncertainty" recommendation should be qualified: when presenting data to an audience that has experience/expertise with the variable being plotted, plot only the data and uncertainty. When presenting data to a naieve audience and when zero is a meaningful data point, I'd first show the data extending to zero so that the audience can get oriented to the scale, then zoom in to show just the data and uncertainty.
Mike Lawrence
since you've went to trouble of writing R code, could you include a jpeg image of the final plot. I find just uploading the image to img84.imageshack.us and linking to it is fairly easy. Oh thanks for the answer :)
csgillespie
@csgillespie: done.
Mike Lawrence
I've found that it's easier to read a plot like this with geom_ribbon() indicating the error. If you don't like producing apparent estimates for regions between 1 and 2, at least reduce the width of the error bar.
JoFrhwld
@JoFrwld: I like ribbons too, though I tend to reserve them for cases where the x-axis variable it truly numeric; my version of the "don't draw lines unless the x-axis variable is numeric" rule that I profess violating in my answer above :Op
Mike Lawrence
2

I'm curious at to why you don't like these plots. I use them all the time. Without wanting to state the blooming obvious, they allow you to compare the means of different groups and see if their 95% CIs overlap (i.e., true mean likely to be different).

It's important to get a balance of simplicity and information for different purposes, I guess. But when I use these plots I am saying- "these two groups are different from each other in some important way" [or not].

Seems pretty great to me, but I'd be interested to hear counter-examples. I suppose implicit in the use of the plot is that the data do not have a bizzare distribution which renders the mean invalid or misleading.

Chris Beeley
sumber
I've added a small section on why I dislike these plots.
csgillespie
1
@Chris check this out about interpreting overlapping CIs pubs.amstat.org/doi/abs/10.1198/000313001317097960 Also the original question is also around the confusion of using SE or SD interchangeably while they are two different things
tosonb1
Or, for an analysis on this site, see stats.stackexchange.com/questions/18215. @tosonb1 Your link is timing out. Could you supply a reference to the paper?
whuber
2

If the data are rates: that is number of successes divided by number of trials, then a very elegant method is a funnel plot. For example, see http://qshc.bmj.com/content/11/4/390.2.full (apologies if the link requires a subscription--let me know and I'll find another).

It may be possible to adapt it to other types of data, but I haven't seen any examples.

UPDATE:

Here's a link to an example which doesn't require a subscription (and has a good explanation for how they might be used): http://understandinguncertainty.org/fertility

They can be used for non-rate data, by simply plotting mean against standard error, however they may lose some of their simplicity.

The wikipedia article is not great, as it only discusses their use in meta-analyses. I'd argue they could be useful in many other contexts.

Simon Byrne
sumber
The data isn't necessary rates. It could be anything.
csgillespie
Subscription link, unfortunately.
Matt Parker
... but here's the Wikipedia link on funnel plots: en.wikipedia.org/wiki/Funnel_plot
Matt Parker
2

I would use boxplots here; clean, meaningful, nonparametric... Or vioplot if the distribution is more interesting.


sumber
2
I'm not sure boxplots or vioplots would be suitable with such a small sample size (n = 6)
csgillespie
Right, I admit I haven't read the question carefully enough, so it was rather a general idea; nevertheless I think that 6 points is minimal but enough for a boxplot. I have made some experiments and they were meaningful. On the other hand, obviously boxplot does not indicate the number of observations (which is an important bit of information here), so I would rather use a combination of it and points.
With 6 points - scatter plot is probably best (maybe with adding a red dot for the mean)
Tal Galili
2
I generally use boxplots with superimposed points, I find it very "visual". Violin plots, instead, are a bit hard to understand in my opinion.
nico
1
@csgillespie: What would indicate that bar and whisker plots are better? They are showing basically the same information as a boxplot (as you point out, the whiskers can represent various things), they just give the error only in one direction, which could be fairly confusing, if not disingenuous... Not arguing for boxplots. But beanplots/violinplots should still work, even for relatively low sample sizes, because it's just a gaussian density estimation, as I explained here.
naught101
1

Simplifying @csgillespie's terrific code from above:

qplot(
    data=a,
    x=num,
    y=prevs,
    colour=let,
    shape=let,
    group=let,
    ymin=prevs-se,
    ymax=prevs+se,
    position=position_dodge(width=0.25),
    geom=c("point", "line", "errorbar")
    )
James Waters
sumber
0

I prefer geom_pointrange to errorbar and think the lines are distracting rather than helpful. Here is version that I find much cleaner than the @James or @csgillespie version:

qplot(
 data=a,
 x=num,
 y=prevs,
 colour=let,
 ymin=prevs-se,
 ymax=prevs+se,
 position=position_dodge(width=0.25),
 geom=c("pointrange"), size=I(2)
 )
Kent Johnson
sumber