The most popular recent post on this blog is our piece on how people injure themselves by age, because of course, old people falling off toilets gets more attention than any serious analysis I might do. Statisticians spend a lot of their time with these types of "distributional analyses," with many different methodologies, so it's an interesting topic to tackle. And that database with old people falling off toilets... there's a lot more there.
Most people remember some very basic distributional analysis from school: Mean and Variance (Standard Deviation). There are many other measures that statisticians use in describing distributions, including kurtosis, skew, modal behavior, quantiles, etc. All of these measures are interesting to staticians, and help us more effectively do our jobs. Many times though it's more helpful for both statisticans and non-statisticians to visualize distributions, their differences and similarities.
For this analysis, I extracted data from the same injury database as before, looking at seven common household objects that cause injuries:
- Basketball Equipment
- Soccer Equipment
- Ceilings and Walls
- Sewing Items
Visualizing multiple distributions is fairly easy, and I use two main types of plots: boxplots and violin plots.
A boxplot gives a visual description of several distributional parameters at once, so laid side by side, we can compare different distributions. Here are the basics:
- The "box" itself ranges from the 1st quartile to the 3rd quartile, meaning that 50% of observatoins fall inside the box.
- The line in the middle of the box represents the median, or central observation of the distribution.
- The "whiskers" (lines coming off the plot) represent the range of non-outlier observations. Different boxplot programs handle this function differently.
- The dots (in black) represent outliers.
Here's a set of boxplots of injuries by age and household objects:
This gives us a general idea of the distrbutions, with some interesting outputs:
- Younger people in a tight distribution tend to hurt themselves on athletic equipment.
- Beds tend to hurt people of all ages it seems.
- Ceilings and walls (as shown before) seem to hurt people from adolescence into their forties.
This is great, but I still don't know exactly what the underlying distributions look like. Enter the violin plot.
Boxplots are great, but they are still abstract and give us only a few data points to describe a distribution. What if we want to understand more about the entire distribution curve?
The violin plot is a smoothed version of the entire distribution, which gives additional information about the entire distribution. Here's what those look like on the same data as before.
This gives us a little more information on the distributions, showing that athletic injuries tend to center around adolescents, while bicycles modally impact children, but are a low risk throughout adulthood. Additionally we see that the "Beds" distribution is bi-modal, meaning that both young children, and older people are likely to be hurt by beds.
Once again, good information, but there's something better.
The problem with a violin plot is it doesn't give us those common references of distribution (quartiles, median, outliers) like a boxplot. I know nothing is perfect, but what about a violin plot with an embedded boxplot... like this:
Ah, the best of both worlds. Interesting to look at the beds, toilets, and walls plots, where the mode (age where most injuries occur) falls outside of the box. Neither a boxplot nor a violin plot would catch this, but together they do.