Tuesday, January 26, 2016

Distribution Analysis Methods, and Bed-Related Injuries

The most popular recent post on this blog is our piece on how people injure themselves by age, because of course, old people falling off toilets gets more attention than any serious analysis I might do.  Statisticians spend a lot of their time with these types of "distributional analyses," with many different methodologies, so it's an interesting topic to tackle.  And that database with old people falling off toilets... there's a lot more there.



Most people remember some very basic distributional analysis from school: Mean and Variance (Standard Deviation).  There are many other measures that statisticians use in describing distributions, including kurtosis, skew, modal behavior, quantiles, etc.  All of these measures are interesting to staticians, and help us more effectively do our jobs.  Many times though it's more helpful for both statisticans and non-statisticians to visualize distributions, their differences and similarities.
For this analysis, I extracted data from the same injury database as before, looking at seven common household objects that cause injuries:
  • Basketball Equipment
  • Soccer Equipment
  • Ceilings and Walls
  • Sewing Items
  • Toilets
  • Beds
  • Bicycles 
Visualizing multiple distributions is fairly easy, and I use two main types of plots: boxplots and violin plots.


A boxplot gives a visual description of several distributional parameters at once, so laid side by side, we can compare different distributions.  Here are the basics:
  • The "box" itself ranges from the 1st quartile to the 3rd quartile, meaning that 50% of observatoins fall inside the box.
  • The line in the middle of the box represents the median, or central observation of the distribution.
  • The "whiskers" (lines coming off the plot) represent the range of non-outlier observations.  Different boxplot programs handle this function differently.
  • The dots (in black) represent outliers.
Here's a set of boxplots of injuries by age and household objects:
This gives us a general idea of the distrbutions, with some interesting outputs:
  • Younger people in a tight distribution tend to hurt themselves on athletic equipment.
  • Beds tend to hurt people of all ages it seems.
  • Ceilings and walls (as shown before) seem to hurt people from adolescence into their forties.
This is great, but I still don't know exactly what the underlying distributions look like.  Enter the violin plot.



Boxplots are great, but they are still abstract and give us only a few data points to describe a distribution.  What if we want to understand more about the entire distribution curve? 
The violin plot is a smoothed version of the entire distribution, which gives additional information about the entire distribution.  Here's what those look like on the same data as before.
This gives us a little more information on the distributions, showing that athletic injuries tend to center around adolescents, while bicycles modally impact children, but are a low risk throughout adulthood.  Additionally we see that the "Beds" distribution is bi-modal, meaning that both young children, and older people are likely to be hurt by beds. 
Once again, good information, but there's something better.



The problem with a violin plot is it doesn't give us those common references of distribution (quartiles, median, outliers) like a boxplot.  I know nothing is perfect, but what about a violin plot with an embedded boxplot... like this:

Ah, the best of both worlds.  Interesting to look at the beds, toilets, and walls plots, where the mode (age where most injuries occur) falls outside of the box.  Neither a boxplot nor a violin plot would catch this, but together they do.

Wednesday, January 20, 2016

Lottery, Bad Math, and Worse Understanding of Financial Instruments

Earlier this week, a certain meme making the rounds on Facebook drew a lot of criticism and quiet giggles from the math literate.  Here's the meme, see if you can pick out the error...

After quiet giggles, I realized that there were actually two errors in the graph:
  1. Math: The simple mathematical error here is that this would only give each person $4.33.
  2. Financial Instrument:  The meme shows a fundamental misunderstanding of the Lottery as a financial instrument, not capable of "creating money."


The math error here is pretty straight forward, in fact someone likely just used the wrong number of digits in one of the numbers when entering this in a calculator.  The error doesn't actually bother me that much, arithmetic happens.  In fact, there's a pretty good reason these types of errors occur: most people don't spend their days dividing numbers in the billions by numbers in the millions.

Truth is, math errors are easy, and in the field that I'm in, I see these types of errors all the time.  But what prevents these regular math errors from leading to larger societal breakdowns?  Context.  Analysts consider context when evaluating the results of their equations, and review for reasonableness.  In this case, if you know anything about how the lottery works, you know that the results from the meme are unreasonable. Which brings me to a larger point on the context of the lottery...


The reason I knew the meme was wrong immediately wasn't because I did the math, it's because the way the lottery works.  Effectively, the lottery in the United States is a money making (for states) gambling-based financial instrument.  Here's a simplified primer:
  1. People buy tickets totaling some initial revenue number paid to the lottery.
  2. The lottery splits that revenue amount among three categories: 
    1.  Cost/Overhead coverage (smallish %)
    2.  "Profit to the State(s)"
    3.  Money for prizes/jackpots
  3. After this step, the "game" is played, payouts are made.
  4. If no one won in step three, some games (like the Powerball) allow the jackpot to aggregate towards the next drawing, which is how we got to this 1.5 billion dollar jackpot.
  5. Eventually someone wins the jackpot, and the jackpot starts over from a lower number ($40 million in the case of Powerball)
There are a lot of other rules in the playing of a lottery like the Powerball, all of them actuarially can be accounted for, and most of them don't significantly impact the results.

The important part here is that there are no external inputs to the Powerball (meaning that no one -- government, private donor etc.-- is net infusing money outside of ticket sales) and the Powerball as financial instrument doesn't create money for ticket buyers (well, as an annuity, but that amount is just long-term investment returns, see below).  In aggregate it loses money for buyers, as the State takes away a share of ticket sales.

But how does this relate back to our initial meme?

  • For the lottery to be so large such that each American could take $4.3 million from it, it means that each American (on average) would have to each spend over $4.3 million in Powerball Tickets since someone last claimed the jackpot.
  • There is something innately sad about people not realizing that in aggregate, the lottery is a net loss.


I should probably have ended this blog entry there, but there's one last *numbers* piece to lottery payouts that people should probably know.  That $1.5 billion advertised does not equal how much money you would receive if you won today.  In the end, you have two options in how to claim your lottery prize:
  • Lump sum.  You get the full amount all at once. 
  • Structured Annuity: The lottery keeps your money, invests it, and pays it to you in 30 annual payments.  Because of the investment over 30 years, you get an aggregate amount larger than the lump sum.
Here's the problem: That $1.5 billion jackpot number is the aggregate of the annuity payout (not adjusted for inflation, either), not the lump sum amount.  In essence the lottery quotes the larger of the two amounts, in a way that is probably misleading to most people who don't understand annuities.


Obviously this post has deviated from the normal content of this blog somewhat substantially, however I think it makes a few key points:
  • In the age of calculators, arithmetic errors are usually merely data entry errors, which can be caught through analysis of the context of the data.
  • More concerning about our initial meme, is that the error demonstrates a misunderstanding of the lottery as a financial instrument.  Not realizing the source/mechanisms behind the lottery, can lead to irrational expectations of outcomes from playing the lottery, especially for under-educated populations.
  • The lottery tends to overstate actual winnings by using aggregate annuitized winnings rather than the easier-to-understand lump sum payout.