Tuesday, January 26, 2016

Distribution Analysis Methods, and Bed-Related Injuries

The most popular recent post on this blog is our piece on how people injure themselves by age, because of course, old people falling off toilets gets more attention than any serious analysis I might do.  Statisticians spend a lot of their time with these types of "distributional analyses," with many different methodologies, so it's an interesting topic to tackle.  And that database with old people falling off toilets... there's a lot more there.



Most people remember some very basic distributional analysis from school: Mean and Variance (Standard Deviation).  There are many other measures that statisticians use in describing distributions, including kurtosis, skew, modal behavior, quantiles, etc.  All of these measures are interesting to staticians, and help us more effectively do our jobs.  Many times though it's more helpful for both statisticans and non-statisticians to visualize distributions, their differences and similarities.
For this analysis, I extracted data from the same injury database as before, looking at seven common household objects that cause injuries:
  • Basketball Equipment
  • Soccer Equipment
  • Ceilings and Walls
  • Sewing Items
  • Toilets
  • Beds
  • Bicycles 
Visualizing multiple distributions is fairly easy, and I use two main types of plots: boxplots and violin plots.


A boxplot gives a visual description of several distributional parameters at once, so laid side by side, we can compare different distributions.  Here are the basics:
  • The "box" itself ranges from the 1st quartile to the 3rd quartile, meaning that 50% of observatoins fall inside the box.
  • The line in the middle of the box represents the median, or central observation of the distribution.
  • The "whiskers" (lines coming off the plot) represent the range of non-outlier observations.  Different boxplot programs handle this function differently.
  • The dots (in black) represent outliers.
Here's a set of boxplots of injuries by age and household objects:
This gives us a general idea of the distrbutions, with some interesting outputs:
  • Younger people in a tight distribution tend to hurt themselves on athletic equipment.
  • Beds tend to hurt people of all ages it seems.
  • Ceilings and walls (as shown before) seem to hurt people from adolescence into their forties.
This is great, but I still don't know exactly what the underlying distributions look like.  Enter the violin plot.



Boxplots are great, but they are still abstract and give us only a few data points to describe a distribution.  What if we want to understand more about the entire distribution curve? 
The violin plot is a smoothed version of the entire distribution, which gives additional information about the entire distribution.  Here's what those look like on the same data as before.
This gives us a little more information on the distributions, showing that athletic injuries tend to center around adolescents, while bicycles modally impact children, but are a low risk throughout adulthood.  Additionally we see that the "Beds" distribution is bi-modal, meaning that both young children, and older people are likely to be hurt by beds. 
Once again, good information, but there's something better.



The problem with a violin plot is it doesn't give us those common references of distribution (quartiles, median, outliers) like a boxplot.  I know nothing is perfect, but what about a violin plot with an embedded boxplot... like this:

Ah, the best of both worlds.  Interesting to look at the beds, toilets, and walls plots, where the mode (age where most injuries occur) falls outside of the box.  Neither a boxplot nor a violin plot would catch this, but together they do.

Wednesday, January 20, 2016

Lottery, Bad Math, and Worse Understanding of Financial Instruments

Earlier this week, a certain meme making the rounds on Facebook drew a lot of criticism and quiet giggles from the math literate.  Here's the meme, see if you can pick out the error...

After quiet giggles, I realized that there were actually two errors in the graph:
  1. Math: The simple mathematical error here is that this would only give each person $4.33.
  2. Financial Instrument:  The meme shows a fundamental misunderstanding of the Lottery as a financial instrument, not capable of "creating money."


The math error here is pretty straight forward, in fact someone likely just used the wrong number of digits in one of the numbers when entering this in a calculator.  The error doesn't actually bother me that much, arithmetic happens.  In fact, there's a pretty good reason these types of errors occur: most people don't spend their days dividing numbers in the billions by numbers in the millions.

Truth is, math errors are easy, and in the field that I'm in, I see these types of errors all the time.  But what prevents these regular math errors from leading to larger societal breakdowns?  Context.  Analysts consider context when evaluating the results of their equations, and review for reasonableness.  In this case, if you know anything about how the lottery works, you know that the results from the meme are unreasonable. Which brings me to a larger point on the context of the lottery...


The reason I knew the meme was wrong immediately wasn't because I did the math, it's because the way the lottery works.  Effectively, the lottery in the United States is a money making (for states) gambling-based financial instrument.  Here's a simplified primer:
  1. People buy tickets totaling some initial revenue number paid to the lottery.
  2. The lottery splits that revenue amount among three categories: 
    1.  Cost/Overhead coverage (smallish %)
    2.  "Profit to the State(s)"
    3.  Money for prizes/jackpots
  3. After this step, the "game" is played, payouts are made.
  4. If no one won in step three, some games (like the Powerball) allow the jackpot to aggregate towards the next drawing, which is how we got to this 1.5 billion dollar jackpot.
  5. Eventually someone wins the jackpot, and the jackpot starts over from a lower number ($40 million in the case of Powerball)
There are a lot of other rules in the playing of a lottery like the Powerball, all of them actuarially can be accounted for, and most of them don't significantly impact the results.

The important part here is that there are no external inputs to the Powerball (meaning that no one -- government, private donor etc.-- is net infusing money outside of ticket sales) and the Powerball as financial instrument doesn't create money for ticket buyers (well, as an annuity, but that amount is just long-term investment returns, see below).  In aggregate it loses money for buyers, as the State takes away a share of ticket sales.

But how does this relate back to our initial meme?

  • For the lottery to be so large such that each American could take $4.3 million from it, it means that each American (on average) would have to each spend over $4.3 million in Powerball Tickets since someone last claimed the jackpot.
  • There is something innately sad about people not realizing that in aggregate, the lottery is a net loss.


I should probably have ended this blog entry there, but there's one last *numbers* piece to lottery payouts that people should probably know.  That $1.5 billion advertised does not equal how much money you would receive if you won today.  In the end, you have two options in how to claim your lottery prize:
  • Lump sum.  You get the full amount all at once. 
  • Structured Annuity: The lottery keeps your money, invests it, and pays it to you in 30 annual payments.  Because of the investment over 30 years, you get an aggregate amount larger than the lump sum.
Here's the problem: That $1.5 billion jackpot number is the aggregate of the annuity payout (not adjusted for inflation, either), not the lump sum amount.  In essence the lottery quotes the larger of the two amounts, in a way that is probably misleading to most people who don't understand annuities.


Obviously this post has deviated from the normal content of this blog somewhat substantially, however I think it makes a few key points:
  • In the age of calculators, arithmetic errors are usually merely data entry errors, which can be caught through analysis of the context of the data.
  • More concerning about our initial meme, is that the error demonstrates a misunderstanding of the lottery as a financial instrument.  Not realizing the source/mechanisms behind the lottery, can lead to irrational expectations of outcomes from playing the lottery, especially for under-educated populations.
  • The lottery tends to overstate actual winnings by using aggregate annuitized winnings rather than the easier-to-understand lump sum payout.

Thursday, January 14, 2016

Brownback State of the State

On Tuesday night, the same night as Obama's last State of the Union Address, there was another little speech here in Kansas.  It was the Brownback State of the State Address, held at 5:30 PM.  I don't get home until about five, and quite honestly I'd rather spend my early evening with my two and half year old daughter.  Brownback's loss, I didn't watch.

Over the last couple of days though, I've seen quite a few commentaries online, critical of the speech, so I thought I would check it out.  Most interesting: an accusation that Brownback called school spending.... "immoral."

So I downloaded the speech, "datafied" it and went to analyzing.  My normal text mining methods are inappropriate because the speech is effectively one single document, so sentiment and topic modeling won't really work.  But I can do other stuff.  First a word cloud:

That's interesting, but not exactly surprising that the number one word associated with a Kansas State of the Union is the word Kansas.  Let's remove Kansas, Kansan, and State and see what we get. 

 The top words here seem to be around "working" and "people" with less focus on the economy.  Also mentioned heavily are the President, the word "rural" and "welfare."

As I mentioned before, with a singular document, topic modeling isn't a valid method, but there are obviously topics covered in the speech.  I reviewed the speech, and found it broke down pretty easily into sections that could be coded.  Here's what the speech looks like in terms of time devoted to topics:

A good amount of the speech was dedicated to bragging about "program successes" (initial brag, welfare to work).  With two other major topics being critiques of the President (terrorism, Obamacare).  Interestingly, the topic in Kansas politics that generally gets the most attention (education) received less attention from the governor (only 7.8%) than water sustainability.

This is all interesting, in terms of where time was spent, but my original reason for looking into the speech was the "immoral" comment.. was it real.  I found it in the speech text, here is specifically what was said:

Yet today, of the more than $4 billion the state puts into education funding, not nearly enough goes toward instruction. That’s highly inefficient, if not immoral, denying Kansans from putting their education dollars were they want it…behind a good teacher.

Saturday, January 9, 2016

ALA MidWinter-Day One Twitter Mining

As I've posted on this blog before, occasionally I tag along with my librarian wife to her professional conferences.  These conferences give me an opportunity to see my wife's professional world, as well as relax and unwind a bit from the normal stressful life of an analyst.  It's a lot of fun, and librarians are a surprisingly fun group to hang out with.

This week I am at the ALA Midwinter conference in Boston, and having a great time.  One interesting way to experience the conference is through the twitter hashtag #alamw16.  This morning I realized it might be fun to point my text mining algorithms towards tweets about the conference, to somewhat scientifically figure out what the chatter is about.

I will just jump right in, for text mining this data I downloaded tweets from Thursday night through Saturday morning using the conference hashtag.  First I created something easy, a wordcloud which gives us an idea of the most common words used and their relative frequencies.  Unsurprisingly, the most frequently used word during the days that people were travelling was the location of the conference: Boston.  Also the word "librarian" and "book" pops up high, as well as some time and place based terms like "tomorrow" and "exhibit."

Then I moved on to a topic modelling methodology, which looks at the general topics people are talking about in the data.  

If you haven't read this blog before, here's a primer on how topic models works:  The algorithm looks at a set of documents (tweets) and finds terms that are often used in conjunction with each other, and from these words can derive topics from the documents, which can otherwise be difficult to "observe."  I use a version of topic models called Correlated Topic Models which involves an additional covariance calculation across the top of the model-in my experience this helps with short documents, such as "tweets."

That was a longer explanation than I intended, and most people just want to look at the documents.  Here is a list of the topics and the most disproportionately represented terms in each topic:

What do these topics represent?  I dug into the data to describe the observed topics, and here's a summary of what people are really talking about on the #alamw16 hashtag:
  1. Topic 1: A topic of vendors asking attendees to stop by and see them while in Boston.
  2. Topic 2: A topic about the work of the ALA Executive Board, and generally about librarianship.
  3. Topic 3: A topic of attendees talking about their excitement towards the convention.
  4. Topic 4: A topic of attendees and vendors talking about how great Boston looks/is as a host for the convention.
  5. Topic 5: A topic of librarians vendors talking about books they are excited about, as well as some other exciting new products.

For those that really like wordclouds, here's a wordcloud on each of the topics:

Topic 1 (this one rendered small, largely because of the importance of the word "booth"):

 Topic 2
 Topic 3
 Topic 4
 Topic 5