Wednesday, December 30, 2015

How We Injure Ourselves By Age

Yesterday I found this article that looks at wall punching by age, gender and various other dimensions. I found the article interesting, wall punching seems to correlate with Male Testosterone change rates, but what was more interesting was underlying dataset, the NEISS.

The NEISS is a dataset of injuries reported by hospitals, with  details on the injured person, objects involved with the injury, and the nature of the act that caused the injury.  I downloaded the data, immediately found it interesting, and quite rich data.

I had a few questions about the original article's wall punching analysis, specifically: did women see an uptick in wall punching in their mid-teens, or was that trend limited to men?  I replicated the analysis below, while both men and women see and increase in wall punching in their mid-teens, the women's increase is less pronounced.

I also noticed a lot of people in the dataset were being injured by their toilets, largely through falling while on their toilet.  It would be amusing to create the same chart as above, but for toilet injuries.

That's fascinating, it appears to be an inversion of the other distribution.  This makes sense though as toilet injuries seem to be related to older people falling on the toilet.  But why more women than men?  Because there are simply more old women.  But for a moment, let's ignore gender and plot both toilet and wall injuries on the same chart:

That's interesting because it shows that the injury curves are inversions of each other, but the number of injuries from toilet never quite reaches that from wall punching.... or does it?  The problem with this is that there are far more people per year in their teens and twenties than those in their seventies and eighties.  What if we control for population age distribution, and turn this into a risk rate per year analysis:

This final chart shows two things first, the annual risk rate for people in their eightiess and nineties due to toilets is far higher than the annual risk rate from younger people punching walls.  Second, there's a point, sometime in your forties, when your risk from falling on the toilet turns higher than your risk from punching things... a true sign of maturity.

Update 2015-12-31 8:42AM

Someone disagreed with my conclusion that the difference between female and male toilet injuries was due to demographic lifespan issues (women live longer so there are more old women).  The contention was that the difference in how men and women use toilets cause the issue-and they appear to be correct. While lifespan still accounts for about half the gender variance in total toilet injuries in the elderly, women still have a higher overall risk rate.  Here's a chart controlling for gender population differences in older Americans.  

Tuesday, December 29, 2015

Year End Summary and Good News

I haven't blogged as much as normal recently, but we'll blame that on the holidays and end of year, or on a bit of good news I'll get to in a bit.  But first, some commentary on this blog.


I started this blog in December 2014 as an outlet for creativity, analysis, and to get me out of my day-to-day life.  In those regards, the blog has been an outstanding success.  Unexpectedly, the blog has turned into a resume builder too.  In interviews it can be difficult for a analyst to demonstrate their true skillset-this blog gives me a portfolio of work.  In fact, I had a couple of interviews this year where I could use this blog as part of the interview process, and received positive feedback from managers.

That's great but boring, and everyone loves lists, how about our final top five posts of 2015?
  1. Daraprim: Price increase or Leveraged Financial System - I talk about the actions of the person who many have called the "most hated man of 2015"
  2. My Data Science Toolkit - A list of my five favorite Data Science software tools.
  3. Peer Group Determination: Library Peer Groups  - A project I conducted with my wife to use a novel methodology to identify peer groups for public libraries. 
  4. Kansas Election Fraud - My first in what ended up being a seven part series criticizing the work of statisticians who claim that statistical anomalies point to a rigged voting system.
  5. Kansas Election Fraud: Part 6 Sedgwick County Suburbs - A followup to the the post above, where I dig into why the initial anomalies exist using GIS mapping technology.

In sum, this blog started as a small project for me where I planned on posting 2-5 times per month, but quickly morphed into a 130+ post blog with just over 100 readers per day.  I look forward to where 2016 will take this blog.


Remember when I mentioned that I had used this blog in a couple of job interview this year?  One of those interviews turned into a new exciting position for me.  I will being starting a new job across the State line in Missouri on January 4th. I am very excited and looking forward to new challenges and opportunities.  My last day at my current job will be December 31st, making this a new year and new beginning.

Tuesday, December 22, 2015

First Look: Mass Shooting Data

I've wanted to take a deep-dive into mass shooting data for quite a while, but I didn't want it to be in the heat of the moment following another mass shooting.  Over the next few days I am going to take a deep analysis into the mass shooting data we have available, what it means, and why numbers differ between sources.


There are two main datasets with mass shooting data, the Mother Jones data and the Shooting Tracker data.  Here is a brief summary of each dataset:

  • Mother Jones Data: Mother Jones focuses on multiple-death, non-gang public mass shootings.  Essentially, the kind of things we see on the news.  
  • Shooting Tracker Data: Shooting tracker focuses on any event where multiple people are shot, a very basic definition of mass shootings.


For this post I created an initial comparison of the data; just to get a sense for differences between the data.  The first issue is that the Shooting Tracker Data only goes back three years, but Mother Jones goes back to 1982.  We can generally solve for this, but any longitudinal analysis will have to be based on Mother Jones data.

As one might expect, the Shooting Tracker data tells us the number of mass shootings in the United States is much higher than Mother Jones does.  In fact Shooting Tracker tells us that we average one mass shooting a day, whereas Mother Jones tells us we average one a quarter:

An additional component is when mass shootings occur, Mother Jones shows most shootings occur during the week, whereas Shooting Tracker shows shootings occur disproportionately on weekends.

There's even a disagreement on the seasonality of mass shootings.  Mother Jones Mass shootings are scattered fairly evenly throughout the year, whereas Shooting Tracker shows a strong summer-bias.


This is just an initial first-look at mass shooting statistics, but it shows an important deviation in the way we talk about mass shootings.  I will dig into these datasets more in the next few days, attempting to understand the following:
  • Why are the datasets so different?
  • What would make a researcher choose one data set or another?
  • Which dataset is a more accurate presentation of "risk"?

Thursday, December 17, 2015

Martin Shkreli Dropped Some Hints

Today I woke up to what most of the world considered good news, that "Pharma CEO" and expensive music buyer Martin Shkreli had been arrested.  This post is going to involve a lot of embedded tweets, let's get it started,  I actually learned about the arrest by Ben Casselman's epic tweet:


It appears that the arrest is on charges related prior securities fraud, not his current venture; what initially brought him press (the pricing of Daraprim) is not related to his current legal issue.  I have written about Shkreli's Daraprim pricing once before on this blog, specifically pointing out... well  this:
The CEO's rhetoric tells us that he is leveraging higher pricing against the financial-insurance system, and effectively betting on the ability to extract large mid-term profits from it.  The insurance system, as it exists,  enables this type of cost increase by giving *ordinary* people *extraordinary* ability to pay for effectively one-time services.
Actually, I pointed out three things in my blog:
  1. While it's fathomable that the drug was under-priced to the point of not being profitable, the price shock he used was likely exorbitant. 
  2. His claims of using pricing to create capital for future research is likely just "CEO BS."
  3. He's just leveraging against incentives in the current insurance system.


Shkreli had alluded to using the insurance system to leverage his profits, but never came out and said it.  Yesterday, he did.  Starting with this nice sounding tweet, no one ever pays more than $10 out of pocket!

And he continues to say nice sounding things like this:
But the best tweet of the day from Shkreli (where he actually admitted to the concept of my prior blog post) was this one:

Effectively Shkreli is outright saying here, let me charge the insurance system huge prices, or I'll just give it away for free.  The full details of why this happens relates to diffuse impacts of insurance system, this being a low-use drug, and what happens when you artificially give people more "ability to pay" for something.  I detailed that out in more detail in my prior post.


This situation gives me mixed feelings. I'm not really a supporter of single pay health insurance, but Shkreli may be a good argument for it.  At least in this sense: as long as the current quasi-governmental insurance incentive system exists, "bad actors" like Shkreli will have incentives to push the system for personal profits.  Even if not every Pharma CEO is like Shkreli, this will likely lead to micro pricing increases due to the relative consumer price insensitivity.  Shkreli could be the initial call for a reformed healthcare system.

Friday, December 11, 2015

Creating Better GIS Maps in QGIS

Quite often on this blog I use GIS mapping, especially maps of Kansas, to make a point.  Generally when I am making a map I use a simple block colored map, like this one of counties:

That map is great, especially for Kansas policy wonks, or people who can identify areas of Kansas without greater context.  But what if we want more contextual data on our map, like roads, city names, and topographical features?  This is very easy in QGIS, using the following steps:

  1. Install the QGIS OpenLayers plugin using the Plugins Menu.
  2. From the Web menu Choose the OpenLayers plugin. 
  3. Choose which layer you want to add to your map (I have best luck with the OpenStreetMap layer).
  4. Move the OpenStreetMap layer to be the bottom layer.
  5. If your top layer (what you are analyzing) is a polygon, it will likely cover your context layer.  Change this top layer to be partially transparent using this dialogue.  

And, here's our ouptut  (same map as we were looking at before):

This gives even more context on close views, like this one:

And is even more helpful for close-up in-town views (like a precinct map of Northern Johnson County Kansas):

Wednesday, December 9, 2015

County Level Unemployment and Determining Cause from Correlation

(A bit of data/mapping ADHD, we're going to go through a lot of maps, very fast!)

Someone pointed out to me that the Kansas side of the KC metro area currently has its lowest unemployment rate in 15 years.  Though this is interesting and positive,  it started me down a path that led to far too many mappings, and a tie into potentially spurious voting correlations.


If you have spent much time in the Kansas side of the Kansas city metro, you know it's the tale of two counties: white; affluent Johnson County, and a mixed, poor Wyandotte County.  Here's how the two counties compare on selected metrics:

Huge differences between these adjacent counties, but I wondered how that played out in terms of unemployment rate (disclaimer: unemployment rate isn't a great metric for a variety of reasons, but works in this scenario).  Luckily the Bureau of Labor Statistics offers county-level unemployment statistics.  Here's a map of unemployment rates I created for northeast Kansas (2014, annualized):

This is close to what might be expected, Johnson county has the lowest rate in northeast Kansas, Wyandotte county has the highest, and the margin between them (3.1%) is striking.  But while I have this data on my desktop, why don't we look at the entire state. 

Three things struck me about this map:
  • There's a huge urban-rural divide (not surprising, very rural, agricultural areas can have very low unemployment rates for multiple reasons).
  • Southeast Kansas is a rural area with high unemployment rate.  This is also not surprising, as this is has been a high-poverty rural area for the past few decades.
  • This looks a little like another map I made.  Specifically this one: 

What is that map?  A map of percent of voters voting for Sam Brownback by county.  Interesting.  I wondered if there might be a correlation.  Two charts proved that there is a fairly significant correlation.

Significant correlation! Low unemployment led to Brownback's win in Kansas! .. probably not..

People say "correlation is not causation" so often that it annoys me.  But this is a great case for explanation.  My thoughts:
  • A priori. One way people debug the correlation/causation is by looking at a priori or functional theory.  This roughly means that we can develop a reasonable theory for the causal mechanisms underlying the correlation.  In this case, we have a pretty clear (and compelling) a priori theory: people in low unemployment counties view the economy as performing well, tend to prefer status quo (incumbent candidates).
  • Covariates: A "gotcha" in the correlation/causation is outside factors simultaneously causing or impacting both correlating variables (this is likely what is occurring here):
    • Other variables: The counties with low unemployment have other things in common, (example: agricultural, more conservative, whiter, more rural).  All of these things lead to supporting more conservative candidates independent of county level unemployment rate. 
    • Pre-existing preferences: The counties with lower unemployment rates voted for Brownback in 2010, before he had any impact on those rates, before a incumbency bias would have been established.
There's another factor at hand though, which *could* have an effect.  Areas with lower unemployment rates could be more conservative generally, due to rational benefits.  The argument here is that low unemployment leads to lower political demand for social services, which are generally considered a liberal policy.  In this case unemployment rates could be at play, but more broadly as a general indicator of well being, and not a preference for an individual candidate.


Obviously this post has been a bit ADHD, but a few takeaways:
  • The Kansas side of the KC metro region may have historically low unemployment rates, but that is in no way homogeneous across the region.
  • There is a significant correlation between unemployment rate and propensity to vote for Sam Brownback.  Brownback won counties with < 3% unemployment with nearly 70% of the vote.
  • It's unlikely that the low unemployment rates in counties are directly responsible for Brownback's support, but alternatively, counties with lower demand for social services, may have more conservative preferences.

And a few more maps, unemployment rates over time:



2010 (height of recession)

Tuesday, December 1, 2015

KU Losses and a Simulation Engine

My mid-season bet that KU would go winless this year turned out to be true, so I probably need to post how right I was about that.  Don't worry though, this isn't a blog post of me simply gloating.  Well the first part is, but I have built a helpful piece of code too, which we'll get to later.  


So let's do a post mortem analysis of whether my predictions for KU Football were accurate.  What was the initial prediction I made?  

About a 50-50 chance of a winless season starting from the first week of the conference season (game four).

Some people might look at this prediction and say "50-50 means you don't know," which is kind of true.  I wasn't sure whether KU would go winless or not.  But in this case the the absolute probability doesn't matter, but information gain does.  Information gain in this case, means how much more we know about the probability of something over a "par" or "average probability.

College football teams go winless about 0.2% of the time, making this a fairly rare event.  To be able to say that KU had a 50% chance of going winless means that they were effectively 250 times more likely to go winless than any random college football team, a huge gain in information from this prediction.

As an example, let's say that the daily probability that an elevator in your building would get stuck on a ride is 0.1%. However, I have modeled the performance of elevators in your building, and tell you that the elevator you're about to get on has a 50-50 chance of getting stuck.   I still "don't know" whether the elevator will get stuck,  but the model is actually quite useful because it provides a lot of information about this specific elevator ride over the normal 0.1% par probability. In essence, the 50% probability is not certain, but is still useful.

On the other hand, an event predicted at the 50% level should come true about 50% of the time, how can we be sure that it wasn’t actually more likely than I had originally predicted?  Without looking at my estimates over a series of seasons, there isn’t a good way to determine the accuracy of the overall predictions.  Some cynics would have claimed prior to the season that KU would almost certainly (>90%?) go winless.  It’s hard to falsify that statement, now that the team has gone 0-12, however there were a few games that were played close (SDSU,Texas Tech, TCU), and that tells me the team had a legitimate shot to win a game along the way.  Add this to the *probability* of when a team under a new coach will play well, then 50-50 still seems like it was a reasonable estimate (imagine if KU played the way they did in the TCU game when they were playing Iowa State).



Now that I have that out of my system, how about some actual statistical work?  One piece of my toolset that I’ve had in SQL or Excel but never in R before is a batch probability simulation engine.  The point of a simulation engine is to look at a set of event with probabilities fixed to them, and simulate them thousands of times, to get a sense of how things might turn out together (likely outcomes for a season).  A concrete way of looking at this is like letting Madden play 1,000,000 seasons (computer versus computer), and then setting probabilities based on what team wins the super bowl the most.

To write a probability simulation engine you need a few general parts:
  • A set of input probabilities (e.g. a vector of probabilities of a team winning each of their games this season).
  • Creating a matrix of random probabilities with columns = number of events (games), rows = to number of simulations.
  • A win/loss classifier that compares random probabilities to the set event probabilities.
  • A summarizer, to summarize total numbers of wins and loses, and the season outcomes.

My code to do this is below.  There are actually some different pieces you can add in, for instance bootstrap modifiers that account for dependencies between events, and other modifiers to run many teams at once.  I'll work on that later.

How does this actually work?  I simulated KU’s season 1 million times (only takes about 2 seconds), and summarized the results. Here’s how the seasons set out in terms of number of wins (including some higher probability wins, e.g. SDSU):

That's a bit depressing.  Even including the "easier" non-conference season KU would go winless 37% of the time.  KU would become bowl eligible (wins>=6) once in every 10,000 seasons.

Here’s a look at just the conference season.  Over 50% of the time, KU won zero games.

How bad was KU compared to a “par” team?  I made a par power-conference team which has three non-conference games of .80 probability to win, and .50 percent to win each conference game.  Here's what that looks like.

And here’s the R code that got me here (this is really simple, but I will expand on it to handle multiple scenarios, and simulate full leagues).


Just a couple of points to cap this one off:
  • I was generally correct about KU's chances of winning a game this season (gloating).
  • It's fairly straight forward, after creating probabilities of winning each game, to simulate how teams may actually perform during the year.  
  • With KU's current performance, they will go winless one out of every three years, and go to a bowl once every 10,000 seasons.