Thursday, June 4, 2015

Normalize Your Data!

Hey, a non-football related post!  

So occasionally I see someone trying to make an argument with data at get really annoyed.  Generally this annoyance comes people using an inappropriate metric to make their point.  Two examples in the last week:

  • Someone arguing that United States has an issue, because more people are murdered here than in other countries.
  • Someone arguing Kansas may have an issue, because we spend less than the national average on education.
Both arguments failed with me because they didn't properly normalize data.  I'll look at both of them in depth. 


So, my first annoyance was related to someone claiming that America has more murders than other countries.  That may be true, but the graph backing up the claim was bogus, because it looked at aggregate murders and all the of the comparison countries were much smaller than the United States.  What drives aggregate murder rates?  Population.

To demonstrate how this this kind of analysis can lead people to make massively false claims, I'll start with my own bogus claim and back it up with data (on a funner subject, nonetheless):

Americans have a drinking problem because they drink much more than Eastern European countries. 
And a graph to back it up:

Wow!  We drink almost three times as much as the Germans!  THE GERMANS!!!! 

(Insert picture of Oktoberfest here for effect).

But not really.  To analyze how individuals are impacted by aggregate numbers you have to normalize for population.  It's an easy calculation, just dividing the aggregate total by the population.

Here's a more accurate view of per capita beer consumption by country, comparing the United States to some other high-consumers of beer.  

I have no idea what is up with Czech Republic, I'm guessing they really like their Pilsner Urquell.

This may all seem trivial, but real decisions are made on these types of numbers, and if policy makers are lead to believe that the first chart is accurate, then policy decisions are made to combat a problem that doesn't exist.  It could be a big deal, and my next analysis demonstrates a more likely scenario.


Late last night a tweet from a journalist popped up on my feed.  Here it is:

I had two thoughts on this:

  • Does "turn it inside out" mean that he thinks the Kansas Legislature will try to lie with the facts?
  • Or is he asking people in the know to look at the numbers and see what they can?
Either way I thought I would dig into the data.  I found two problems with comparing states to a national mean or average of States
  1. The data was not normalized for factors that impact the cost of doing business in a State, specifically, cost of living.
  2. Because of that failure to normalize this data and other distributional aspects, the data was likely skewed in a way that would drive up the mean.

As a result, I needed to make a couple of normalizations to the analysis, first compensating for distributional skew by looking at ranking versus other states versus mean.  Then adjusting for cost of living (as an imperfect proxy for cost differential). 

First the median chart, it shows that Kansas is 24th out of 51 states.  (DC counts)   Kansas is effectively a median State.  Also, though, if you look at the shape of the distribution in this chart, you see that high skew exists.

Cost of living isn't a perfect measure to normalize for cost differentials, but because the bulk of school costs are related to paying salaries, it works for this purpose.  So I normalized using a cost of living index, which shows that Kansas moves up one place to 25th.  Obviously this is not a significant change in result, but if you look at other States they move around significantly.  Conclusion: Kansas is about in the middle, spending wise. 

A little unexpected that Wyoming moves up to be a top spending state, but if I had to guess, I would think it's because of a relatively low cost of living (after normalization) and poor economies of scale.


Normalization matters, because it allows us to control for the big factors that impact numbers like cost of living and population differences.

Nerdy Conclusion:  On our second analysis some interesting nerdyness.  First the original distribution has an expected skew ratio of .94, and a standard deviation of $3207.  The correlation between cost of living and school spending is .66, which is huge, obviously (potentially endogenous because good schools cost more, but that's a trick for a different day).  The normalized distribution reduces skew to .31, and reduces standard deviation to about $2100.

**Quick side note:  This post is intended only to speak about the issue of normalization, not the NORMATIVE issue of whether Americans should drink more beer, or Kansas should spend more or less on education.  A related post tackles the also non-normative question of whether spending matters.

No comments:

Post a Comment