Sunday, June 28, 2015

Exponential Growth!!! Maybe not...

It's much easier to sell a business plan based on the idea of exponential growth, than diminishing returns.

A few years ago, I asked a junior team member for an update on a product he was working on.  He was the assigned analytics resource on the product, and I was curious how the new product was proceeding.  He laid out the way the product would work, followed by the general business plan.  He capped his statement off with "which would lead to exponential growth...." which he had been told by the product manager.

The last statement got my attention.  So I asked a couple of questions such as "so, how exactly would that lead to exponential business growth?" and "has anyone modeled out how this functionally leads to an exponential growth problem?"  The answer was no.


The reason I am posting on this is something I see far too often.  Many times Marketing and Product managers make claims to executives that products will take a while to get off the ground, but will soon see exponential growth.  Often these claims don't come true, leaving analysts to blame things like lack of consumer buy-in, or being late to market.  In reality the failure goes back to the original growth projection.


So to give an example of how people get their exponential growth claims wrong, lets look at my example from earlier.  Here's an explanation of how I saw this problem:
  • It was a two-sided stochastic process where the business acquired customers of type A and type B.  
  • Each customer had a subset of goods A{} and B{}, and each of the goods in A{xyz} had to find an appropriate match B{xyz} for our company to make money.  
  • The business plan was largely centered on increasing the number of type B customers on the site, thus increasing the chances of finding matches for each of the type A customers.  
  • The distribution of matches were loosely correlated between A and B, however there was a sparsity with a long tail, so the probability of any one B{} matching an individual A{}, is less than 1%.
  • Each "good" leaves after it is matched.  So duplicate B{} matches to each A{} do not increase profits.
This sounds a lot like a dating site.. doesn't it?  It's not, but, sounds like it.

This clearly wasn't the junior analysts fault, so I took my issue to the problem directly:  the product manager.  When I pressed the product manager who made the exponential growth claim, here's his explanation of how growth would occur:

  1. We'll go get more type B's to enter data on the site.
  2. We'll see the NETWORK EFFECT.
  3. Network effect leads to exponential growth.
So this is effectively the same business model as the Southpark gnomes, except with the network effect substituted for the ???.  But does it work? Not in the way assumed here.


The reason this isn't an exponential growth problem is simple: this is a dubious invoking of the Network Effect.  Rather than a true network where each individual potentially interacts on an ongoing basis with each other user (think: Facebook), the individuals are divided into two teams, can only connect to the other team, and after connecting once, are removed.  Simply speaking:

Because duplicate matches don't create additional "growth" each incremental growth in B holds a diminished potential of positively impacting business.

So, in reality, the growth is a diminishing returns game.  Which looks like this:

Rather than the expected exponential growth, which looks like this:


This post is a simple warning:  there are a lot of dubious claims of exponential growth out there.  Always try to model out the underlying processes with realistic probabilities, before buying in, or before giving the analytics "go-ahead" to new projects. BTW, last update I have on the project stated above is that it was cancelled after not achieving desired growth results.

Saturday, June 27, 2015

Anatomy of an Analysis: How your Toolkit comes together.

This is a follow up to my original post on my Data Science Toolkit, which received a hugely positive response.  One of the questions I get from young analysts is "what tool should I use for this task?"  Generally when they are asking this question, they are suggesting a couple of software products, both capable of completing the task, but one has a distinct advantage. So, the advice I give goes something like this:
Use whatever tool will get you to an accurate answer fastest; putting you in the best position to followup with the data in the future. 
Rules like this are nice, but I think it's more interesting to show process.  Interestingly, my recent post on people driving to avoid potential sales tax involved every tool in my toolkit post.  So, let's go through my process as a demonstration to young analysts.


  1. I read initial post by another analyst on Twitter. Became initially annoyed, but also curious.  I used Google Maps to calculate the distance to Missouri from my home, and then Excel for some "back of the napkin" calculations.  I realized, hey I have something here, maybe I should write this one up? 
  2.  I needed a framework for analysis, what tool is good for "what if" scenarios?  Excel.  So I ran some final numbers in Excel, based on miles from the border, driving costs as well as time costs, and then created a chart of the cost curves.
  3. After the framework was developed, I knew I needed to apply it, and this certainly had a geographic aspect.  I acquired a couple of Johnson County specific shapefiles (Census Blocks; Major Highways) and imported them to QGIS.
  4. From my framework analysis I knew my mileage cutoffs (4,6,9), but I needed a geographically defined way to implement this. For this, I used some custom R functions I developed previously that are implementations of Haversine's method, and allow for a backsolve (put in mileages, get coordinates).  
  5. Next I needed to code my shapefiles by the distance breaks.  For this, I used the Python engine internal to QGIS, to create the maps, color code them for Data Viz effects. I also used this Python script to summarize census block data by distance to border, so that I could say things like "70% of people in Joco live within nine miles of the border." (I used Notepad++ for this code, and all the other code created for this project
  6. After completing my analysis, I wanted to quickly validate the data.  Luckily I have a Haversine backsolve for SQL, which allowed me to validate this analysis.  Also: GIS shapefiles have an internal *.dbf file that holds the attribute table, which holds the analytical information you need (lat/long, population, demographics, etc) and can be easily imported to SQL.  I validated, and everything checked out.

Monday, June 22, 2015

Fitness Week #1: Distribution and Descriptives

I've been promising for a few weeks an update on fitness tracking results, then not delivering.  I'm going to fix that today.  In fact, I'm going to take this entire week to fix that, starting my first ever "Fitness Week" - which is actually just me taking a very nerdy look at numbers.

Here are the posts I'm laying out for this week:

Today: This post.  Just general update on what my fitness tracker has been tracking.
Tuesday: On "targeting" and its distributional impacts.
Wednesday: Aggregate fatigue, what happened to me three weeks ago.
Thursday: Product review, of my Fitness Tracker (Garmin Fit 2)
Friday: My updated model of fitness data.


Let's start today with some easy data (it is Monday after all).  How active am I per my fitness tracker?  Here's the summary of my steps per day data:

I know, ugly and pulled straight from excel, but it shows what we need, just some observations on steps:
  • On some research, the average American takes only 5900 steps a day (seriously?) so I am at least three times the average.
  • My original goal was 20,000 step average, so, goal achieved!
  • The mean is above the median, verifying our skew number.  Also (for non-stats people) this basically means that though I average over 20,000 steps a day, I don't get 20,000 steps on my average days.  (fairly important for goal setting)
  • My sleep numbers look less skewed, and the average is in a good range 7-8 hours of sleep a night.


But what does this actually look like? First lets look at steps. Here's a cumulative distribution graph that I like to put together when analyzing relative frequencies.  

The distribution is obviously not normal, but in my mind it's not that unexpected.

  • 65% of the time, I get between 16,000 and 22,000 steps.  Further analysis shows I should consider this my "average weekday."
  • 10% of the time, I get between 10,000 and 16,000 steps.  These appear to be low outlier weekdays, generally when I'm especially tired or busy.
  • 25% of the time, I get over 22,000 steps.  These are generally weekend days.
Now let's look at sleep:
  • My sleep patters are fairly normally distributed, which is amazing, because (even as a statistician) I RARELY see anything with this normal of a distribution.  
  • The variance is a bit disturbing.  Sure most nights I'm between 6.5 and  8 hours, but there are still quite a few nights where I'm below 6 or above 8.  I'm going to review the literature on sleep to see what's going on there.  


So, I'm meeting my goals on steps which makes me very happy. My sleeping hours are normally distributed, but I'm not convinced this is evidence that I am sleeping "correctly."  I'll spend the rest of the week modeling and diving into these numbers more deeply.  

Wednesday, June 17, 2015

The Briefcase: A Weird Reality Show

I was wrong in a tweet.  No really, I was wrong on the internet, AND I am admitting it.  Here's how it all started.

A few weeks ago my wife sent a link to an article about a new show, "The Briefcase."  I can't find the article she sent me, but here's a link to a similar article.

A quick summation, is that this show gives poor people money, they get to make some decisions about helping people in need or keeping the money.  It's a fascinating concept, but the press was dubbing it "poverty porn" and claiming that it exploits poverty.


So a few nights ago, I watched the show, and had a bit of a different reaction than the press.  I've talked about on this blog being relatively well off but also very cheap.  As such I drive a 2010 Honda Fit because it's cheap, efficient, and does what I need it to do.  I saw that the people on the show were driving nicer vehicles than me... which led to this tweet:

I thought I had expressed myself fairly succinctly, in a grumpy, cheap, old man kind of way.  Then the producer of the show responded:

So, I thought about this for a couple of hours and realized some things:

  1. He's right.
  2. The show ISN'T poverty porn in a general sense.  This is distinctly something else.
  3. The media got this one really wrong, which indicates a lack of understanding of poverty.


So, some background, you could read elsewhere, but the quick summary:
  • A "poor" family is given a briefcase with $100K.
  • They are also given information about another "poor" family,  this information increases throughout the show.
  • They have to choose how much money to keep for themselves, how much to give to the other family.
  • The family is completely blind to this, but the other family is doing the exact same experiment, but looking back at the original family.
So, what does the output of this game look like?  Fairly simply, the players believe their "win" will simply be the percent of 100K that they choose to keep, represented by this simple graph:

In reality, their "win" will be what they keep, plus what the other team gives them (this is the blind part of the game, that they are unaware of.  Represented by a more complex graph (percent family chooses to keep on bottom, percent other family chooses to give at right):

So what does the output of this game give us? It's actually fairly simple and doesn't require fancy charts or equilibrium calculations.  

Some people may assume that this is a prisoner's dilemma problem, as it is a blind, two player game.  But it's not.  At all.  The reason?  There is no interdependence of results.  I will get the amount I choose to keep, no matter what the other person decides to do.  Thus the rational solution is to keep all the money.  But what about the other couple?

Two main factors deter from keeping money for self, thus undermining rational action:
  • Guilt over the plight of the other family (they need it worse).
  • Fear of the optics of keeping all the money on public television (looking bad).


So, back to the episode I watched, and why my initial tweet was wrong.  It's fairly simple:

The people featured on this show aren't poor, and are generally better off than median Americans.  On the show I watched, the two families made $60K and $70K, compared to $51,900 for average Americans. (source 538)
The families both had a lot of debt, but that debt seemed largely due to poor decision making, which means this is a story about highly leveraged middle class Americans.  This is IN FACT a story about the new American middle class.  Which leads us to our next point...


Time to coin a phrase.  DECISION PORN.  Effectively this show is creating a spectacle of already over-leveraged middle American's trying to make decisions to benefit their financial situations, weighed against the those of another family (or, just the desire not to look selfish on TV).  And it is somewhat entertaining.

Am I writing off the plight of these families?  Somewhat, but keep in mind they are generally better off than average Americans.

The part of the show that struck me most as an analogy to middle-class Americans: upon seeing a potential financial windfall, the initial reaction of both families is to put themselves into a MORE leveraged position.

  • Family one: Wants to buy a boat.  Granted it's to start a new business, but it would still cost more than they would receive from the show, and it's a hugely speculative move.
  • Family two: Wants to adopt a foreign child.  Certainly sounds like a nice thing to do, but also is a huge immediate cost, and creates a high ongoing cost of raising a child.

Now for the big question, will I be watching the show in the future?  Probably not.  While the concept is fascinating, I get really uncomfortable watching other people make bad decisions, especially regarding large sums of money.  I think that watching this show would actually give me a good amount of anxiety.

Tuesday, June 16, 2015

Tuesday Jams: Dog Fashion Disco

Some days at work, I need to do some high energy coding, and I need to be a little ADD about it, and a whole lot of creative.  Enter another one of my favorite bands.

Dog Fashion Disco is a band that I liked for quite a while. Then they broke up.  Some members formed a more electronic, but still pretty cool band called Polkadot Cadaver.  Released a couple of albums with great names (Purgatory Dance Party and Last Call in Jonestown).  But now Dog Fashion Disco is back together, with a new album, which seems to be pretty good.

Some examples of their music.

First, a favorite of mine (over scenes from a great TV show):
And for our older readers, a cover of an older song:
And from the new album (moderately NSFW):

Monday, June 15, 2015

Up this week: No.. Really

At the beginning of last week I created this optimistic post about upcoming posts.  I honestly thought I would be creating these two posts in the upcoming week.  The results?

Failure.  Well, I don't consider it a failure actually, but I promised two blog posts that never came to fruition, a music one, and one on fitness tracking data.  Instead of posting on those relatively mundane topics, I created three more timely (and controversial) posts on distributional impacts of tax policy in the state I live (Kansas).  Post 1 - Post 2 - Post 3  

I also realized I had an  email that should probably become a blog entry, which became this entry on using R in production.  A good post? Yes, but not what I promised readers.

I think it was the right call, but this is just a promise to readers that I will be posting the two promised posts from last week, as well as at least one additional post.  Anything you'd like to see me post on?  Let me know at:

Saturday, June 13, 2015

R in production diaries: sqlsave slowness

A few weeks ago, I received a text from a former colleague, here's what it read:

Do you do any bulk writes from R to a db?  Sqlsave is slow.  I'm considering writing a file and picking it up with another tool.

I knew exactly what he was talking about.  Sqlsave() is a function to write from R to SQL databases in the RODBC package.  I also know that he was likely refactoring my code, or code that he had partially copied from me at one time.


Fortunately, a couple of years ago, I had migrated off of  Sqlsave() to another solution.  Here was my response to my colleague, in email:

Easier to type on here.
We don't have to do any bulk saves in production, everything there is a single transaction, so we're saving back 6-7 records at any time transactionally (times 100 tx's per minute, yes it's a lot at a time but not a lot at any one execution).

As for SQLSAVE. I got fed up with the command and no longer use it.  We have an insert stored procedure that we call from R.  Much faster.  Much more efficient and easier.

We basically call an ODBC execute command wrapper, then just execute the proc with parameters to insert our data.

I don't know how this would work form your point of view, but you could (sans proc) in theory, create a text string in R that is essentially the bulk insert you want to do, and execute it AS SQL.  Which I think is the key... send the command, and let the insert be handled by the ODBC driver as a native ODBC SQL command, not as a hybrid weirdo SQLSAVE.


  • Sqlsave runs slow in production jobs.  Not sure why, but I would guess it is in a general class of problems I call "Meta-data shenanigans"
  • Using SQL directly to write back to the database in some way is generally a faster solution.
I received a thank you text yesterday from my former colleague, he refactored and his job runs faster.  All is right in the world again.

Friday, June 12, 2015

Tax Analysis Part 3: House Tax Plan June 12th

This morning my Twitter/Facebook feeds were littered with friends very upset about the Kansas legislature.  Again.  Essentially the story goes like this:
  • The Kansas House passed what is described as the "largest tax increase" in Kansas history.
  • It was passed at 4am, after putting massive pressure on a few legislators to change their votes.
  • It is mainly a change to sales tax, which is described as making Kansas one of the most regressive States in the country.
  • It was passed after threats by the governor to massively cut university budgets, including university athletics.
That's the story people are telling at least.  But this isn't a political blog, this is a numbers blog.  What will the new tax plan do to average citizens?


It's been a long week, and I'm tired, so not many words here.  If you want a methodological description look at this prior post.  If you want my disclaimers, and additional analysis of food sales taxes, look at this other prior post.

If you're not a tax nerd, skip to results.

Here are my general assumptions, given my reading of last night's bill:
  • Sales tax goes from 6.15% to 6.5% (slightly less of an increase than prior proposal)
  • No reduction of food sales tax (food taxed at 6.5%)
  • No significant change to most people's income tax rates.
These are my "new" general assumptions, and my prior posts contain other assumptions and disclaimers. Once again, let me know if you find any of these to be in error, or would like me to run the numbers under a different set of assumptions.


Once again, not a lot of words this morning, but of all the proposals, this is the most regressive option.  Major factors: food sales taxed at full rate and no change to highest bracket income taxes.

Here is an updated chart of the tax change.  The gray line indicates the shift in tax burden (measured by increased tax bill) by income level:

And my ever growing matrix:


Once again, the most regressive option was passed, especially with the removal of food sales tax. Whether this plan benefits you (err, is better than other options)  largely depends on your income level. And now we wait for the Kansas Senate.

Wednesday, June 10, 2015

Income Versus Sales Tax Pt. 2: Exempted Food

This is a followup to yesterday's post analyzing the different impact between raising income tax and raising sales tax.  There's a proposal to (in FY 2017) tax grocery purchases at a different rate than other sales-taxed items. 


If you're not a tax wonk, or nerd, skip to the results. 

For long methodology and background, see yesterday's post.  This post focuses on the current plan to charge a lower (4.95%) sales tax on food, beginning in Fiscal Year 2017.  A few methodological notes:

  • Regular sales tax would stay at the 6.55% rate from yesterday.  
  • Food sales tax would fall to 4.95%.
  • The biggest estimate I had to make in this analysis is the portion of taxable purchases people in different income groups make on food.  I used USDA data, that shows (unsurprisingly) households in the lowest income groups spend the highest percent of their income on food (~40% in the lowest quintile).  
  • I used the USDA data to model estimates of the portion of purchases that would be taxed at the lower lower food rate, by income group.
  • Once again, if anyone has any questions, or would like me to re-run numbers under different assumption, I'm happy to do that.  However I believe this is an honest attempt to accurately represent the impacts of tax change.


FIRST: Because food sales taxes make up a larger portion of low income family expenditures, this is a massive move to a more "progressive" tax system.  First compare yesterday's sales tax change (all purchases from 6.15% to 6.55%) where lowest income groups saw a 5% plus tax bill increase:

To today's (general purchases from 6.15 to 6.55, food from 6.15 to 4.95) where lowest income groups see a tax cut:

This shows that lowering the food sales tax significantly reduces the additional tax on lower income groups, and is in fact about a 4% cut to their overall tax burden.

SECOND:   Why is this such a massive impact on effective sales tax rates?  Simple, because the downward change to the food sales tax is three times the magnitude of the upwards change to the general sales tax.  This has an interesting impact: if you make more than a quarter of your sales taxable purchases on food, this is a net tax cut.

And once again for you nerds who like charts:


Two easy things here.  First, moving the food sales tax lower is a big move towards a progressively oriented sales tax.  In fact, even with the general rate rising from 6.15 to 6.55, reducing the food sales tax makes this a net tax cut on low income families.

  • There are a lot of things in the current bill, and at this moment (9pm, June 10th) no one knows how it will end up, and some could impact this analysis, including ending certain sales tax exemptions, and ending the low income food sales tax credit.
  • I'm really cheap.  Ask my wife.  Will not spend money.  Likes to save.  As such, I have probably erred on the conservative side with spending assumptions.  I don't think this materially; directionally impacts my analysis, however numbers will vary by different household spending habits.  In short, you spend more on non-food items generally, this costs you more.  That 25% number is key.
  • You may notice that after accounting for food tax, only higher income households see a tax increase, and that increase is moderate.  Two things, this is generally in-line with revenue estimates I have seen.  Also, this analysis is specific to four person families, living under certain conditions.  Households with fewer people and lower percent food expenditures will see more of a tax increase.

Tuesday, June 9, 2015

Impact of Sales Versus Income Tax: Some Numbers

If you follow the Kansas legislature like I have recently, you've seen many debates on the correct way to increase taxes.  The debate has risen from a $400 million budget hole, leading to the longest legislative session in history seeking a solution.

The tax policy debate centers on which mechanism to use to raise taxes: Sales or Income taxes. Generally the sides are:
  • Liberals: Because sales taxes are regressive and harm those who can afford it least, we should raise the top income rate.
  • Conservatives: Because consumption based taxes are economically superior to income taxes, we should raise the sales tax.
So, how would these tax changes actually impact families at different income rates?


For this methodology I am making estimates of how a sales tax versus income tax impacts people in different income brackets in Kansas.  A few assumptions need to be made:
  • For sales tax, I'm using the assumption of a rate raise from 6.15% to 6.55%.
  • For income tax, I'm assuming a change in the top rate from the current 4.8% back to the 2012 level of 6.25%.
  • I used a four person family, married filing jointly, as our example for all cases.
  • There are additional downstream economic impacts of any tax policy, which are real, largely occur over-time, but do not significantly impact this analysis.  I've largely ignored these effects, including in reductions of consumption, business spending, and multiplier effects.
  • These tax changes don't produce the same income to the State, but they have been offered as competing ideas, so I consider the comparison largely valid
  • Sales tax is regressive largely because high income families spend less % of their income on sales-taxable items.  It's difficult to know what that actually looks like, but I used a couple of sources to estimate a curve (see plot below).  (source 1) (source 2) Also, if anyone has Kansas specific data, I'd be happy to re-run numbers


First a preface on terminology:

  • Total Tax Bill: State only, sales tax + income tax paid.
  • New Total Tax Bill: State only, sales tax + income tax paid, after proposed change.
  • Effective Tax Rate: effective rate of taxes, Total Tax Bill divided by Income.
  • Tax % Increase: Percent change in Total Tax Bill.
  • Tax Rate Increase: Percentage point change in Effective Tax Rate.

A look at the current tax model, we see a flat, but mildly progressive tax structure.  People generally pay between 3.5%:5.3% of total income in State sales and income taxes.

With the sales tax option, everyone's tax liability increases slightly. However if we calculate the % change the poorest Kansas will pay 6% ($40) more while richer Kansas will pay only 1.5% ($160) more than they currently do.  The effective rate spread moves to 3.7%:5.4%.   This would be considered a move towards a regressive tax, because the lowest earners see the highest % increase.

For the income tax option, we only changed the rate on the highest earners, back to 2012 values.  For this option, we see no change for household incomes less than $50,000.  We do however see a large % change in higher earning households, with $200K+ households seeing a 21% increase in Total Tax Bill YoY.  The new rate spread moves to 3.5%:6.4%, a very progressive move.

And for those of you who prefer to see all of the numbers in a convenient chart, with a bit more information.:


Overall, a change to sales tax will impact low-income Kansans much more than a change to income tax, especially if income only applies to top tier.  A change to the income tax system like this, however, has the potential to significantly and quickly increase the tax rate to higher income Kansans.

I am not taking a side in this (in my tax bracket, I know what solution benefits me), but I hope that this analysis at least contributes to the discussion/understanding in tax policies.  Feel free to direct message more or comment on this blog if you have any questions or methodological concerns.

Monday, June 8, 2015

Whats Next?!

Quite a bit of blogging over the past couple of weeks, but I wanted to give a road map of where we're going this week, and also solicit some user feedback. 

First, the request for user feedback:

This blog has been around for almost six months, and gets quite a bit of traffic from various sources including twitter, reddit, dailykos, and organic search.  It has branched in several directions, and if we based our future on prior traffic, we would create new posts around Kansas Political Issues and Data Science Toolkits.

But we want to be sensitive to our users real needs and wants.  So, what would you like from us?  If you have ideas, you can comment on this post or email to this address.  For starters here are some examples of things we're considering looking at:

  • Releasing player-level fantasy football ratings for 2015, based on our models from prior years that had good results.  This would be along the same vein as our 2015 pre-season predictions.
  • Continuing our analysis of Kansas education funding found here.  We were kind of burned out with looking for data after this one, but this still seems to be relevant.  Also, we have some overall methodological concerns with the 2006 study.
  • Analysis of tax impact between sales and income tax?  What is effective change to citizens with varying level of incomes.
  • An update on how R in production is going.  This would include server traffic, timing stats, and a summary of issues and resolutions we have encountered.  
  • A case study in how we create production-ified machine learning models, from start to finish.
  • Early 2016 election predictions?
  • More sports?
  • More politics?
  • etc?

And for a preview, here's what is coming up later this week (for sure):
  1. Tuesday Jams: (of course, we have to have music) I don't want spoil it, but the band initials are DFD.
  2. Cumulative Fatigue: This is in our series on modeling fitness tracker data.  After weeks of ramping up activity level using my Garmin fitness tracker (read: no days where I ran less than four miles since April 1), I hit the wall and my body fell apart.  But can we predict when this will happen to prevent and use it in training plans?

Friday, June 5, 2015

Florida Hospital Deaths May Not Be Systematic

This afternoon I was confronted with two news stories regarding a hospital that has what is being reported as a disproportionately high mortality rate during infant heart surgery.  Obviously this is a sad story, but because it's about a statistical anomaly, it is interesting to me.


Specifically, I wanted to know is this a case of something systematic occurring (doctors bad at surgery, something nefarious, etc) or is it possible that this is just a "statistical anomaly" .. in essence.. is this just an outlier hospital?

Here are two news stories for background (the feds are investigating this now):

Story 1 and Story 2


From a statistical standpoint I can't prove that something systematic isn't occurring, but I can speak to the likelihood of this hospital's death rate a random outlier, effectively due to "sampling error".  Here is the data I was able to glean from the articles:

  • The national average death rate for these surgeries is 3.3%
  • The average at this hospital is 12.5%.  They cited 8 deaths, so I can calculate that our "n" is 64.

The question here is: Is the St. Mary's hospital death rate significantly different than the national average?  Because this is just testing one sample proportion against the population, I used the binomial exact test.  Here's my R output:


The important piece here is the p-value which is approximately 0.0012.  That means about a 1.2 in 1000 chance of this difference being due to random chance.  So this is obviously unexpected... but is it really?

The problem with this logic is that there may be a few thousand hospitals that perform this type of surgery.  And if each has a 1.2 in 1000 chance of having a mortality rate of 12.5% or greater, then it is likely that one could?

In short summary, we know this is an instance of a very unlikely data point.  However, given the number of hospitals that perform this type of surgery, it is possible that this is due to statistical sampling.

Thursday, June 4, 2015

Normalize Your Data!

Hey, a non-football related post!  

So occasionally I see someone trying to make an argument with data at get really annoyed.  Generally this annoyance comes people using an inappropriate metric to make their point.  Two examples in the last week:

  • Someone arguing that United States has an issue, because more people are murdered here than in other countries.
  • Someone arguing Kansas may have an issue, because we spend less than the national average on education.
Both arguments failed with me because they didn't properly normalize data.  I'll look at both of them in depth. 


So, my first annoyance was related to someone claiming that America has more murders than other countries.  That may be true, but the graph backing up the claim was bogus, because it looked at aggregate murders and all the of the comparison countries were much smaller than the United States.  What drives aggregate murder rates?  Population.

To demonstrate how this this kind of analysis can lead people to make massively false claims, I'll start with my own bogus claim and back it up with data (on a funner subject, nonetheless):

Americans have a drinking problem because they drink much more than Eastern European countries. 
And a graph to back it up:

Wow!  We drink almost three times as much as the Germans!  THE GERMANS!!!! 

(Insert picture of Oktoberfest here for effect).

But not really.  To analyze how individuals are impacted by aggregate numbers you have to normalize for population.  It's an easy calculation, just dividing the aggregate total by the population.

Here's a more accurate view of per capita beer consumption by country, comparing the United States to some other high-consumers of beer.  

I have no idea what is up with Czech Republic, I'm guessing they really like their Pilsner Urquell.

This may all seem trivial, but real decisions are made on these types of numbers, and if policy makers are lead to believe that the first chart is accurate, then policy decisions are made to combat a problem that doesn't exist.  It could be a big deal, and my next analysis demonstrates a more likely scenario.


Late last night a tweet from a journalist popped up on my feed.  Here it is:

I had two thoughts on this:

  • Does "turn it inside out" mean that he thinks the Kansas Legislature will try to lie with the facts?
  • Or is he asking people in the know to look at the numbers and see what they can?
Either way I thought I would dig into the data.  I found two problems with comparing states to a national mean or average of States
  1. The data was not normalized for factors that impact the cost of doing business in a State, specifically, cost of living.
  2. Because of that failure to normalize this data and other distributional aspects, the data was likely skewed in a way that would drive up the mean.

As a result, I needed to make a couple of normalizations to the analysis, first compensating for distributional skew by looking at ranking versus other states versus mean.  Then adjusting for cost of living (as an imperfect proxy for cost differential). 

First the median chart, it shows that Kansas is 24th out of 51 states.  (DC counts)   Kansas is effectively a median State.  Also, though, if you look at the shape of the distribution in this chart, you see that high skew exists.

Cost of living isn't a perfect measure to normalize for cost differentials, but because the bulk of school costs are related to paying salaries, it works for this purpose.  So I normalized using a cost of living index, which shows that Kansas moves up one place to 25th.  Obviously this is not a significant change in result, but if you look at other States they move around significantly.  Conclusion: Kansas is about in the middle, spending wise. 

A little unexpected that Wyoming moves up to be a top spending state, but if I had to guess, I would think it's because of a relatively low cost of living (after normalization) and poor economies of scale.


Normalization matters, because it allows us to control for the big factors that impact numbers like cost of living and population differences.

Nerdy Conclusion:  On our second analysis some interesting nerdyness.  First the original distribution has an expected skew ratio of .94, and a standard deviation of $3207.  The correlation between cost of living and school spending is .66, which is huge, obviously (potentially endogenous because good schools cost more, but that's a trick for a different day).  The normalized distribution reduces skew to .31, and reduces standard deviation to about $2100.

**Quick side note:  This post is intended only to speak about the issue of normalization, not the NORMATIVE issue of whether Americans should drink more beer, or Kansas should spend more or less on education.  A related post tackles the also non-normative question of whether spending matters.

Tuesday, June 2, 2015

NFL 2015 Predictions: Finale and Playoffs

Bringing my prediction analysis of the 2015 season to a close, finally, though this is generally a lot more fun than the data I usually look at.


The projected playoff picture looks fairly similar to last year.  In the AFC especially, we simply swap in the Houston Texans, and pull out the Bengals.

In the NFC, things are a bit more complicated:  Lions, Cardinals, and Panthers are out, but the Saints, Eagles, and Falcons are in.

Overall, I hope to just get three out of six  teams correct in each conference (which shows my models were at least somewhat predictive for good teams over the season).


You may have noticed that my above chart includes three columns for each team.  The first is the output from my initial model, found here.  The last column is how many games the team won last year.  These columns are similar in many cases, though the projections are generally compressed towards the mean.

The middle column is a little more complex to explain.  It has to do with the difference between picking which team will win each week, versus how many games a team will win in a season.

Let's choose a bland, but explanatory example.  Let's say that my model predicts that a team has a 75% chance to win each of its games (it never does, each game has a different probability based on opponent and other factors, but this example works):  
  • Original Model: It seems redundant, but teams generally win 75% of the time when they have a projected 75% percent chance to win.  This means that if all games are projected at 75%, they would win 12 of their 16 games on average that season.
  • Picks Model: If I were a sportswriter who has to make picks each week about who will win a each game, it may seem that I would pick the team 12 times, and their opponent 4 times.  But this isn't true, because the team is a 75% percent favorite in each game, so I should pick them each time to assure my best chance of being right.  I know that they will lose 4 of their games in all likelihood, but I don't know which ones (without external knowledge) The issue here is that although they are a favorite in each game, they also have a 25% chance to lose each, which aggregates into 4 likely losses.  
Now think about lesser probabilities to win. For instance, you could have a 51% chance to win in each of your games.  In this case, you would be the rightful favorite in each game, but only end up winning about half of them.

The picks model is valuable in a number of ways, especially in considering which teams have a potential to win a lot of games this year if they win in each game they are slightly favored (can they beat the odds?).  It also helps to explain why teams that seem to be better than all of their opponents end up with  6 or 7 losses.

Finally, consider the case of my Kansas City Chiefs. This is a team that I project will win 9 games, but will be favored in 12.  As a fan it's important to understand that the chiefs may look good to win a majority of their games, but it may not happen as a result of wins not being high-probability wins.

Tuesday Jams: What's in my bag?

A little change in direction for this week's music.  In fact, I'm not picking any music at all, I'm letting other people pick music for me (a meta music pick?).

If you've ever been to Amoeba Records in LA, you know it's an awesome record store.  One of the coolest things they do is their "What's in my bag" series.  For the series, they bring in musicians and other celebrities to show what records they're buying right now.  

It's a cool series that I like, and it helps me find a lot of new musicians and bands.  You can search YouTube for their videos, with celebrities like Gerald Casale, Bob Odenkirk, The Zombies, J Mascis, Dave Grohl, and Weird Al, but here are a few of my favorites.

Trevor Dunn (bassist for various awesome bands) picks out some weird music:

JG Thirlwell (from the band Foetus) picks out a variety of even weirder music than Trevor:

Deltron 3030 record a video. Dan the Automator spends his time picking out kitschy movies, while Del picks out various rock and hip-hop albums, which is a pretty good explanation of Deltron 3030 as a group:

Monday, June 1, 2015

NFL 2015 Predictions: NFC South

I saved the most exciting predictions for last!

Not really, but this answers the question: Will any NFC South team have a winning record this year?  
The answer, per the models, is YES, but not very winning.  One major difference between this division and others is that every team gets better or stays the same.  The Saints make the playoffs, not  too surprising if you follow this division.

NFL 2015 Predictions: NFC North

On to the NFC North...

It doesn't take a statistical model to determine that the team who's quarterback looks like this will continue to sit at the bottom of their division.

The model shows the Packers will continue to rule this division, but this year without competition from the Lions.  The model essentially shows the Lions getting lucky last year, and this year they will return to being a below average team, according to their fundamental statistics.

The models show the Vikings continuing to be the Vikings, of course their results are largely dependent on the play status of a certain child abuser.