Sunday, June 28, 2015

Exponential Growth!!! Maybe not...

It's much easier to sell a business plan based on the idea of exponential growth, than diminishing returns.

A few years ago, I asked a junior team member for an update on a product he was working on.  He was the assigned analytics resource on the product, and I was curious how the new product was proceeding.  He laid out the way the product would work, followed by the general business plan.  He capped his statement off with "which would lead to exponential growth...." which he had been told by the product manager.

The last statement got my attention.  So I asked a couple of questions such as "so, how exactly would that lead to exponential business growth?" and "has anyone modeled out how this functionally leads to an exponential growth problem?"  The answer was no.

EXPONENTIAL GROWTH AS A SELLING POINT

The reason I am posting on this is something I see far too often.  Many times Marketing and Product managers make claims to executives that products will take a while to get off the ground, but will soon see exponential growth.  Often these claims don't come true, leaving analysts to blame things like lack of consumer buy-in, or being late to market.  In reality the failure goes back to the original growth projection.

THE CLAIM


So to give an example of how people get their exponential growth claims wrong, lets look at my example from earlier.  Here's an explanation of how I saw this problem:
  • It was a two-sided stochastic process where the business acquired customers of type A and type B.  
  • Each customer had a subset of goods A{} and B{}, and each of the goods in A{xyz} had to find an appropriate match B{xyz} for our company to make money.  
  • The business plan was largely centered on increasing the number of type B customers on the site, thus increasing the chances of finding matches for each of the type A customers.  
  • The distribution of matches were loosely correlated between A and B, however there was a sparsity with a long tail, so the probability of any one B{} matching an individual A{}, is less than 1%.
  • Each "good" leaves after it is matched.  So duplicate B{} matches to each A{} do not increase profits.
This sounds a lot like a dating site.. doesn't it?  It's not, but, sounds like it.

This clearly wasn't the junior analysts fault, so I took my issue to the problem directly:  the product manager.  When I pressed the product manager who made the exponential growth claim, here's his explanation of how growth would occur:

  1. We'll go get more type B's to enter data on the site.
  2. We'll see the NETWORK EFFECT.
  3. Network effect leads to exponential growth.
So this is effectively the same business model as the Southpark gnomes, except with the network effect substituted for the ???.  But does it work? Not in the way assumed here.

THE REALITY

The reason this isn't an exponential growth problem is simple: this is a dubious invoking of the Network Effect.  Rather than a true network where each individual potentially interacts on an ongoing basis with each other user (think: Facebook), the individuals are divided into two teams, can only connect to the other team, and after connecting once, are removed.  Simply speaking:

Because duplicate matches don't create additional "growth" each incremental growth in B holds a diminished potential of positively impacting business.

So, in reality, the growth is a diminishing returns game.  Which looks like this:



Rather than the expected exponential growth, which looks like this:

 CONCLUSION

This post is a simple warning:  there are a lot of dubious claims of exponential growth out there.  Always try to model out the underlying processes with realistic probabilities, before buying in, or before giving the analytics "go-ahead" to new projects. BTW, last update I have on the project stated above is that it was cancelled after not achieving desired growth results.



Saturday, June 27, 2015

Anatomy of an Analysis: How your Toolkit comes together.

This is a follow up to my original post on my Data Science Toolkit, which received a hugely positive response.  One of the questions I get from young analysts is "what tool should I use for this task?"  Generally when they are asking this question, they are suggesting a couple of software products, both capable of completing the task, but one has a distinct advantage. So, the advice I give goes something like this:
Use whatever tool will get you to an accurate answer fastest; putting you in the best position to followup with the data in the future. 
Rules like this are nice, but I think it's more interesting to show process.  Interestingly, my recent post on people driving to avoid potential sales tax involved every tool in my toolkit post.  So, let's go through my process as a demonstration to young analysts.

PROCESS

  1. I read initial post by another analyst on Twitter. Became initially annoyed, but also curious.  I used Google Maps to calculate the distance to Missouri from my home, and then Excel for some "back of the napkin" calculations.  I realized, hey I have something here, maybe I should write this one up? 
  2.  I needed a framework for analysis, what tool is good for "what if" scenarios?  Excel.  So I ran some final numbers in Excel, based on miles from the border, driving costs as well as time costs, and then created a chart of the cost curves.
  3. After the framework was developed, I knew I needed to apply it, and this certainly had a geographic aspect.  I acquired a couple of Johnson County specific shapefiles (Census Blocks; Major Highways) and imported them to QGIS.
  4. From my framework analysis I knew my mileage cutoffs (4,6,9), but I needed a geographically defined way to implement this. For this, I used some custom R functions I developed previously that are implementations of Haversine's method, and allow for a backsolve (put in mileages, get coordinates).  
  5. Next I needed to code my shapefiles by the distance breaks.  For this, I used the Python engine internal to QGIS, to create the maps, color code them for Data Viz effects. I also used this Python script to summarize census block data by distance to border, so that I could say things like "70% of people in Joco live within nine miles of the border." (I used Notepad++ for this code, and all the other code created for this project
  6. After completing my analysis, I wanted to quickly validate the data.  Luckily I have a Haversine backsolve for SQL, which allowed me to validate this analysis.  Also: GIS shapefiles have an internal *.dbf file that holds the attribute table, which holds the analytical information you need (lat/long, population, demographics, etc) and can be easily imported to SQL.  I validated, and everything checked out.

Monday, June 22, 2015

Fitness Week #1: Distribution and Descriptives

I've been promising for a few weeks an update on fitness tracking results, then not delivering.  I'm going to fix that today.  In fact, I'm going to take this entire week to fix that, starting my first ever "Fitness Week" - which is actually just me taking a very nerdy look at numbers.

Here are the posts I'm laying out for this week:

Today: This post.  Just general update on what my fitness tracker has been tracking.
Tuesday: On "targeting" and its distributional impacts.
Wednesday: Aggregate fatigue, what happened to me three weeks ago.
Thursday: Product review, of my Fitness Tracker (Garmin Fit 2)
Friday: My updated model of fitness data.


SUMMARY 

Let's start today with some easy data (it is Monday after all).  How active am I per my fitness tracker?  Here's the summary of my steps per day data:


I know, ugly and pulled straight from excel, but it shows what we need, just some observations on steps:
  • On some research, the average American takes only 5900 steps a day (seriously?) so I am at least three times the average.
  • My original goal was 20,000 step average, so, goal achieved!
  • The mean is above the median, verifying our skew number.  Also (for non-stats people) this basically means that though I average over 20,000 steps a day, I don't get 20,000 steps on my average days.  (fairly important for goal setting)
  • My sleep numbers look less skewed, and the average is in a good range 7-8 hours of sleep a night.

DISTRIBUTION

But what does this actually look like? First lets look at steps. Here's a cumulative distribution graph that I like to put together when analyzing relative frequencies.  



The distribution is obviously not normal, but in my mind it's not that unexpected.

  • 65% of the time, I get between 16,000 and 22,000 steps.  Further analysis shows I should consider this my "average weekday."
  • 10% of the time, I get between 10,000 and 16,000 steps.  These appear to be low outlier weekdays, generally when I'm especially tired or busy.
  • 25% of the time, I get over 22,000 steps.  These are generally weekend days.
Now let's look at sleep:
  • My sleep patters are fairly normally distributed, which is amazing, because (even as a statistician) I RARELY see anything with this normal of a distribution.  
  • The variance is a bit disturbing.  Sure most nights I'm between 6.5 and  8 hours, but there are still quite a few nights where I'm below 6 or above 8.  I'm going to review the literature on sleep to see what's going on there.  


CONCLUSION

So, I'm meeting my goals on steps which makes me very happy. My sleeping hours are normally distributed, but I'm not convinced this is evidence that I am sleeping "correctly."  I'll spend the rest of the week modeling and diving into these numbers more deeply.  


Wednesday, June 17, 2015

The Briefcase: A Weird Reality Show

I was wrong in a tweet.  No really, I was wrong on the internet, AND I am admitting it.  Here's how it all started.

A few weeks ago my wife sent a link to an article about a new show, "The Briefcase."  I can't find the article she sent me, but here's a link to a similar article.

A quick summation, is that this show gives poor people money, they get to make some decisions about helping people in need or keeping the money.  It's a fascinating concept, but the press was dubbing it "poverty porn" and claiming that it exploits poverty.

THE TWEET

So a few nights ago, I watched the show, and had a bit of a different reaction than the press.  I've talked about on this blog being relatively well off but also very cheap.  As such I drive a 2010 Honda Fit because it's cheap, efficient, and does what I need it to do.  I saw that the people on the show were driving nicer vehicles than me... which led to this tweet:


I thought I had expressed myself fairly succinctly, in a grumpy, cheap, old man kind of way.  Then the producer of the show responded:


So, I thought about this for a couple of hours and realized some things:

  1. He's right.
  2. The show ISN'T poverty porn in a general sense.  This is distinctly something else.
  3. The media got this one really wrong, which indicates a lack of understanding of poverty.

THE GAME

So, some background, you could read elsewhere, but the quick summary:
  • A "poor" family is given a briefcase with $100K.
  • They are also given information about another "poor" family,  this information increases throughout the show.
  • They have to choose how much money to keep for themselves, how much to give to the other family.
  • The family is completely blind to this, but the other family is doing the exact same experiment, but looking back at the original family.
So, what does the output of this game look like?  Fairly simply, the players believe their "win" will simply be the percent of 100K that they choose to keep, represented by this simple graph:


In reality, their "win" will be what they keep, plus what the other team gives them (this is the blind part of the game, that they are unaware of.  Represented by a more complex graph (percent family chooses to keep on bottom, percent other family chooses to give at right):


So what does the output of this game give us? It's actually fairly simple and doesn't require fancy charts or equilibrium calculations.  

Some people may assume that this is a prisoner's dilemma problem, as it is a blind, two player game.  But it's not.  At all.  The reason?  There is no interdependence of results.  I will get the amount I choose to keep, no matter what the other person decides to do.  Thus the rational solution is to keep all the money.  But what about the other couple?

Two main factors deter from keeping money for self, thus undermining rational action:
  • Guilt over the plight of the other family (they need it worse).
  • Fear of the optics of keeping all the money on public television (looking bad).

WHY THIS ISN'T POVERTY PORN

So, back to the episode I watched, and why my initial tweet was wrong.  It's fairly simple:

The people featured on this show aren't poor, and are generally better off than median Americans.  On the show I watched, the two families made $60K and $70K, compared to $51,900 for average Americans. (source 538)
The families both had a lot of debt, but that debt seemed largely due to poor decision making, which means this is a story about highly leveraged middle class Americans.  This is IN FACT a story about the new American middle class.  Which leads us to our next point...

IF NOT POVERTY PORN, THEN WHAT?

Time to coin a phrase.  DECISION PORN.  Effectively this show is creating a spectacle of already over-leveraged middle American's trying to make decisions to benefit their financial situations, weighed against the those of another family (or, just the desire not to look selfish on TV).  And it is somewhat entertaining.

Am I writing off the plight of these families?  Somewhat, but keep in mind they are generally better off than average Americans.

The part of the show that struck me most as an analogy to middle-class Americans: upon seeing a potential financial windfall, the initial reaction of both families is to put themselves into a MORE leveraged position.


  • Family one: Wants to buy a boat.  Granted it's to start a new business, but it would still cost more than they would receive from the show, and it's a hugely speculative move.
  • Family two: Wants to adopt a foreign child.  Certainly sounds like a nice thing to do, but also is a huge immediate cost, and creates a high ongoing cost of raising a child.

Now for the big question, will I be watching the show in the future?  Probably not.  While the concept is fascinating, I get really uncomfortable watching other people make bad decisions, especially regarding large sums of money.  I think that watching this show would actually give me a good amount of anxiety.




Saturday, June 13, 2015

R in production diaries: sqlsave slowness

A few weeks ago, I received a text from a former colleague, here's what it read:

Do you do any bulk writes from R to a db?  Sqlsave is slow.  I'm considering writing a file and picking it up with another tool.

I knew exactly what he was talking about.  Sqlsave() is a function to write from R to SQL databases in the RODBC package.  I also know that he was likely refactoring my code, or code that he had partially copied from me at one time.

THE SOLUTION 


Fortunately, a couple of years ago, I had migrated off of  Sqlsave() to another solution.  Here was my response to my colleague, in email:

Easier to type on here.
We don't have to do any bulk saves in production, everything there is a single transaction, so we're saving back 6-7 records at any time transactionally (times 100 tx's per minute, yes it's a lot at a time but not a lot at any one execution).

As for SQLSAVE. I got fed up with the command and no longer use it.  We have an insert stored procedure that we call from R.  Much faster.  Much more efficient and easier.

We basically call an ODBC execute command wrapper, then just execute the proc with parameters to insert our data.

I don't know how this would work form your point of view, but you could (sans proc) in theory, create a text string in R that is essentially the bulk insert you want to do, and execute it AS SQL.  Which I think is the key... send the command, and let the insert be handled by the ODBC driver as a native ODBC SQL command, not as a hybrid weirdo SQLSAVE.

CONCLUSION

  • Sqlsave runs slow in production jobs.  Not sure why, but I would guess it is in a general class of problems I call "Meta-data shenanigans"
  • Using SQL directly to write back to the database in some way is generally a faster solution.
I received a thank you text yesterday from my former colleague, he refactored and his job runs faster.  All is right in the world again.

Friday, June 5, 2015

Florida Hospital Deaths May Not Be Systematic

This afternoon I was confronted with two news stories regarding a hospital that has what is being reported as a disproportionately high mortality rate during infant heart surgery.  Obviously this is a sad story, but because it's about a statistical anomaly, it is interesting to me.

BACKGROUND

Specifically, I wanted to know is this a case of something systematic occurring (doctors bad at surgery, something nefarious, etc) or is it possible that this is just a "statistical anomaly" .. in essence.. is this just an outlier hospital?

Here are two news stories for background (the feds are investigating this now):

Story 1 and Story 2

METHOD

From a statistical standpoint I can't prove that something systematic isn't occurring, but I can speak to the likelihood of this hospital's death rate a random outlier, effectively due to "sampling error".  Here is the data I was able to glean from the articles:


  • The national average death rate for these surgeries is 3.3%
  • The average at this hospital is 12.5%.  They cited 8 deaths, so I can calculate that our "n" is 64.

The question here is: Is the St. Mary's hospital death rate significantly different than the national average?  Because this is just testing one sample proportion against the population, I used the binomial exact test.  Here's my R output:


CONCLUSION

The important piece here is the p-value which is approximately 0.0012.  That means about a 1.2 in 1000 chance of this difference being due to random chance.  So this is obviously unexpected... but is it really?

The problem with this logic is that there may be a few thousand hospitals that perform this type of surgery.  And if each has a 1.2 in 1000 chance of having a mortality rate of 12.5% or greater, then it is likely that one could?

In short summary, we know this is an instance of a very unlikely data point.  However, given the number of hospitals that perform this type of surgery, it is possible that this is due to statistical sampling.

Thursday, June 4, 2015

Normalize Your Data!

Hey, a non-football related post!  

So occasionally I see someone trying to make an argument with data at get really annoyed.  Generally this annoyance comes people using an inappropriate metric to make their point.  Two examples in the last week:

  • Someone arguing that United States has an issue, because more people are murdered here than in other countries.
  • Someone arguing Kansas may have an issue, because we spend less than the national average on education.
Both arguments failed with me because they didn't properly normalize data.  I'll look at both of them in depth. 


ALCOHOL CONSUMPTION

So, my first annoyance was related to someone claiming that America has more murders than other countries.  That may be true, but the graph backing up the claim was bogus, because it looked at aggregate murders and all the of the comparison countries were much smaller than the United States.  What drives aggregate murder rates?  Population.

To demonstrate how this this kind of analysis can lead people to make massively false claims, I'll start with my own bogus claim and back it up with data (on a funner subject, nonetheless):

Americans have a drinking problem because they drink much more than Eastern European countries. 
And a graph to back it up:


Wow!  We drink almost three times as much as the Germans!  THE GERMANS!!!! 

(Insert picture of Oktoberfest here for effect).


But not really.  To analyze how individuals are impacted by aggregate numbers you have to normalize for population.  It's an easy calculation, just dividing the aggregate total by the population.

Here's a more accurate view of per capita beer consumption by country, comparing the United States to some other high-consumers of beer.  







I have no idea what is up with Czech Republic, I'm guessing they really like their Pilsner Urquell.

This may all seem trivial, but real decisions are made on these types of numbers, and if policy makers are lead to believe that the first chart is accurate, then policy decisions are made to combat a problem that doesn't exist.  It could be a big deal, and my next analysis demonstrates a more likely scenario.

EDUCATION FUNDING

Late last night a tweet from a journalist popped up on my feed.  Here it is:

I had two thoughts on this:

  • Does "turn it inside out" mean that he thinks the Kansas Legislature will try to lie with the facts?
  • Or is he asking people in the know to look at the numbers and see what they can?
Either way I thought I would dig into the data.  I found two problems with comparing states to a national mean or average of States
  1. The data was not normalized for factors that impact the cost of doing business in a State, specifically, cost of living.
  2. Because of that failure to normalize this data and other distributional aspects, the data was likely skewed in a way that would drive up the mean.

As a result, I needed to make a couple of normalizations to the analysis, first compensating for distributional skew by looking at ranking versus other states versus mean.  Then adjusting for cost of living (as an imperfect proxy for cost differential). 

First the median chart, it shows that Kansas is 24th out of 51 states.  (DC counts)   Kansas is effectively a median State.  Also, though, if you look at the shape of the distribution in this chart, you see that high skew exists.





Cost of living isn't a perfect measure to normalize for cost differentials, but because the bulk of school costs are related to paying salaries, it works for this purpose.  So I normalized using a cost of living index, which shows that Kansas moves up one place to 25th.  Obviously this is not a significant change in result, but if you look at other States they move around significantly.  Conclusion: Kansas is about in the middle, spending wise. 


A little unexpected that Wyoming moves up to be a top spending state, but if I had to guess, I would think it's because of a relatively low cost of living (after normalization) and poor economies of scale.

CONCLUSION

Normalization matters, because it allows us to control for the big factors that impact numbers like cost of living and population differences.

Nerdy Conclusion:  On our second analysis some interesting nerdyness.  First the original distribution has an expected skew ratio of .94, and a standard deviation of $3207.  The correlation between cost of living and school spending is .66, which is huge, obviously (potentially endogenous because good schools cost more, but that's a trick for a different day).  The normalized distribution reduces skew to .31, and reduces standard deviation to about $2100.


**Quick side note:  This post is intended only to speak about the issue of normalization, not the NORMATIVE issue of whether Americans should drink more beer, or Kansas should spend more or less on education.  A related post tackles the also non-normative question of whether spending matters.