Monday, June 29, 2015

Fitness Week Summary

One last post to wrap up Fitness Week.  I realized I let myself get distracted by tax policy and whether I could save money by driving to Missouri (read: me being cheap) that I forgot my last post for Fitness week.  That post was supposed to be on modeling fitness data, but first, how did I actually do during fitness week?

FITNESS WEEK STATS

Fitness week ended up being a pretty average week for me, in aggregate above average, as shown below:

  
A few notes:
  • Each weekday was close to its average over time, each day slightly exceeding the Garmin set goal, per my earlier analysis on "targeting."
  • Monday and Thursday were my two worst days.
  •  Saturday and Sunday were both high days, over 30,000 steps.  Which, is a record, as I've never had two consecutive days over 30K steps

MODELING THE DATA

Now an update on modeling fitness data to predict activity levels.   I've posted on this a few times before, here's my most recent post for review.

For you nerds here's my model.  I don't have a ton of data (about 100 days) so I have to use a fairly limited methodology.  I'm using a log-log regression model to predict daily steps using six variables (3 logged "priors" and 3 fixed effects).  R-squared is .77, which, is ok, but accounting for variance is fairly easy, because a good amount of the variance is just weekends and travel days being vastly deviant from mean.  Here's what the regression looks like, variable definitions in next section.


For you non-nerds here are the simple results:

Most Important Factors
  •  Weekend: On weekends, all else equal, I take approximately 30% more steps than on weekdays.
  • Steps_Prior: Each 1% increase in steps yesterday leads to a 0.4% decrease in steps today.  A priori, the more active I am yesterday, the more tired I am today.
  • Steps_Three_Prior: Each 1% increase in the steps over the past three days leads to a 0.97% increase in steps today.  Essentially: one of the best predictors of steps today, is how I've done recently.
Less Important Factors
  • Hours_Three_Prior: Each 1% increase in sleep in the last three days leads to a 0.37% decrease in steps today.  Likely related to cumulative fatigue theory.
  • Alone: Days when I'm home alone I tend to take 13% more steps.
  • Travel: Days when I have to travel for work or family, I take 11% fewer steps. 

CONCLUSION

A couple of things here:
  1. Fitness week ended up being a fitness tracker success.  I exceeded my weekly average, and set a record for consecutive days over 30K steps.
  2. My fitness model is coming together well.  It's fairly predictive, and all the input variables make a lot of sense.  It's nice to make ceteris paribus approximations of the impact of certain factors.  (e.g. How much more will I move on  the weekends?  How much does having a big day today impact tomorrow? etc.)  I will continue working these models, adding additional factors and potentially acquire other people's data (shameless solicitation?)

Sunday, June 28, 2015

Exponential Growth!!! Maybe not...

It's much easier to sell a business plan based on the idea of exponential growth, than diminishing returns.

A few years ago, I asked a junior team member for an update on a product he was working on.  He was the assigned analytics resource on the product, and I was curious how the new product was proceeding.  He laid out the way the product would work, followed by the general business plan.  He capped his statement off with "which would lead to exponential growth...." which he had been told by the product manager.

The last statement got my attention.  So I asked a couple of questions such as "so, how exactly would that lead to exponential business growth?" and "has anyone modeled out how this functionally leads to an exponential growth problem?"  The answer was no.

EXPONENTIAL GROWTH AS A SELLING POINT

The reason I am posting on this is something I see far too often.  Many times Marketing and Product managers make claims to executives that products will take a while to get off the ground, but will soon see exponential growth.  Often these claims don't come true, leaving analysts to blame things like lack of consumer buy-in, or being late to market.  In reality the failure goes back to the original growth projection.

THE CLAIM


So to give an example of how people get their exponential growth claims wrong, lets look at my example from earlier.  Here's an explanation of how I saw this problem:
  • It was a two-sided stochastic process where the business acquired customers of type A and type B.  
  • Each customer had a subset of goods A{} and B{}, and each of the goods in A{xyz} had to find an appropriate match B{xyz} for our company to make money.  
  • The business plan was largely centered on increasing the number of type B customers on the site, thus increasing the chances of finding matches for each of the type A customers.  
  • The distribution of matches were loosely correlated between A and B, however there was a sparsity with a long tail, so the probability of any one B{} matching an individual A{}, is less than 1%.
  • Each "good" leaves after it is matched.  So duplicate B{} matches to each A{} do not increase profits.
This sounds a lot like a dating site.. doesn't it?  It's not, but, sounds like it.

This clearly wasn't the junior analysts fault, so I took my issue to the problem directly:  the product manager.  When I pressed the product manager who made the exponential growth claim, here's his explanation of how growth would occur:

  1. We'll go get more type B's to enter data on the site.
  2. We'll see the NETWORK EFFECT.
  3. Network effect leads to exponential growth.
So this is effectively the same business model as the Southpark gnomes, except with the network effect substituted for the ???.  But does it work? Not in the way assumed here.

THE REALITY

The reason this isn't an exponential growth problem is simple: this is a dubious invoking of the Network Effect.  Rather than a true network where each individual potentially interacts on an ongoing basis with each other user (think: Facebook), the individuals are divided into two teams, can only connect to the other team, and after connecting once, are removed.  Simply speaking:

Because duplicate matches don't create additional "growth" each incremental growth in B holds a diminished potential of positively impacting business.

So, in reality, the growth is a diminishing returns game.  Which looks like this:



Rather than the expected exponential growth, which looks like this:

 CONCLUSION

This post is a simple warning:  there are a lot of dubious claims of exponential growth out there.  Always try to model out the underlying processes with realistic probabilities, before buying in, or before giving the analytics "go-ahead" to new projects. BTW, last update I have on the project stated above is that it was cancelled after not achieving desired growth results.



Saturday, June 27, 2015

Anatomy of an Analysis: How your Toolkit comes together.

This is a follow up to my original post on my Data Science Toolkit, which received a hugely positive response.  One of the questions I get from young analysts is "what tool should I use for this task?"  Generally when they are asking this question, they are suggesting a couple of software products, both capable of completing the task, but one has a distinct advantage. So, the advice I give goes something like this:
Use whatever tool will get you to an accurate answer fastest; putting you in the best position to followup with the data in the future. 
Rules like this are nice, but I think it's more interesting to show process.  Interestingly, my recent post on people driving to avoid potential sales tax involved every tool in my toolkit post.  So, let's go through my process as a demonstration to young analysts.

PROCESS

  1. I read initial post by another analyst on Twitter. Became initially annoyed, but also curious.  I used Google Maps to calculate the distance to Missouri from my home, and then Excel for some "back of the napkin" calculations.  I realized, hey I have something here, maybe I should write this one up? 
  2.  I needed a framework for analysis, what tool is good for "what if" scenarios?  Excel.  So I ran some final numbers in Excel, based on miles from the border, driving costs as well as time costs, and then created a chart of the cost curves.
  3. After the framework was developed, I knew I needed to apply it, and this certainly had a geographic aspect.  I acquired a couple of Johnson County specific shapefiles (Census Blocks; Major Highways) and imported them to QGIS.
  4. From my framework analysis I knew my mileage cutoffs (4,6,9), but I needed a geographically defined way to implement this. For this, I used some custom R functions I developed previously that are implementations of Haversine's method, and allow for a backsolve (put in mileages, get coordinates).  
  5. Next I needed to code my shapefiles by the distance breaks.  For this, I used the Python engine internal to QGIS, to create the maps, color code them for Data Viz effects. I also used this Python script to summarize census block data by distance to border, so that I could say things like "70% of people in Joco live within nine miles of the border." (I used Notepad++ for this code, and all the other code created for this project
  6. After completing my analysis, I wanted to quickly validate the data.  Luckily I have a Haversine backsolve for SQL, which allowed me to validate this analysis.  Also: GIS shapefiles have an internal *.dbf file that holds the attribute table, which holds the analytical information you need (lat/long, population, demographics, etc) and can be easily imported to SQL.  I validated, and everything checked out.

Friday, June 26, 2015

Kansas Sales Tax: Drive to Missouri?


I have lived in Johnson County (JoCo for locals) Kansas for about a year now.  For our out of State readers, Johnson county is considered the most affluent county in Kansas by a fairly wide margin, and is located on the Missouri border.  It is home to many of the Kansas-side Kansas City suburbs, and some of the largest corporations in Kansas City (notably Sprint and Garmin).

I live in the eastern half of Johnson County, about ten miles to Missouri.  Since most of Kansas City is in Missouri, you might think that I cross the state line pretty often.  But that's not true.  Actually, in the last year, I've crossed the line only five times I can remember:

  • 2 - Kansas City Royals Games.
  • 1 - To see David Sedaris at the Kaufman Center.
  • 1 - To see a college basketball game at the Sprint Center.
  • 1 - Some dog festival my wife wanted to go to.
Ok, this may just mean I'm lame (which I am), but it points something: while Missouri is relatively close to most of Johnson County, it's not like I live "half" my life in Missouri.  Ten miles, is still ten miles, and it takes quite an incentive for me to drive that far, especially with the amenities already close.

WHY THIS MATTERS

So why does this matter?  My previous analysis looked at changes in Kansas tax policy, specifically looking at food taxes.   Basically, things are going to be more expensive for Kansas consumers this year.  Then yesterday I see this tweet:

Which can really be taken no other way than the 20% of Kansans living in JoCo could save money by buying groceries in Missouri.  My thought: no way that I'm driving to freaking Missouri every time I need to buy groceries.  My actual thoughts were along the lines of ...


But it's not like me to base my decisions on grandmothers from old westerns, so I ran some numbers. 

THE NUMBERS

Ok, we'll use the initial numbers from the tweet, and assume a family of four would spend about $670 less based sales tax differentials (thought this appears to "miss" the way pricing works).  That's the savings amount, but the analysis didn't account for the cost of doing business across the border, what are those costs?
  • Physical cost of driving (wear/tear on car;gas).  We'll use federal mileage rate.
  • Cost of time to drive.  We'll use a few estimates here hourly ($10,$25,$50)
  • Assuming once-weekly grocery shopping.

The following chart accounts for the cost of time and driving over a year, and reduces the $670 by those actual costs the horizontal (X) axis is number of miles to Missouri.  

You see that the cost savings reduces quickly as your distance to Missouri increases. Where each line crosses the $0 axis is the break even point.  Which means just to break even, here are the maximum miles to drive:



Note: the value of time here may sound high, but keep in mind that Johnson County is the richest in Kansas, a lot of highly paid corporate employees here, and sometimes more than one person in a family goes shopping.  Also, pay for convenience is a huge thing in the JoCo; many people have cleaning people, where they pay these types of rates to save their own time.

MAPPING 

Based on the numbers from above, we can map the break-even points for each group.  Keep in mind that even though these are break-even points, the amount of savings after accounting for costs is still very low.  The red sections areas are within the break-even points.  Green areas are outside of the break-even driving distance.


The 4 mile limit (this includes 29% of Johnson County residents)(I think this is the most likely scenario, when you include potential multiple shopping trips per week, and people's value for time on the weekends):



The 6 mile limit (this includes 48% of Johnson County residents):




Finally the 9 mile limit (this includes 70% of Johnson County residents):






CONCLUSION

Though I may consider making larger purchases on the Missouri side, I doubt that I will be driving to Missouri for groceries.  Also, with the focus on convenience, and relative affluence of Johnson County, I doubt many other Johnson County residents will be driving for groceries, either.





Thursday, June 25, 2015

Fitness Week #4: Garmin Vivo Fit 2 Review

A few months ago I started tracking my fitness using Google Fit on my phone.  It wasn't ideal, but it was good for a while. Soon I bought a Garmin Vivo Fit 2, after reviewing several products and determining what I wanted.  After about three months, I'm generally happy with the product.  Here's my review.

I'm not the average user.  What I want in a fitness tracker is this: accurate, complete data.  In long form:
My fitness tracker would capture every step and moment of sleep, and have a "dump" button that dumps everything it tracks into a tabular form, where my ETL's can pick it up and write what I need into a database.  
I haven't found a fitness tracker that will do exactly that, but I found one that I thought was pretty good, here's a picture and some pros and cons:



PROS

  • Always on.  Only requires charging once a year, and goes on my wrist, so I literally never have to take it off.  My wife uses a different tracker, one that you wear on clothing, and requires weekly charging, so if she was modeling her data, she may be annoyed by this built in error.
  • Tracks sleep.  Not all fitness trackers track sleep, but this one does; fairly accurately from what I can tell.  I like it because it allows me to work on variance reduction in my sleeping patterns.  I've also found that hours of sleep is predictive in a few of my models.
  • Easy to sync.  All I have to do is push a button for a few seconds, and it auto syncs to my phone.  Easy.
  • Data is accurate.  I've done a couple of tests, one wearing a manual pedometer, the other just knowing my stride length and how far I've run.  Generally, the Garmin Vivo Fit 2 is within 3% variance of actual steps taken.

CONS

  • Data created imperfect.  To create my database of activity for analysis, I have to create spreadsheets and import them, but the Garmin tool doesn't create great data (sometimes using days of week rather than actual dates, and not giving me a "dump all" option, instead having to take one or four week blocks.
  • Doesn't count elevation.  My wife's tracker counts elevation in terms of flights of stairs.  Because my runs often include a big "hill" component, this would be nice for me.

CONCLUSION

I really like this device, and I'll continue using it.  I think in the future these devices may better integrate with my homebrew data systems, but for now, it is at least a working, accurate system.

Wednesday, June 24, 2015

Fitness Week #3: Cumulative Fatigue

A couple of weeks ago I started feeling kind of bad.  Tired. Run down.  Not wanting to work out.  Not wanting to go to work.  Etc.  What was going on?  The impacts quickly showed up in my fit tracker data over the course of a week.

FITNESS DATA

This chart really tells the whole story...

The chart shows a few things:

  • My steps total on weekdays has generally increased over time.
  • Each week, I generally see a few days over and under the trend line.  
  • On 6/4 I had my lowest step total since 4/2.  
  • I then proceeded to have five consecutive weekdays below the trend line.
This was all a bit bizarre to me, as I reviewed my data on 6/11 at mid-day.  But then I started thinking about what had been going on lately:
  • Slowly increasing my intensity and overall amount of exercise.
  • Run at least four miles, every day, since some time in February* (see note below).
  • I hadn't really taken a break in about four months.
So, I took 6/11 off.  No running.  And guess what?  The next day I felt a lot better. I hypothesized that my activity level over time had reached a point of cumulative fatigue, where my body was just wore out and needed a day off.  But could I have prevented that?

USING THE DATA


A few years ago, I read a book that talked about the impact of cumulative fatigue.  Basically there is an advantage of running on tired legs over time, but too much running on tired legs ends with the body breaking down. Fairly intuitive stuff, but there is some disagreement in the literature ranging from people who see little risk in cumulative fatigue to those who see a lot of risk and believe that running less can be inherently better.

The expert opinions are fine and all, but what can I do to keep myself from having fatigue induced down weeks?  I'm taking a multi-pronged approach:


  • Forcing myself to take one day every two weeks off from running.
  • Developing a metric that tracks my aggregate fatigue load in recent periods (past two weeks), then compares it to prior historical trends to spot possible issues.** (see note 2 below)
  • Training a model based on the above derived metric, in order to predict days where I may be approaching a fatigue issue, and should take a day off.
I'll post more in the future on this, but generally, I'm taking the short term step of forced days off, but hope to have a model to predict "needed days off" in the near future.




*I may give some more background on my running "habit" at a later time.  But the general of it is: I run 40-50 miles every week, but don't run races.  I used to run races, but generally don't have time anymore. So I just run for exercise.  It's good fun.

**The metric I'm playing with is aggregate steps over the period, "penalized" for variance.  Variance is generally good for cumulative fatigue, because a single low-mileage day can *theoretically* allow healing from many high-mileage days.

Tuesday, June 23, 2015

Fitness Week #2: Targeting

Yesterday an article on fitness trackers popped up on my Facebook timeline, found here.  I thought it was an interesting article on people's interaction with wearable fitness technology, even if I disagree with some of the negative sentiments.  Specifically the following passage:

Perhaps more alarming, many felt under pressure to reach their daily targets (79%) and that their daily routines were controlled by Fitbit (59%). Add to this that almost 30% felt that Fitbit was an enemy and made them feel guilty, and suddenly this technology doesn’t seem so perfect.

My specific disagreement here is with the notion that feeling "under pressure" is a bad thing.  On the contrary, I like to feel that I'm setting goals that take me far out of the comfort zone, toward the realm of accomplishment.  But that's not the point of this blog.

Here's what is: How much am I ruled by the targets in my fitness tracker?

BACKGROUND

So how might targets effect my behavior using a fitness tracker?  In two general ways that I can think of:
  • Psychological targets: I set internal goals, just wanting to hit 20,000; 30,000;40,000 for today, etc.
  • Vivo Fit Targets: My Garmin Vivo fit sets a dynamic target for me each day, that grows as I become more active.  It started at a low default (6,500) but has grown to about 16,500, which is where it has been for about the last month.

You can read about my prior models of my fitness data here and at other prior posts, but if I want to measure the impact of targets, I may want to control for some meta factors, specifically day of the week.  This graph gives a reasonable view of day of week variation.


So this proves that weekends are significantly different than weekdays, to a tune of 6-8,000 steps.

Side note:   My Thursdays especially suck, which isn't particularly surprising.  On Thursdays I'm generally exhausted from the week, and I have a day packed full of sit-in-place meetings at work.

What does this mean?  Let me develop a hypothesis:  Because I'm closer to the Vivo Fit goals on the weekdays (due to less time, more office time, etc), I'm likely more driven to hit those numbers, whereas on the weekends when I have more slack (and know I'll easily hit the Vivo Fit goals) I'm morelikely driven to psychological thresholds.  Let's see what the numbers show...


WEEKDAYS

I hypothesized that during the week, my behavior would track to the Vivo Fit targets, and thus my numbers would hover around those targets.  How do I prove that though?  

See graph below.  Over the most recent time period, my target has been between 16,000 and 18,000, where as prior my goal was much lower.  The chart below shows an obvious impact, as my target shifted, my behavior shifted.  Though I'm not walking a ton more steps on each day (17440 versus 18450; t-test p value = .26), I'm far less likely 10% versus 25% to have a "low step day."


WEEKENDS

My hypothesis states that for weekends I would see bumps at psychological thresholds, and this is fairly easy to see in the data.  Two major points stick out:


  • 20,000: A huge bump here, indicates me just trying to get to the 20K mark on a weekend day.
  • 40,000: I've never had between 34000 and 40000 steps, however I have two days at 40K.  To be honest, the data is much more obvious here, because two days are 40,091 and 40,125.  Each barely over the mark, indicating me trying to hit a target (I remember walking in circles to hit one; please don't judge).



CONCLUSION

I had a strong reason to believe that since getting my Vivo Fit 2, my life had been largely about targeting specific numbers.  I now have fairly strong evidence to prove this.  Will this change my behavior?  Probably not, but hey, at least I've quantified my quantification of fitness.

Monday, June 22, 2015

Fitness Week #1: Distribution and Descriptives

I've been promising for a few weeks an update on fitness tracking results, then not delivering.  I'm going to fix that today.  In fact, I'm going to take this entire week to fix that, starting my first ever "Fitness Week" - which is actually just me taking a very nerdy look at numbers.

Here are the posts I'm laying out for this week:

Today: This post.  Just general update on what my fitness tracker has been tracking.
Tuesday: On "targeting" and its distributional impacts.
Wednesday: Aggregate fatigue, what happened to me three weeks ago.
Thursday: Product review, of my Fitness Tracker (Garmin Fit 2)
Friday: My updated model of fitness data.


SUMMARY 

Let's start today with some easy data (it is Monday after all).  How active am I per my fitness tracker?  Here's the summary of my steps per day data:


I know, ugly and pulled straight from excel, but it shows what we need, just some observations on steps:
  • On some research, the average American takes only 5900 steps a day (seriously?) so I am at least three times the average.
  • My original goal was 20,000 step average, so, goal achieved!
  • The mean is above the median, verifying our skew number.  Also (for non-stats people) this basically means that though I average over 20,000 steps a day, I don't get 20,000 steps on my average days.  (fairly important for goal setting)
  • My sleep numbers look less skewed, and the average is in a good range 7-8 hours of sleep a night.

DISTRIBUTION

But what does this actually look like? First lets look at steps. Here's a cumulative distribution graph that I like to put together when analyzing relative frequencies.  



The distribution is obviously not normal, but in my mind it's not that unexpected.

  • 65% of the time, I get between 16,000 and 22,000 steps.  Further analysis shows I should consider this my "average weekday."
  • 10% of the time, I get between 10,000 and 16,000 steps.  These appear to be low outlier weekdays, generally when I'm especially tired or busy.
  • 25% of the time, I get over 22,000 steps.  These are generally weekend days.
Now let's look at sleep:
  • My sleep patters are fairly normally distributed, which is amazing, because (even as a statistician) I RARELY see anything with this normal of a distribution.  
  • The variance is a bit disturbing.  Sure most nights I'm between 6.5 and  8 hours, but there are still quite a few nights where I'm below 6 or above 8.  I'm going to review the literature on sleep to see what's going on there.  


CONCLUSION

So, I'm meeting my goals on steps which makes me very happy. My sleeping hours are normally distributed, but I'm not convinced this is evidence that I am sleeping "correctly."  I'll spend the rest of the week modeling and diving into these numbers more deeply.  


Wednesday, June 17, 2015

The Briefcase: A Weird Reality Show

I was wrong in a tweet.  No really, I was wrong on the internet, AND I am admitting it.  Here's how it all started.

A few weeks ago my wife sent a link to an article about a new show, "The Briefcase."  I can't find the article she sent me, but here's a link to a similar article.

A quick summation, is that this show gives poor people money, they get to make some decisions about helping people in need or keeping the money.  It's a fascinating concept, but the press was dubbing it "poverty porn" and claiming that it exploits poverty.

THE TWEET

So a few nights ago, I watched the show, and had a bit of a different reaction than the press.  I've talked about on this blog being relatively well off but also very cheap.  As such I drive a 2010 Honda Fit because it's cheap, efficient, and does what I need it to do.  I saw that the people on the show were driving nicer vehicles than me... which led to this tweet:


I thought I had expressed myself fairly succinctly, in a grumpy, cheap, old man kind of way.  Then the producer of the show responded:


So, I thought about this for a couple of hours and realized some things:

  1. He's right.
  2. The show ISN'T poverty porn in a general sense.  This is distinctly something else.
  3. The media got this one really wrong, which indicates a lack of understanding of poverty.

THE GAME

So, some background, you could read elsewhere, but the quick summary:
  • A "poor" family is given a briefcase with $100K.
  • They are also given information about another "poor" family,  this information increases throughout the show.
  • They have to choose how much money to keep for themselves, how much to give to the other family.
  • The family is completely blind to this, but the other family is doing the exact same experiment, but looking back at the original family.
So, what does the output of this game look like?  Fairly simply, the players believe their "win" will simply be the percent of 100K that they choose to keep, represented by this simple graph:


In reality, their "win" will be what they keep, plus what the other team gives them (this is the blind part of the game, that they are unaware of.  Represented by a more complex graph (percent family chooses to keep on bottom, percent other family chooses to give at right):


So what does the output of this game give us? It's actually fairly simple and doesn't require fancy charts or equilibrium calculations.  

Some people may assume that this is a prisoner's dilemma problem, as it is a blind, two player game.  But it's not.  At all.  The reason?  There is no interdependence of results.  I will get the amount I choose to keep, no matter what the other person decides to do.  Thus the rational solution is to keep all the money.  But what about the other couple?

Two main factors deter from keeping money for self, thus undermining rational action:
  • Guilt over the plight of the other family (they need it worse).
  • Fear of the optics of keeping all the money on public television (looking bad).

WHY THIS ISN'T POVERTY PORN

So, back to the episode I watched, and why my initial tweet was wrong.  It's fairly simple:

The people featured on this show aren't poor, and are generally better off than median Americans.  On the show I watched, the two families made $60K and $70K, compared to $51,900 for average Americans. (source 538)
The families both had a lot of debt, but that debt seemed largely due to poor decision making, which means this is a story about highly leveraged middle class Americans.  This is IN FACT a story about the new American middle class.  Which leads us to our next point...

IF NOT POVERTY PORN, THEN WHAT?

Time to coin a phrase.  DECISION PORN.  Effectively this show is creating a spectacle of already over-leveraged middle American's trying to make decisions to benefit their financial situations, weighed against the those of another family (or, just the desire not to look selfish on TV).  And it is somewhat entertaining.

Am I writing off the plight of these families?  Somewhat, but keep in mind they are generally better off than average Americans.

The part of the show that struck me most as an analogy to middle-class Americans: upon seeing a potential financial windfall, the initial reaction of both families is to put themselves into a MORE leveraged position.


  • Family one: Wants to buy a boat.  Granted it's to start a new business, but it would still cost more than they would receive from the show, and it's a hugely speculative move.
  • Family two: Wants to adopt a foreign child.  Certainly sounds like a nice thing to do, but also is a huge immediate cost, and creates a high ongoing cost of raising a child.

Now for the big question, will I be watching the show in the future?  Probably not.  While the concept is fascinating, I get really uncomfortable watching other people make bad decisions, especially regarding large sums of money.  I think that watching this show would actually give me a good amount of anxiety.




Tuesday, June 16, 2015

Tuesday Jams: Dog Fashion Disco

Some days at work, I need to do some high energy coding, and I need to be a little ADD about it, and a whole lot of creative.  Enter another one of my favorite bands.

Dog Fashion Disco is a band that I liked for quite a while. Then they broke up.  Some members formed a more electronic, but still pretty cool band called Polkadot Cadaver.  Released a couple of albums with great names (Purgatory Dance Party and Last Call in Jonestown).  But now Dog Fashion Disco is back together, with a new album, which seems to be pretty good.

Some examples of their music.

First, a favorite of mine (over scenes from a great TV show):
And for our older readers, a cover of an older song:
And from the new album (moderately NSFW):

Monday, June 15, 2015

Up this week: No.. Really

At the beginning of last week I created this optimistic post about upcoming posts.  I honestly thought I would be creating these two posts in the upcoming week.  The results?

Failure.  Well, I don't consider it a failure actually, but I promised two blog posts that never came to fruition, a music one, and one on fitness tracking data.  Instead of posting on those relatively mundane topics, I created three more timely (and controversial) posts on distributional impacts of tax policy in the state I live (Kansas).  Post 1 - Post 2 - Post 3  

I also realized I had an  email that should probably become a blog entry, which became this entry on using R in production.  A good post? Yes, but not what I promised readers.

I think it was the right call, but this is just a promise to readers that I will be posting the two promised posts from last week, as well as at least one additional post.  Anything you'd like to see me post on?  Let me know at: leviabowles@gmail.com



Saturday, June 13, 2015

R in production diaries: sqlsave slowness

A few weeks ago, I received a text from a former colleague, here's what it read:

Do you do any bulk writes from R to a db?  Sqlsave is slow.  I'm considering writing a file and picking it up with another tool.

I knew exactly what he was talking about.  Sqlsave() is a function to write from R to SQL databases in the RODBC package.  I also know that he was likely refactoring my code, or code that he had partially copied from me at one time.

THE SOLUTION 


Fortunately, a couple of years ago, I had migrated off of  Sqlsave() to another solution.  Here was my response to my colleague, in email:

Easier to type on here.
We don't have to do any bulk saves in production, everything there is a single transaction, so we're saving back 6-7 records at any time transactionally (times 100 tx's per minute, yes it's a lot at a time but not a lot at any one execution).

As for SQLSAVE. I got fed up with the command and no longer use it.  We have an insert stored procedure that we call from R.  Much faster.  Much more efficient and easier.

We basically call an ODBC execute command wrapper, then just execute the proc with parameters to insert our data.

I don't know how this would work form your point of view, but you could (sans proc) in theory, create a text string in R that is essentially the bulk insert you want to do, and execute it AS SQL.  Which I think is the key... send the command, and let the insert be handled by the ODBC driver as a native ODBC SQL command, not as a hybrid weirdo SQLSAVE.

CONCLUSION

  • Sqlsave runs slow in production jobs.  Not sure why, but I would guess it is in a general class of problems I call "Meta-data shenanigans"
  • Using SQL directly to write back to the database in some way is generally a faster solution.
I received a thank you text yesterday from my former colleague, he refactored and his job runs faster.  All is right in the world again.

Friday, June 12, 2015

Tax Analysis Part 3: House Tax Plan June 12th

This morning my Twitter/Facebook feeds were littered with friends very upset about the Kansas legislature.  Again.  Essentially the story goes like this:
  • The Kansas House passed what is described as the "largest tax increase" in Kansas history.
  • It was passed at 4am, after putting massive pressure on a few legislators to change their votes.
  • It is mainly a change to sales tax, which is described as making Kansas one of the most regressive States in the country.
  • It was passed after threats by the governor to massively cut university budgets, including university athletics.
That's the story people are telling at least.  But this isn't a political blog, this is a numbers blog.  What will the new tax plan do to average citizens?

METHODOLOGY

It's been a long week, and I'm tired, so not many words here.  If you want a methodological description look at this prior post.  If you want my disclaimers, and additional analysis of food sales taxes, look at this other prior post.

If you're not a tax nerd, skip to results.

Here are my general assumptions, given my reading of last night's bill:
  • Sales tax goes from 6.15% to 6.5% (slightly less of an increase than prior proposal)
  • No reduction of food sales tax (food taxed at 6.5%)
  • No significant change to most people's income tax rates.
These are my "new" general assumptions, and my prior posts contain other assumptions and disclaimers. Once again, let me know if you find any of these to be in error, or would like me to run the numbers under a different set of assumptions.

RESULTS

Once again, not a lot of words this morning, but of all the proposals, this is the most regressive option.  Major factors: food sales taxed at full rate and no change to highest bracket income taxes.

Here is an updated chart of the tax change.  The gray line indicates the shift in tax burden (measured by increased tax bill) by income level:



And my ever growing matrix:

CONCLUSION

Once again, the most regressive option was passed, especially with the removal of food sales tax. Whether this plan benefits you (err, is better than other options)  largely depends on your income level. And now we wait for the Kansas Senate.

Wednesday, June 10, 2015

Income Versus Sales Tax Pt. 2: Exempted Food

This is a followup to yesterday's post analyzing the different impact between raising income tax and raising sales tax.  There's a proposal to (in FY 2017) tax grocery purchases at a different rate than other sales-taxed items. 


METHODOLOGY

If you're not a tax wonk, or nerd, skip to the results. 

For long methodology and background, see yesterday's post.  This post focuses on the current plan to charge a lower (4.95%) sales tax on food, beginning in Fiscal Year 2017.  A few methodological notes:

  • Regular sales tax would stay at the 6.55% rate from yesterday.  
  • Food sales tax would fall to 4.95%.
  • The biggest estimate I had to make in this analysis is the portion of taxable purchases people in different income groups make on food.  I used USDA data, that shows (unsurprisingly) households in the lowest income groups spend the highest percent of their income on food (~40% in the lowest quintile).  
  • I used the USDA data to model estimates of the portion of purchases that would be taxed at the lower lower food rate, by income group.
  • Once again, if anyone has any questions, or would like me to re-run numbers under different assumption, I'm happy to do that.  However I believe this is an honest attempt to accurately represent the impacts of tax change.

RESULTS

FIRST: Because food sales taxes make up a larger portion of low income family expenditures, this is a massive move to a more "progressive" tax system.  First compare yesterday's sales tax change (all purchases from 6.15% to 6.55%) where lowest income groups saw a 5% plus tax bill increase:

To today's (general purchases from 6.15 to 6.55, food from 6.15 to 4.95) where lowest income groups see a tax cut:

This shows that lowering the food sales tax significantly reduces the additional tax on lower income groups, and is in fact about a 4% cut to their overall tax burden.


SECOND:   Why is this such a massive impact on effective sales tax rates?  Simple, because the downward change to the food sales tax is three times the magnitude of the upwards change to the general sales tax.  This has an interesting impact: if you make more than a quarter of your sales taxable purchases on food, this is a net tax cut.


And once again for you nerds who like charts:

CONCLUSION

Two easy things here.  First, moving the food sales tax lower is a big move towards a progressively oriented sales tax.  In fact, even with the general rate rising from 6.15 to 6.55, reducing the food sales tax makes this a net tax cut on low income families.

DISCLAIMERS:
  • There are a lot of things in the current bill, and at this moment (9pm, June 10th) no one knows how it will end up, and some could impact this analysis, including ending certain sales tax exemptions, and ending the low income food sales tax credit.
  • I'm really cheap.  Ask my wife.  Will not spend money.  Likes to save.  As such, I have probably erred on the conservative side with spending assumptions.  I don't think this materially; directionally impacts my analysis, however numbers will vary by different household spending habits.  In short, you spend more on non-food items generally, this costs you more.  That 25% number is key.
  • You may notice that after accounting for food tax, only higher income households see a tax increase, and that increase is moderate.  Two things, this is generally in-line with revenue estimates I have seen.  Also, this analysis is specific to four person families, living under certain conditions.  Households with fewer people and lower percent food expenditures will see more of a tax increase.

Tuesday, June 9, 2015

Impact of Sales Versus Income Tax: Some Numbers

If you follow the Kansas legislature like I have recently, you've seen many debates on the correct way to increase taxes.  The debate has risen from a $400 million budget hole, leading to the longest legislative session in history seeking a solution.

The tax policy debate centers on which mechanism to use to raise taxes: Sales or Income taxes. Generally the sides are:
  • Liberals: Because sales taxes are regressive and harm those who can afford it least, we should raise the top income rate.
  • Conservatives: Because consumption based taxes are economically superior to income taxes, we should raise the sales tax.
So, how would these tax changes actually impact families at different income rates?

METHODOLOGY

For this methodology I am making estimates of how a sales tax versus income tax impacts people in different income brackets in Kansas.  A few assumptions need to be made:
  • For sales tax, I'm using the assumption of a rate raise from 6.15% to 6.55%.
  • For income tax, I'm assuming a change in the top rate from the current 4.8% back to the 2012 level of 6.25%.
  • I used a four person family, married filing jointly, as our example for all cases.
  • There are additional downstream economic impacts of any tax policy, which are real, largely occur over-time, but do not significantly impact this analysis.  I've largely ignored these effects, including in reductions of consumption, business spending, and multiplier effects.
  • These tax changes don't produce the same income to the State, but they have been offered as competing ideas, so I consider the comparison largely valid
  • Sales tax is regressive largely because high income families spend less % of their income on sales-taxable items.  It's difficult to know what that actually looks like, but I used a couple of sources to estimate a curve (see plot below).  (source 1) (source 2) Also, if anyone has Kansas specific data, I'd be happy to re-run numbers





RESULTS


First a preface on terminology:

  • Total Tax Bill: State only, sales tax + income tax paid.
  • New Total Tax Bill: State only, sales tax + income tax paid, after proposed change.
  • Effective Tax Rate: effective rate of taxes, Total Tax Bill divided by Income.
  • Tax % Increase: Percent change in Total Tax Bill.
  • Tax Rate Increase: Percentage point change in Effective Tax Rate.

CURRENT STATE
A look at the current tax model, we see a flat, but mildly progressive tax structure.  People generally pay between 3.5%:5.3% of total income in State sales and income taxes.

SALES TAX OPTION
With the sales tax option, everyone's tax liability increases slightly. However if we calculate the % change the poorest Kansas will pay 6% ($40) more while richer Kansas will pay only 1.5% ($160) more than they currently do.  The effective rate spread moves to 3.7%:5.4%.   This would be considered a move towards a regressive tax, because the lowest earners see the highest % increase.



INCOME TAX OPTION
For the income tax option, we only changed the rate on the highest earners, back to 2012 values.  For this option, we see no change for household incomes less than $50,000.  We do however see a large % change in higher earning households, with $200K+ households seeing a 21% increase in Total Tax Bill YoY.  The new rate spread moves to 3.5%:6.4%, a very progressive move.


And for those of you who prefer to see all of the numbers in a convenient chart, with a bit more information.:

CONCLUSION

Overall, a change to sales tax will impact low-income Kansans much more than a change to income tax, especially if income only applies to top tier.  A change to the income tax system like this, however, has the potential to significantly and quickly increase the tax rate to higher income Kansans.

I am not taking a side in this (in my tax bracket, I know what solution benefits me), but I hope that this analysis at least contributes to the discussion/understanding in tax policies.  Feel free to direct message more or comment on this blog if you have any questions or methodological concerns.



Monday, June 8, 2015

Whats Next?!

Quite a bit of blogging over the past couple of weeks, but I wanted to give a road map of where we're going this week, and also solicit some user feedback. 

First, the request for user feedback:

This blog has been around for almost six months, and gets quite a bit of traffic from various sources including twitter, reddit, dailykos, and organic search.  It has branched in several directions, and if we based our future on prior traffic, we would create new posts around Kansas Political Issues and Data Science Toolkits.

But we want to be sensitive to our users real needs and wants.  So, what would you like from us?  If you have ideas, you can comment on this post or email to this address.  For starters here are some examples of things we're considering looking at:

  • Releasing player-level fantasy football ratings for 2015, based on our models from prior years that had good results.  This would be along the same vein as our 2015 pre-season predictions.
  • Continuing our analysis of Kansas education funding found here.  We were kind of burned out with looking for data after this one, but this still seems to be relevant.  Also, we have some overall methodological concerns with the 2006 study.
  • Analysis of tax impact between sales and income tax?  What is effective change to citizens with varying level of incomes.
  • An update on how R in production is going.  This would include server traffic, timing stats, and a summary of issues and resolutions we have encountered.  
  • A case study in how we create production-ified machine learning models, from start to finish.
  • Early 2016 election predictions?
  • More sports?
  • More politics?
  • etc?

And for a preview, here's what is coming up later this week (for sure):
  1. Tuesday Jams: (of course, we have to have music) I don't want spoil it, but the band initials are DFD.
  2. Cumulative Fatigue: This is in our series on modeling fitness tracker data.  After weeks of ramping up activity level using my Garmin fitness tracker (read: no days where I ran less than four miles since April 1), I hit the wall and my body fell apart.  But can we predict when this will happen to prevent and use it in training plans?

Friday, June 5, 2015

Florida Hospital Deaths May Not Be Systematic

This afternoon I was confronted with two news stories regarding a hospital that has what is being reported as a disproportionately high mortality rate during infant heart surgery.  Obviously this is a sad story, but because it's about a statistical anomaly, it is interesting to me.

BACKGROUND

Specifically, I wanted to know is this a case of something systematic occurring (doctors bad at surgery, something nefarious, etc) or is it possible that this is just a "statistical anomaly" .. in essence.. is this just an outlier hospital?

Here are two news stories for background (the feds are investigating this now):

Story 1 and Story 2

METHOD

From a statistical standpoint I can't prove that something systematic isn't occurring, but I can speak to the likelihood of this hospital's death rate a random outlier, effectively due to "sampling error".  Here is the data I was able to glean from the articles:


  • The national average death rate for these surgeries is 3.3%
  • The average at this hospital is 12.5%.  They cited 8 deaths, so I can calculate that our "n" is 64.

The question here is: Is the St. Mary's hospital death rate significantly different than the national average?  Because this is just testing one sample proportion against the population, I used the binomial exact test.  Here's my R output:


CONCLUSION

The important piece here is the p-value which is approximately 0.0012.  That means about a 1.2 in 1000 chance of this difference being due to random chance.  So this is obviously unexpected... but is it really?

The problem with this logic is that there may be a few thousand hospitals that perform this type of surgery.  And if each has a 1.2 in 1000 chance of having a mortality rate of 12.5% or greater, then it is likely that one could?

In short summary, we know this is an instance of a very unlikely data point.  However, given the number of hospitals that perform this type of surgery, it is possible that this is due to statistical sampling.

Thursday, June 4, 2015

Normalize Your Data!

Hey, a non-football related post!  

So occasionally I see someone trying to make an argument with data at get really annoyed.  Generally this annoyance comes people using an inappropriate metric to make their point.  Two examples in the last week:

  • Someone arguing that United States has an issue, because more people are murdered here than in other countries.
  • Someone arguing Kansas may have an issue, because we spend less than the national average on education.
Both arguments failed with me because they didn't properly normalize data.  I'll look at both of them in depth. 


ALCOHOL CONSUMPTION

So, my first annoyance was related to someone claiming that America has more murders than other countries.  That may be true, but the graph backing up the claim was bogus, because it looked at aggregate murders and all the of the comparison countries were much smaller than the United States.  What drives aggregate murder rates?  Population.

To demonstrate how this this kind of analysis can lead people to make massively false claims, I'll start with my own bogus claim and back it up with data (on a funner subject, nonetheless):

Americans have a drinking problem because they drink much more than Eastern European countries. 
And a graph to back it up:


Wow!  We drink almost three times as much as the Germans!  THE GERMANS!!!! 

(Insert picture of Oktoberfest here for effect).


But not really.  To analyze how individuals are impacted by aggregate numbers you have to normalize for population.  It's an easy calculation, just dividing the aggregate total by the population.

Here's a more accurate view of per capita beer consumption by country, comparing the United States to some other high-consumers of beer.  







I have no idea what is up with Czech Republic, I'm guessing they really like their Pilsner Urquell.

This may all seem trivial, but real decisions are made on these types of numbers, and if policy makers are lead to believe that the first chart is accurate, then policy decisions are made to combat a problem that doesn't exist.  It could be a big deal, and my next analysis demonstrates a more likely scenario.

EDUCATION FUNDING

Late last night a tweet from a journalist popped up on my feed.  Here it is:

I had two thoughts on this:

  • Does "turn it inside out" mean that he thinks the Kansas Legislature will try to lie with the facts?
  • Or is he asking people in the know to look at the numbers and see what they can?
Either way I thought I would dig into the data.  I found two problems with comparing states to a national mean or average of States
  1. The data was not normalized for factors that impact the cost of doing business in a State, specifically, cost of living.
  2. Because of that failure to normalize this data and other distributional aspects, the data was likely skewed in a way that would drive up the mean.

As a result, I needed to make a couple of normalizations to the analysis, first compensating for distributional skew by looking at ranking versus other states versus mean.  Then adjusting for cost of living (as an imperfect proxy for cost differential). 

First the median chart, it shows that Kansas is 24th out of 51 states.  (DC counts)   Kansas is effectively a median State.  Also, though, if you look at the shape of the distribution in this chart, you see that high skew exists.





Cost of living isn't a perfect measure to normalize for cost differentials, but because the bulk of school costs are related to paying salaries, it works for this purpose.  So I normalized using a cost of living index, which shows that Kansas moves up one place to 25th.  Obviously this is not a significant change in result, but if you look at other States they move around significantly.  Conclusion: Kansas is about in the middle, spending wise. 


A little unexpected that Wyoming moves up to be a top spending state, but if I had to guess, I would think it's because of a relatively low cost of living (after normalization) and poor economies of scale.

CONCLUSION

Normalization matters, because it allows us to control for the big factors that impact numbers like cost of living and population differences.

Nerdy Conclusion:  On our second analysis some interesting nerdyness.  First the original distribution has an expected skew ratio of .94, and a standard deviation of $3207.  The correlation between cost of living and school spending is .66, which is huge, obviously (potentially endogenous because good schools cost more, but that's a trick for a different day).  The normalized distribution reduces skew to .31, and reduces standard deviation to about $2100.


**Quick side note:  This post is intended only to speak about the issue of normalization, not the NORMATIVE issue of whether Americans should drink more beer, or Kansas should spend more or less on education.  A related post tackles the also non-normative question of whether spending matters.

Tuesday, June 2, 2015

NFL 2015 Predictions: Finale and Playoffs

Bringing my prediction analysis of the 2015 season to a close, finally, though this is generally a lot more fun than the data I usually look at.

PLAYOFFS

The projected playoff picture looks fairly similar to last year.  In the AFC especially, we simply swap in the Houston Texans, and pull out the Bengals.

In the NFC, things are a bit more complicated:  Lions, Cardinals, and Panthers are out, but the Saints, Eagles, and Falcons are in.

Overall, I hope to just get three out of six  teams correct in each conference (which shows my models were at least somewhat predictive for good teams over the season).



MAKING PICKS, PROBABILITY AGGREGATION

You may have noticed that my above chart includes three columns for each team.  The first is the output from my initial model, found here.  The last column is how many games the team won last year.  These columns are similar in many cases, though the projections are generally compressed towards the mean.

The middle column is a little more complex to explain.  It has to do with the difference between picking which team will win each week, versus how many games a team will win in a season.

Let's choose a bland, but explanatory example.  Let's say that my model predicts that a team has a 75% chance to win each of its games (it never does, each game has a different probability based on opponent and other factors, but this example works):  
  • Original Model: It seems redundant, but teams generally win 75% of the time when they have a projected 75% percent chance to win.  This means that if all games are projected at 75%, they would win 12 of their 16 games on average that season.
  • Picks Model: If I were a sportswriter who has to make picks each week about who will win a each game, it may seem that I would pick the team 12 times, and their opponent 4 times.  But this isn't true, because the team is a 75% percent favorite in each game, so I should pick them each time to assure my best chance of being right.  I know that they will lose 4 of their games in all likelihood, but I don't know which ones (without external knowledge) The issue here is that although they are a favorite in each game, they also have a 25% chance to lose each, which aggregates into 4 likely losses.  
Now think about lesser probabilities to win. For instance, you could have a 51% chance to win in each of your games.  In this case, you would be the rightful favorite in each game, but only end up winning about half of them.

The picks model is valuable in a number of ways, especially in considering which teams have a potential to win a lot of games this year if they win in each game they are slightly favored (can they beat the odds?).  It also helps to explain why teams that seem to be better than all of their opponents end up with  6 or 7 losses.

Finally, consider the case of my Kansas City Chiefs. This is a team that I project will win 9 games, but will be favored in 12.  As a fan it's important to understand that the chiefs may look good to win a majority of their games, but it may not happen as a result of wins not being high-probability wins.

Tuesday Jams: What's in my bag?

A little change in direction for this week's music.  In fact, I'm not picking any music at all, I'm letting other people pick music for me (a meta music pick?).

If you've ever been to Amoeba Records in LA, you know it's an awesome record store.  One of the coolest things they do is their "What's in my bag" series.  For the series, they bring in musicians and other celebrities to show what records they're buying right now.  

It's a cool series that I like, and it helps me find a lot of new musicians and bands.  You can search YouTube for their videos, with celebrities like Gerald Casale, Bob Odenkirk, The Zombies, J Mascis, Dave Grohl, and Weird Al, but here are a few of my favorites.

Trevor Dunn (bassist for various awesome bands) picks out some weird music:


JG Thirlwell (from the band Foetus) picks out a variety of even weirder music than Trevor:


Deltron 3030 record a video. Dan the Automator spends his time picking out kitschy movies, while Del picks out various rock and hip-hop albums, which is a pretty good explanation of Deltron 3030 as a group:




Monday, June 1, 2015

NFL 2015 Predictions: NFC South

I saved the most exciting predictions for last!

Not really, but this answers the question: Will any NFC South team have a winning record this year?  
The answer, per the models, is YES, but not very winning.  One major difference between this division and others is that every team gets better or stays the same.  The Saints make the playoffs, not  too surprising if you follow this division.



NFL 2015 Predictions: NFC North

On to the NFC North...




It doesn't take a statistical model to determine that the team who's quarterback looks like this will continue to sit at the bottom of their division.

The model shows the Packers will continue to rule this division, but this year without competition from the Lions.  The model essentially shows the Lions getting lucky last year, and this year they will return to being a below average team, according to their fundamental statistics.

The models show the Vikings continuing to be the Vikings, of course their results are largely dependent on the play status of a certain child abuser.