Sunday, May 31, 2015

NFL 2015 Predictions: NFC East

Next up is the NFC East.  

Another Division I don't actively follow, but nothing too interesting here anyways.  The Cowboys fall back to parity with the Eagles (who have a slight statistical edge, so I show them as Division winners).

The Giants and Redskins each win two more games than last year, but don't gain enough to get in the playoffs.  

Editor's note on this Division.  The last two times I have played Fantasy Football, I have won my league fairly easily, both times with Tony Romo at quarterback.  I like Romo in fantasy football, because he is a middle level QB that I can wait a few rounds to pick up.  I don't like wasting early picks on quarterbacks.

NFL 2015 Predictions: NFC West

And now the NFC West.  Interesting stuff here.

The Seahawks essentially stay the same at 12-4 at the top of the division.  This is actually a huge prediction because team quality tends to entropy back the mean, it effectively means that statistically speaking the Seahawks were better than a 12-4 team last year.

Other teams in this division get worse, the Cardinals much worse.  The Cardinals fall from 11-5 to 6-10.  This effectively means that the Cardinals won 11 games last year, on very week underlying statistics.

As someone from Kansas City, it pains me to say this, but the Rams get a little better.  Fortunately, they still fail to make the playoffs.

Saturday, May 30, 2015

NFL 2015 Predictions: AFC North

Finally a division with massive shifts!  

Only one AFC North team is projected at the same position as last year, and that's just the Browns continuing to be at the bottom.  In fact the Browns actually get worse next year (meaning that their seven wins from last year actually exceeded what we would expect from their playing statistics).

The Steelers are projected to win two fewer games, likely due to the "rapist penalty" in the model.  (Just kidding.  But still.  Looking at you Roethlisberger).  Cause here are statistiics 

The Ravens move up to the top of the division (meaning their prior 10 wins was actually an under-performance against statistics) and the Bengals fall out of playoff contention.  

Friday, May 29, 2015

NFL 2015 Predictions: AFC South

And we're on to the next conference: the AFC South.  To be honest I don't follow any of the teams in this conference, so I'm a bit clueless about the results here.

This is the first conference where we see a flip between teams, this time the 3rd and 4th place teams in the conference.  Unfortunately this flip is between two teams who fared horribly last year, and still won't be playoff contenders.  

The models saw, however, that the titans played significantly better than their two wins would indicate, so they are awarded a six win projection for next year.  The Jaguars,  performed better than their three wins, but not as "better to record" as the Titans project 4 win gain.

The top of the division will be dominated by the same two teams.  The Texans may outperform the forecast especially if Arian Foster can stay healthy (ok, ok, I know ONE thing about this division).

NFL 2015 Predictions: AFC EAST

This morning I noticed my full league win/loss predictions were still on my desktop.  I thought, why don't I just release a division a day for a while?  

For those of you that didn't read my prior post on modeling NFL team results prior to the season, this is a followup to yesterday's post.


As any football fan knows, the New England Patriots have dominated this conference for the better part of a decade, but they may have to play with fully inflated balls next year.  Also my model has a hard time accounting for Tom Brady's potential four game suspension.

The Patriots came out of the model as a borderline 10/11 win team.  Due to the outstanding questions regarding the impact of deflategate, I didn't round up.  But they are clearly still the best team in the division.

Of note, is the New York Jets, who are projected to win three more games this year than last year, which is a pretty big statistical jump.  The main reason for this is that the underlying statistics of the Jets show a team that has potential to play somewhat better football.


Finally, a methodological note on these predictions:  These predictions are a general estimate of how well or poorly a team will do next year, especially compared to their division.  Are they perfect estimates? No, for a variety of reasons I cited yesterday.  But some additional explanation:

The estimates tend to be variance reducing, meaning that they smooth out volatility.  An example of this is that in 2014 the best record in the NFL was 12-4, a record which five teams achieved.  However in my rankings only one or two teams will get this rating. This is intentional, as predictions like this tend to predict well for overall trends (e.g. the Denver Broncos will win more games than other AFC West teams), but miss teams that have exceptionally good or bad years.

Overall, the goal is to give an early prediction of how teams stack-up and are likely to perform.  Will there be teams that greatly outperform or under-perform this forecast?  Absolutely.  But this at least serves as a starting point.

Thursday, May 28, 2015

Early NFL Football Predictions: 2015

*non-nerds just skip to the bottom where I predict the Chiefs fail to win the AFC West.
** edit 5/29, I'm releasing a series on these models, the next post (on the AFC East) is found here.

After reading Nate Silver's book I experienced a bit of professional jealousy.  Not because Silver is more famous than I am, but because he gets to look at cooler data.

While I'm spending my days looking at customer and underwriting data, the guys at get to analyze political and baseball data all day.  I like my paycheck, so this is all fine by me, but hey, I can at least look at least look at some sports data in my free time right?

Problem is: I don't really like baseball.  Growing up in Kansas...  Well, just look at the play of the Royals from 1986-2013 to figure out why I don't like baseball.  But I do like football.


Open Question: can we create effective very-early predictive models on NFL data?  The data exists and it seems as though it would have some predictive power.  What if I set out predict a team's overall record at the end of the NFL regular season (January 2016) before the season begins?

I can obviously create a model, and that model may be predictive, but will that forecast be any good?  Some distinct challenges of NFL data:

  • Small data set:  NFL Teams only play 16 games a year. (versus 162 for MLB, 82 for NBA).  Thus we don't have a ton of prior year data to predict with, and that data AND results are both somewhat volatile.
  • Very early predictions: We're in late May right now, and the teams haven't even started pre-season games. Some players aren't even signed to their teams.  
  • Injuries: Injuries play a huge part in NFL performance week to week.  Can I really make accurate predictions for the Denver Broncos if I think there's a >50% chance that Peyton Manning will miss 4 or more games?
I went forward creating a statistical model from game-level historical NFL data, going back to 2008.  It's a logistic model that predicts the probability that a team will win a game.  


So, first question, is the model predictive?  Answer Yes.  A couple of ways to look at this.

First when the model predicts a 20% chance of a team winning a game, does that team win about 20% of the time?  The answer here is yes, and here's a graphical illustration:

Win Probability on Bottom, Actual Outcome Vertical
With a bit of volatility around very low probability and very high probability projections (largely due to low volume) the actual win% progresses upwards as the projected probabilities move left to right (R code at the bottom of this post that I use to create these types of outputs).

But this is game by game, is the model predictive by aggregated season?   A couple of ways to evaluate, by two different base metrics:

  1. Average model.  Does my model perform better than a model where we guess each team would just get the average 8 wins?  Answer: yes, it's about twice as predictive as this model.
  2. Static model. Does my model perform better than a model where each team gets the same number of wins this year over last year?  Yes, it's about 2.5 times more predictive than that model.


I know everyone is only interested in who will win the AFC West, but first a few observations from the model:
  • Because the NFL season is relatively low N, and a few unprobable results can shift a teams seasonal win total significantly, we find that NFL results are fairly volatile.
  • This is evidenced in the "Average Model" outperforming the "Static Model" in essence, in many cases it's better to guess that a team will "return to the mean" next season, rather than continue to perform as they did in prior seasons.
  • The model reflects these volatile results. Specifically, total wins last year were much less predictive as a variable than metrics around football fundamentals.  Specifically, some metrics that outperformed prior year wins (I specified these several different ways in models): Home/Away, average offensive yards, defensive yards allowed point differentials, turnover ratios, points scored, points allowed.

And finally, for the moment you've been waiting for, the AFC West results.  

Basically, my Chiefs stay the same and don't win the division, all other teams return slightly towards the mean.  The chiefs are a *bubble* playoff team.  Personally, I don't think they'll make it.

I do have these statistics for all other divisions, which I may post at a future date, if they are of interest.

Addendum: R code for prediction versus actual probability segmentation:

Tuesday, May 26, 2015

Tuesday Jams: David Bowie

I listen to quite a few older David Bowie albums in the office.  I know the music is old, but Bowie was an innovator in popular music, and to be honest the music still sounds quite modern.  Also, the diversity of his early music (from metal, to rock, to pop, to electonica) seems to keep my mind fresh in the office and doesn't allow my mind to go stagnant.

I'll just post about that a couple of my favorite albums.  First, The Man Who Sold The World.  Most people my age associate this album with the Nirvana unplugged cover of the title track.  But that track is a weird latin-sounding outlier on the album.  The other tracks are generally hard rock and early metal jams.  

Station to Station was released in early 1976, and is a bridge between his more conventional early 70's work and late 70's more bizarre albums with Brian Eno.  Weird enough for me to enjoy, but still conventional enough to have a little bit of mass appeal.  Of note is the track TVC-15, which sounds a bit like a BETTER and less rapey version of Robin Thicke's Blurred Lines.

Saturday, May 23, 2015

Cleveland Police Department Diversity, And Yet Another Metric

After yesterday's post on diversity indexes, I assumed I wouldn't be looking at diversity measurement in the near future.  However today's news of another acquittal in a case of white police killing a unarmed African American, I thought some people may be interested in looking at these statistics specific to Cleveland.

The case was similar to prior cases, but essentially, two unarmed African Americans were killed by Officer Michael Brelo in Cleveland after a police chase.  Here's an article that covers more details than I care to here.


Yesterday, I used the the metric of a "diversity index"  to compare police departments to their constituencies (among other things).  The diversity index is calculated as an effective number of race, a metric borrowed from other areas of study.  Please see yesterday's post for a detailed methodology.
 For the Cleveland police department, I was only able to find data broken down into three categories: black, white, other.  For consistency I aggregated general population data in this way as well.  Here's what I found:

Cleveland Population Cleveland Police
Black 0.53 0.27
White 0.33 0.67
Other 0.14 0.06
Diversity Index 2.41 1.9

So 2.41 versus 1.9 diversity index (effective races) is a fairly large difference, but isn't huge.  Compared to yesterday's look at the US Senate which is about 1.2 versus 2.2 for the rest of the US population, this is actually a mild deviation.  But this analysis suffers from the same issue as yesterday's analysis of Ferguson MO and discussion of Apartheid South Africa.    In these cases, the diversity index tends to massively understate these demographic differences.

When the majority is flip flopped between the general population and the police department as it is in these cases (see table above, African Americans make up the majority of the Cleveland Population, yet a small minority of the police force), we need another metric.


Enter the chi-squared goodness of fit test.  Which, BTW, "goodness of fit" is really what it's called.  I spent 20 minutes convincing my boss once that this was a real stats test.  No really.  This test is useful in determining if a sample (the police department in this case) matches the categorical (race) distribution of the underlying general population.  For all of you math nerds, here's how the statistic is calculated:

Where O is the observed units in a category, and E is the Expected units in a category.

So, to implement, we calculate the expected distribution for each racial group and compare it to the expectation.  Here's the calculation from the data above:

Cleveland Population Cleveland Police Chi Squared Statistic
Black 0.53 0.27 201.15
White 0.33 0.67 523.92
Other 0.14 0.06 70.86
Diversity Index 2.41 1.9 795.92

But what do these chi-square values mean?
  • The total (795.9) is a significance test against the chi-squared distribution at k-1 (2) degrees of freedom.  This is a massively significant test statistic, as the threshold for a .001 p-value is only 13.8. (NON NERDS READ: HUGE DIFFERENCE HERE)
  • Each individual statistic is a measure of the variation in each group.  This can give us an idea of where the biggest statistical anomalies exist.  In this case, of course, it's that whites make up far more of the police force than the general population.
But really, what we learn from this case, is that we don't need fancy test statistics or analytics to know how different these two groups are.  It is obvious in their simple descriptive statistics.  But using exaggerated cases like this can sometime help us develop frameworks for analysis.


Social:  From a statistical point of view, there is once again a massive demographic difference between cops and the general population.  But honestly, in this case the difference is so huge, we don't really need significance testing, or any fancy analytics to test that.

Stats:  Obviously, in this case the diversity index shows it's shortcomings. However using general "goodness of fit" testing to determine differences in all groups seems to hold up better in instances of flipped demographics.  I'm going to continue pursuing this and determine if I can use the Goodness of Fit test as good measure of overall demographic differences.

Friday, May 22, 2015

Reduction of Multinomial Data: Measuring Diversity with a Single Number

A decade ago I was confronted three coworkers with a major data problem.  They were trying to make a meaningful statement on whether or not certain entities had become more racially diverse over time.  They were looking at numbers like this (these are 2010 total US census numbers):

The problem they faced was this: How do we show a change in racial statistics over time (which is generally expressed as a series of six or seven numbers) in way that's accurate and easy to measure  and communicate?  

The analysts had a few good starts, but none of them really measured diversity:
  • Percentage of non-white people?
  • Ratio of minority to majority groups?
So I took a couple of days to think...


(skip if you're just interested in acutal results)

After a few days of thinking, I came up with a metric, which I called:  Effective Number of Races, or my diversity index, calculated as: 

Where p is the proportion of the total population of each racial group.  Admittedly, this metric isn't perfect, but it has an appropriate reaction to diversity:  Homogeneous populations with few racial groups active in small numbers get very low diversity scores; heterogeneous populations with more, large minority groups get larger scores.

This metric isn't necessarily a new idea; but it's derived from other fields where it's necessary to measure heterogeneity in multi-nomial variables. Two examples: in economics (effective number of firms: How many firms are REALLY active in this sector?) and political science (effective number of parties: How many parties are REALLY active in this parliamentary system?).  


So for comparative purposes I can use this score to compare diversity among different populations and sub-populations.  For instance:

  • US Populations in general: 2.21
  • US House of Representatives: 1.53
  • US Senate: 1.13
  • US President: 1.0 (This is a joke, FYI)
This analysis clearly demonstrates, through a single number something that we already know: the US Government is significantly less diverse than the nation as a whole.


This is good for demonstrative purposes, but what about predictive purposes?  Can diversity be predictive of political, social, or economic outcomes?

So lets start with politics.  Will a more diverse population lead to different political outcomes?  A priori theory would tell us that more diverse populations in the US would favor the democratic party (I won't go into the reasons why).

To prove this out I regressed the diversity index for each state against the percentage of democrats in the state legislature.  I found that a 1 point change in the diversity index led to a 11 percentage point gain in democratic votes. This is a significant correlation. 

(BTW, to do this I had to calculate diversity values for each State, if you're interested in that data let me know)


One of the best uses for the diversity index, involves its ability to measure diversity over time, and show how a population has changed in a single number.  I've plotted three time periods for the diversity index below (1960,2010,2050) against a simple % White metric.  Two thoughts:
  • I'm personally shocked at how large this change is.
  • The absolute slope of the diversity index is much higher (and here more accurate) because it measure not just a reduction of the white proportion, but also the growth in multiple minority populations.


One of the cases I wanted to analyze was one that has been covered in the media: the diversity of the population of Ferguson, MO to that of its police department.  Here are the values I received:

  • Ferguson MO: 1.85
  • Ferguson PD: 1.12
As we may expect, there is a huge differential in the diversity values.  Unfortunately though, because Ferguson is overwhelmingly African American and the Police department is majority white, the diversity index actually underestimates the effective differential here. 

This doesn't occur often, but a good analogy would be South African apartheid.  The diversity index would not be a great metric in this case because both values would read close to one, because both the population and leadership were homogeneous groups.  Only, in this case, one was homogeneously black and the other homogeneously white.  In essence: the diversity index is not a good proxy for measures of inequality.


The diversity index can serve as a powerful metric to measure differences between populations, subpopulations, change over time, and to predict the impact of diversity on various outcomes.  However, in instances of flipped majority groups (Apartheid South Africa, for instance) it is not a good proxy measure for inequality.

Wednesday, May 20, 2015

Halfway Book Review: The Signal and the Noise

This post should make my librarian wife quite happy, as it is a book review.  And actually, a book that she picked out for me to read, Nate Silver's The Signal and the Noise.

I'm currently only halfway through the book, but I have a lot of thoughts, and not sure I'll get around to writing a blog entry when I finish.  To summarize, the book is an explainer for why some predictions are good, and why some are bad.  The book is written from a simplistic perspective, such that even people with no background in statistics can understand the underlying concepts.

Silver walks the reader through several fields where predictions are made: predicting baseball player performance, predicting earth quakes, predicting the weather, predicting economic downturns, and predicting election results.  He takes us through the reasons that many fields are bad at predictions (earthquake research) and some are fairly good (weather predictions, by the government).  

A big downside to this book, if you are someone who makes predictions for a living, and your job depends on it, is that there are a lot of stories of failed predictions.  Specifically anecdotes about when researchers were confident in their prediction and confidence levels, but ended up being horribly wrong.  If you make predictions for a living, some of these stories are nightmare inducing, or at least sleep reducing.

A theme that runs through the first half of the book is that context matters a lot in predictions, and that forecasters need to understand things that aren't represented in data.  Examples of this may be how well a political candidate speaks in public or if a baseball player likes to party or takes care of his body.  These may not be evident in original data, but will impact eventual outcomes.  (My guess is that this later turns into a pitch for bayesian statistical methods, but we'll see.)

This take on context in political predictions takes me back to an earlier post on this blog, where I was critical of  Nate Silver for making a potentially out of context prediction about the Kansas 2014 Governor's race.  My specific criticism was that I felt a forecast that gave Paul Davis a major edge was seriously lacking in Kansas social and historical context. Upon reading this book, the political context was important (which Silver would undoubtedly agree with) and I was just the advantaged forecaster with 30+ years of living in Kansas (a lot of contextual knowledge).  

In summary, I definitely recommend this book to anyone who 1. it won't keep up at night and 2. who wants to understand the way statistical predictions work.  My wife has read this book, and I think it gave her some great insight into the way I think about and approach problems.  But you don't have to take my word for it (SORRY ALWAYS WANTED TO DO THAT).

Friday, May 15, 2015

Kansas Education Funding Analysis Part 1

Public education funding in Kansas is a huge mess.  The last ten years have seen multiple lawsuits, annual battles in the State Legislature on education spending, and massive changes in the State education funding formula.  

Why is this such a big deal in Kansas?  A couple of factors.  First, the Kansas Constitution has a section that says the State must adequately fund education.  Second, the Kansas legislature is largely made up of small-government fiscal conservatives, so it is relatively difficult to increase public spending for anything.  Twice in the last decade, citizens have filed lawsuits against the State to force an increase in public spending.  Twice they have (essentially) won.

At the heart of this question seems to be two main questions:
  • How much do we actually need to spend on schools to meet constitutional requirements?
  • If we fund schools more, will we get better results?


If you're just interested in whether or not spending matters to educational outcomes, skip to the conclusion section.

I have mentioned before that some of my first professional work was on a Kansas education "cost study" for the State auditor's office, about a decade ago.  During that study I did quite a bit of work on relative teacher salaries, but also some work on the relationship between spending and education outcomes.  For that study we contracted with a couple of professors out of Syracuse, their study can be found here.  (That study goes into more detail, so if you're very interested in method, start there).

Now, 10 years on in my career, I know I can do the spending to education outcomes research on my own, specifically replicating their original research and answering some questions:
  • How have coefficient values and relationships evolved over the past 10 years?
  • Does increased spending continue to relate to better outcomes in Kansas education?


My methodology here is  to replicate other education research on spending to outcome, this is just part 1 of a potentially many part series.  Please keep in mind:
  • I'm just one dude running this analysis while my other queries run.
  • The original study cost the State over a million dollars, and took up 6 full months of the State audit offices time (read: 25ish staff).
  • I'm going to build models slowly, as data comes available, so what I have for today is just a truncated model, but it's a start for a conversation.
I went to the State of Kansas Department of Education website looking for data, and found a nice data warehouse where I can run custom queries.  I pulled down data for the past three complete school years, did a bit of data cleaning, and pulled out what appeared to be the top variables.  I also bucketized district size, as had been done in the original study.

The model type here is a cost function, which estimates how much something will cost by various input factors.  Certain things increase costs for a district, which we can measure (poor kids, having fewer kids (being less efficient), and performing better on standardized tests, theoretically, should cost more money).

Here's my variable list and an explanation:

  • PERPUPSEPDN: Our dependent variable, per pupil dollars spent.
  • AVGASS: Our most important independent variable, average assessment values for each district (how well do kids do on standardized assessments).
  • FREELUNCH: % of kids on free lunch.  Kids in poverty are more difficult to educate, so this increases cost.
  • TEACHSAL: Average salary of teachers in the district.
  • VALPUP: Per pupil property values.  This is part of the efficiency variables used in the original study.  Effectively, these efficiency variables measure factors that make it easier for school district to spend money in inefficient ways.
  • ENROLLCAT: Categories for different school district sizes.
  • YEAR: Fixed effect for what year we are measuring.


So, what did I find?  

The important AVGASS variable is positive, and approaching statistical significance, meaning that some kind of relationship likely exists. 

Percent of kids receiving free lunch (a proxy for poverty) shows that districts with a lot of kids in poverty still spend more to get the same results.  Also, "property rich" districts still outspend "property poor" districts.

Keeping in mind that this data is still a bit noisy and I'm not yet controlling for all of the factors of the original study, nor using as many years of data, this is quite promising.  I can generally conclude, spending is still significantly related to education outcomes.*

Next Steps:  For the next steps here, I will try to acquire more years of data and more attributes, clean the current data set (I think it's likely form what I've seen that I have some data entry errors), and work on a better model.  Eventually, I may try to calculate actual spending levels required to hit specific outcomes levels for different school districts.

*Quick footnote from above.  I'm purposely avoiding terms that insinuate causal linkages, largely because this analysis has not yet flushed that issue out.  Do I think it's likely that spending more can create better results?  Yes.  But I also know how confounding some of these issues can be.  Specifically, intervening and co-linear variables mean that the relationship observed isn't as simple as spend more, get better.  It's likely that more affluent districts both tend to spend more money AND get better results for other reasons (less "unmeasurable" poverty, other social problems, parents with de-facto higher education levels, etc).  My point:  this doesn't prove cause, though through iterated analysis, we should be able to move in that direction.

Tuesday, May 12, 2015

My Top R Libraries

A couple of weeks ago I posted a list of my top five data science software tools, which received quite a few pageviews and shares across the internet.  As someone told me, people just freaking love numbered lists.

Later that week I saw Kirk Borne post on twitter regarding the top downloaded R packages, which was interesting, but a bit predictable.  The top packages contained elements that would be relevant across many fields that use R.  Packages like plyr, which has a lot of handy data-handling tools, and ggplot2 which is used to plot data.  This list was interesting, but for people just starting in the field,  I thought a post on data science specific tools would be useful.  So, in a non-random order, here they are:

  1. e1071:  Aside from having the weirdest name of the R packages, this one is probably one of the more most useful.  Because of the industry I'm in, the Support Vector Machine has great functionality, and I seem to always have this package loaded.  Outside of the SVM, there are a few other useful functions, including a Naive Bayes classifier, as well as a couple of clustering algorithms.
  2. randomForest: I've talked about the problems of decision trees before on this blog, so if I'm fitting any type of tree it's generally a random forest.  If you don't know how these work, it essentially fits several trees based on randomized subsets of features, and averages (bags) the results.  It's a great library, but does contain one piece of "functionality" that annoys me.  An informational message on calling the library, which is just annoying to those of us who use R in a production server.  See image below, with RODBC working without message, but randomForest providing an unnecessary message.
  3. RODBC: Speaking of RODBC, hey wait a minute didn't I just complain that Borne's list contained just a lot of general, non data-science specific packages? Don't care.  This one is just too useful. RODBC is the package I use to connect to databases for data pulling and pushing.    Best part?  Data comes into R with correct data types, which doesn't always happen when you import from flat files or csv's.   (A quick note though, I use RJDBC for similar functionality in production, because we use MSSQL and this allows me to use a *.jar proprietary driver)
  4. topicmodels (RTextTools?): These are the two libraries I use  for text mining.  Topicmodels provides a Latent Dirichlet Allocation (LDA) function that I use often.  To be  completely honest the two packages are complementary, and I can't remember which functions are contained in each package (I generally call them both at the same time), but together they provide most of the tools I need to process text data, as well as creating Document-Term matrices.
  5. nnet: If I want to test a simple neural network might perform well, or out-perform another model specification, I turn to this package.  While there are many other packages providing various types of neural network, this is the simple standard for a neural network in its simplest form.  I will use this as a test first before turning to the more complex and dynamic packages.

Honorable Mention:

The list above contains the most valuable functions I use, however these functions below also make my work life much easier:

For data processing: plyr
For Twitter connectivity: twitteR (requires that you acquire a twitter developer's account)
For Geo processing: rgeos,geosphere
For visualization/GUI functionality: rattle, grDevices, ggplot2

Friday, May 8, 2015

Have you tried logarithms?

So, cue XKCD reference.  

Randall Munroe is making fun of varying levels of technical knowledge and rigor in different fields.  It's a funny cartoon, and hopefully not too offensive to any readers of this blog involved with Sociology or Literary Criticism (I doubt there are many).

The irony here, is in the first panel.  In this case playing off a seemingly ignorant question of "Have you tried Logarithms?"

In analytics, I actually say "have you tried logarithms?" quite a bit.  The reason is simple: to emulate different shapes of relationships that occur in nature, sometimes variable transformations are necessary.

 Background Information

Although logarithmic transformations are used across many modeling types, the most common is linear regression.  If you understand why we log linear regression variables, it's easy to apply  to other models.  There are many ways to transform variables, but here's a quick primer focusing on logarithmic transformations.

  • linear-linear: Neither dependent nor independent variables are logged.  A  change in X creates a "coefficient value" change in Y.  This is your normal straight line relationship.
  • linear-log: Only the independent variables are logged.  In this case, a change in log(X) creates a  corresponding coefficient change in Y.  This looks like a normal log curve, and can be used in many diminishing returns problems.
  • log-linear: Only the Dependent variable is logged. A change in X creates a coefficient change in log(Y)  This creates an exponential curve, and is appropriate for exponential growth type relationships.
  • log-log: Both dependent and independent variables are logged.  If using a natural logarithm, this can be interpreted as % change in X creates a coefficient % change in Y.  (Think calculus, the important relationship being that the derivative of LN(X) is 1/X.)  This is used often in econometrics to represent elastic relationships.

Real Life Example

Remember my fitness tracker data from earlier in the week?  It struck me after that analysis that my activity didn't follow linear patterns.  For instance, it seemed like my increased activity on the weekend varied in a non-linear fashion, and that there was an elastic relationship between steps yesterday and steps today.

So I logged some variables and re-ran my regression.  Here's what I got:

Does it fit better? Well, R-squared as a metric has it's issues, but we can use it to generally measure fit.  My original model had an R-squared value of 68%, whereas this model is at 71%.  A small improvement, but there's one problem.  Log transformation of a variable changes its variance, so it's inaccurate to compare the R-squared of the logged dependent variable to the unlogged.  So what do we do?

I didn't document this.  But you can create predictions using the logged equation in R, unlog those predictions, and then calculate an unlogged R-squared, for the logged equation.  (If you have any questions on this or want code you can ask in comments.)  This unlogged R-squared for the logged equation sits at 70%.


In regards to Munroe's original joke, sometimes "Have you considered logarithms?" is the right question to ask.  In this case, I got a 2% boost to R-squared, which isn't huge, but is 6% of unexplained variance.  And I've seen entire news stories and lawsuits over lesser correlations than that.  

Wednesday, May 6, 2015

Modeling Fitness Tracking and Messing Up Models

I've never met a metric that I couldn't screw up in some way after measuring, modeling, and focusing on it. 

This happens to me quite a bit at work, where someone will say "hey, we really need to work on our 'X' ratio."  Then everyone works on the X ratio for a few weeks which makes X go up, predictably.  This creates a couple of problems, though:
  • X gets better for not-normal, and often not-model-able reasons (maybe not a concern for the business, but certainly irritating to me).
  • X gets better sometimes at the expense of the business.  I should probably blog on this later, but I've seen many times businesses sacrificing long-term revenue to improve a single KPI.

But this is a fitness blog entry, so let's start with that.

So, good news on the fitness data front:
  • I now have over a month of data to model.
  • My activity level has increased over 30% since my first week of tracking.
Except that taken together, these two things don't necessarily lead to predictive data models.  But more on that later.  Here is a summary chart of progress over the weeks.

Data Modeling:

I'm trying to create a model by which I can model future day's activity level using various data I'm collecting.  So far, here are the attributes I have for modeling:

  • steps: (how many steps I do today, dependent variable)
  • weekend: (is today a weekend?)
  • day_of_week: What day of the week is it?
  • week: What week is it?
  • steps_prior: How many steps did I do yesterday.
  • hours_three_prior: How many hours of sleep did I get in the last three nights.
  • sleep_hours: Hours of sleep last night.
  • travel: a binary for whether I was traveling that day or not.
  • sick: was I sick today?
Obviously with only 35 observations, I can't use all of these attributes, but I can start a small model. 

First I plotted some data just to get some ideas about trends.  A few interesting correlations found below.

I found that a simple correlation between steps yesterday and steps today was positive.  Does this imply hat by simply moving more I can get an infinite feedback loop where I continuously up my activity level?  Probably not, but interesting.

Next I looked at my weekend variable (plot below).  I knew that most weekend days I move more than weekdays, but now I can quantify.  Variation is also a lot larger on weekends.  I looked through some past data, and realized that there are some relatively low activity weekend days, generally when I'm traveling.  This is why I added the "travel" variable to my models.

So can I model this? Absolutely.  Here's a summary of the attributes in the model I created.

  • Weekend: I get (on average) 9,000 more steps on weekend days than weekdays.
  • Steps Prior: Ceteris paribus, for every 1 step I take today, I will take .25 fewer steps tomorrow.  This doesn't meet my correlation above, however it makes more a priori sense (if I move a lot today, I'll be sore or tired and move less tomorrow).  Why did the correlation change? Likely because the initial simple correlation was just measuring weekend days, or intra-week correlations (my general increase of activity over time).
  • Travel: On travel days I get 6,000 fewer steps.
  • Factor(week): Week level fixed effects generally increasing over time.  I probably need a better methodology here.  But I had to add this factor to make any of the models really work.  Why?  Because  as my first graph on this blog shows, I've greatly changed my behavior week-to-week in a way that make models far less predictive if I don't control for it.  
In common terms, weekends have always been better than weekdays, but my behavior over time has changed so much, that current weekdays are almost as active as weekends when I first started tracking.  This is similar to the effect we see for the "steps prior"correlation mentioned above.

I think I have a reasonable start on a model, and some interesting insight into my activity level.  Tracking data this way is fun (to me) and also interesting.  Would love to get my hands on someone else's data to see if general relationships hold up.

On the other hand, much like a business KPI that gets distorted as a result of measurement and focus, my fitness data has changed so significantly since the beginning of this experiment, that it has become more difficult to model what my "natural state" activity level might be.

Tuesday, May 5, 2015

Tuesday Jam: Of Montreal

I knew I was old the day that one of my "favorite bands" got a lot of attention from National Public Radio (NPR)...

I was traveling last week so I couldn't blog, but I promise some real data content within the next couple of days.

As you can tell by my prior posts, I listen to quite a bit of metal, but sometimes I need something a bit more relaxed and toned-down in the office on a stressful day.

Enter Of Montreal, a band from Georgia, which a long string of weird, enjoyable albums.  I recommend two mid-career albums Sunlandic Twins and Hissing Fauna, Are You the Destroyer? as a good starting point.

They also have cool live shows apparently (I've never seen them live).  I found the below concert a few months ago on NPR.  Although a great show and a good sampling of music across several albums, it's still on NPR.  So... just go watch some videos on YouTube if you don't want to feel old.

Ok, so the embed code doesn't work well, so here's the link to the video: Link.

Also of note, is this song, props to the first reader who can identify who used an altered-lyrics version of this in their mid-2000's advertising...