Wednesday, October 19, 2016

A Layperson's Guide to Multivariate Regression Outputs

Multivariate regression is a common technique used to predict a single outcome (dependent variable) using many predictors (independent variables).  For data scientists, multivariate regression is often the first statistical predictive technique to learn, and the easiest to understand and describe mathematically.  

For a lay person, the math of multivariate regression can seem daunting.  However, the general concepts can be described using knowledge of high school algebra and geometry.  Essentially, multivariate regression is the process of determining the line that best fits a set of data across multiple factors.  We can easily visualize this when we think about a two-dimensional problem, such as the pace in the Boston Marathon by age: 

The blue line represents the mathematically determined line of best fit.  As this looks like a coordinate plane from high school algebra, we can describe this line (and thus, the mean relationship between age and pace) using a y=mx+b equation.  Here that relationship looks like:

pace =  0.029 X [age]  +   8.06
This concept is easy to understand in two dimensions, but the "multi" in multivariate regression refers to multiple predictive variables.  In this case, the line now extends into multiple dimensions, and instead of our simple high school y=mx+b we now have something like:

 y=m1x1 + m2x2+m3x3+m4x4..+b
This sounds complex, and it certainly can be. The complex outputs that authors sometimes post in conjunction with the analyses do not help a non-statistical reader understand what the analysis means.  Below I will describe how to interpret multivariate regression outputs and what they mean in terms of describing the world, without having to understand advanced statistics.  I created a simple regression, predicting Boston Marathon pace by age and gender, and printed the output below:  

Here's what each element means one-by-one (elements 4,7 and 11, are all you really need to understand):
  1. Formula: shows what predictive elements went into the equation, and what is being predicted, separated by the ~ symbol.
  2. Residuals: tells us statistics about the "error" in the equation, essentially how far each of our data points are from our final predicted line.
  3. Variables: a list of all the predictive variables placed in the equation plus the "intercept," or the high school "b" from our predictive equation.
  4. Coefficient Estimate: this is the "m" from our high school equation, and tells the slope of the line between each independent variable and the dependent variable.  For instance, the data above tells us that pace tends to increase with age (coefficient is positive) at a rate of about 0.03 minutes per year.
  5. Standard Error: we can think about this as the standard deviation of data around our coefficient estimates.  Essentially, how much variation do we believe could exist in all of our "m" elements in the equation?
  6. t-value: this is the t statistical value, which is calculated by dividing the coefficient (4) by the standard error (5).  The interpretation isn't straight forward, so as a non-statistical person you can largely ignore this.
  7. P: the p-value is calculated from our t value (you can look it up in a table).  There two definitions we can use:
    1. (dumb statistical definition, skip) The value represents the probability of observing the coefficient and standard error values at least this extreme, assuming that the NULL hypothesis is correct. 
    2. This is tells you if an independent variable is statistically significant, meaning, if it is likely to have a non-random predictive relationship with the dependent variable.  In essence: does the independent variable have a meaning relationship with the dependent variable?  The lower this is the more likely the relationship is significant, with values below 0.05 being "significant" under common statistical standards.
  8. Asterisks: the asterisks represent at what level of significance each variable is significant in the model, redundant to (7).
  9. Sig Codes: these serve as a key for (8) above, and are derived from the P-value, (7).
  10. Residual error: this can largely be ignored, though the "degrees of freedom" tells us how many rows were used in the dataset.
  11. Multiple-R-Square: this is a measure of the quality of the model, expressed in how much variance is "explained" in the dependent variable. Using the prior example, if we were going to predict the pace of runners in the Boston Marathon, and knew nothing except average times for past finishers, our best predictive strategy would be to just guess the average for each competitor (R-squared = 0).  But if we know the age and gender of the competitor, we can create a linear model to represent their relationships to pace, and use those to inform our predictions (model above, R-squared is 0.086).  If we had prior paces for each finisher, we may be able to  improve our predictive model, and could measure that improvement by the R-squared value. This value runs from 0 to 1, and gets continuously "better" as it increases. 
  12. F-statistic: This one can also largely be ignored, though the p-value tells whether the regression as a whole is statistically significant.

And a test!  For the regression below I've added a single factor to the equation.  The variable e_a is a categorically indicator (assumes values 0 or 1) and indicates that the runner is or isn't part of some group.  Mail your answers to, to these questions (no prizes, only knowing you're right):

  1. What group of runners do you think the variable e_a represent? (hint: e_a is an acronym)
  2. How did the addition of this variable impact the model in terms of quality and other variables?

Wednesday, September 21, 2016

Distribution Analytics and My Fitness Tracker

The last few weeks have been quite crazy so I haven't had much time to write on the blog.  Also I'm working under so many NDA's (non-disclosure agreements) that I don't have much work or contracting items that I can blog about.  Actually, I've been somewhat hesitant to talk to anyone about anything lately, for fear of breaking an agreement.  Luckily, today I have some every-day data, with data science applications that we can initially explore.

Today I can break my blog-drought, based on some very cool behavior out of my activity tracker.  About a year ago, I did some work looking at the data I capture from my activity tracker (Garmin Vivo Fit 2).   I created some regression models to predict/understand daily steps patterns which led to some interesting observations:
  • My activity patterns tend to be auto-regressive counter-cyclical.  In essence, my daily step patterns are negatively correlated to steps the prior day, high-step days follow low-step days, and vice-versa.
  • My weekends are much more active than my weekdays.  This was by a factor of about 5k-10K steps, but now seems to be much less.
  • Sleep had a weird multi-day effect.  My fitness tracker also tracked when I sleep, which led to some interesting step-sleep interactions.  Generally more steps led to a bit more sleep that night (marginally measurable) but sleep had a multi-day effect on sleep.  In essence, the best sleep predictor of steps, was the last give day average of sleep time.  
  • Here are a couple of recent weeks of the available data.  In essence, I average a little over 20K steps per day (80+ miles per week), and most days (now) are similar in steps.

Analysis of my own fitness tracker distributions last year were interesting, however I wondered how my activity level compared to the rest of the population.  I had researched generally, and found several studies that say Americans on average take 5,000 steps per day.   This is interesting because it is only about a quarter of my daily step total.  I wondered if some demographics were impacting this, for instance, is it possible that older Americans were significantly dragging down the average.

Then, my Garmin app did something cool.  Specifically, it sent me this chart.

A summary of points from the chart:
  • I'm in the top 1% of activity level for people in my age and gender group (mid-30's dudes).
  • The average for my demo group appears to be about 6K-8K.  
  • The data is significantly skewed, with a long right-hand tail (which I'm in).  This indicates a group of people who also take a significantly higher number of steps than the average.
This is great data, and gives my activity level a bit of good context.  But it really makes me want to start a database so I can create auto-regressive models for all users.  More directly, knowing that pooled data exists, I want access to the Garmin data warehouse so I can apply data science techniques to this data, I doubt they offer this though.  

I was really sold on this data, then my activity tracker brought me back down to earth.  Specifically, it sent me this chart:

In sum: 
  • I get less sleep than almost 90% of people my age (under 7 hours). 
  • The average amount of sleep for people my age is just below 8 hours. 
I don't know exactly what to do with this information... maybe it's time to create full-on blown out auto-regressive sleep models? See if I can diagnose why I sleep less?  Maybe.  Though I don't feel tired.  Here's how Garmin describes a week of my sleep.

Nothing obvious there, except some nights that I don't get to sleep until well after 11 pm (likely due to late-night contracting).  Also a few *awake* bars in the middle of the night, likely due a certain three year old yelling for me.  Nothing obvious, but maybe some models will help me figure this one out, as I seek to find out why I'm killing it with my activity level, and sucking it up with sleeping.  More coming soon...

Wednesday, August 31, 2016

Windows, Contests, and Mis-Perceived Risks

Over the past three weeks, I've seen the same promotional advertisement (in radio, TV, internet channels) what seems like 100 times. Here's the promotion:

According to the fine print underlying the ad, this promotion can be summarized as:
Sign up for new windows in your home from this company before August 29th, and if the high temperature on Labor day at Kansas City International Airport is at least 97 degrees, your windows are free. There are also financing options, which allow you to defer any payments for a full year.
My wife and I bought an older house in Lenexa, KS (KC Metro) two years ago, and planned on installing new windows sometime in the next ten years, so this is an interesting deal to me. But what are the chances of free windows from this deal? I wanted to look at three questions:
  1. What are the odds of the terms of the deal coming true (Temp >= 97 on Sept 5th)?
  2. How can this business get people to sign up for a financial gamble on a big purchase?
  3. How can a business afford to risk this much product/financials?


My first reaction to this kind of problem is to go download a ton of data and calculate probabilities. But I wondered how less data savvy individuals might assess the risk probability (the average person won't have a good way to estimate this probability). I took a quick polling of people I knew in KC, asking their perception of the odds of a 97 degree Labor Day. Their values ranged from 8% to 50% with a central tendency in the 15-20% range. So generally people assumed that there was a one-in-five chance of this contest paying out.

Then I downloaded all available history for KCI high temps from NOAA as well as additional confirmatory data from other Kansas City weather stations. In the last 43 years, there has only been one September 5th with a temperature at or above 97 degrees. I plotted high temperatures on September 5th over time.  

So initially, once in every 43 years would be a probability of about 2.3%, but with only once case, and a sample of 43, we're in a bit of a data crunch. So I expanded my sample in two ways. First, I looked at surrounding days, analyzing all days between September 3rd and 7th. Then I looked at additional weather stations around Kansas City, extending the view back to 1950. My final estimate:
The chance of collecting on this promotion is about 4.5%, or less than 1-in-20. 
(Nerd note: Obviously there's a clustering issue as daily temps aren't independent, and same-day temps at different local weather stations aren't either. However, it does provide additional information to the credibility of the estimate).

So there's a chance, but a very small one, and people (and by the existence of the promotional material) seem to over-estimate the chances of this occurring. I wondered if there was any time of the year when there would be a greater than 20% chance of a high temperature at 97 degrees or above. (To meet this condition the 80th percentile temp would have to be above 97 degrees)

The below graph shows mean, 80th percentile, and record highs throughout the Kansas City summer months. Though the high can exceed 97 degrees from mid-June until Mid September, the probability of that has historically only exceeded 20% (ie, 80th percentile) one day of the year (July 17th). Due to surrounding data points and limited data, that value is likely subject to volatility, and potentially anomalous.


A few thoughts on why I think people would opt in for this type of marketing, despite low payout odds:
  • Mis-perceived analysis of risk (they over-estimate the probability of getting free windows).  A few functional theories on how this happens:
    • Perception of weather as being more extreme than it actually is.
    • Not being familiar with actual average temperatures and/or temperature data.
    • Perception of global warming as having an extreme effect, and increasing probability (I doubt too many people buy this).
    • Misjudging the gradient of temperature decline in late summer.  See chart above, in this case people don't recognize the speed at which average temps decline from early August through early September
  • Doing it anyway. It's possible that there are quite a few people like me that need new windows, and if all else is equal (other discounts don't apply at other times of the year, this is the best company, etc) then why not take a chance to get free windows. It only makes sense.  


My prior job was largely in risk analysis, so it is natural for me to consider why a business would take this kind of risk. It seems that giving away free windows to everyone who signed for this deal could effectively end a company. Keep in mind, it isn't like the lottery where one person wins. If there's a winner, EVERYONE (who bought windows) in Kansas city wins. The answer? Well, a few ways to mitigate risk:
  • Nationwide company. Info on this company shows it's a big, nationwide company. It's possible that they run this contest in other markets, and with the 4.5% risk threshold assume a few will payout. If margin AND incremental marketing lift (how many people take the deal, effectively) are sufficiently high in other markets, then the net payoff of the risk will over-run the losses in the ... 'hot markets' ... where they have to give away free windows.
  • Insure contest risk.  There is an ability to get an insurance policy against someone winning certain kinds of high-dollar contests. Think "hole in one" contests. Actually, this is my favorite instance of that from the early 1990s.
  • Other margin factors.  There's a possibility that while the windows are free, other costs actually make up the bulk of margin dollars, and these other costs partially mitigate the costs of the contest. What if installation is actually a high-margin undertaking, and part of that margin can offset the cost of free windows?
There are likely other ways to mitigate this risk, but I'm certain this company has an understanding of the risk of payout, and likely understand consequences of *winning.*


From this blog entry, I've laid out that the contest has relatively low chances of winning (4.5%), but people may misperceive that risk for one of many reasons. Even if everyone *wins* the contest, the business likely understands the loss they would take, and has taken actions to mitigate that risk financially. How likely is it now? Seems low, the forecast for Monday, September 5th is 88 degrees.

Oh, and by the way, since this contest is over (August 29th has passed) I went over to the website to see what the new deal was: 20% off of the price of windows. What does that mean? You could either have a 4.5% chance at FREE windows, or a 100% (advertised) chance at 20% off of windows.

Thursday, August 25, 2016

Hitler, Trump, Hillary and Social Network Data

Last night on Facebook I saw an interesting post from someone I went to school with... here's the general nature of his complaint: 
He's fairly conservative (a Trump supporter) and recently created a post on Facebook that likened Hillary Clinton to Hitler.  One of his friends reported the post to Facebook, and Facebook removed the post from visibility, labeling it offensive.
This post actually struck me fairly hard. Two reasons:
  1. I may not agree that Hillary is effectively Adolf Hitler, but it seems firmly within this guy's rights to compare political candidates to Hitler. In fact, there's nothing really more American than comparing people you disagree with to Nazis (there's even a social-science internet law about this).
  2. I see people compare Donald Trump to Hitler all the time. Literally... all the time, with no apparent repercussions. Is there a double standard on Facebook in comparing Hillary to Hitler? This is especially interesting to me considering Facebook's reported bias against conservatives (which they blamed on rogue employees).
So, I thought it might be interesting to dig into the data of Hitler references on contemporary social media, and see what people most often discuss in relation to Hitler. Here's what I found on Twitter:
  1. The most common term, by far, used when discussing Hitler is "Trump."
  2. Right now, there are only really two Hitler related topical discussions on Twitter: one related to Trump and the other pertaining to Syria.


I downloaded the last ten days of tweets (all tweets) containing the hashtag "#Hitler," which was around 15K total tweets. I ran a cleanup algorithm I designed to remove tweets that contain duplicate content, likely tweetbot posts, and spam. The result was about 3,500 clean tweets that mentioned Hitler and were unlikely to be spam.  

Then I conducted some text-data cleaning steps, you can look at some prior posts on this site to understand what this entails, but generally it removes frequent words that are less meaningful (e.g. "the") and reduces words to their stem, or root meaning (e.g. "run" and "running" are interpreted the same way).  


I wanted to solve two questions with this analysis:

  1. What do people talk about most when they say Hitler on social media (sub-question: how often is Trump mentioned?)
  2. What are the general topics of conversation regarding Hitler?

The first part of the analysis was easy, and I found that Tweets with the term "Hitler" most often use the word "Trump" followed by a series of words related to elections including "like" and "support." Terms related to nazis are also popular, specifically, "adolf" and "nazi."

Of note, "Hillary" or "Clinton" aren't in the top 20 terms associated with Hitler, though Trump's Twitter user name is, as well as marginally associated words "never" (as in never Trump) and "vote." Searching the data, Hillary was mentioned about 90 times in the data set, so Hitler tweets are only about 12% as likely to contain Hillary as they are to contain Trump.

And here's what those tweets look like as a wordcloud.:

Next, I wanted to discover the underlying topics inside the data, for which I used correlated topic models (CTMs). I've written about them before on this blog, and technical specifications can be found here. For this analysis, I reduced my dataset to just tweets that hash-tagged "Hitler," for two reasons: 
  • faster processing
  • only analyze tweets with the strongest relation to Hitler
The algorithm seemed to converge on only two detected topics. What does that mean? Over the past week, people on Twitter mentioning Hitler have generally been talking about one of two things 1. Trump and 2. Assad of Syria. Here's the top terms by these two topics:

One might argue that Trump supporters are also mentioning Hitler, but the term output makes it more clear who is invoking Hitler and Trump.  It is generally the "Never" Trump movement as evidenced by the terms "Never" and "Trump The Fraud." For a bit more color, here are the associated wordclouds from each topic.

First Assad:

And now Trump (my favorite random association is "Vote Hitler"):

For some final color on this, here is an example of a couple of Tweets, generally representative of the Trump category:


A few takeaways:
  • The most common topic discussed on Twitter when using the term Hitler is "Trump." Users seem to continuously make the comparison of Trump to Hitler, without being punished by the social media platform.
  • The main two subjects of discussion when talking about Hitler on Twitter appears to be Trump and Assad. This isn't hugely surprising about Assad given recent news out of the middle east. The Trump comparisons to Hitler, is a bit more jarring, given that he is the Presidential candidate for a major US party.
QUICK COMMENTARY: The guy with the banned post from the beginning of this entry just posted that he challenged the banned post with Facebook, and Facebook gave in and said that it was in fact appropriate. This is interesting, and it tends to coincide with previous articles stating that individuals at Facebook who receive front line complaints are often biased against conservative views, and tend to over-extend censorship towards conservative positions. Obviously, I looked at a different social network, but it is commonplace to compare Trump to Hitler, so it seems like the Hillary to Hitler comparison shouldn't be censored. 

Friday, August 19, 2016

Trump and Breitbart Alliance: A Match Made on Emotion?

For the Trump campaign, this week has been fairly crazy, highlighted by naming the CEO for conservative news organization (Steven Bannon) as the CEO of his campaign. There has been quite a bit of punditry on this subject, but that's not really the place of this blog. I'm focusing on a follow-up to our prior post on Trump's use of disgust to drive engagement, emotional reaction and (ultimately) political support.  First, a summary of earlier findings:

  • Trump not only receives engagement boosts from using disgust based language, but also from using anger, fear, sadness, and trust. He doesn't see boosts from joy, anticipation, or surprise.
  • Neither Hillary Clinton nor Bernie Sanders get a statistically significant boost from using emotional language on Twitter like Trump. Both of these candidates see flat relationships with emotion and engagement. (Full disclosure, the author of this blog sees a positive and significant boost in retweets from using disgust language.)
  • Breitbart news sees a similar boost in engagement (from emotional speech) as Trump. This may go to explain the Trump/Breitbart alliance, they use similar tactics to engage users, AND they use their user bases respond to similar types of language.


(non-nerds can skip)

For this project, I used the same effective code and methodology as my prior post on Trump's use of disgust in tweets.  I made a few improvements (some code at the end of this post).
  • I downloaded the twitter feeds for four internet news sites (Foxnews, CNN, Breitbart, MSNBC), and scored those feeds in the same way I scored candidates in my prior post.  I also followed the same normalization strategy, controlling for incident rate of emotion in the data set, and tendency towards emotional language for each candidate.
  • I created linear models for each candidate/newsagency (entity) by emotion pair (8 entities * 8 emotions = 64 models) with an observation level of "tweet", dependent variable of retweets, and predictor variable of emotion.  This model shows at what rates emotions drive engagement for each entity.
  • I created an output matrix of the emotions, reporting only statistically significant results, for easy comparison of engagement by emotion and entity.


For commentary on the meaning of the charts, please reference the prior post on Trump's use of disgust. The basics are this: we use a sentiment mining algorithm to measure the overall emotion of tweets, and then aggregate the results to each user (candidates and news agencies) and relate that to engagement to determine which emotions drive social engagement results for which groups.

First our emotional term index, which shows that Breitbart is actually the least emotional-sounding news agency, comparing to Foxnews which is most emotional.

Next we summarize the normalized emotional tendencies of each news agency.  Breitbart (of special interest) wins on use of "disgust," MSNBC wins hugely on "surprise", and CNN wins on "trust" emotions.  

That last chart demonstrated an important component related to our prior post: Trump's strategic ally Breitbart news also tends to use a lot of disgust emotional language and signaling in their tweets. But are they as successful in driving engagement by use of disgust as Trump is? Time for a statistical test.

My prior post received quite a bit of traffic, but was a bit intellectually lacking in my opinion. I had demonstrated Trump's relationship with disgust, and his followers reaction to it, but I hadn't looked at two other dimensions:  
  1. Do other candidates get engagement boosts from disgust or other emotions? 
  2. Does Trump get boosts from other emotions outside of disgust?
To test this hypothesis I created 64 linear models which serve as statistical tests to determine the effect of emotional language on engagement measured by retweets (side note: I threw in my own tweets as a comparison). The code for those models are found at the bottom of this post, and the results are in the heatmap directly below.  

The red results above are statistically insignificant, with green results showing statistically significant coefficients (emotions that give candidates a significant boost). The numbers in the chart represent the actual coefficients:
  1. News agencies (except for Breitbart) see very little engagement response to emotions. The relationships we do see for Fox and CNN are related to emotions we may see with breaking news (anticipation, joy, surprise).
  2. Breitbart news sees a stronger relationship with emotional tweets. The strongest engagement-generating emotions for Breitbart are anger, disgust, and sadness.
  3. Sanders and Clinton do not see increases in engagement when they use emotional language.
  4. Trump sees the largest positive relationship with engagement and emotional language, which is strongest on the emotions similar to Breitbart.
  5. Full disclosure: the author of this blog sees positive engagement from disgust tweets too.

An easier visualization of the disgust measure, looking first at candidates, then at news agencies (x axis is a disgust rating, y axis is a retweet count)


A few bullet points in closing:
  • The Breitbart/Trump alignment makes sense both from the way they speak AND from the way their followers engage with their tweets.
  • Other candidates and news agencies see much less engagement from openly emotional tweeting.
  • Trump and Breitbart both get more engagement in their tweets by using anger, fear, disgust, sadness, and trust.

 #Model dimensions  
 o <- c("CNN","BREITBART","FoxNews","MSNBC")  
 e <- c("anger", "anticipation", "disgust","fear", "joy","sadness" ,"surprise" ,"trust")  
 jj <- matrix(nrow = length(o), ncol = length(e))  
 rownames(jj) <- o  
 colnames(jj) <- e  
 rsq <- matrix(nrow = length(o), ncol = length(e))  
 rownames(rsq) <- o  
 colnames(rsq) <- e  
 initial <- ("lm(retweetCount~")  
 for(j in e){  
 for(i in o) {  
 dd <- subset(mydata, user ==i)  
 model <- paste(initial, j,",data=dd)" )   
 temp <- summary(eval(parse(text=model)))  
 jj[[i,j]] <- temp$coefficients[2,1]  
 rsq[[i,j]] <- temp$coefficients[2,4]  
 rsq <- ifelse(rsq <=.050, 1,0)  
 out <- rsq * jj  

Wednesday, August 17, 2016

Pop Culture Data Scientists in TV and Movies

A few months ago my wife was watching House of Cards and called me into the room... the conversation went something like this:

Wife: Hey there's a data scientist on this show!
Me: Really?  .. why...
Wife: Yeah, he's a weirdo, just like you!
Me: .. thanks

I watched a couple of episodes and found out the character was in fact much weirder than I am.  But I also thought this was the first data scientist title I had seen fictionalized in popular culture. Given that this blog has been a bit too serious lately, I thought it would be interesting to compile a list of fictional data scientists, are they really data scientists, and another fictional metric: would Levi hire them?

An interesting thing I found in researching this is that since the concept of data science is relatively new, there are very few actual data scientists in pop culture. It also may be difficult to fictionalize what data scientists actual do in any kind of interesting way, but .. whatever...

The list is short, and most people on the list wouldn't necessarily identify as data scientists, but here we go:

Aidan MacAllan, House of Cards:  

Synopsis: Aidan is a data scientist who apparently works for the government, or the President and his wife directly and does... data science-y stuff for them. Most of the tasks resemble real data science tasks (large web analytics, identifying targets for marketing based on who has been close to gun violence) and others aren't really data science but sound cool (tapping a phone). The portrayal of Aidan as a data scientist is romanticized, but some of the tasks are at least in the ballpark of what we can do.

Personality Portrayal: My wife was wrong, he's much weirder than me. The only similarity between Aidan and I is consistently messed up hair and listening to death metal.  But he dances naked when he's alone compiling code (I DO NOT DO THIS). And the character is portrayed as a weird-artistic-savant, which is good for Hollywood story lines, but is overplayed from the real personalities of most data scientists.

Would I hire?  No.

Reasoning: The naked dancing thing, questionable data ethics, and overstatement of certainty.

Seth Bregman and Peter Sullivan: Margin Call. 

Synopsis: Another Kevin Spacey project, strangely enough, maybe he likes data science! I like the movie Margin Call a lot, partially because the movie is about quants getting things right. In this movie Seth and Peter are identified as risk analysts, but had this movie been released in 2016 rather than 2011, there's a good chance they would be data scientists. At the beginning of the film they work with a model that shows their firm to be over-leveraged, which sets up the rest of the film. Though not strictly data scientists in title, they create models/simulations out of large data sets, which is effectively what data scientists do.

Personality Portrayal:  These guys aren't portrayed as weirdos, but more junior Wall-Street guys that are maybe a bit more numbers/science focused. One aspect of the movie that parallels modern data scientists is that they have broad academic backgrounds (e.g. astrophysics) and have been brought into business to solve large financial modeling problems.

Would I hire? Yes.

Reasoning: They are portrayed as generally competent, and aren't afraid to escalate issues to management, which is important.

Peter Brand: Moneyball.

Synopsis: This isn't even a data scientist really at all, but I'm already hitting the bottom of the well with examples. Jonah Hill plays this character as complete nerd who is basically a statistician and saber metrics expert. As portrayed on film, it doesn't appear he has the computer science skills required for a data science career. He does one very important thing for data scientists in the film though: he continually tells those around him that they are looking at the wrong metrics, and that by focusing on metrics that actually create production, they can run a better business.

Personality Portrayal: Nerdy, classic statistician that likely plays Magic the Gathering in his free time. This may just be an offshoot of casting Hill to play the role though.

Would I hire? Maybe.

Reasoning: Yes, if I had a lot of tasks that lent to econometric rather than Machine Learning models. And I could stick a data engineer with him to do the Computer Science side of the job.

Max Cohen: Pi.  

Synopsis: This is way before what we think of as modern data science, but the essentials are there: looking for patterns in large and somewhat unstructured data sets. Basically, an unemployed number theorist starts analyzing data in stock markets and finds patterns. He makes makes accurate predictions based on these patterns (seemingly related to a 216 digit number). He then moves on to numeric/textual analysis of the Torah (which BTW, is a fairly common junk science). In the end, the numbers drive him crazy and he solves the problem with a power drill to the brain (seriously).

Personality Portrayal: Paranoid, potentially schizophrenic, intelligent, with extremely bad headaches. I've seen data scientists get to this point, but it's usually solved by a good night's rest more easily with the power drill.

Would I hire? No.

Reasoning: He's crazy.

Honorable Mention/Exclusions:
  • Alan Turing, The Imitation Game:  Excluded because Alan Turing was a real (awesome) person.
  • Artificial Intelligence, Her, Ex Machina: Generally nameless data scientists, and highly speculative futurism.  I prefer real data scientists solving real problems.

Thursday, August 11, 2016

Contracting Rates, Competition, LinkedIn's Race to the Bottom

A few weeks ago I was looking to connect with more people in data science (read: networking) so I did something out of the ordinary for me: I joined a LinkedIn group. It was a group related to R statistical engine programming, and I assumed I would join it, read some posts, and maybe it would keep me engaged in the field.

Questions started popping up in the group, generally related to the best ways to make certain graphs in ggplot2 or how to handle certain dataframe manipulation tasks. I ignored. Then a question caught my interest:
I've been asked how much the hourly rate is for a freelancer and I have no idea. Could anyone provide a ballpark range in US dollars?
Going against my gut, I engaged with LinkedIn.


The question was interesting, and I contract on occasion so I thought I should type a quick response. I clicked through on the question, and saw the following first answer:
Hi M----, I am a consulting statistician with many years experience using R. I charge $40/hour for my services.
Things just got weird. I'm more familiar with contract rates in the $150-$200 per hour rate. Very weird. What could be going on, and why is this person's rate so low?  First a few facts on data science salaries:
  • Median salary for a data scientist in the US is about $112,000 (~$55 an hour + benefits)
  • Mean salary for a data scientist in the US is about $125,000 (~$61 an hour + benefits)
  • Contract or freelance gigs, in any field generally pay more on a per increment (hour basis) than full-time gigs of similar veracity, for a couple of reasons:
    • Contractors have less stable employment and thus earn more as a hedge against instability and opportunity costs (time spent marketing, invoicing, etc).
    • Businesses using contractors incur much lower total costs than if they hired a full-time employee, so they are willing to pay a premium to keep it in pay per increment.
The original question wasn't looking for a data scientist per se (just a statistician with R experience) but the skillsets and pay rates are somewhat similar.  The question came to my mind: why were some data science related contract rates so far below normal data science salaries in the US?


A lot of people ended up responding to the LinkedIn question, so I had a fairly large sample of analysts and their self-reported charge for contract work. Because it was LinkedIn I could also click-thru to their resumes and determine their experience and education backgrounds, as well as other demographic factors. Per the post, I also referenced the website Upwork, which is kind of like an Uber service for Freelancers in various fields (to increase my sample).

In the data, I found three basic groups in the posts:
  • Workers living overseas, especially in South Asia who were willing to work for sub-par wages ($30-40 an hour). This group seemed to charge less for a number of reasons:
    • Businesses incur a bit of risk in working in these areas, which shifts wage rates down.
    • The exchange rates and local cost of living in these areas make lower wages more tolerable to data scientists living in the area.
    • Data science and IT jobs in that region (substitute employment) pay less than in the United States.
  • Workers living in US with thin or non-existent resumes, willing to work for sub-par wage ($30-50 an hour). These individuals generally had strong educational backgrounds (some with PhD's), but had resumes that lacked any substantive analytical experience. I know that there are some people who have great education backgrounds, but are unemployable for various reasons (e.g personality, work ethic) so this isn't hugely surprising to see high-education people out of work. A couple reasons that these are likely under-market:
    • Many may be willing to work under market because they are CURRENTLY unemployed and/or unemployable.
    • Much of a data scientist's value in the workplace comes from solving real-world business projects. These workers realize that they have substantively less to sell themselves on to large employers.
    • For unemployable individuals, there is very little potential for substitute employment (getting a real data science job) so they are willing to work temporary work for far less.
  • Workers living in US or similar countries, with long resumes, who use similar contracting rates ($120-$300 an hour). These were generally US residents with a background similar to mine, having worked in analytics and data science for many years within large companies. These are generally people with $100K+ day jobs, that will contract in their free time, if a company will "make it worth it."


Though data science contractors with business experience in the United States are extremely well compensated, those with limited experience or working overseas cost a fraction of the price. These overseas and low-experience resources are likely best suited for projects of low-level coding or entry-level data science, however, as one responder to the original LinkedIn post said, they come caveat emptor which may be the reason many businesses pay higher rates for more guaranteed talent.