Wednesday, August 31, 2016

Windows, Contests, and Mis-Perceived Risks

Over the past three weeks, I've seen the same promotional advertisement (in radio, TV, internet channels) what seems like 100 times. Here's the promotion:

According to the fine print underlying the ad, this promotion can be summarized as:
Sign up for new windows in your home from this company before August 29th, and if the high temperature on Labor day at Kansas City International Airport is at least 97 degrees, your windows are free. There are also financing options, which allow you to defer any payments for a full year.
My wife and I bought an older house in Lenexa, KS (KC Metro) two years ago, and planned on installing new windows sometime in the next ten years, so this is an interesting deal to me. But what are the chances of free windows from this deal? I wanted to look at three questions:
  1. What are the odds of the terms of the deal coming true (Temp >= 97 on Sept 5th)?
  2. How can this business get people to sign up for a financial gamble on a big purchase?
  3. How can a business afford to risk this much product/financials?


My first reaction to this kind of problem is to go download a ton of data and calculate probabilities. But I wondered how less data savvy individuals might assess the risk probability (the average person won't have a good way to estimate this probability). I took a quick polling of people I knew in KC, asking their perception of the odds of a 97 degree Labor Day. Their values ranged from 8% to 50% with a central tendency in the 15-20% range. So generally people assumed that there was a one-in-five chance of this contest paying out.

Then I downloaded all available history for KCI high temps from NOAA as well as additional confirmatory data from other Kansas City weather stations. In the last 43 years, there has only been one September 5th with a temperature at or above 97 degrees. I plotted high temperatures on September 5th over time.  

So initially, once in every 43 years would be a probability of about 2.3%, but with only once case, and a sample of 43, we're in a bit of a data crunch. So I expanded my sample in two ways. First, I looked at surrounding days, analyzing all days between September 3rd and 7th. Then I looked at additional weather stations around Kansas City, extending the view back to 1950. My final estimate:
The chance of collecting on this promotion is about 4.5%, or less than 1-in-20. 
(Nerd note: Obviously there's a clustering issue as daily temps aren't independent, and same-day temps at different local weather stations aren't either. However, it does provide additional information to the credibility of the estimate).

So there's a chance, but a very small one, and people (and by the existence of the promotional material) seem to over-estimate the chances of this occurring. I wondered if there was any time of the year when there would be a greater than 20% chance of a high temperature at 97 degrees or above. (To meet this condition the 80th percentile temp would have to be above 97 degrees)

The below graph shows mean, 80th percentile, and record highs throughout the Kansas City summer months. Though the high can exceed 97 degrees from mid-June until Mid September, the probability of that has historically only exceeded 20% (ie, 80th percentile) one day of the year (July 17th). Due to surrounding data points and limited data, that value is likely subject to volatility, and potentially anomalous.


A few thoughts on why I think people would opt in for this type of marketing, despite low payout odds:
  • Mis-perceived analysis of risk (they over-estimate the probability of getting free windows).  A few functional theories on how this happens:
    • Perception of weather as being more extreme than it actually is.
    • Not being familiar with actual average temperatures and/or temperature data.
    • Perception of global warming as having an extreme effect, and increasing probability (I doubt too many people buy this).
    • Misjudging the gradient of temperature decline in late summer.  See chart above, in this case people don't recognize the speed at which average temps decline from early August through early September
  • Doing it anyway. It's possible that there are quite a few people like me that need new windows, and if all else is equal (other discounts don't apply at other times of the year, this is the best company, etc) then why not take a chance to get free windows. It only makes sense.  


My prior job was largely in risk analysis, so it is natural for me to consider why a business would take this kind of risk. It seems that giving away free windows to everyone who signed for this deal could effectively end a company. Keep in mind, it isn't like the lottery where one person wins. If there's a winner, EVERYONE (who bought windows) in Kansas city wins. The answer? Well, a few ways to mitigate risk:
  • Nationwide company. Info on this company shows it's a big, nationwide company. It's possible that they run this contest in other markets, and with the 4.5% risk threshold assume a few will payout. If margin AND incremental marketing lift (how many people take the deal, effectively) are sufficiently high in other markets, then the net payoff of the risk will over-run the losses in the ... 'hot markets' ... where they have to give away free windows.
  • Insure contest risk.  There is an ability to get an insurance policy against someone winning certain kinds of high-dollar contests. Think "hole in one" contests. Actually, this is my favorite instance of that from the early 1990s.
  • Other margin factors.  There's a possibility that while the windows are free, other costs actually make up the bulk of margin dollars, and these other costs partially mitigate the costs of the contest. What if installation is actually a high-margin undertaking, and part of that margin can offset the cost of free windows?
There are likely other ways to mitigate this risk, but I'm certain this company has an understanding of the risk of payout, and likely understand consequences of *winning.*


From this blog entry, I've laid out that the contest has relatively low chances of winning (4.5%), but people may misperceive that risk for one of many reasons. Even if everyone *wins* the contest, the business likely understands the loss they would take, and has taken actions to mitigate that risk financially. How likely is it now? Seems low, the forecast for Monday, September 5th is 88 degrees.

Oh, and by the way, since this contest is over (August 29th has passed) I went over to the website to see what the new deal was: 20% off of the price of windows. What does that mean? You could either have a 4.5% chance at FREE windows, or a 100% (advertised) chance at 20% off of windows.

Friday, August 19, 2016

Trump and Breitbart Alliance: A Match Made on Emotion?

For the Trump campaign, this week has been fairly crazy, highlighted by naming the CEO for conservative news organization (Steven Bannon) as the CEO of his campaign. There has been quite a bit of punditry on this subject, but that's not really the place of this blog. I'm focusing on a follow-up to our prior post on Trump's use of disgust to drive engagement, emotional reaction and (ultimately) political support.  First, a summary of earlier findings:

  • Trump not only receives engagement boosts from using disgust based language, but also from using anger, fear, sadness, and trust. He doesn't see boosts from joy, anticipation, or surprise.
  • Neither Hillary Clinton nor Bernie Sanders get a statistically significant boost from using emotional language on Twitter like Trump. Both of these candidates see flat relationships with emotion and engagement. (Full disclosure, the author of this blog sees a positive and significant boost in retweets from using disgust language.)
  • Breitbart news sees a similar boost in engagement (from emotional speech) as Trump. This may go to explain the Trump/Breitbart alliance, they use similar tactics to engage users, AND they use their user bases respond to similar types of language.


(non-nerds can skip)

For this project, I used the same effective code and methodology as my prior post on Trump's use of disgust in tweets.  I made a few improvements (some code at the end of this post).
  • I downloaded the twitter feeds for four internet news sites (Foxnews, CNN, Breitbart, MSNBC), and scored those feeds in the same way I scored candidates in my prior post.  I also followed the same normalization strategy, controlling for incident rate of emotion in the data set, and tendency towards emotional language for each candidate.
  • I created linear models for each candidate/newsagency (entity) by emotion pair (8 entities * 8 emotions = 64 models) with an observation level of "tweet", dependent variable of retweets, and predictor variable of emotion.  This model shows at what rates emotions drive engagement for each entity.
  • I created an output matrix of the emotions, reporting only statistically significant results, for easy comparison of engagement by emotion and entity.


For commentary on the meaning of the charts, please reference the prior post on Trump's use of disgust. The basics are this: we use a sentiment mining algorithm to measure the overall emotion of tweets, and then aggregate the results to each user (candidates and news agencies) and relate that to engagement to determine which emotions drive social engagement results for which groups.

First our emotional term index, which shows that Breitbart is actually the least emotional-sounding news agency, comparing to Foxnews which is most emotional.

Next we summarize the normalized emotional tendencies of each news agency.  Breitbart (of special interest) wins on use of "disgust," MSNBC wins hugely on "surprise", and CNN wins on "trust" emotions.  

That last chart demonstrated an important component related to our prior post: Trump's strategic ally Breitbart news also tends to use a lot of disgust emotional language and signaling in their tweets. But are they as successful in driving engagement by use of disgust as Trump is? Time for a statistical test.

My prior post received quite a bit of traffic, but was a bit intellectually lacking in my opinion. I had demonstrated Trump's relationship with disgust, and his followers reaction to it, but I hadn't looked at two other dimensions:  
  1. Do other candidates get engagement boosts from disgust or other emotions? 
  2. Does Trump get boosts from other emotions outside of disgust?
To test this hypothesis I created 64 linear models which serve as statistical tests to determine the effect of emotional language on engagement measured by retweets (side note: I threw in my own tweets as a comparison). The code for those models are found at the bottom of this post, and the results are in the heatmap directly below.  

The red results above are statistically insignificant, with green results showing statistically significant coefficients (emotions that give candidates a significant boost). The numbers in the chart represent the actual coefficients:
  1. News agencies (except for Breitbart) see very little engagement response to emotions. The relationships we do see for Fox and CNN are related to emotions we may see with breaking news (anticipation, joy, surprise).
  2. Breitbart news sees a stronger relationship with emotional tweets. The strongest engagement-generating emotions for Breitbart are anger, disgust, and sadness.
  3. Sanders and Clinton do not see increases in engagement when they use emotional language.
  4. Trump sees the largest positive relationship with engagement and emotional language, which is strongest on the emotions similar to Breitbart.
  5. Full disclosure: the author of this blog sees positive engagement from disgust tweets too.

An easier visualization of the disgust measure, looking first at candidates, then at news agencies (x axis is a disgust rating, y axis is a retweet count)


A few bullet points in closing:
  • The Breitbart/Trump alignment makes sense both from the way they speak AND from the way their followers engage with their tweets.
  • Other candidates and news agencies see much less engagement from openly emotional tweeting.
  • Trump and Breitbart both get more engagement in their tweets by using anger, fear, disgust, sadness, and trust.

 #Model dimensions  
 o <- c("CNN","BREITBART","FoxNews","MSNBC")  
 e <- c("anger", "anticipation", "disgust","fear", "joy","sadness" ,"surprise" ,"trust")  
 jj <- matrix(nrow = length(o), ncol = length(e))  
 rownames(jj) <- o  
 colnames(jj) <- e  
 rsq <- matrix(nrow = length(o), ncol = length(e))  
 rownames(rsq) <- o  
 colnames(rsq) <- e  
 initial <- ("lm(retweetCount~")  
 for(j in e){  
 for(i in o) {  
 dd <- subset(mydata, user ==i)  
 model <- paste(initial, j,",data=dd)" )   
 temp <- summary(eval(parse(text=model)))  
 jj[[i,j]] <- temp$coefficients[2,1]  
 rsq[[i,j]] <- temp$coefficients[2,4]  
 rsq <- ifelse(rsq <=.050, 1,0)  
 out <- rsq * jj  

Wednesday, August 17, 2016

Pop Culture Data Scientists in TV and Movies

A few months ago my wife was watching House of Cards and called me into the room... the conversation went something like this:

Wife: Hey there's a data scientist on this show!
Me: Really?  .. why...
Wife: Yeah, he's a weirdo, just like you!
Me: .. thanks

I watched a couple of episodes and found out the character was in fact much weirder than I am.  But I also thought this was the first data scientist title I had seen fictionalized in popular culture. Given that this blog has been a bit too serious lately, I thought it would be interesting to compile a list of fictional data scientists, are they really data scientists, and another fictional metric: would Levi hire them?

An interesting thing I found in researching this is that since the concept of data science is relatively new, there are very few actual data scientists in pop culture. It also may be difficult to fictionalize what data scientists actual do in any kind of interesting way, but .. whatever...

The list is short, and most people on the list wouldn't necessarily identify as data scientists, but here we go:

Aidan MacAllan, House of Cards:  

Synopsis: Aidan is a data scientist who apparently works for the government, or the President and his wife directly and does... data science-y stuff for them. Most of the tasks resemble real data science tasks (large web analytics, identifying targets for marketing based on who has been close to gun violence) and others aren't really data science but sound cool (tapping a phone). The portrayal of Aidan as a data scientist is romanticized, but some of the tasks are at least in the ballpark of what we can do.

Personality Portrayal: My wife was wrong, he's much weirder than me. The only similarity between Aidan and I is consistently messed up hair and listening to death metal.  But he dances naked when he's alone compiling code (I DO NOT DO THIS). And the character is portrayed as a weird-artistic-savant, which is good for Hollywood story lines, but is overplayed from the real personalities of most data scientists.

Would I hire?  No.

Reasoning: The naked dancing thing, questionable data ethics, and overstatement of certainty.

Seth Bregman and Peter Sullivan: Margin Call. 

Synopsis: Another Kevin Spacey project, strangely enough, maybe he likes data science! I like the movie Margin Call a lot, partially because the movie is about quants getting things right. In this movie Seth and Peter are identified as risk analysts, but had this movie been released in 2016 rather than 2011, there's a good chance they would be data scientists. At the beginning of the film they work with a model that shows their firm to be over-leveraged, which sets up the rest of the film. Though not strictly data scientists in title, they create models/simulations out of large data sets, which is effectively what data scientists do.

Personality Portrayal:  These guys aren't portrayed as weirdos, but more junior Wall-Street guys that are maybe a bit more numbers/science focused. One aspect of the movie that parallels modern data scientists is that they have broad academic backgrounds (e.g. astrophysics) and have been brought into business to solve large financial modeling problems.

Would I hire? Yes.

Reasoning: They are portrayed as generally competent, and aren't afraid to escalate issues to management, which is important.

Peter Brand: Moneyball.

Synopsis: This isn't even a data scientist really at all, but I'm already hitting the bottom of the well with examples. Jonah Hill plays this character as complete nerd who is basically a statistician and saber metrics expert. As portrayed on film, it doesn't appear he has the computer science skills required for a data science career. He does one very important thing for data scientists in the film though: he continually tells those around him that they are looking at the wrong metrics, and that by focusing on metrics that actually create production, they can run a better business.

Personality Portrayal: Nerdy, classic statistician that likely plays Magic the Gathering in his free time. This may just be an offshoot of casting Hill to play the role though.

Would I hire? Maybe.

Reasoning: Yes, if I had a lot of tasks that lent to econometric rather than Machine Learning models. And I could stick a data engineer with him to do the Computer Science side of the job.

Max Cohen: Pi.  

Synopsis: This is way before what we think of as modern data science, but the essentials are there: looking for patterns in large and somewhat unstructured data sets. Basically, an unemployed number theorist starts analyzing data in stock markets and finds patterns. He makes makes accurate predictions based on these patterns (seemingly related to a 216 digit number). He then moves on to numeric/textual analysis of the Torah (which BTW, is a fairly common junk science). In the end, the numbers drive him crazy and he solves the problem with a power drill to the brain (seriously).

Personality Portrayal: Paranoid, potentially schizophrenic, intelligent, with extremely bad headaches. I've seen data scientists get to this point, but it's usually solved by a good night's rest more easily with the power drill.

Would I hire? No.

Reasoning: He's crazy.

Honorable Mention/Exclusions:
  • Alan Turing, The Imitation Game:  Excluded because Alan Turing was a real (awesome) person.
  • Artificial Intelligence, Her, Ex Machina: Generally nameless data scientists, and highly speculative futurism.  I prefer real data scientists solving real problems.

Thursday, August 11, 2016

Contracting Rates, Competition, LinkedIn's Race to the Bottom

A few weeks ago I was looking to connect with more people in data science (read: networking) so I did something out of the ordinary for me: I joined a LinkedIn group. It was a group related to R statistical engine programming, and I assumed I would join it, read some posts, and maybe it would keep me engaged in the field.

Questions started popping up in the group, generally related to the best ways to make certain graphs in ggplot2 or how to handle certain dataframe manipulation tasks. I ignored. Then a question caught my interest:
I've been asked how much the hourly rate is for a freelancer and I have no idea. Could anyone provide a ballpark range in US dollars?
Going against my gut, I engaged with LinkedIn.


The question was interesting, and I contract on occasion so I thought I should type a quick response. I clicked through on the question, and saw the following first answer:
Hi M----, I am a consulting statistician with many years experience using R. I charge $40/hour for my services.
Things just got weird. I'm more familiar with contract rates in the $150-$200 per hour rate. Very weird. What could be going on, and why is this person's rate so low?  First a few facts on data science salaries:
  • Median salary for a data scientist in the US is about $112,000 (~$55 an hour + benefits)
  • Mean salary for a data scientist in the US is about $125,000 (~$61 an hour + benefits)
  • Contract or freelance gigs, in any field generally pay more on a per increment (hour basis) than full-time gigs of similar veracity, for a couple of reasons:
    • Contractors have less stable employment and thus earn more as a hedge against instability and opportunity costs (time spent marketing, invoicing, etc).
    • Businesses using contractors incur much lower total costs than if they hired a full-time employee, so they are willing to pay a premium to keep it in pay per increment.
The original question wasn't looking for a data scientist per se (just a statistician with R experience) but the skillsets and pay rates are somewhat similar.  The question came to my mind: why were some data science related contract rates so far below normal data science salaries in the US?


A lot of people ended up responding to the LinkedIn question, so I had a fairly large sample of analysts and their self-reported charge for contract work. Because it was LinkedIn I could also click-thru to their resumes and determine their experience and education backgrounds, as well as other demographic factors. Per the post, I also referenced the website Upwork, which is kind of like an Uber service for Freelancers in various fields (to increase my sample).

In the data, I found three basic groups in the posts:
  • Workers living overseas, especially in South Asia who were willing to work for sub-par wages ($30-40 an hour). This group seemed to charge less for a number of reasons:
    • Businesses incur a bit of risk in working in these areas, which shifts wage rates down.
    • The exchange rates and local cost of living in these areas make lower wages more tolerable to data scientists living in the area.
    • Data science and IT jobs in that region (substitute employment) pay less than in the United States.
  • Workers living in US with thin or non-existent resumes, willing to work for sub-par wage ($30-50 an hour). These individuals generally had strong educational backgrounds (some with PhD's), but had resumes that lacked any substantive analytical experience. I know that there are some people who have great education backgrounds, but are unemployable for various reasons (e.g personality, work ethic) so this isn't hugely surprising to see high-education people out of work. A couple reasons that these are likely under-market:
    • Many may be willing to work under market because they are CURRENTLY unemployed and/or unemployable.
    • Much of a data scientist's value in the workplace comes from solving real-world business projects. These workers realize that they have substantively less to sell themselves on to large employers.
    • For unemployable individuals, there is very little potential for substitute employment (getting a real data science job) so they are willing to work temporary work for far less.
  • Workers living in US or similar countries, with long resumes, who use similar contracting rates ($120-$300 an hour). These were generally US residents with a background similar to mine, having worked in analytics and data science for many years within large companies. These are generally people with $100K+ day jobs, that will contract in their free time, if a company will "make it worth it."


Though data science contractors with business experience in the United States are extremely well compensated, those with limited experience or working overseas cost a fraction of the price. These overseas and low-experience resources are likely best suited for projects of low-level coding or entry-level data science, however, as one responder to the original LinkedIn post said, they come caveat emptor which may be the reason many businesses pay higher rates for more guaranteed talent.

Monday, August 1, 2016

Quantitative Analysis: Trump's use of Disgust to Drive Engagement

I went down a bit of a rabbit hole this morning. I was using a sentiment mining algorithm to classify text by emotion (similar to the positive/negative polarity analysis from last week), and noticed the least-used emotion my algorithm was assigning was "disgust."

This isn't odd given field I work in, but I wondered if there are fields, or sets of text, that would skew disgust differently. Then I remembered a tweet... and some articles I read last winter about Donald Trump's status as a disgust/hygiene candidate. Here's one of those tweets:

And I was off, downloading a few thousand presidential candidate tweets, once again. This time though, I limited analysis to the Trump disgust theory, whether, 1. he's the candidate that uses the disgust emotion the most, and 2. whether using disgust has been a successful strategy for him to date. (Side note: I could write a dissertation on emotional language used by candidates.) This is what I found:

  • Analyzing emotional language of the final three candidates, Trump skews highest in disgust, and it his is strongest skew-over-others emotion. (Sanders is strongest on fear, Clinton on trust.)
  • Trump is successful in generating positive interaction from disgust-tweets, generating an additional 1000 retweets for disgust tweets over non-disgust tweets.


The basics of this theory is that Trump gains an advantage by signaling disgust (re: some political event) to his followers and advocating himself as the hygienic (purity) candidate.  Examples of this behavior?
  • Framing America's problems in terms of infectious disease: problem from outside of US, brought in by immigrants. 
  • Response to Megyn Kelly, purportedly referencing menstruation: "... blood coming out of her wherever."
  • Comments on Hillary Clinton's potty break: (I have a two-year-old) during a democratic debate.
The underlying functional theories of why and how Trump uses disgust to attract voters isn't the place of this blog, but two good starting places are this New Republic article and this article from the New York Times.


SHORT: I used sentiment mining technology which analyzes the texts of individual comments (tweets, in this case) and determines the underlying emotions of each text.

LONG (skip if non-nerdy): I pulled a sample of tweets from the past four months using the Twitter API, and cleaned the data using standard text cleaning methods (removing punctuation, numbers, stem words, etc). I applied sentiment mining technology to the same tweets I pulled last week, this time using an emotion classifier rather than a polarity classifier. (Available in the R Package syuzhet, and wrote additional code to summarize the data --  if you want my code, ask).

The emotion classifier breaks down emotions into the eight primary emotions —anger, fear, sadness, disgust, surprise, anticipation, trust, and joy.  For more about that those emotions, see Plutchik's Wheel of Emotions, which is where this fairly standard list is derived.

Then I segmented the tweets by poster (Clinton,Trump,Sanders) and summarized their postings using a double normalized index of emotional strength. The scores were normalized by:
  1. The candidates' tendency to use emotional words in tweets (Bernie was the most emo, Hillary the least).
  2. Each emotion's propensity over the entire population (Trust and Anticipation were most common). 
By using this two-way distribution normalization, we can determine the individual strength of emotion displayed by each candidate, controlled for each candidates tendency to be emotional AND the in-nature distribution of each emotion.


First, the two-way normalization forced me to calculate a net emotionality score.  This score is effectively the average number of highly-emotional words per tweet by candidate.  Here's what that looks like.  (I also added analysis of my tweets.)

Bernie is most emotional, followed by Trump and Hillary.  I am the least emotional in my terms, which may derive from being a quant... or not a presidential candidate.

Second, I created an index to show which candidates skew towards which emotions.  This also allows us to determine which emotion is relatively strongest to each candidate (highest value compared to values of other candidates), which breaks down like this:
  • Bernie: Fear
  • Trump: Disgust
  • Clinton: Trust
A chart of these emotions.

In relative terms, Trump clearly skews towards more disgust-related words than the other candidates, and this seems to be his strongest difference with both Bernie and Hillary. Analysis of the data this way allows for a quantitative validation of the disgust theory put forward earlier this year by other publications.

Does leveraging disgust-emotions help Trump gain engagement and voters?

Third, in two prior posts on polarity, I demonstrated how Trump's use of negative words in tweets (highly negative polarity) drove engagement on twitter (roughly: the more negative Trump goes, the more engagement he gets; not true for Hillary or Bernie). Can we demonstrate the same for disgust emotions?

Here I plotted disgust words per tweet by average retweets (best measure of engagement overall). Once again there is a positive relationship between Twitter engagement and the emotion of "disgust." In fact, tweets that were algorithmically classified as emotion = "disgust" were retweeted about 1,000 more times (20% more) than Trump's non-disgust tweets (p = 0.025).


This post provided a novel method to determine relative strength of emotions used by candidates, and also test a theory about a candidate's (Trump) use of a specific emotion to gain an advantage. To reiterate my findings:
  • Analyzing emotional language of the final three candidates, Trump skews highest in disgust, and it his is strongest skew-over-others emotion. (Sanders is strongest on fear, Clinton on Trust)
  • Trump is successful in generating positive interaction from disgust-tweets, generating an additional 1000 retweets for disgust tweets over non-disgust tweets.