Thursday, August 20, 2015

Sentiment Mining: Do Tweets Correlate with Team Performance?

It's been a while since I blogged about football, and it's about time again.  I know the cool sport for statisticians to like is baseball, but keep in mind I grew up in Kansas during the 80's and 90's, and that the Royals sucked from roughly 1986-2013.

BACKGROUND AND GOALS

A couple of background posts.  First I watch quite a bit of football and have an interest in predicting the outcomes of games.  Second my interest in sentiment mining or classifying text (tweets) by their general emotion expressed.  

The football models I made earlier in the summer generally outperform other "pre-season" models simply based on win-loss records, because they try to measure underlying performance, beyond prior year aggregate win/loss. One thing they don't account for are "soft" factors, that are difficult to measure in data.  For instance injuries to important players, players under-performing but still winning, off-field issues, Tom Brady getting divorced, etc.

So my goal in sentiment mining is to enhance my predictions by also including general external sentiment about how a team is doing week to week.  That is-Twitter can serve as a generalized proxy for those soft factors.  Here are my research questions:
  1. Can we use twitter data to provide additional color to predictive algorithms?
  2. Does the sentiment of tweets about a particular team correlate to performance?
  3. Does the direction of above predict forward, backward, or both?
Because the regular season hasn't started yet, I can't prove forward predictivity, but I can determine if tweets correlate backwards to last year's performance.  


METHOD AND DATA

For methodology I downloaded tweets associated with the hashtag used for each NFL team.  Then I used my sentiment mining algorithm (naive Bayes) to determine the emotion and polarity for each tweet.  The St. Louis Rams were enough of an outlier (3x more negative than other teams) that I removed them (they're going through some stuff right now).

And the results.  There is a statistically significant negative correlation between wins last year and % of negative tweets this year.  That means the fewer games a team won last year, the more negative tweets this year.  Here's what that looks like:



The biggest outlier found was the Bills who were one of our more negative teams, even though they had nine wins last year.  They didn't make the playoffs though.  Which led me to a second and more statistically significant conclusion:

Playoff teams generally had 5% fewer negative tweets than non-playoff teams.  Generally, about 1 in 5 tweets about playoff teams were negative, while 1 in 4 tweets about non-playoff teams were negative.
One last piece of not-statistically-significant but otherwise interesting evidence.  Our two lowest negative-tweet receiving teams?  The Seahawks and Patriots, the two teams that played in last year's Superbowl.

CONCLUSION

Just a few points here:
  • The #Rams have real problems, as demonstrated by their true outlier status.
  • I have at least some evidence that twitter sentiment follows team performance; and that we can model performance using our Naive Bayes classifier.
  • More to come, I will be including this information in my Week 1 models, to determine if it  can add value.




No comments:

Post a Comment