Wednesday, May 18, 2016

Trump Gets A Large Boost from Negative Tweets

Earlier this week I posted on Presidential candidate tweets (Donald Trump versus Hillary Clinton) and how they differ and are similar (spoiler alert: everyone is talking about Trump... a lot).  That analysis was largely qualitative, but I conducted a more quantitative analysis as well.

I'll include some nerdy details below (as well as code) but the analysis used sentiment models to determine the underlying sentiment (positive, negative) of candidate tweets.  Here's a quick definition of sentiment modeling:
"refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials."

Effectively what we're doing is measuring the underlying feelings and emotions expressed in a tweet, and then reducing those to positive and negative.  This analysis is interesting partially because Hillary is criticized in the press for being overly negative, and not smiling enough, which is a weird criticism for a Presidential candidate.  But can we validate that notion with data?  I will post visualizations below, but here are a few quick findings:
  • The candidates are similar in aggregate sentiment with Trump being slightly more positive.
  • But Trump's sentiment is much more volatile (wider distribution), and he has a monopoly on the most negative tweets.
  • Trump gets more engagement with followers the more negative he is, aside from a certain "taco bowl" tweet.

INITIAL DISTRIBUTIONS

Our sentiment model algorithm assigns a score to each tweet ranging from negative to positive, with negative numbers being increasingly negative sentiment, while positive numbers are increasingly positive sentiment.

Here's a distribution analysis of Clinton, Sanders, and Trump tweets.  Note that Trump's tweets are slightly shifted right (more positive, though), but Trump's distribution includes more tweets on the fringes and less central tendency (lower modal value, a kurtosis effect).  In effect Trumps tweets are a bit more bipolar than Clinton's.



Here's a look at some distribution statistics, note that Trump has a significantly wider distribution, and is in net slightly more positive (also note that I included my own tweets in this :)).



TWEET EXAMPLES

While in the data, I found that some of Trump's more negative tweets were, at least a bit amusing.  Should we look at them?  Let's do it in two ways.  First (to be fair to both candidates) I created wordclouds of each candidates most negative tweets. (First Trump, then Clinton.)






The interesting information here is that when Trump goes negative he most often uses the names of opponents, and worlds like weak, total, and fail.  When Clinton goes negative she is most often referring to families, women, gun, and crisis.  I also ran this for Bernie who seems to go most negative when talking about the country, poverty, and America.





Wordclouds are fun, but Trump has the ten most negative rated tweets on the list, let's just read them (I stripped these down just to the simple text -- you're welcome).



IT GETS BETTER

At dinner with my wife last night I described the project and she was quite interested.  Together we came up with another interesting question: engagement.  Do internet users engage with Trump and Clinton more or less when they are more positive and negative?

The easiest way to measure engagement is retweets, because it also measures a network effect.  I had to remove one tweet from this analysis ... this one.


Which was a far-outlier, it's actually sentimentally positive (he loves Hispanics) but the reaction and underlying sentiment are something completely different.  This is also an advertisement for one of his businesses, ensuring it is positive in tone.  Anyways, I created a method to compare tweet sentiment to average engagement.

The chart below compares Clinton to Trump, showing an interesting fact: When Trump goes more negative he gets more engagement and response, Clinton's pattern is not as clear, and does best with slightly positive tweets. (Engagement is indexed to 1.0 = average retweets on user basis.)


And another view of Trumps activity:


From a statistics perspective this is a significant correlation (r = 0.14), but with a low amount of explained variance.  For every point of negativity (net negative word) Trump gets (on average) an additional 240 retweets.


CONCLUSION

In this case, sentiment models provided a lot of interesting information, more so than our initial analysis from Monday.  The main takeaways are:
  • Trump in aggregate is slightly more positive than Hillary.
  • Trump's sentiment is a bit more volatile.
  • Trump gets a ton of engagement by going negative.

CODE

For this analysis I used the word list and initial code from the R-Bloggers website, with a couple of changes:
  1. I created a normalized word score, which adjusts for tweet length in evaluation the most-net-negative tweets.
  2. The code didn't run out of the box for me (I believe some objects are mis-named), I fixed these issues.

Here's my finalized function:

 sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){  
  scores = laply(tweets,  
          function(tweet, positive_words, negative_words){  
           #print(positive_words)  
           tweet = gsub("[[:punct:]]", "", tweet)  # remove punctuation  
           tweet = gsub("[[:cntrl:]]", "", tweet)  # remove control characters  
           tweet = gsub('\\d+', '', tweet)     # remove digits  
           # Let's have error handling function when trying tolower  
           tryTolower = function(x){  
            # create missing value  
            y = NA  
            # tryCatch error  
            try_error = tryCatch(tolower(x), error=function(e) e)  
            # if not an error  
            if (!inherits(try_error, "error"))  
             y = tolower(x)  
            # result  
            return(y)  
           }  
           # use tryTolower with sapply  
           tweet = sapply(tweet, tryTolower)  
           # split sentence into words with str_split function from stringr package  
           word_list = str_split(tweet, "\\s+")  
           words = unlist(word_list)  
           #print(words)  
           # compare words to the dictionaries of positive & negative terms  
           #print(positive_words)  
           positive.matches = match(words, positive_words)  
           #print(positive.matches)  
           negative.matches = match(words, negative_words)  
          # print(negative.matches)  
           # get the position of the matched term or NA  
           # we just want a TRUE/FALSE  
           positive_matches = !is.na(positive.matches)  
           #print(positive_matches)  
           negative_matches = !is.na(negative.matches)  
           # final score  
           score = sum(positive_matches) - sum(negative_matches)  
           leng <- length(words)  
           score_two <- score/leng  
           return(c(score,score_two))  
          }, positive_words, negative_words, .progress=.progress )  
  return(scores)  
 }  
 score = sentiment_scores(mydata$text, positives, negatives, .progress='text')  
 mydata$score=score[,1]  
 mydata$score_two = score[,2]  

3 comments:


  1. I have been following you for a couple of months now but this is my first time commenting on a blog post. Thank you for sharing your knowledge and experience with us. Keep up the good work. Already bookmarked for future reference.

    SAP training in Chennai

    ReplyDelete
  2. LB, thanks for awesome post. The concept of relating engagement with negativity is a unique take on #trump and #hillary twitter data that I have seen anywhere else. I have tried to summarize twitter sentiment in real-time at https://electionwatch.vizually.io .It will be useful to incorporate some of these techniques.

    ReplyDelete