Wednesday, May 18, 2016

Trump Gets A Large Boost from Negative Tweets

Earlier this week I posted on Presidential candidate tweets (Donald Trump versus Hillary Clinton) and how they differ and are similar (spoiler alert: everyone is talking about Trump... a lot).  That analysis was largely qualitative, but I conducted a more quantitative analysis as well.

I'll include some nerdy details below (as well as code) but the analysis used sentiment models to determine the underlying sentiment (positive, negative) of candidate tweets.  Here's a quick definition of sentiment modeling:
"refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials."

Effectively what we're doing is measuring the underlying feelings and emotions expressed in a tweet, and then reducing those to positive and negative.  This analysis is interesting partially because Hillary is criticized in the press for being overly negative, and not smiling enough, which is a weird criticism for a Presidential candidate.  But can we validate that notion with data?  I will post visualizations below, but here are a few quick findings:
  • The candidates are similar in aggregate sentiment with Trump being slightly more positive.
  • But Trump's sentiment is much more volatile (wider distribution), and he has a monopoly on the most negative tweets.
  • Trump gets more engagement with followers the more negative he is, aside from a certain "taco bowl" tweet.


Our sentiment model algorithm assigns a score to each tweet ranging from negative to positive, with negative numbers being increasingly negative sentiment, while positive numbers are increasingly positive sentiment.

Here's a distribution analysis of Clinton, Sanders, and Trump tweets.  Note that Trump's tweets are slightly shifted right (more positive, though), but Trump's distribution includes more tweets on the fringes and less central tendency (lower modal value, a kurtosis effect).  In effect Trumps tweets are a bit more bipolar than Clinton's.

Here's a look at some distribution statistics, note that Trump has a significantly wider distribution, and is in net slightly more positive (also note that I included my own tweets in this :)).


While in the data, I found that some of Trump's more negative tweets were, at least a bit amusing.  Should we look at them?  Let's do it in two ways.  First (to be fair to both candidates) I created wordclouds of each candidates most negative tweets. (First Trump, then Clinton.)

The interesting information here is that when Trump goes negative he most often uses the names of opponents, and worlds like weak, total, and fail.  When Clinton goes negative she is most often referring to families, women, gun, and crisis.  I also ran this for Bernie who seems to go most negative when talking about the country, poverty, and America.

Wordclouds are fun, but Trump has the ten most negative rated tweets on the list, let's just read them (I stripped these down just to the simple text -- you're welcome).


At dinner with my wife last night I described the project and she was quite interested.  Together we came up with another interesting question: engagement.  Do internet users engage with Trump and Clinton more or less when they are more positive and negative?

The easiest way to measure engagement is retweets, because it also measures a network effect.  I had to remove one tweet from this analysis ... this one.

Which was a far-outlier, it's actually sentimentally positive (he loves Hispanics) but the reaction and underlying sentiment are something completely different.  This is also an advertisement for one of his businesses, ensuring it is positive in tone.  Anyways, I created a method to compare tweet sentiment to average engagement.

The chart below compares Clinton to Trump, showing an interesting fact: When Trump goes more negative he gets more engagement and response, Clinton's pattern is not as clear, and does best with slightly positive tweets. (Engagement is indexed to 1.0 = average retweets on user basis.)

And another view of Trumps activity:

From a statistics perspective this is a significant correlation (r = 0.14), but with a low amount of explained variance.  For every point of negativity (net negative word) Trump gets (on average) an additional 240 retweets.


In this case, sentiment models provided a lot of interesting information, more so than our initial analysis from Monday.  The main takeaways are:
  • Trump in aggregate is slightly more positive than Hillary.
  • Trump's sentiment is a bit more volatile.
  • Trump gets a ton of engagement by going negative.


For this analysis I used the word list and initial code from the R-Bloggers website, with a couple of changes:
  1. I created a normalized word score, which adjusts for tweet length in evaluation the most-net-negative tweets.
  2. The code didn't run out of the box for me (I believe some objects are mis-named), I fixed these issues.

Here's my finalized function:

 sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){  
  scores = laply(tweets,  
          function(tweet, positive_words, negative_words){  
           tweet = gsub("[[:punct:]]", "", tweet)  # remove punctuation  
           tweet = gsub("[[:cntrl:]]", "", tweet)  # remove control characters  
           tweet = gsub('\\d+', '', tweet)     # remove digits  
           # Let's have error handling function when trying tolower  
           tryTolower = function(x){  
            # create missing value  
            y = NA  
            # tryCatch error  
            try_error = tryCatch(tolower(x), error=function(e) e)  
            # if not an error  
            if (!inherits(try_error, "error"))  
             y = tolower(x)  
            # result  
           # use tryTolower with sapply  
           tweet = sapply(tweet, tryTolower)  
           # split sentence into words with str_split function from stringr package  
           word_list = str_split(tweet, "\\s+")  
           words = unlist(word_list)  
           # compare words to the dictionaries of positive & negative terms  
           positive.matches = match(words, positive_words)  
           negative.matches = match(words, negative_words)  
          # print(negative.matches)  
           # get the position of the matched term or NA  
           # we just want a TRUE/FALSE  
           positive_matches = !  
           negative_matches = !  
           # final score  
           score = sum(positive_matches) - sum(negative_matches)  
           leng <- length(words)  
           score_two <- score/leng  
          }, positive_words, negative_words, .progress=.progress )  
 score = sentiment_scores(mydata$text, positives, negatives, .progress='text')  
 mydata$score_two = score[,2]  

Monday, May 16, 2016

Most Common Terms Used by Candidates

I've once again been very busy the last couple of weeks, so the posting has been, well, really non-existent.  It's probably a good idea if I put something on the blog, such that I don't forget that I have one.

This morning I was filing away some recent twitter code, and thinking about the presidential race.  It appears that we're in for a Trump versus Clinton general election, and now most of the dialogue resides between these two candidates.  I ran some quick analyses, looking at sentiment and topics which I'll post in followups, but for an initial look I came up with word clouds representing the twitter dialogue between the two candidates.

TRUMP-Trump spends most of his time talking about Trump (largely quoting other people talking about Trump).  He also seems very thankful and uses words like America, Great, and Will. Bottom Line: Trump spends most of his time talking about himself thanking supporters and "making America Great again."

CLINTON- Clinton spends most of her time talking about.. wait for it... TRUMP.  She also uses her own name, and talks about Women, Care, Families, and Children. Bottom Line: Clinton is focused on two primary dimensions 1. defeating Trump and 2. social issues largely women's and family issues. (If you were around in the mid 90's you might remember "It Takes A Village.")

BERNIE- He's still technically in the election, and the code is easy to run.  Bernie spends most of his time talking about voting, but also uses words like "Make, Must, Right." Bottom Line: Bernie spends most of his time encouraging his supporters to vote and making other calls to action for the country (classical campaigning, really).

BOWLES- Okay, this is my own Twitter feed, but ironically I may talk more about serious economic issues than any of the three candidates, largely talking about tax rates, interest, data, and impacts.  Bottom Line: I'm a nerd that talks about tax policy too much online.

Friday, May 6, 2016

OSHA Preliminary Death Data: How We Die at Work?

Earlier this week Twitter comedian, "devops thought leader," and Edward Scissorhands expert John Hendren (who tweets under the handle @fart) posted a tweet that caught my eye.  Specifically:

  • A normal person might react to this by saying "how horrible."
  • A nerdy or curious, but otherwise normal person might open the file, look at some of the accidents in horror and then close it.  
  • A normal data analyst would probably open the file, summarize by day of the week, and then say something lame like "most workplace deaths occur on... Friday or something."
  • I reacted in none of these ways.  I decided to make the world's most horrible word clouds (and use some cool text mining implementations). 


The file itself is just a web available CSV.  The really nice thing about the R stats language is that it's amazingly simple to read these types of files into data.

j <- read.csv("")

On inspection there are five attributes available in the table:
  •  $ Fiscal.Year
  •  $ Summary.Report.Date
  •  $ Date.of.Incident
  •  $ Company  
  •  $ Preliminary.Description.of.Incident
The data isn't very rich in reality.  Sure we know when people have been killed at work and what company they worked for.  But any information on how they died or what killed them is locked up in irregular text data, without coding, and that appears to be written somewhat haphazardly. For a flavor of that text data I created a word cloud:

For more flavor, here's actually the most awful description I found in the data:

Decedent was dumping a load of offal from a tractor trailer. He was in the process of dumping offal into a bin when the tailgate malfunctioned. Decedent was freeing the tailgate, it released, and the load swept the decedent into the offal bin. Decedent drowned in the bin.
Horrible. But what if we could use the text data to measure underlying ways that people die at work?


I used a method used quite a bit before on this blog, Correlated Topic Models to measure the underlying topics-essentially a way of summarizing and differentiating the way people die at work.  For each of the topics I created a word cloud (for effect) and some examples of the original OSHA text description.  

Topic One: Falls (Many Times off Roofs)

Topic Two: Industrial Injuries (explosions, falling into tanks)

 Topic Three: Found Unresponsive. (Generally Natural Causes)

Topic Four: Electrocution

 Topic Five: Tractor Trailers and Warehouses

 Topics Six: Hit by Something (Cranes, Trucks, Booms)

 Topic Seven: Crushed (by various things, low frequency).

And here are some of the individual descriptions of our categories above:

Decedent fell 4-ft, 8-inches off a platform, striking his head.
Decedent was using a scaffold above 10 to 15 feet while painting. Instead of extending the scaffold, he used a step ladder on the scaffold, and feel off the scaffold.
Decedent was trimming a tree and fell 60 feet to the ground.
Worker was sandblasting under a bridge and fell 124-feet from a two-point suspension scaffold.
Decedent was working inside a mobile home, mixing propane and butane to make a substitute refrigerant. A fire occurred. The cause of death was determined to be carbon monoxide asphyxiation.
Decedent was walking across a tank and fell through a hatch into a tank of boiling water. He either drowned or died of thermal burns.
Worker was washing flights on an auger of a concrete machine and was pulled into the flights of the auger.
The worker was trapped in a large auger attached to a grain silo.
The decedent returned from his break to his work area and was sitting when he fell over. He was transported to a local hospital where he was pronounced dead. It was determined the decedent died from natural causes.
The decedent was not feeling well and left work early. While he was sitting waiting for the bus he collapsed and was non-responsive from an apparent heart attack. He was transported to the hospital where he was pronounced dead.
The worker was found behind the sales counter unconscious and unresponsive. The worker was pronounced dead at the scene by the coroner's office at 4:30pm.
Worker was found unresponsive in the employee restroom.
Worker was performing welding duties aboard a marine vessel. An electrode from his welding equipment contacted the sweat on his neck, causing an electric shock.
Worker was electrocuted.
Worker was trimming trees and was electrocuted, after the aerial lift contacted an overhead powerline, causing the bucket truck to become energized.
Worker was under a home doing plumbing work in a tunnel, dug under the concrete foundation using an electrical shovel type drill, and was electrocuted.
Worker was struck by a pick-up truck that was backing up from a warehouse.
Worker was waiting in line to be loaded out at a mill. He had exited his truck to check on something on the trailer in front of him. While doing so the truck pull forward. The trailer tires struck and traveled over the worker.
Worker was unloading a vessel and was struck by a loose spinning cargo sling chain.
Worker was standing along side his semi-trailer, as it was being unloaded by a powered industrial truck. A bundle of steel weighing 2700-pounds fell, striking him.
Hit by
Decedent was observed standing behind the parts counter, conscious but bleeding from the head.
Worker was assigned to check out a semi-trailer's front right air bag. While positioned between the two trailer axles, he was struck on the head by the left front air bag's base cup.
Worker was performing work on a gas drilling rig and was struck in the chest by a drill pipe.
Worker was operating a line 2 extrusion machine and was struck by a plug at the end of the line.
Worker was crushed by steam roller.
Worker was working alone and found by other employees having either fallen into or been inadvertently crushed by a moving part on a piece of equipment.
Worker was crushed between a forklift and a parked flatbed truck.
Worker was crushed by a cable after being pulled into a motorized capstan.


This data is interesting, as it gives us some insight into how people die at work.  But measuring topics this way allow us to also summarize the data using our newly created observed topics.  To do this, we simply summarize the probability that each death would belong to each category, by category.  This gives us a sense of the relative frequency of each type of death in our underlying data set.

Ok, falls seem like a bad thing in the workplace, and unresponsive is number two, though that seems to be mostly people dying of natural causes at work.  But what about my "day of the week question from earlier...

Weekends are low and well.. it looks like Wednesday is the most dangerous day in the workplace.. especially in warehouse environments.  But how about another look into daily skews:

Natural causes (unresponsive) tend to over-skew on Sundays, but that may be due to low death volume or industrial jobs having the day off.  Other skews exist too, Tuesday is a big fall day, Thursday is a big electrocution day and industrial accident day.  Saturday is a relatively big day to get crushed.  Actually, these are easier to read as a 1.0 index:  

And some full code for this project:

 j <- read.csv("")  
 mydata <- j  
 mydata$text <- mydata$Preliminary.Description.of.Incident  
 #clean data frame  
 try(mydata$text <- tolower(mydata$text))  
 mydata$text <- gsub("@\\w+", "", mydata$text)  
 mydata$text <- gsub("[[:punct:]]", "", mydata$text)  
 mydata$text <- gsub("http\\w+", "", mydata$text)  
 #create corpus  
 corp <-Corpus(VectorSource(mydata$text))  
 #clean corpus  
 corp <- tm_map(corp, content_transformer(tolower))  
 corp <- tm_map(corp, removeNumbers)  
 corp <- tm_map(corp, removePunctuation)  
 corp <- tm_map(corp, removeWords, stopwords("english"))  
 corp <- tm_map(corp, removeWords, c("decedent","worker"))  
 #stem words into roots  
 corp <- tm_map(corp, stemDocument, "english")  
 corp <- tm_map(corp, removeWords, c("decedent","worker"))  
 corp <- tm_map(corp, stripWhitespace)  
 matx <- DocumentTermMatrix(corp)  
 #print frequent terms  
 par(bg = "black")  
 wordcloud(corp, scale=c(5,0.5), max.words=400, random.order=FALSE,   
                rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))  
 ctm <- CTM(matx,7)  
 mydata$topics <- topics(ctm)  
 m <-mydata  
 for(i in c(1:7)){                 
      z <- subset(m,topics == i)  
      z <- Corpus(VectorSource(z$text))  
      z <- tm_map(z, removeWords, stopwords("english"))  
      z <- tm_map(z, removeWords, c("decedent","worker"))  
      #stem words into roots  
      z <- tm_map(z, stemDocument, "english")  
      z <- tm_map(z, removeWords, c("decedent","worker"))  
      pal2 <- brewer.pal(8,"Dark2")  
      png(paste(i,".png",sep = ""), width=8, height=8, units="in", res=300)  
      par(bg = "black")  
      wordcloud(z,scale=c(5,0.5), max.words=400, random.order=FALSE,   
                     rot.per=0.35, colors=pal2)  

Thursday, May 5, 2016

Social Media Data: Identifying Tweet Schedulers

I've been working with social media data recently, which always makes me a bit more astute to my interactions on Twitter and Facebook. I saw a post from an acquaintance that was interesting, as the person was talking about their current emotional state though the post was obviously scheduled in advance.

The ability to schedule posts is quite useful for organizations trying to make multiple contacts with users, or hitting users in off-hour time periods.  To those of us that work with social media data, it's fairly easy to identify scheduled posts on face.  If an algorithm could be developed to identify likely users of scheduled posts, we could remove them from algorithmic analysis, which is a good thing for a few reasons:
  • Scheduled posts are more likely to be spam.
  • Scheduled posts are more likely to be out-of-time (not posted in direct relation to a current event).
  • Scheduled posts are more likely to be emotional fraud (people posting about how great they are.. scheduling their own emotions.. which is fairly difficult to do).
In my career there's a fairly big payoff for identifying tweet schedulers and spammers, but other people seem interested in this too (wanting to know when their friends are just generating content, rather than being sincere on social media).  I'll look at both a simple (every-day user) and data-derived-algorithmic way to identify scheduled posts.


In order to predict or identify a process or event, we need an understanding of said process, and how it differs from "ordinary" events.  Here's what scheduled social media processes look like:
  1. You use a scheduling program (TweetDeck, HootSuite, Buffer).
  2. You pick a time that you want to post.
  3. You create content you want to post.
Using this knowledge of the process we can design a method to determine if someone is using a tweet scheduler. I'll use my prior Twitter data mining of the #ksleg hashtag.  Here's R my code to pull:

 #search setting  
 searchterm <- "#ksleg"  
 numtweets <- 10000000  
 since_date <- "2016-02-10"  
 min_freq_terms <- 100  
 topic_num <- 4  
 #auth in  
                          access_token=NULL ,  
 #pull from twitter  
 tweets <- searchTwitter(searchterm,  
 #pull out retweets                                
 tweets <-strip_retweets(tweets,   
                          strip_manual = TRUE,   
                          strip_mt = TRUE)  
 #JSON to Data Frame                           
 tweets <- twListToDF(tweets)  
 #create a clean object  
 mydata <-tweets  
 #create a field of direct URL  
 mydata$tl <- paste(as.character(""),  

This creates a nice data frame, which I exported to excel for demonstrative purposes.  We'll start our analysis from that.


Non-Technical Explanation:  The easiest to identify element of a scheduled tweet is it's origin, because scheduled social media posts usually come from some external program.  If you just want to know if your friends are using a scheduler, you can often identify it from the post, for instance in Facebook an icon will appear next to the post like this:

Technical Explanation: Using our downloaded data, identifying tweet scheduling programs are just as easy with the *statusSource* field form our data frame above.  This field shows the program or origination for each tweet, whether it came through the web interface, Buffer, or Hootsuite. Here's a screenshot that shows that column, with yellow indicating common scheduler programs.

Identifying tweet scheduling programs gives us some good candidates for tweet schedulers, but you can also directly tweet from Buffer or Hootsuite so we need to look at other factors.


Identifying the time of social media post is fairly easy, both online and through our tweet metadata data set.  Here's what each looks like in each:

But how do those help us identify tweet schedulers?

Non-Technical Explanation:  Look for your friends that tend to post everything at the same "minute" mark.  Most notable are those that always post on the hour, half hour, or ten-minute increment.  But also the people who schedule on the 5's or other seemingly pre-set timing (I have a friend that tends to schedule posts on the 3's).

Technical Explanation: Digital Analysis.  A simple phenomenon in which psychological/chosen distributions don't match random, natural, or stochastic distributions.  To understand this, you can read about Benford's distribution and its application to forensic accounting.  For an algorithmic implementation of this analysis, you can run each user's "minute" distribution against the entire distribution using a Chi-Squared Goodness of Fit test.  Those that don't fit a flat or global distribution are more likely to be Tweet schedulers.


The content of the posts often vary significantly for post schedulers too.  In simple terms, the decision to schedule your posts is a very deliberate content-creation choice. A few simple rules for the content created by this process:
  • The posts are less likely to be a reply to another user's post.
  • The posts are MORE likely to contain a link or other pre-planned content.
  • The posts are tonally more likely to be a call to action in someway.
Non-Technical Explanation: Your friend's posts seem like a pre-planned call to action, aren't responses, and tend to contain links or marketing material.

Technical Explanation:  For items 1 and 2 (replies and links) you can fairly easily create factors in your data set that identify these types of posts (and either remove them from consideration, or model association probabilities).  For the call to action we can use sentiment and topic modeling to attempt to identify posts that resemble "asks" or "calls to action."


As I mentioned before, there is a business advantage to identifying scheduled posts, especially if we can create an algorithm and automate it.  Then we can use this algorithm to identify fraudulent and spam users, and move them out of analytical processes.

Starting on that algorithm (using our data from above), and implementing the three areas above I coded a solution that does the following things:
  • Removes tweets that came from the Twitter web application (these are unlikely to be scheduled).
  • Remove posts that are replies to other users.
  • Calculate a "goodness of fit" test at the user level, comparing last-digit date-time frequencies minutes) to a flat-line prior.
  • Calculate % of posts that are content creation (largely links and graphics).
  • Calculate user % of time posting from services versus the web application.
  • Topic modeling of tweets, to find "call to action" type tweets.
  • A classifier based on marrying the above factors together in a robust way that can classify tweets by their overall likelihood to be scheduled.
I'm still working on refining the classifier, and hope to bring in some more sophisticated methods eventually, but the initial results are good.  Here are some tweets from the top two *likely to be scheduler* accounts (click to view):


Scheduled social media posts are a fact of life now, and they are useful for organizations trying to optimize messaging. They also are useful for spammers and others looking to use social media for manipulative purposes.  Either way, there is a business and analytical advantage in being able to discriminate between organic and scheduled posts, and my initial attempts at an algorithm provide a road-map.