Thursday, August 25, 2016

Hitler, Trump, Hillary and Social Network Data

Last night on Facebook I saw an interesting post from someone I went to school with... here's the general nature of his complaint: 
He's fairly conservative (a Trump supporter) and recently created a post on Facebook that likened Hillary Clinton to Hitler.  One of his friends reported the post to Facebook, and Facebook removed the post from visibility, labeling it offensive.
This post actually struck me fairly hard. Two reasons:
  1. I may not agree that Hillary is effectively Adolf Hitler, but it seems firmly within this guy's rights to compare political candidates to Hitler. In fact, there's nothing really more American than comparing people you disagree with to Nazis (there's even a social-science internet law about this).
  2. I see people compare Donald Trump to Hitler all the time. Literally... all the time, with no apparent repercussions. Is there a double standard on Facebook in comparing Hillary to Hitler? This is especially interesting to me considering Facebook's reported bias against conservatives (which they blamed on rogue employees).
So, I thought it might be interesting to dig into the data of Hitler references on contemporary social media, and see what people most often discuss in relation to Hitler. Here's what I found on Twitter:
  1. The most common term, by far, used when discussing Hitler is "Trump."
  2. Right now, there are only really two Hitler related topical discussions on Twitter: one related to Trump and the other pertaining to Syria.

QUICK METHODOLOGY 

I downloaded the last ten days of tweets (all tweets) containing the hashtag "#Hitler," which was around 15K total tweets. I ran a cleanup algorithm I designed to remove tweets that contain duplicate content, likely tweetbot posts, and spam. The result was about 3,500 clean tweets that mentioned Hitler and were unlikely to be spam.  

Then I conducted some text-data cleaning steps, you can look at some prior posts on this site to understand what this entails, but generally it removes frequent words that are less meaningful (e.g. "the") and reduces words to their stem, or root meaning (e.g. "run" and "running" are interpreted the same way).  

THE DATA

I wanted to solve two questions with this analysis:

  1. What do people talk about most when they say Hitler on social media (sub-question: how often is Trump mentioned?)
  2. What are the general topics of conversation regarding Hitler?

The first part of the analysis was easy, and I found that Tweets with the term "Hitler" most often use the word "Trump" followed by a series of words related to elections including "like" and "support." Terms related to nazis are also popular, specifically, "adolf" and "nazi."

Of note, "Hillary" or "Clinton" aren't in the top 20 terms associated with Hitler, though Trump's Twitter user name is, as well as marginally associated words "never" (as in never Trump) and "vote." Searching the data, Hillary was mentioned about 90 times in the data set, so Hitler tweets are only about 12% as likely to contain Hillary as they are to contain Trump.


And here's what those tweets look like as a wordcloud.:




Next, I wanted to discover the underlying topics inside the data, for which I used correlated topic models (CTMs). I've written about them before on this blog, and technical specifications can be found here. For this analysis, I reduced my dataset to just tweets that hash-tagged "Hitler," for two reasons: 
  • faster processing
  • only analyze tweets with the strongest relation to Hitler
The algorithm seemed to converge on only two detected topics. What does that mean? Over the past week, people on Twitter mentioning Hitler have generally been talking about one of two things 1. Trump and 2. Assad of Syria. Here's the top terms by these two topics:



One might argue that Trump supporters are also mentioning Hitler, but the term output makes it more clear who is invoking Hitler and Trump.  It is generally the "Never" Trump movement as evidenced by the terms "Never" and "Trump The Fraud." For a bit more color, here are the associated wordclouds from each topic.

First Assad:



And now Trump (my favorite random association is "Vote Hitler"):


For some final color on this, here is an example of a couple of Tweets, generally representative of the Trump category:



CONCLUSION

A few takeaways:
  • The most common topic discussed on Twitter when using the term Hitler is "Trump." Users seem to continuously make the comparison of Trump to Hitler, without being punished by the social media platform.
  • The main two subjects of discussion when talking about Hitler on Twitter appears to be Trump and Assad. This isn't hugely surprising about Assad given recent news out of the middle east. The Trump comparisons to Hitler, is a bit more jarring, given that he is the Presidential candidate for a major US party.
QUICK COMMENTARY: The guy with the banned post from the beginning of this entry just posted that he challenged the banned post with Facebook, and Facebook gave in and said that it was in fact appropriate. This is interesting, and it tends to coincide with previous articles stating that individuals at Facebook who receive front line complaints are often biased against conservative views, and tend to over-extend censorship towards conservative positions. Obviously, I looked at a different social network, but it is commonplace to compare Trump to Hitler, so it seems like the Hillary to Hitler comparison shouldn't be censored. 

Friday, August 19, 2016

Trump and Breitbart Alliance: A Match Made on Emotion?

For the Trump campaign, this week has been fairly crazy, highlighted by naming the CEO for conservative news organization Breitbart.com (Steven Bannon) as the CEO of his campaign. There has been quite a bit of punditry on this subject, but that's not really the place of this blog. I'm focusing on a follow-up to our prior post on Trump's use of disgust to drive engagement, emotional reaction and (ultimately) political support.  First, a summary of earlier findings:

  • Trump not only receives engagement boosts from using disgust based language, but also from using anger, fear, sadness, and trust. He doesn't see boosts from joy, anticipation, or surprise.
  • Neither Hillary Clinton nor Bernie Sanders get a statistically significant boost from using emotional language on Twitter like Trump. Both of these candidates see flat relationships with emotion and engagement. (Full disclosure, the author of this blog sees a positive and significant boost in retweets from using disgust language.)
  • Breitbart news sees a similar boost in engagement (from emotional speech) as Trump. This may go to explain the Trump/Breitbart alliance, they use similar tactics to engage users, AND they use their user bases respond to similar types of language.


 METHODOLOGY

(non-nerds can skip)

For this project, I used the same effective code and methodology as my prior post on Trump's use of disgust in tweets.  I made a few improvements (some code at the end of this post).
  • I downloaded the twitter feeds for four internet news sites (Foxnews, CNN, Breitbart, MSNBC), and scored those feeds in the same way I scored candidates in my prior post.  I also followed the same normalization strategy, controlling for incident rate of emotion in the data set, and tendency towards emotional language for each candidate.
  • I created linear models for each candidate/newsagency (entity) by emotion pair (8 entities * 8 emotions = 64 models) with an observation level of "tweet", dependent variable of retweets, and predictor variable of emotion.  This model shows at what rates emotions drive engagement for each entity.
  • I created an output matrix of the emotions, reporting only statistically significant results, for easy comparison of engagement by emotion and entity.

RESULTS

For commentary on the meaning of the charts, please reference the prior post on Trump's use of disgust. The basics are this: we use a sentiment mining algorithm to measure the overall emotion of tweets, and then aggregate the results to each user (candidates and news agencies) and relate that to engagement to determine which emotions drive social engagement results for which groups.

First our emotional term index, which shows that Breitbart is actually the least emotional-sounding news agency, comparing to Foxnews which is most emotional.


Next we summarize the normalized emotional tendencies of each news agency.  Breitbart (of special interest) wins on use of "disgust," MSNBC wins hugely on "surprise", and CNN wins on "trust" emotions.  


That last chart demonstrated an important component related to our prior post: Trump's strategic ally Breitbart news also tends to use a lot of disgust emotional language and signaling in their tweets. But are they as successful in driving engagement by use of disgust as Trump is? Time for a statistical test.

My prior post received quite a bit of traffic, but was a bit intellectually lacking in my opinion. I had demonstrated Trump's relationship with disgust, and his followers reaction to it, but I hadn't looked at two other dimensions:  
  1. Do other candidates get engagement boosts from disgust or other emotions? 
  2. Does Trump get boosts from other emotions outside of disgust?
To test this hypothesis I created 64 linear models which serve as statistical tests to determine the effect of emotional language on engagement measured by retweets (side note: I threw in my own tweets as a comparison). The code for those models are found at the bottom of this post, and the results are in the heatmap directly below.  









The red results above are statistically insignificant, with green results showing statistically significant coefficients (emotions that give candidates a significant boost). The numbers in the chart represent the actual coefficients:
  1. News agencies (except for Breitbart) see very little engagement response to emotions. The relationships we do see for Fox and CNN are related to emotions we may see with breaking news (anticipation, joy, surprise).
  2. Breitbart news sees a stronger relationship with emotional tweets. The strongest engagement-generating emotions for Breitbart are anger, disgust, and sadness.
  3. Sanders and Clinton do not see increases in engagement when they use emotional language.
  4. Trump sees the largest positive relationship with engagement and emotional language, which is strongest on the emotions similar to Breitbart.
  5. Full disclosure: the author of this blog sees positive engagement from disgust tweets too.

An easier visualization of the disgust measure, looking first at candidates, then at news agencies (x axis is a disgust rating, y axis is a retweet count)






CONCLUSION

A few bullet points in closing:
  • The Breitbart/Trump alignment makes sense both from the way they speak AND from the way their followers engage with their tweets.
  • Other candidates and news agencies see much less engagement from openly emotional tweeting.
  • Trump and Breitbart both get more engagement in their tweets by using anger, fear, disgust, sadness, and trust.




 #Model dimensions  
 o <- c("CNN","BREITBART","FoxNews","MSNBC")  
 e <- c("anger", "anticipation", "disgust","fear", "joy","sadness" ,"surprise" ,"trust")  
 #MATRICES FOR OUTPUT  
 jj <- matrix(nrow = length(o), ncol = length(e))  
 rownames(jj) <- o  
 colnames(jj) <- e  
 rsq <- matrix(nrow = length(o), ncol = length(e))  
 rownames(rsq) <- o  
 colnames(rsq) <- e  
 initial <- ("lm(retweetCount~")  
 for(j in e){  
 for(i in o) {  
 dd <- subset(mydata, user ==i)  
 print(summary(dd))  
 model <- paste(initial, j,",data=dd)" )   
 temp <- summary(eval(parse(text=model)))  
 print(temp)  
 print(temp$coefficients)  
 jj[[i,j]] <- temp$coefficients[2,1]  
 rsq[[i,j]] <- temp$coefficients[2,4]  
 }  
 }  
 rsq <- ifelse(rsq <=.050, 1,0)  
 out <- rsq * jj  

Wednesday, August 17, 2016

Pop Culture Data Scientists in TV and Movies

A few months ago my wife was watching House of Cards and called me into the room... the conversation went something like this:

Wife: Hey there's a data scientist on this show!
Me: Really?  .. why...
Wife: Yeah, he's a weirdo, just like you!
Me: .. thanks

I watched a couple of episodes and found out the character was in fact much weirder than I am.  But I also thought this was the first data scientist title I had seen fictionalized in popular culture. Given that this blog has been a bit too serious lately, I thought it would be interesting to compile a list of fictional data scientists, are they really data scientists, and another fictional metric: would Levi hire them?

An interesting thing I found in researching this is that since the concept of data science is relatively new, there are very few actual data scientists in pop culture. It also may be difficult to fictionalize what data scientists actual do in any kind of interesting way, but .. whatever...

The list is short, and most people on the list wouldn't necessarily identify as data scientists, but here we go:


Aidan MacAllan, House of Cards:  


Synopsis: Aidan is a data scientist who apparently works for the government, or the President and his wife directly and does... data science-y stuff for them. Most of the tasks resemble real data science tasks (large web analytics, identifying targets for marketing based on who has been close to gun violence) and others aren't really data science but sound cool (tapping a phone). The portrayal of Aidan as a data scientist is romanticized, but some of the tasks are at least in the ballpark of what we can do.

Personality Portrayal: My wife was wrong, he's much weirder than me. The only similarity between Aidan and I is consistently messed up hair and listening to death metal.  But he dances naked when he's alone compiling code (I DO NOT DO THIS). And the character is portrayed as a weird-artistic-savant, which is good for Hollywood story lines, but is overplayed from the real personalities of most data scientists.

Would I hire?  No.

Reasoning: The naked dancing thing, questionable data ethics, and overstatement of certainty.




Seth Bregman and Peter Sullivan: Margin Call. 


Synopsis: Another Kevin Spacey project, strangely enough, maybe he likes data science! I like the movie Margin Call a lot, partially because the movie is about quants getting things right. In this movie Seth and Peter are identified as risk analysts, but had this movie been released in 2016 rather than 2011, there's a good chance they would be data scientists. At the beginning of the film they work with a model that shows their firm to be over-leveraged, which sets up the rest of the film. Though not strictly data scientists in title, they create models/simulations out of large data sets, which is effectively what data scientists do.


Personality Portrayal:  These guys aren't portrayed as weirdos, but more junior Wall-Street guys that are maybe a bit more numbers/science focused. One aspect of the movie that parallels modern data scientists is that they have broad academic backgrounds (e.g. astrophysics) and have been brought into business to solve large financial modeling problems.

Would I hire? Yes.

Reasoning: They are portrayed as generally competent, and aren't afraid to escalate issues to management, which is important.




Peter Brand: Moneyball.

Synopsis: This isn't even a data scientist really at all, but I'm already hitting the bottom of the well with examples. Jonah Hill plays this character as complete nerd who is basically a statistician and saber metrics expert. As portrayed on film, it doesn't appear he has the computer science skills required for a data science career. He does one very important thing for data scientists in the film though: he continually tells those around him that they are looking at the wrong metrics, and that by focusing on metrics that actually create production, they can run a better business.

Personality Portrayal: Nerdy, classic statistician that likely plays Magic the Gathering in his free time. This may just be an offshoot of casting Hill to play the role though.

Would I hire? Maybe.

Reasoning: Yes, if I had a lot of tasks that lent to econometric rather than Machine Learning models. And I could stick a data engineer with him to do the Computer Science side of the job.




Max Cohen: Pi.  

Synopsis: This is way before what we think of as modern data science, but the essentials are there: looking for patterns in large and somewhat unstructured data sets. Basically, an unemployed number theorist starts analyzing data in stock markets and finds patterns. He makes makes accurate predictions based on these patterns (seemingly related to a 216 digit number). He then moves on to numeric/textual analysis of the Torah (which BTW, is a fairly common junk science). In the end, the numbers drive him crazy and he solves the problem with a power drill to the brain (seriously).

Personality Portrayal: Paranoid, potentially schizophrenic, intelligent, with extremely bad headaches. I've seen data scientists get to this point, but it's usually solved by a good night's rest more easily with the power drill.

Would I hire? No.

Reasoning: He's crazy.



Honorable Mention/Exclusions:
  • Alan Turing, The Imitation Game:  Excluded because Alan Turing was a real (awesome) person.
  • Artificial Intelligence, Her, Ex Machina: Generally nameless data scientists, and highly speculative futurism.  I prefer real data scientists solving real problems.

Thursday, August 11, 2016

Contracting Rates, Competition, LinkedIn's Race to the Bottom

A few weeks ago I was looking to connect with more people in data science (read: networking) so I did something out of the ordinary for me: I joined a LinkedIn group. It was a group related to R statistical engine programming, and I assumed I would join it, read some posts, and maybe it would keep me engaged in the field.

Questions started popping up in the group, generally related to the best ways to make certain graphs in ggplot2 or how to handle certain dataframe manipulation tasks. I ignored. Then a question caught my interest:
I've been asked how much the hourly rate is for a freelancer and I have no idea. Could anyone provide a ballpark range in US dollars?
Going against my gut, I engaged with LinkedIn.

BACKGROUND

The question was interesting, and I contract on occasion so I thought I should type a quick response. I clicked through on the question, and saw the following first answer:
Hi M----, I am a consulting statistician with many years experience using R. I charge $40/hour for my services.
Things just got weird. I'm more familiar with contract rates in the $150-$200 per hour rate. Very weird. What could be going on, and why is this person's rate so low?  First a few facts on data science salaries:
  • Median salary for a data scientist in the US is about $112,000 (~$55 an hour + benefits)
  • Mean salary for a data scientist in the US is about $125,000 (~$61 an hour + benefits)
  • Contract or freelance gigs, in any field generally pay more on a per increment (hour basis) than full-time gigs of similar veracity, for a couple of reasons:
    • Contractors have less stable employment and thus earn more as a hedge against instability and opportunity costs (time spent marketing, invoicing, etc).
    • Businesses using contractors incur much lower total costs than if they hired a full-time employee, so they are willing to pay a premium to keep it in pay per increment.
The original question wasn't looking for a data scientist per se (just a statistician with R experience) but the skillsets and pay rates are somewhat similar.  The question came to my mind: why were some data science related contract rates so far below normal data science salaries in the US?

FIGURING OUT DATA

A lot of people ended up responding to the LinkedIn question, so I had a fairly large sample of analysts and their self-reported charge for contract work. Because it was LinkedIn I could also click-thru to their resumes and determine their experience and education backgrounds, as well as other demographic factors. Per the post, I also referenced the website Upwork, which is kind of like an Uber service for Freelancers in various fields (to increase my sample).

In the data, I found three basic groups in the posts:
  • Workers living overseas, especially in South Asia who were willing to work for sub-par wages ($30-40 an hour). This group seemed to charge less for a number of reasons:
    • Businesses incur a bit of risk in working in these areas, which shifts wage rates down.
    • The exchange rates and local cost of living in these areas make lower wages more tolerable to data scientists living in the area.
    • Data science and IT jobs in that region (substitute employment) pay less than in the United States.
  • Workers living in US with thin or non-existent resumes, willing to work for sub-par wage ($30-50 an hour). These individuals generally had strong educational backgrounds (some with PhD's), but had resumes that lacked any substantive analytical experience. I know that there are some people who have great education backgrounds, but are unemployable for various reasons (e.g personality, work ethic) so this isn't hugely surprising to see high-education people out of work. A couple reasons that these are likely under-market:
    • Many may be willing to work under market because they are CURRENTLY unemployed and/or unemployable.
    • Much of a data scientist's value in the workplace comes from solving real-world business projects. These workers realize that they have substantively less to sell themselves on to large employers.
    • For unemployable individuals, there is very little potential for substitute employment (getting a real data science job) so they are willing to work temporary work for far less.
  • Workers living in US or similar countries, with long resumes, who use similar contracting rates ($120-$300 an hour). These were generally US residents with a background similar to mine, having worked in analytics and data science for many years within large companies. These are generally people with $100K+ day jobs, that will contract in their free time, if a company will "make it worth it."


CONCLUSION

Though data science contractors with business experience in the United States are extremely well compensated, those with limited experience or working overseas cost a fraction of the price. These overseas and low-experience resources are likely best suited for projects of low-level coding or entry-level data science, however, as one responder to the original LinkedIn post said, they come caveat emptor which may be the reason many businesses pay higher rates for more guaranteed talent.



Monday, August 1, 2016

Quantitative Analysis: Trump's use of Disgust to Drive Engagement

I went down a bit of a rabbit hole this morning. I was using a sentiment mining algorithm to classify text by emotion (similar to the positive/negative polarity analysis from last week), and noticed the least-used emotion my algorithm was assigning was "disgust."

This isn't odd given field I work in, but I wondered if there are fields, or sets of text, that would skew disgust differently. Then I remembered a tweet... and some articles I read last winter about Donald Trump's status as a disgust/hygiene candidate. Here's one of those tweets:


And I was off, downloading a few thousand presidential candidate tweets, once again. This time though, I limited analysis to the Trump disgust theory, whether, 1. he's the candidate that uses the disgust emotion the most, and 2. whether using disgust has been a successful strategy for him to date. (Side note: I could write a dissertation on emotional language used by candidates.) This is what I found:

  • Analyzing emotional language of the final three candidates, Trump skews highest in disgust, and it his is strongest skew-over-others emotion. (Sanders is strongest on fear, Clinton on trust.)
  • Trump is successful in generating positive interaction from disgust-tweets, generating an additional 1000 retweets for disgust tweets over non-disgust tweets.

DISGUST THEORY

The basics of this theory is that Trump gains an advantage by signaling disgust (re: some political event) to his followers and advocating himself as the hygienic (purity) candidate.  Examples of this behavior?
  • Framing America's problems in terms of infectious disease: problem from outside of US, brought in by immigrants. 
  • Response to Megyn Kelly, purportedly referencing menstruation: "... blood coming out of her wherever."
  • Comments on Hillary Clinton's potty break: (I have a two-year-old) during a democratic debate.
The underlying functional theories of why and how Trump uses disgust to attract voters isn't the place of this blog, but two good starting places are this New Republic article and this article from the New York Times.

METHODOLOGY


SHORT: I used sentiment mining technology which analyzes the texts of individual comments (tweets, in this case) and determines the underlying emotions of each text.

LONG (skip if non-nerdy): I pulled a sample of tweets from the past four months using the Twitter API, and cleaned the data using standard text cleaning methods (removing punctuation, numbers, stem words, etc). I applied sentiment mining technology to the same tweets I pulled last week, this time using an emotion classifier rather than a polarity classifier. (Available in the R Package syuzhet, and wrote additional code to summarize the data --  if you want my code, ask).

The emotion classifier breaks down emotions into the eight primary emotions —anger, fear, sadness, disgust, surprise, anticipation, trust, and joy.  For more about that those emotions, see Plutchik's Wheel of Emotions, which is where this fairly standard list is derived.

Then I segmented the tweets by poster (Clinton,Trump,Sanders) and summarized their postings using a double normalized index of emotional strength. The scores were normalized by:
  1. The candidates' tendency to use emotional words in tweets (Bernie was the most emo, Hillary the least).
  2. Each emotion's propensity over the entire population (Trust and Anticipation were most common). 
By using this two-way distribution normalization, we can determine the individual strength of emotion displayed by each candidate, controlled for each candidates tendency to be emotional AND the in-nature distribution of each emotion.

RESULTS

First, the two-way normalization forced me to calculate a net emotionality score.  This score is effectively the average number of highly-emotional words per tweet by candidate.  Here's what that looks like.  (I also added analysis of my tweets.)



Bernie is most emotional, followed by Trump and Hillary.  I am the least emotional in my terms, which may derive from being a quant... or not a presidential candidate.

Second, I created an index to show which candidates skew towards which emotions.  This also allows us to determine which emotion is relatively strongest to each candidate (highest value compared to values of other candidates), which breaks down like this:
  • Bernie: Fear
  • Trump: Disgust
  • Clinton: Trust
A chart of these emotions.



In relative terms, Trump clearly skews towards more disgust-related words than the other candidates, and this seems to be his strongest difference with both Bernie and Hillary. Analysis of the data this way allows for a quantitative validation of the disgust theory put forward earlier this year by other publications.

Does leveraging disgust-emotions help Trump gain engagement and voters?

Third, in two prior posts on polarity, I demonstrated how Trump's use of negative words in tweets (highly negative polarity) drove engagement on twitter (roughly: the more negative Trump goes, the more engagement he gets; not true for Hillary or Bernie). Can we demonstrate the same for disgust emotions?

Here I plotted disgust words per tweet by average retweets (best measure of engagement overall). Once again there is a positive relationship between Twitter engagement and the emotion of "disgust." In fact, tweets that were algorithmically classified as emotion = "disgust" were retweeted about 1,000 more times (20% more) than Trump's non-disgust tweets (p = 0.025).


CONCLUSION

This post provided a novel method to determine relative strength of emotions used by candidates, and also test a theory about a candidate's (Trump) use of a specific emotion to gain an advantage. To reiterate my findings:
  • Analyzing emotional language of the final three candidates, Trump skews highest in disgust, and it his is strongest skew-over-others emotion. (Sanders is strongest on fear, Clinton on Trust)
  • Trump is successful in generating positive interaction from disgust-tweets, generating an additional 1000 retweets for disgust tweets over non-disgust tweets.

Tuesday, July 19, 2016

Donald Trump Is Getting An Even Bigger Reward for Negativity Now

Last night's Republican National Convention in Cleveland sparked my interest in Donald Trump's persona and campaign... again. It was a weird mix of reality TV stars, war heroes, and grieving mothers (who had a poor understanding of what next of kin means... but that's another story). The night reminded me a bit of my earlier post that demonstrated that Trump gets more positive response by going negative (with the opposite result for Hillary Clinton).

For me, the interest combined with opportunity (work-related reason to pull out my sentiment mining code today).  My findings here are interesting, though consistent with prior work (seriously, go read that prior post for context).  In a way things got "worse."  Here's what I found:
  • Trump continues to be significantly more negative than the rest of the field.  
  • Trump continues to be significantly more bipolar in sentiment (higher sentiment variance) than the rest of the field.
  • Trump is getting an even higher boost from his negative tweets now.  Whereas historically he was receiving 230 additional retweets per negative word, he's currently getting over 800 incremental per negative word.  

METHODOLOGY

I'm using the same methodology as last time, downloading tweets using the Twitter API. Then I use fairly common stemming and word removal techniques, and throw my sentiment mining algorithm at it (code released in prior post). Also, this time I limited to July tweets, so I cut out all of the tweets used in my prior analysis.

The sentiment score derived is simply a score based on how negative or positive a tweet is using an algorithmic analysis of the words used in that tweet. More negatively scored tweets use more negative words, positive scored tweets use more positive words.

RESULTS

There are many ways to cut this data, but I will run the data few ways for quick analysis.  Just quick data, findings and descriptions.

First, what do the distribution of tweet sentiments for each candidate look like?  I left Bernie in because he's... still... apparently in the race.  And myself, for full disclosure.



And a visual distribution of the two main candidates:


  • Trump is the only one with negative average tweet sentiment.
  • Trump also has the widest distribution of sentiment (visually, standard deviation), showing some bipolarity in his tweets.
  • Clinton is the most positive tweeter of the group.
  • I'm the most active on twitter of the group (hmmm).
Next we look at a more interesting area, how many retweets each candidate gets by net sentiment score.  In essence, this analysis looks at how Twitter users react to each candidates tweet. The more retweets, the more interaction, the fewer.. less interaction.  If we correlate this to the sentiment of tweets we can determine what kind of sentiment draws the most interaction for each candidate.  Here's what Trump looks like (sentiment score is horizontal axis, retweets per tweet on vertical):


And Clinton:



A few takeaways:
  • The relationship for Trump is statistically significant, and we can infer that Trump gets 800 incremental retweets (on average) for each negative word he uses.  (more negative == more interactions)
  • The Clinton correlation is statistically insignificant, and almost completely flat.

And just for fun, let's look at the most negative tweet for each Tweeter:









CONCLUSION

Final thoughts:
  • Trump continues to be more negative on Twitter than Clinton, and also more bipolar.  This behavior may be somewhat a reaction to his environment though, as Trump gets continuous positive feedback for worse tweets.
  • Clinton's tweets continue to be more neutral, and somewhat less volatile in sentiment.  Her followers also do not react systematically more positive towards more negative (or positive) tweets as Trump's followers do.

Oh, and I know you just hung around for the word clouds, so here you go.

Donald (Thanks people, Making America Great Again):


Hillary (mainly talks about Donald):



Bernie:

 And.. me.





Friday, July 1, 2016

Search Origin and "Hidden Identities"

I've been very busy lately, so sorry to my normal readers for the lack of posts.  Here's a post I have had in the works for a while, a bit of a "meta" post on what types of people find this blog from different sources.

BACKGROUND


I've been running this blog for nearly a year and a half now, and in the last few months it has seen quite an improvement in organic search traffic (ie, people finding this blog, via search engines). Somewhat ironically I don't do a lot of deep-dive data science type work into the blog stats because, well honestly the numbers are small and I have access too much more interesting web analytics data.

Over the past few weeks, however, I noticed something interesting in blog statistics: the blog gets quite a bit of traffic from a relatively obscure (at least low-use) search engine: DuckDuckGo.  I first heard in depth of DuckDuckGo from Bruce Schneier in his book Data and Goliath, a book that takes fairly extreme views towards cyber security.  I was surprised to see the volume of hits to datascience notes.com from this relatively obscure search engine.  I wondered what was going on.

What is DuckDuckGo?  Effectively, DuckDuckGo is a search engine that doesn't track users, and offers a theoretically more secure view of the internet.  It also gives "user agnostic" search results, which is a topic for a more in-depth post on another day.

(Side note, Data and Goliath is worth a read. I deal with Big Data in my everyday working life and don't live in the paranoia of Schneier, but I appreciate his point of view and perspective.)

THE DATA


The first thing I noted when analyzing Google Analytics data, was that the Google search traffic and DuckDuckGo search traffic tended to land on different resources on the blog.  So I dug into Google analytics data regarding traffic sources and landing pages, here's a summary of top landing pages by search engine:


On the Google side, we see that most people are directed to my homepage (which is a good thing, btw), followed by two posts on specific issues related to the R statistical engine, a Bernie Sanders post, a Voter Fraud post and a few more general data science posts.  Generally speaking, traditional search users find this blog for it's intent: data science, with a couple of "pop-science" posts mixed in.

The DuckDuckGo results, on the other hand exclusively go to Bernie Sanders and Election Fraud related posts.  The Election Fraud and Bernie Sanders posts do get quite a bit of traffic overall on this blog, but comparatively little from google and traditional search channels.

CONCLUSION


There are two main reasons this difference in search engine redirects could occur:

  1. The type of people who use DuckDuckGo could be more interested in Bernie Sanders and Election fraud. There's actually quite a bit of face validity to this view, based on both election fraud truthers and Bernie Sanders voters believe that the system is rigged, and are somewhat paranoid of systems (voting; economic) that actively monitor, control and put-down "defectors."
  2. The search engine DuckDuckGo is better optimized towards my Bernie Sanders and Election Fraud posts than Google.  This one is difficult to falsify because I don't have a list of the search terms used to find this blog in DuckDuckGo.  A quick test of both websites (from a clean browser) gives similar results, and it is known that DuckDuckGo relies on other search engines for results, so it seems somewhat unlikely that optimization is creating a large variation in search results.  However if DuckDuckGo is self-optimizing, and my these posts create more clicks within a more-paranoid user group, it's possible that optimization is still in play.
It's possible that a combination of both factors are at play, but on face it's more likely that the users of DuckDuckGo are more interested in protecting their search and browsing patterns from intrusion of the government.  That's interesting, but maybe not all that surprising.

What's more interesting to me, is that these users are interested in protecting their identities while *searching* the internet, but not while browsing the internet. What does that mean?  Essentially this: while election fraud/Bernie Sanders users protect their identities by using DuckDuckGo to find this blog, once they reach the website I can generally ascertain a lot about that user by looking at logs and IP related data.