Friday, July 24, 2015

Text Mining: Mining the BlackLivesMatter Hashtag

Last night I asked my wife if she had read yesterday's post on topic mining the #ksleg hashtag, remarking I have never had topic models converge to news stories quite that well, blah blah blah; nerd nerd nerd. 

Honestly, she looked a little unimpressed.  And then finally said, "why don't you do that on the #blacklivesmatter hashtag?"

She was right.  I had mined a policy-wonk centered, Kansas specific hashtag.  Why?  I should do something with more volume, that people actually care about.  And so I went to downloading.  


Just a few nerd notes on mining larger hashtags in R:

  • The API has a limit for search results that I hit a few times, you just have to wait it out.
  • Changing your search is really as simple as changing the term in code, as well as time constraints.
  • This hashtag gets a lot of volume, so random subsampling is necessary.  I used two point in time samples, one from the 22nd, one from the 17th.


This data ended up being fascinating, so I ran quite a few models. My sample included two days, 2015-07-17 and 2015-07-22.  I created an individual model for each day, as well as looked at the general high-frequency terms.  All the outputs for each day are below in "appendix" but here are some general patterns I discovered:
  • Both day's tweets were dominated by the Sandra Bland case, an African American woman who died in police custody.
  • 2015-07-17 was the one year anniversary of Eric Garner's death, more about that here.  I didn't know this before downloading the data, but it was apparent that this was a major topic of the day.
  • There was a clear change in the Sandra Bland related topics between the days:
    • On the 17th, the terms were more factual about the case or events(jail,mystery, death, vigil). 
    • On the 22nd, the terms had grown angrier, such as the "f" word was now a top word, along with murder, demand, investigation, and kill.
Those two models and descriptive statistics are below in my appendix.  But the combined model is fairly interesting. Here's what the algorithm found:

Here's my analysis of each topic:

  1. This is a Sandra Bland topic that centers around finding justice (DOJ, Justice, Investigation).
  2. This topic also has Sandra Bland undertones, but focuses on the racism-specific notions of the case.
  3. This topic focuses on the African American community using the media and twitter to fight racism.  (read: Black twitter)
  4. This is an Eric Garner one year anniversary topic.
  5. This is another Sandra Bland related topic, largely around white cops killing black people, and that "needing to stop."
The topics here are fairly tight and obvious, but can we demonstrate how they changed over time?  This chart shows tweets allocated to each topic for each day:

Effectively, the Eric Garner topic went away and was replaced with more calls for Justice for Sandra Bland.  The other topics stayed the same for the most part, with a slight increase with the calls to end white police killing black people.

Let's be honest here though, we all just want to look at word clouds.  First a word cloud of the 17th (prominent Garner references):
Now a word cloud of the 22nd (Garner gone, now "justice" references):


JULY 22nd

Frequent terms from July 22nd:

Topics from July 22nd:

JULY 17th

Frequent terms from July 17th:

Topics from July 17th:

No comments:

Post a Comment