Wednesday, July 22, 2015

Text Mining Part Three: A Better Use Case

Mining my own tweets yesterday was interesting, but wasn't the most illustrative example of good text analysis.  There are a few reasons for that, but it's mainly about the weird way that I use twitter.  So what happens with a "tighter" and more topic friendly list of tweets?  Here's a quick look using some Kansas government tweets.

DATA SET

For this dataset I used a more "topical" set of tweets, specifically with the hashtag: "#ksleg" meaning that they have something to do with Kansas government.  I've blogged about these kind of topics before regarding taxes, elections, and education.  

To pull this data, you need to authenticate from R into twitter using the setup_twitter_oauth() function.  Then retrieving the tweets are as simple as:

tweets <- searchTwitter("#ksleg",n=10000000, lang="en",since="2015-01-01")
tweets <-strip_retweets(tweets, strip_manual = TRUE, strip_mt = TRUE)
tweets <- twListToDF(tweets)
*Quick note this actually only returns about a week of data, due to a limitation in the twitter search API.

Following this step, I went through all the data cleaning steps implemented in my post from last week.

RESULTS

So what does data under this hashtag look like? Common words are Kansas, KSED (hashtag used for Kansas educators), teacher, Brownback, and State.  Since everyone else in the world likes wordclouds, here you go:
Another way to view is the findFreqTerms(matx.new,20), which just gives us a list of words used more than 20 times in the tweets.




But can we model this data into topics?  Yes.  And much better than my personal tweet data.  Same procedures as yesterday, here are the topics created:


You may not be familiar with Kansas politics, but the fit her is much better than yesterday, and here are the "realities" behind each topic: 

Topic 9: this is a low-frequency topic; relates to impact of education policies on kids.

Another interesting way to look at this data, is by the percent of tweets in each topic by day.  Interestingly, on the day that the story behind topic 1 broke, over 40% of #ksleg tweets were on that subject.  Topic two broke on the 17th, corresponding to a spike in related tweets.  The chart below demonstrates this:




CONCLUSION

Overall, my topic models work much better on data that is more topic-focused, like tweets about government.  If you're wondering about my tweeting in the last week, I had two tweets with the hashtag that were scored, one in topic 1, the other in topic 5.  

Have I mentioned that you can create word clouds within each topic?  Because I know that's all people really want.  Wordclouds.  To finish, some topical wordclouds:


Topic 1: 


 Topic 4: 

Topic 6: 
 Topic 7: 


No comments:

Post a Comment