Friday, July 17, 2015

Text Mining First Step: Topic Modeling Your Tweets Pt. 1

Many businesses are seeing value from mining the vast amounts of text data they accumulate, meaning for analysts text mining skills are in high demand.  For analysts new to mining text, the analysis can seem intimidating.  From a primal-analyst perspective it goes something like this:
Analyzing words?  I don't do words!  I do numbers!

So I make fun of myself.  But on a serious note, the amount of pre-processing, data cleaning, and understanding of "language" concepts can make textual analysis a scary prospect.  Whenever tackling a new type of analysis, I find it best to start with small, familiar datasets.  Luckily, Twitter data both works well for text mining, and is easy to get started with.  In this post I'll take you through the basic steps of importing and cleaning twitter data; tomorrow I will show the results of some initial topic models.

ACCESSING TWITTER DATA

The twitteR library in R provides a nice direct way to connect to twitter and import data.  Unfortunately this requires a twitter developer account and some authentication shenanigans, so it's not the easiest way to get started.  Instead, if you're just looking for a quick intro, I suggest going into the settings of your twitter account, and requesting your archive as a CSV.

INITIAL CODE


OK, so now we have the data.  What do we need to do to get started.  Here are the initial steps to prepare the data:


  1. Read in your packages (I use tm, RTextTools, topicmodels, and wordcloud, in addition to my normal load).
  2. Read in the data.  Simply reading in the csv provided by twitter is straight forward in R.
  3. Create a corpus from the data.  Per R documentation, the corpus are "collections of documents containing (natural language) text."
  4. Clean the text.  A few steps here in code below, most of them are fairly obvious in what they do, but a few comments.
    1. "Stopwords" are words that effectively do not matter.  Think of articles ("the", "a","an") which are contained in a document, but don't add a lot of value to meaning.
    2. That second word removal step.  Yeah. About that. I found that I have certain vocabulary quirks, that I use often, and don't add a lot of value.  Text mining helped me realize this initially, but it's annoying, and I promise to stop using the word "just" in all my tweets.
    3. Stemming: stemming is the process of taking words and reducing them to their root.  That way when I say "run" and "running", the words are interpreted the same way by our predictive algorithms.
  5. Create a Document Term Matrix.  This is the process of creating a matrix of words, and how often they show up in each document (tweet).  This is the process of making the data numerical, so we can feed it to an algorithm.
  6. Remove null rows. Step 5 creates "0" rows sometimes.  Need to remove them. 
  7. Descriptive Statistics:  What words are most common (how I learned about my use of the word "just")?  Getting a basic view of the data can be helpful for identifying additional cleaning and getting a layout of the data, just link in any other dataset.





SUMMARY

So today's post is a fairly straight-forward how-to clean data. Tomorrow we'll get into actually modelling text data.  

Except, I don't feel that today's post is very meaningful.  I mean, where's the content?  See the last step in my code.  Here's the wordcloud it created (I'm not a fan of wordclouds, but whatever).  

The word cloud makes a lot of sense.  At the center is "kwhit14" which is the twitter handle for my wife, who I talk to regularly on social media, sometimes when we're in the same room.  The next two words are "work" (I tweet about work stuff quite a bit) and "think" (on those psychological tests I'm told that I am a "thinker" rather than a "feeler" meaning that I would express what I think rather than what I feel).  The words descend from there, a lot of common words, and references to people I talk to often online, and terms I use often.   

No comments:

Post a Comment