Wednesday, July 22, 2015

Text Mining Part 2: Fitting Topic Models

Finally have time to write my second post on text mining.  My first post was late last week, and dealt with the rather boring subject of importing and cleaning data.  This one gets into some actual analysis.

THE BASICS

So what types of models will I be creating here?  Topic models.  Topic models are a way to analyze the terms (words) used in documents (for this purpose, tweets) and group the documents together into topics.  In simple terms, the algorithms used find which words are often used together and cluster those word commonalities into topics.  It's a way to measure unobserved groups (topics) in underlying distributions.  From a predictive standpoint, we can then apply those topics back to the original documents, and group our tweets by topic.

Before starting on a model, a good place to get started thinking about this, is to look at which words associate most highly with which other words. R provides this functionality through the findAssocs() function.  I ran it for two things I tweet about often: running and taxes.  

The results make sense here.  Top three words associated with "run" are long,mile, and trail.  That means that documents with the word run are more likely to contain those words than other tweets.

The top words associated with tax are sale, income, overhaul, growth, outstrip, deduction, and elasticity.  If you've followed my blogging or tweeting on Kansas politics in recent month, this all makes great sense.  But will we be able to see these groups in the final topic models?

THE MODELS

There are two basic types of topic models that I will use in this analysis: Latent Dirichlet Allocation and Correlated Topic Models. Both models work in similar ways, assuming a prior distribution (dirichlet and logistic normal) and use those distributions to model the underlying topics.  The main divergence in the models is that the Correlated Topic Model then calculates covariance over the top, which adds the bonus of correlation between topics, and makes the model generally fit better.

You can email me or comment below if you want to understand the underlying math behind these types of models, but this is more of a how-to explainer.

How do we get started in R?  The R code is straight forward as seen below.  First build the models using the LDA and CTM commands, then you can get a list of high probability words for either model using the "terms" command.  There are more parameters obviously, but I want to start simple.



How good are the models provided here?  They're ok.  Unfortunately my tweets are fairly sparse topic wise, but I expected to see two major topics pop up: running and Kansas legislative policy.  And I did, essentially, and it appears on face that the CTM is the better fit to the data.  Here are the high probability terms list:


The CTM model found both of my original topics, contained within topic 3 (running) and topic 8 (kansas policy). I'll post later on how to fit these models (potentially with better data), and how to evaluate how good they are in dividing topics, but this is a good start.

One last thing for now.  Can we pull a list of tweets on a particular topic using the models derived above?  Yes.  Here are the top two tweets associated with my topic 8, which generally associates with policy.







No comments:

Post a Comment