Saturday, January 9, 2016

ALA MidWinter-Day One Twitter Mining

As I've posted on this blog before, occasionally I tag along with my librarian wife to her professional conferences.  These conferences give me an opportunity to see my wife's professional world, as well as relax and unwind a bit from the normal stressful life of an analyst.  It's a lot of fun, and librarians are a surprisingly fun group to hang out with.

This week I am at the ALA Midwinter conference in Boston, and having a great time.  One interesting way to experience the conference is through the twitter hashtag #alamw16.  This morning I realized it might be fun to point my text mining algorithms towards tweets about the conference, to somewhat scientifically figure out what the chatter is about.

I will just jump right in, for text mining this data I downloaded tweets from Thursday night through Saturday morning using the conference hashtag.  First I created something easy, a wordcloud which gives us an idea of the most common words used and their relative frequencies.  Unsurprisingly, the most frequently used word during the days that people were travelling was the location of the conference: Boston.  Also the word "librarian" and "book" pops up high, as well as some time and place based terms like "tomorrow" and "exhibit."



Then I moved on to a topic modelling methodology, which looks at the general topics people are talking about in the data.  

If you haven't read this blog before, here's a primer on how topic models works:  The algorithm looks at a set of documents (tweets) and finds terms that are often used in conjunction with each other, and from these words can derive topics from the documents, which can otherwise be difficult to "observe."  I use a version of topic models called Correlated Topic Models which involves an additional covariance calculation across the top of the model-in my experience this helps with short documents, such as "tweets."

That was a longer explanation than I intended, and most people just want to look at the documents.  Here is a list of the topics and the most disproportionately represented terms in each topic:


What do these topics represent?  I dug into the data to describe the observed topics, and here's a summary of what people are really talking about on the #alamw16 hashtag:
  1. Topic 1: A topic of vendors asking attendees to stop by and see them while in Boston.
  2. Topic 2: A topic about the work of the ALA Executive Board, and generally about librarianship.
  3. Topic 3: A topic of attendees talking about their excitement towards the convention.
  4. Topic 4: A topic of attendees and vendors talking about how great Boston looks/is as a host for the convention.
  5. Topic 5: A topic of librarians vendors talking about books they are excited about, as well as some other exciting new products.


For those that really like wordclouds, here's a wordcloud on each of the topics:

Topic 1 (this one rendered small, largely because of the importance of the word "booth"):

 Topic 2
 Topic 3
 Topic 4
 Topic 5
 

1 comment:

  1. The verification program does have a practical objective. Twitter brims with bogus or parody accounts. So when buyers are sifting through an index of possible usernames, it helps to possess alerts that will help locate the true human being they need to observe. twitter verification services

    ReplyDelete