Wednesday, July 22, 2015

Text Mining Part Three: A Better Use Case

Mining my own tweets yesterday was interesting, but wasn't the most illustrative example of good text analysis.  There are a few reasons for that, but it's mainly about the weird way that I use twitter.  So what happens with a "tighter" and more topic friendly list of tweets?  Here's a quick look using some Kansas government tweets.


For this dataset I used a more "topical" set of tweets, specifically with the hashtag: "#ksleg" meaning that they have something to do with Kansas government.  I've blogged about these kind of topics before regarding taxes, elections, and education.  

To pull this data, you need to authenticate from R into twitter using the setup_twitter_oauth() function.  Then retrieving the tweets are as simple as:

tweets <- searchTwitter("#ksleg",n=10000000, lang="en",since="2015-01-01")
tweets <-strip_retweets(tweets, strip_manual = TRUE, strip_mt = TRUE)
tweets <- twListToDF(tweets)
*Quick note this actually only returns about a week of data, due to a limitation in the twitter search API.

Following this step, I went through all the data cleaning steps implemented in my post from last week.


So what does data under this hashtag look like? Common words are Kansas, KSED (hashtag used for Kansas educators), teacher, Brownback, and State.  Since everyone else in the world likes wordclouds, here you go:
Another way to view is the findFreqTerms(,20), which just gives us a list of words used more than 20 times in the tweets.

But can we model this data into topics?  Yes.  And much better than my personal tweet data.  Same procedures as yesterday, here are the topics created:

You may not be familiar with Kansas politics, but the fit her is much better than yesterday, and here are the "realities" behind each topic: 

Topic 9: this is a low-frequency topic; relates to impact of education policies on kids.

Another interesting way to look at this data, is by the percent of tweets in each topic by day.  Interestingly, on the day that the story behind topic 1 broke, over 40% of #ksleg tweets were on that subject.  Topic two broke on the 17th, corresponding to a spike in related tweets.  The chart below demonstrates this:


Overall, my topic models work much better on data that is more topic-focused, like tweets about government.  If you're wondering about my tweeting in the last week, I had two tweets with the hashtag that were scored, one in topic 1, the other in topic 5.  

Have I mentioned that you can create word clouds within each topic?  Because I know that's all people really want.  Wordclouds.  To finish, some topical wordclouds:

Topic 1: 

 Topic 4: 

Topic 6: 
 Topic 7: 

Text Mining Part 2: Fitting Topic Models

Finally have time to write my second post on text mining.  My first post was late last week, and dealt with the rather boring subject of importing and cleaning data.  This one gets into some actual analysis.


So what types of models will I be creating here?  Topic models.  Topic models are a way to analyze the terms (words) used in documents (for this purpose, tweets) and group the documents together into topics.  In simple terms, the algorithms used find which words are often used together and cluster those word commonalities into topics.  It's a way to measure unobserved groups (topics) in underlying distributions.  From a predictive standpoint, we can then apply those topics back to the original documents, and group our tweets by topic.

Before starting on a model, a good place to get started thinking about this, is to look at which words associate most highly with which other words. R provides this functionality through the findAssocs() function.  I ran it for two things I tweet about often: running and taxes.  

The results make sense here.  Top three words associated with "run" are long,mile, and trail.  That means that documents with the word run are more likely to contain those words than other tweets.

The top words associated with tax are sale, income, overhaul, growth, outstrip, deduction, and elasticity.  If you've followed my blogging or tweeting on Kansas politics in recent month, this all makes great sense.  But will we be able to see these groups in the final topic models?


There are two basic types of topic models that I will use in this analysis: Latent Dirichlet Allocation and Correlated Topic Models. Both models work in similar ways, assuming a prior distribution (dirichlet and logistic normal) and use those distributions to model the underlying topics.  The main divergence in the models is that the Correlated Topic Model then calculates covariance over the top, which adds the bonus of correlation between topics, and makes the model generally fit better.

You can email me or comment below if you want to understand the underlying math behind these types of models, but this is more of a how-to explainer.

How do we get started in R?  The R code is straight forward as seen below.  First build the models using the LDA and CTM commands, then you can get a list of high probability words for either model using the "terms" command.  There are more parameters obviously, but I want to start simple.

How good are the models provided here?  They're ok.  Unfortunately my tweets are fairly sparse topic wise, but I expected to see two major topics pop up: running and Kansas legislative policy.  And I did, essentially, and it appears on face that the CTM is the better fit to the data.  Here are the high probability terms list:

The CTM model found both of my original topics, contained within topic 3 (running) and topic 8 (kansas policy). I'll post later on how to fit these models (potentially with better data), and how to evaluate how good they are in dividing topics, but this is a good start.

One last thing for now.  Can we pull a list of tweets on a particular topic using the models derived above?  Yes.  Here are the top two tweets associated with my topic 8, which generally associates with policy.

Friday, July 17, 2015

Text Mining First Step: Topic Modeling Your Tweets Pt. 1

Many businesses are seeing value from mining the vast amounts of text data they accumulate, meaning for analysts text mining skills are in high demand.  For analysts new to mining text, the analysis can seem intimidating.  From a primal-analyst perspective it goes something like this:
Analyzing words?  I don't do words!  I do numbers!

So I make fun of myself.  But on a serious note, the amount of pre-processing, data cleaning, and understanding of "language" concepts can make textual analysis a scary prospect.  Whenever tackling a new type of analysis, I find it best to start with small, familiar datasets.  Luckily, Twitter data both works well for text mining, and is easy to get started with.  In this post I'll take you through the basic steps of importing and cleaning twitter data; tomorrow I will show the results of some initial topic models.


The twitteR library in R provides a nice direct way to connect to twitter and import data.  Unfortunately this requires a twitter developer account and some authentication shenanigans, so it's not the easiest way to get started.  Instead, if you're just looking for a quick intro, I suggest going into the settings of your twitter account, and requesting your archive as a CSV.


OK, so now we have the data.  What do we need to do to get started.  Here are the initial steps to prepare the data:

  1. Read in your packages (I use tm, RTextTools, topicmodels, and wordcloud, in addition to my normal load).
  2. Read in the data.  Simply reading in the csv provided by twitter is straight forward in R.
  3. Create a corpus from the data.  Per R documentation, the corpus are "collections of documents containing (natural language) text."
  4. Clean the text.  A few steps here in code below, most of them are fairly obvious in what they do, but a few comments.
    1. "Stopwords" are words that effectively do not matter.  Think of articles ("the", "a","an") which are contained in a document, but don't add a lot of value to meaning.
    2. That second word removal step.  Yeah. About that. I found that I have certain vocabulary quirks, that I use often, and don't add a lot of value.  Text mining helped me realize this initially, but it's annoying, and I promise to stop using the word "just" in all my tweets.
    3. Stemming: stemming is the process of taking words and reducing them to their root.  That way when I say "run" and "running", the words are interpreted the same way by our predictive algorithms.
  5. Create a Document Term Matrix.  This is the process of creating a matrix of words, and how often they show up in each document (tweet).  This is the process of making the data numerical, so we can feed it to an algorithm.
  6. Remove null rows. Step 5 creates "0" rows sometimes.  Need to remove them. 
  7. Descriptive Statistics:  What words are most common (how I learned about my use of the word "just")?  Getting a basic view of the data can be helpful for identifying additional cleaning and getting a layout of the data, just link in any other dataset.


So today's post is a fairly straight-forward how-to clean data. Tomorrow we'll get into actually modelling text data.  

Except, I don't feel that today's post is very meaningful.  I mean, where's the content?  See the last step in my code.  Here's the wordcloud it created (I'm not a fan of wordclouds, but whatever).  

The word cloud makes a lot of sense.  At the center is "kwhit14" which is the twitter handle for my wife, who I talk to regularly on social media, sometimes when we're in the same room.  The next two words are "work" (I tweet about work stuff quite a bit) and "think" (on those psychological tests I'm told that I am a "thinker" rather than a "feeler" meaning that I would express what I think rather than what I feel).  The words descend from there, a lot of common words, and references to people I talk to often online, and terms I use often.   

Monday, July 13, 2015

Local Gender Ratios Part 2

Last week, my research for a friend led me down a rabbit hole of local gender ratios.  That same research quickly helped me find some racial anomalies in North West Kansas, which turned out to be influenced by prisons.

I had promised a second post on gender ratios, but the other post was a bit more interesting, so it took precedence. After looking at the county level last Thursday, I  drilled into the census block level.  The census block is a lower level of data, of varying geographical size, for which there is detailed census demographic data.  What I found was kind of interesting.


First a map of gender ratio by census block.  The map is shown below.

My main observation was this: larger, more rural census blocks tend to have more men, while smaller (city/town) blocks have more women.  But a lot of variance in a map like this.  What if we focus on a smaller area.  Here's a map of Johnson County Kansas only:

 The Johnson county map demonstrates the same pattern, with more urban, dense census blocks being more female (redder) while outlying areas tend to be more male (bluer).

What causes this?  I have a few a priori guesses, but nothing strong:

  • Males tending to be more comfortable living alone in the woods.
  • "Feminized" jobs (read: secretary, nurses,teachers) tend to lie more in cities, denser areas.
  • Older (urban core) communities tend to have older populations, so underlying correlation of age:gender ratio takes effect.
  • Women tend to move to town after their farmer-husbands die. (seen this in my own family)
That's an ocular analysis of gender skews, any statistical validity though?


Can we model gender ratios by other data? If you want to skip the nerd stuff, here's your short answer:

It can be modeled, and significant predictors found, but the model isn't hugely "predictive."

What factors appeared to matter in predicting gender ratio?  Here are our variables:

  • Percent Female: Dependent variable.  What we're predicting.
  • Dense: Population Density. Should be positive, as denser populations seem to lean female.
  • Med_age: Median age of population.  We know that older populations are more female, so this should be positive.
  • Vac_Perc:  We assumed that housing availability likely mattered, and assumed that the amount of vacant housing would be negatively correlated to number of females.
  • Renter_Perc: Another housing variable, this time percent of housing units that are rented rather than owned.  Likely positive to female rate.
So do the models work?  Yes and no.  Here's the first model, on statewide data:

Notice that each variables is significant and in the correct direction.  A couple of comments though.  First the data R-squared is low, so while we have significant predictors, we don't have the most "predictive" model.  This tends to happen quite a bit when predictive percentages like we are here, in reality, there are likely a lot of factors that impact gender ratios, many local effects that are difficult to measure.  Second, we have over 120K observations here, meaning a lot of statistical power.  So while are p-values are *very low* this isn't necessarily representative of very predictive variables.

Because of the statistical power issue, and concerns if different counties behave differently, I also ran the analysis for a sub-sample of counties.  Generally the relationships hold up, but are weaker (due to lower N in more rural counties).  First, the county I live in, Johnson County:

Next the county where I grew up, Saline County:

Next a more rural county, Lincoln County:

And finally, a VERY rural county, Gove County (only 2600 people live here):


A few easy bullet points for a conclusion:
  • Gender ratios vary geographically, sometimes in very significant ways.
  • At least part of these variations appear to be systematic, and correlated to other variables.
  • We know at least some of the factors that determine gender ratio by county, however the global model isn't extremely predictive: likely many local factors at play.
  • My friend (from the initial analysis) should spend her time in rural areas, with young populations, vacant housing, and few renters.

Friday, July 10, 2015

African American Clusters in Rural America: Prisons

In yesterday's post looking at gender ratios, I noted the weird highly "male" clusters in Western Kansas. Per the post, those clusters turned out to be mainly explained by prisons, as well as labor-based immigration.  

I thought about this a little last night.  If prisons *move* people demographically (realizing they don't impact actual community dynamics due to incarceration) what other demographic shifts could occur due to the location of prison populations versus actual population?  


To think about how prisons may shift demographics, I needed to think about how prison populations are different from the general population. More men are incarcerated than women, obviously, but I also knew that more African Americans are incarcerated. This could be exacerbated in rural communities, due to the homogeneously white populations in these areas.  The question came to if the skew was large enough in Kansas to impact demographics. I found my answer here, with this fact:

African Americans are over represented by a factor of about five in the Kansas prison system versus the general population.

I plotted the data on the map, and then focused on the area of northwest Kansas as defined by the area west of Salina, and generally north of Great Bend.  This area is interesting to me because both my mother and my father came from here and most of my family still lives in Mitchell and Osborne counties.

Here is what a map of counties by % African American in Northwest Kansas looks like:

So, a lot of red, but three counties with greater than 1% African American populations.  Explanations?

  • Ellsworth County:  Most of the African Americans appear to be associated with the prison.
  • Norton County: Most of the African Americans appear to be associated with the prison.
  • Graham County: Actually a cool story here, about ex-slaves forming a new town on the Western Kansas high prairie.  Not appropriate for this blog but read about it here.
If you prefer data in a table format, here's an easier look with a couple of other data points.  There is also a fairly large African American population in Ellis county, largely associated with the college (Fort Hays State University).  Still, in aggregate that county is less than 1% African American.

After some cross referencing with Kansas Prison records, census data, and a few calculations, it appears that there are about 414 African Americans in prison in northwest Kansas.  Which leads to a startling statistic:

37% (more than a third) of African Americans living in northwest Kansas are prisoners. 
I realize that this doesn't mean that 37% of African Americans living in the area go to jail, but instead there are two underlying mechanisms that end in this demographic distortion:

  • African Americans are incarcerated at a rate higher than whites.
  • Prisoners are shipped from more African American heavy parts of the State, to the very white northwest Kansas. 


I don't do a lot of commentary on this blog, but I will here, as northwest Kansas is sort of my "homeland."  Two final scenarios to think about:

  • Imagine being an African American from eastern Kansas, incarcerated and moved west into a town, that's 99% white, but your fellow inmates are disproportionately black.  The optics of that, combined with the cultural history of our country create a disturbing situation.

  • Imagine being a young white child growing up in a county with a prison.  Sure, you see African Americans in media, but the only ones in your community are prisoners or former prisoners.  How might that impact the development of your perceptions of the world?

Thursday, July 9, 2015

Analytics for Dating: Local Gender Ratios?

The other day I heard a friend in her mid-30's say: "I used to be on, but they were always connecting me to guys in their 50's."  I thought this was weird, did Match have some kind of built-in "sugar-daddy" bias?  Maybe.  I remember reading the book "Dataclysm" last year,  pointed towards that type of behavior from guys on dating sites.  But could something else in the data (demographics) be to blame?

Today a light-bulb went off,  I had recently seen some weird age trending data for Kansas showing a proportional dropoff in people in their 30's living in Kansas.  Were the two related?


So in a quick conversation on twitter yesterday, I was led in the direction of how Kansas may be demographically different than the US in general.  I ran a few numbers, but one *anomaly* stuck out to me: Kansas has proportionately fewer people between 34 and 47 than one would expect based on US population data.  This double axis graph lays it out fairly well:

Could this be the root cause of my friend's issue, just more people in their 50's than 30's?  I thought maybe, but the effect size is fairly low (about a 5% differential). 

But what if the age anomaly was somehow gender driven, then maybe?  I broke down the numbers by gender.  No huge difference, though another trend emerged: the gender ratio (rate of males to females) is higher in Kansas in young adulthood than in other parts of the US.  And.. down the rabbit hole I went.


Quick background on gender ratio (or human sex ratio, yes I know the difference, but one matters to initial biology, the other to dating, so, we have to consider each).  The ratio varies over different human populations, but generally is about 1.06 at birth (more males than females).

The ratio steadily declines with age as men die faster due to biology and doing dumb things (shout out to my wife, yes I'll go to the doctor this decade).  Sometime in early adulthood the ratio approaches 1.0, and dips down significantly beyond that.  By 60 the ratio has fallen to 0.93, by age 80 it is around 0.7.

Although the ratio starts well above 1.0, the early deaths of males mean that population averages are generally less than 1.0.  For instance the ratios for Kansas and the United States as a whole are .984 and .967 respectively.  By that measure, there are more dudes proportionally in Kansas, so my friend should have no problems finding a guy?  (cue Levi getting slapped)


So what does this all mean?  If gender ratios vary significantly by geography, then what are some potential research questions and implications?

  • What underlying factors drive gender ratio trends?
  • What makes communities more female friendly versus male friendly?
  • Because gender ratio varies by age, are "older" areas inherently more female?
  • A dating implication? : Go to this high-ratio county to meet men tonight!!!
Out of curiosity, I mapped gender ratios by Kansas counties.  To moderate surprise, there was quite a bit of variation across the state.   Specifically, there were some clear heavily-male outlier counties. Here's the map by county.

I wrote a quick email to my friend, suggesting that she go live in one of the three heavily-male blue counties (Norton, Pawnee, Ellsworth).  Wait. Those outliers are actually un-realistically high. What is happening?

Prisons.  Those three rural counties are home to large male correctional facilities.  Correction email to friend: you should move to one of those three counties if you're into the dating prisoners thing.

On to the next tier of outlier male counties: 
  • Leavenworth: Prison.
  • Riley: Military Base.
  • Ottawa: Unsure, though I believe they have a jail that houses out-of-county inmates.
  • Two counties in far Southwest KS: Unsure, but they are surrounded by a statistically significant male leaning block.
For our out of state readers, Southwest Kansas has been defined by a massive influx of central-American immigrants over the past few decades.  These immigrants move here generally looking for work in meat packing and other agriculturally centered businesses in Southwest Kansas. Looking at the data, it appears that this movement for jobs in southwest Kansas has also impacted the gender ratio.  

Here's a map of Kansas counties by % Hispanic.  Note that the highly Hispanic area in southwest Kansas correlates nearly perfectly to the gender ratio outliers in the same area.


I started out trying to solve an anomaly on a dating site for a friend, and ended up looking at a lot of different issues.  

From a women seeking men perspective: I can say if women want to put themselves in areas with abundant men, then look at areas with high "traditionally male" labor demand.  Also prisons, if you're into that kind of thing.  I have a lower level (within Johnson County) analysis I will post tomorrow, that's a bit more informative, useful, and interesting.

Why are their few 30-something in Kansas: No great hypothesis here, but a few that are testable.

  • May not be a "dip" at 34, but instead two ongoing factors.  An aging native-Kansas population causing a hump to the right side of the distribution, combined with 20-somethings moving into the state to find work (hence male skew), such as Fort Riley and southwest Kansas. This would cause an apparent lack of 30-somethings, but is actually just an abundance in the high and low age groups.
  • Liberals will tell you that 30-somethings are leaving the state in droves to escape the Brownback administration and go to economically better states.  I will say that I've had several friends leave Kansas for Colorado in the past five years, so this is anecdotally possible.
  • Birth rate issues: haven't really explored this at all, but it could have shifted over that time period.