Friday, July 31, 2015

Text Mining: #ksleg Followup

Last week's tweeting and blogging about mining the #ksleg (Kansas government related tweets) hashtag was a hit, so I decided to do it again.  First a wordcloud (because I know that's what everyone cares about)   It looks like Brownback, budget, audit, cut, and ksed (education) are hot topics.

And I used correlated topic modeling to model this weeks tweets as well.  What did I find?

This week, the algorithm only found three general topics in the data, and if you follow Kansas politics, these all make great sense:

  1. Covers the Kansas Legislative Post Audit committee's meeting, important news regarding a foster care audit.
  2. Covers issues with the Kansas education system, specifically regarding funding and teacher pay. 
  3. Covers a Thursday press conference, where the governor made additional budget cuts. 

And (this is psuedo-ironic-satire), the #ksleg hashtag tweeting leader-board for the week.  A tie for first place,  but BryanLowry3 wins in the "retweeted" tie breaker*

And, some example top tweets by topic above:

Topic 1: 

Topic 2:

Topic 3:

*Quick note for the data scientists, this bit of satire is easy to replicate for any twitter search term.  Essentially you use the twitteR library to download all the tweets, convert into a data frame, and then summarize the data. 

Thursday, July 30, 2015

Text Mining Lafayette Shooting: Followup

Last week I posted some topic modeling results about the immediate twitter reaction to the Lafayette shooting (post found here).  I thought it would be interesting to revisit the data a week later, and see if the conversation has changed.  Short post here, but some interesting findings.

First the wordcloud:

Lots of "gun" and "control" talk.  But also the word "argument" .. and something about a comedian named Jim Jeffries.

Interesting, but what do the actual models look like?  This time, it split into fewer topics:

To summarize the topics:

  1. Focused on gun control directly.  With a lot of tweets linking to a video of a comedian named Jim Jeffries "destroying" arguments against gun control.
  2. Focused on facts of the case.  Specifically related to the injured and still living victims.
  3. Focused on gun control.  A lot of demands on republican candidates to refuse donations from the NRA.
  4. Focused on gun control generally.   Many tweets using the hashtag "#gunsense"
  5. Focused on the victims that died.  Specifically related to the two women, funerals, and other facts.

And finally by frequency:

So a few conclusive points:
  • About a a week after the shooting, 70% of tweets still using the hashtag relate to gun control topics.  This is roughly double the proportion from immediately after the shooting.
  • One thing missing is Zayn Malik.  And anything grossly off topic in general, some good news.  It appears that once a hashtag is no longer trending, spammers stop using it.
  • To close out, here is something interesting to look at.  The tweet that "ranks" (has highest probability of association) with each topic:

Topic 1:

Topic 2:

Topic 3:

Topic 4 (This is cool, because it show how topic modeling groups together "unobserved" topics.  This isn't explicitly saying gun control, but it certainly is a gun control related tweet)

Topic 5:

Monday, July 27, 2015

Kansas Teacher Salaries: The $7,000 Mistake

Last Friday, Governor Brownback in Kansas held a press conference to cover various topics from abortion to education.  Recent news stories in Kansas had held that teachers were leaving the state  (and profession) in record numbers due to a de-funding of education.  Brownback moved to address this issue with the following chart:

So, if you follow this blog you know I am already annoyed at the chart that's not zero scaled.  But Brownback just annoyed many more people, for far less nerdier reasons.  Turns out, the numbers were wrong, to a scale of about $7,000.  In essence, the administration claimed that teachers made $7,000 more than their Missouri counterparts, when salaries are effectively equal.  The media was immediately all over this, if you want to read more on that here's a summary.  It seems to me this shouldn't be hugely difficult data to validate...

There's another problem here though:  Teachers salaries vary across the country for dozens of legitimate reasons; even if the $7,000 difference was accurate (it wasn't), we shouldn't jump to conclusions before exploring reasons behind variance.


I've looked at teacher pay before in my career, and referenced that before on this blog.  For a quick background, there are several categories of things that impact teacher pay:
  • Attributes specific to teacher (education level, experience)
  • Attributes specific to district (cost of living, ability to pay, working conditions, amenities)
Because the administration is looking at aggregate numbers, the most relevant numbers are the district numbers.  Of those numbers, the most significant by-far is cost of living (Pearson = .77). So, at the very least, any comparison of teacher compensation should look at differences in cost of living.


When addressing teacher pay, generally two accusations made by the teaching community:

  1. Teachers are leaving for Missouri where salaries are higher and conditions are better.
  2. Teachers in Kansas are paid less than those in other states.
For the first point here, I don't know think that State level aggregate data is even the correct analysis.  More interesting would be comparisons of Northeast Kansas to Northwest Missouri salaries, or pairwise comparisons of adjacent border county districts.  

But that's probably not even sufficient for a few reasons.  Teachers are likely leaving not just because of salary differential, but perceived future differentials: They feel things are getting worse, and it will continue that direction (read some media reports in Kansas for context).  That's difficult to measure or model.  Maybe I'll try later this week though.  

But the second question is still out there, and quite a bit easier to answer: adjusted for cost of living, are Kansas teacher salaries similar to national averages?


I used data that's a couple years old, but appears to be the most reliable source on the matter.  In this data, Kansas teachers make in aggregate $47,464, while Missouri teachers make $47,517.  Very close, but the national average is north of $56,000, a number that is pulled up by a few high-population, high cost of living States.  For methodology, I simply regressed the teacher salary data by state against state costs of living.

So what happens if we control for cost of living?  

A couple of points here: 
  • The model is fairly predictive, with cost of living accounting for almost 60% of all aggregate level teacher pay variation.
  • Kansas and Missouri both fall beneath the line.  Kansas has a higher cost of living than Missouri, so it is actually a little worse off.  The by-line here:  Kansas and Missouri both pay teachers between 3.5% and 4% less than national averages, adjusted for cost of living. Here's a summary:

Some nerdyness coming up, so non-nerds can skip...

I had a hypothesis (for various reasons, contact me if you want to know why) that a log-log transformation of this data would fit better.  It didn't really improve results much, and the fit line was essentially the same.  In fact, when I corrected the R-squared from the log-log model (the one stated on chart below is invalid due to the impact on variance of a logarithmic transformation), the R-squares were nearly identical between the two specifications.


Just a couple of quick bullet points:
  • When adjusted for cost of living, Kansas teachers make on average .4% less than Missouri teachers.  
  • Both Kansas and Missouri teachers make at least 3.5% less than the national average.
  • A more in depth study would be required to determine why teachers are leaving the state, including analysis of atmosphere, and perceived future salary risks.  I have a few ideas on how to analyze this, and will try to get to this in the next couple of weeks.

Friday, July 24, 2015

Text Mining: Lafayette Shooting

Today I learned that Zayn Malik may be coming back to One Direction.  That's not the point of this post, but unfortunately, more on that later.

This morning I noticed that the #Lafayetteshooting hashtag was getting a lot of action, and at least on my Twitter timeline, the talk was skewed towards a gun control conversation.  I've heard two kinds of accusations around this, that people are either too willing or not willing enough to talk about gun control after a mass shooting.

People are obviously talking gun control already, but can we quantify it?  Hey, that code I've been kicking around the last couple of days could help...


I downloaded tweets from mid-afternoon today regarding the Lafayette shooting from yesterday.  I created a word cloud to get a sense of common words.  Like I suspected, frequent words centered around "gun" .. and ... national tequila day? What?

If gun control is such a large part of our conversation, then it should easily segregate into topics.  So I ran a CTM analysis:

The topics above can be summarized generally in this way:
  1. Focuses on facts of the shooting: gunman's name, media reports, etc.
  2. Focuses on emotional reactions: victims, prayers, thoughts, tragedy, etc.
  3. Focuses on:  .. national tequila day.. and Zayn Malik.  WHAT?  Ok this is something that happens on twitter, people hijack popular hashtags and use them for their own purposes.  FYI, Zayn isn't coming back to OD, it's just a silly rumor.
  4. Focuses on gun control: specifically "yet another mass killing, need more gun control."
  5. Focuses on gun control:  though many of the tweets are anti-gun control.
  6. Focuses on mental health issues: mental, illness, lone, white, gunmen.
  7. This is a low frequency topic, that focuses on general facts of the case, theater, movie, Lafayette, etc.


Using a posterior estimate, I can estimate what portion of tweets are on gun control by adding topics four and five, and that's roughly a third.  So effectively, the day after this mass shooting, we see that about a third of the conversation has to do with gun control. Closer to 40%, if we remove Zayn Malike and National Tequila day. 

Text Mining: Mining the BlackLivesMatter Hashtag

Last night I asked my wife if she had read yesterday's post on topic mining the #ksleg hashtag, remarking I have never had topic models converge to news stories quite that well, blah blah blah; nerd nerd nerd. 

Honestly, she looked a little unimpressed.  And then finally said, "why don't you do that on the #blacklivesmatter hashtag?"

She was right.  I had mined a policy-wonk centered, Kansas specific hashtag.  Why?  I should do something with more volume, that people actually care about.  And so I went to downloading.  


Just a few nerd notes on mining larger hashtags in R:

  • The API has a limit for search results that I hit a few times, you just have to wait it out.
  • Changing your search is really as simple as changing the term in code, as well as time constraints.
  • This hashtag gets a lot of volume, so random subsampling is necessary.  I used two point in time samples, one from the 22nd, one from the 17th.


This data ended up being fascinating, so I ran quite a few models. My sample included two days, 2015-07-17 and 2015-07-22.  I created an individual model for each day, as well as looked at the general high-frequency terms.  All the outputs for each day are below in "appendix" but here are some general patterns I discovered:
  • Both day's tweets were dominated by the Sandra Bland case, an African American woman who died in police custody.
  • 2015-07-17 was the one year anniversary of Eric Garner's death, more about that here.  I didn't know this before downloading the data, but it was apparent that this was a major topic of the day.
  • There was a clear change in the Sandra Bland related topics between the days:
    • On the 17th, the terms were more factual about the case or events(jail,mystery, death, vigil). 
    • On the 22nd, the terms had grown angrier, such as the "f" word was now a top word, along with murder, demand, investigation, and kill.
Those two models and descriptive statistics are below in my appendix.  But the combined model is fairly interesting. Here's what the algorithm found:

Here's my analysis of each topic:

  1. This is a Sandra Bland topic that centers around finding justice (DOJ, Justice, Investigation).
  2. This topic also has Sandra Bland undertones, but focuses on the racism-specific notions of the case.
  3. This topic focuses on the African American community using the media and twitter to fight racism.  (read: Black twitter)
  4. This is an Eric Garner one year anniversary topic.
  5. This is another Sandra Bland related topic, largely around white cops killing black people, and that "needing to stop."
The topics here are fairly tight and obvious, but can we demonstrate how they changed over time?  This chart shows tweets allocated to each topic for each day:

Effectively, the Eric Garner topic went away and was replaced with more calls for Justice for Sandra Bland.  The other topics stayed the same for the most part, with a slight increase with the calls to end white police killing black people.

Let's be honest here though, we all just want to look at word clouds.  First a word cloud of the 17th (prominent Garner references):
Now a word cloud of the 22nd (Garner gone, now "justice" references):


JULY 22nd

Frequent terms from July 22nd:

Topics from July 22nd:

JULY 17th

Frequent terms from July 17th:

Topics from July 17th:

Wednesday, July 22, 2015

Text Mining Part Three: A Better Use Case

Mining my own tweets yesterday was interesting, but wasn't the most illustrative example of good text analysis.  There are a few reasons for that, but it's mainly about the weird way that I use twitter.  So what happens with a "tighter" and more topic friendly list of tweets?  Here's a quick look using some Kansas government tweets.


For this dataset I used a more "topical" set of tweets, specifically with the hashtag: "#ksleg" meaning that they have something to do with Kansas government.  I've blogged about these kind of topics before regarding taxes, elections, and education.  

To pull this data, you need to authenticate from R into twitter using the setup_twitter_oauth() function.  Then retrieving the tweets are as simple as:

tweets <- searchTwitter("#ksleg",n=10000000, lang="en",since="2015-01-01")
tweets <-strip_retweets(tweets, strip_manual = TRUE, strip_mt = TRUE)
tweets <- twListToDF(tweets)
*Quick note this actually only returns about a week of data, due to a limitation in the twitter search API.

Following this step, I went through all the data cleaning steps implemented in my post from last week.


So what does data under this hashtag look like? Common words are Kansas, KSED (hashtag used for Kansas educators), teacher, Brownback, and State.  Since everyone else in the world likes wordclouds, here you go:
Another way to view is the findFreqTerms(,20), which just gives us a list of words used more than 20 times in the tweets.

But can we model this data into topics?  Yes.  And much better than my personal tweet data.  Same procedures as yesterday, here are the topics created:

You may not be familiar with Kansas politics, but the fit her is much better than yesterday, and here are the "realities" behind each topic: 

Topic 9: this is a low-frequency topic; relates to impact of education policies on kids.

Another interesting way to look at this data, is by the percent of tweets in each topic by day.  Interestingly, on the day that the story behind topic 1 broke, over 40% of #ksleg tweets were on that subject.  Topic two broke on the 17th, corresponding to a spike in related tweets.  The chart below demonstrates this:


Overall, my topic models work much better on data that is more topic-focused, like tweets about government.  If you're wondering about my tweeting in the last week, I had two tweets with the hashtag that were scored, one in topic 1, the other in topic 5.  

Have I mentioned that you can create word clouds within each topic?  Because I know that's all people really want.  Wordclouds.  To finish, some topical wordclouds:

Topic 1: 

 Topic 4: 

Topic 6: 
 Topic 7: 

Text Mining Part 2: Fitting Topic Models

Finally have time to write my second post on text mining.  My first post was late last week, and dealt with the rather boring subject of importing and cleaning data.  This one gets into some actual analysis.


So what types of models will I be creating here?  Topic models.  Topic models are a way to analyze the terms (words) used in documents (for this purpose, tweets) and group the documents together into topics.  In simple terms, the algorithms used find which words are often used together and cluster those word commonalities into topics.  It's a way to measure unobserved groups (topics) in underlying distributions.  From a predictive standpoint, we can then apply those topics back to the original documents, and group our tweets by topic.

Before starting on a model, a good place to get started thinking about this, is to look at which words associate most highly with which other words. R provides this functionality through the findAssocs() function.  I ran it for two things I tweet about often: running and taxes.  

The results make sense here.  Top three words associated with "run" are long,mile, and trail.  That means that documents with the word run are more likely to contain those words than other tweets.

The top words associated with tax are sale, income, overhaul, growth, outstrip, deduction, and elasticity.  If you've followed my blogging or tweeting on Kansas politics in recent month, this all makes great sense.  But will we be able to see these groups in the final topic models?


There are two basic types of topic models that I will use in this analysis: Latent Dirichlet Allocation and Correlated Topic Models. Both models work in similar ways, assuming a prior distribution (dirichlet and logistic normal) and use those distributions to model the underlying topics.  The main divergence in the models is that the Correlated Topic Model then calculates covariance over the top, which adds the bonus of correlation between topics, and makes the model generally fit better.

You can email me or comment below if you want to understand the underlying math behind these types of models, but this is more of a how-to explainer.

How do we get started in R?  The R code is straight forward as seen below.  First build the models using the LDA and CTM commands, then you can get a list of high probability words for either model using the "terms" command.  There are more parameters obviously, but I want to start simple.

How good are the models provided here?  They're ok.  Unfortunately my tweets are fairly sparse topic wise, but I expected to see two major topics pop up: running and Kansas legislative policy.  And I did, essentially, and it appears on face that the CTM is the better fit to the data.  Here are the high probability terms list:

The CTM model found both of my original topics, contained within topic 3 (running) and topic 8 (kansas policy). I'll post later on how to fit these models (potentially with better data), and how to evaluate how good they are in dividing topics, but this is a good start.

One last thing for now.  Can we pull a list of tweets on a particular topic using the models derived above?  Yes.  Here are the top two tweets associated with my topic 8, which generally associates with policy.

Friday, July 17, 2015

Text Mining First Step: Topic Modeling Your Tweets Pt. 1

Many businesses are seeing value from mining the vast amounts of text data they accumulate, meaning for analysts text mining skills are in high demand.  For analysts new to mining text, the analysis can seem intimidating.  From a primal-analyst perspective it goes something like this:
Analyzing words?  I don't do words!  I do numbers!

So I make fun of myself.  But on a serious note, the amount of pre-processing, data cleaning, and understanding of "language" concepts can make textual analysis a scary prospect.  Whenever tackling a new type of analysis, I find it best to start with small, familiar datasets.  Luckily, Twitter data both works well for text mining, and is easy to get started with.  In this post I'll take you through the basic steps of importing and cleaning twitter data; tomorrow I will show the results of some initial topic models.


The twitteR library in R provides a nice direct way to connect to twitter and import data.  Unfortunately this requires a twitter developer account and some authentication shenanigans, so it's not the easiest way to get started.  Instead, if you're just looking for a quick intro, I suggest going into the settings of your twitter account, and requesting your archive as a CSV.


OK, so now we have the data.  What do we need to do to get started.  Here are the initial steps to prepare the data:

  1. Read in your packages (I use tm, RTextTools, topicmodels, and wordcloud, in addition to my normal load).
  2. Read in the data.  Simply reading in the csv provided by twitter is straight forward in R.
  3. Create a corpus from the data.  Per R documentation, the corpus are "collections of documents containing (natural language) text."
  4. Clean the text.  A few steps here in code below, most of them are fairly obvious in what they do, but a few comments.
    1. "Stopwords" are words that effectively do not matter.  Think of articles ("the", "a","an") which are contained in a document, but don't add a lot of value to meaning.
    2. That second word removal step.  Yeah. About that. I found that I have certain vocabulary quirks, that I use often, and don't add a lot of value.  Text mining helped me realize this initially, but it's annoying, and I promise to stop using the word "just" in all my tweets.
    3. Stemming: stemming is the process of taking words and reducing them to their root.  That way when I say "run" and "running", the words are interpreted the same way by our predictive algorithms.
  5. Create a Document Term Matrix.  This is the process of creating a matrix of words, and how often they show up in each document (tweet).  This is the process of making the data numerical, so we can feed it to an algorithm.
  6. Remove null rows. Step 5 creates "0" rows sometimes.  Need to remove them. 
  7. Descriptive Statistics:  What words are most common (how I learned about my use of the word "just")?  Getting a basic view of the data can be helpful for identifying additional cleaning and getting a layout of the data, just link in any other dataset.


So today's post is a fairly straight-forward how-to clean data. Tomorrow we'll get into actually modelling text data.  

Except, I don't feel that today's post is very meaningful.  I mean, where's the content?  See the last step in my code.  Here's the wordcloud it created (I'm not a fan of wordclouds, but whatever).  

The word cloud makes a lot of sense.  At the center is "kwhit14" which is the twitter handle for my wife, who I talk to regularly on social media, sometimes when we're in the same room.  The next two words are "work" (I tweet about work stuff quite a bit) and "think" (on those psychological tests I'm told that I am a "thinker" rather than a "feeler" meaning that I would express what I think rather than what I feel).  The words descend from there, a lot of common words, and references to people I talk to often online, and terms I use often.   

Monday, July 13, 2015

Local Gender Ratios Part 2

Last week, my research for a friend led me down a rabbit hole of local gender ratios.  That same research quickly helped me find some racial anomalies in North West Kansas, which turned out to be influenced by prisons.

I had promised a second post on gender ratios, but the other post was a bit more interesting, so it took precedence. After looking at the county level last Thursday, I  drilled into the census block level.  The census block is a lower level of data, of varying geographical size, for which there is detailed census demographic data.  What I found was kind of interesting.


First a map of gender ratio by census block.  The map is shown below.

My main observation was this: larger, more rural census blocks tend to have more men, while smaller (city/town) blocks have more women.  But a lot of variance in a map like this.  What if we focus on a smaller area.  Here's a map of Johnson County Kansas only:

 The Johnson county map demonstrates the same pattern, with more urban, dense census blocks being more female (redder) while outlying areas tend to be more male (bluer).

What causes this?  I have a few a priori guesses, but nothing strong:

  • Males tending to be more comfortable living alone in the woods.
  • "Feminized" jobs (read: secretary, nurses,teachers) tend to lie more in cities, denser areas.
  • Older (urban core) communities tend to have older populations, so underlying correlation of age:gender ratio takes effect.
  • Women tend to move to town after their farmer-husbands die. (seen this in my own family)
That's an ocular analysis of gender skews, any statistical validity though?


Can we model gender ratios by other data? If you want to skip the nerd stuff, here's your short answer:

It can be modeled, and significant predictors found, but the model isn't hugely "predictive."

What factors appeared to matter in predicting gender ratio?  Here are our variables:

  • Percent Female: Dependent variable.  What we're predicting.
  • Dense: Population Density. Should be positive, as denser populations seem to lean female.
  • Med_age: Median age of population.  We know that older populations are more female, so this should be positive.
  • Vac_Perc:  We assumed that housing availability likely mattered, and assumed that the amount of vacant housing would be negatively correlated to number of females.
  • Renter_Perc: Another housing variable, this time percent of housing units that are rented rather than owned.  Likely positive to female rate.
So do the models work?  Yes and no.  Here's the first model, on statewide data:

Notice that each variables is significant and in the correct direction.  A couple of comments though.  First the data R-squared is low, so while we have significant predictors, we don't have the most "predictive" model.  This tends to happen quite a bit when predictive percentages like we are here, in reality, there are likely a lot of factors that impact gender ratios, many local effects that are difficult to measure.  Second, we have over 120K observations here, meaning a lot of statistical power.  So while are p-values are *very low* this isn't necessarily representative of very predictive variables.

Because of the statistical power issue, and concerns if different counties behave differently, I also ran the analysis for a sub-sample of counties.  Generally the relationships hold up, but are weaker (due to lower N in more rural counties).  First, the county I live in, Johnson County:

Next the county where I grew up, Saline County:

Next a more rural county, Lincoln County:

And finally, a VERY rural county, Gove County (only 2600 people live here):


A few easy bullet points for a conclusion:
  • Gender ratios vary geographically, sometimes in very significant ways.
  • At least part of these variations appear to be systematic, and correlated to other variables.
  • We know at least some of the factors that determine gender ratio by county, however the global model isn't extremely predictive: likely many local factors at play.
  • My friend (from the initial analysis) should spend her time in rural areas, with young populations, vacant housing, and few renters.

Friday, July 10, 2015

African American Clusters in Rural America: Prisons

In yesterday's post looking at gender ratios, I noted the weird highly "male" clusters in Western Kansas. Per the post, those clusters turned out to be mainly explained by prisons, as well as labor-based immigration.  

I thought about this a little last night.  If prisons *move* people demographically (realizing they don't impact actual community dynamics due to incarceration) what other demographic shifts could occur due to the location of prison populations versus actual population?  


To think about how prisons may shift demographics, I needed to think about how prison populations are different from the general population. More men are incarcerated than women, obviously, but I also knew that more African Americans are incarcerated. This could be exacerbated in rural communities, due to the homogeneously white populations in these areas.  The question came to if the skew was large enough in Kansas to impact demographics. I found my answer here, with this fact:

African Americans are over represented by a factor of about five in the Kansas prison system versus the general population.

I plotted the data on the map, and then focused on the area of northwest Kansas as defined by the area west of Salina, and generally north of Great Bend.  This area is interesting to me because both my mother and my father came from here and most of my family still lives in Mitchell and Osborne counties.

Here is what a map of counties by % African American in Northwest Kansas looks like:

So, a lot of red, but three counties with greater than 1% African American populations.  Explanations?

  • Ellsworth County:  Most of the African Americans appear to be associated with the prison.
  • Norton County: Most of the African Americans appear to be associated with the prison.
  • Graham County: Actually a cool story here, about ex-slaves forming a new town on the Western Kansas high prairie.  Not appropriate for this blog but read about it here.
If you prefer data in a table format, here's an easier look with a couple of other data points.  There is also a fairly large African American population in Ellis county, largely associated with the college (Fort Hays State University).  Still, in aggregate that county is less than 1% African American.

After some cross referencing with Kansas Prison records, census data, and a few calculations, it appears that there are about 414 African Americans in prison in northwest Kansas.  Which leads to a startling statistic:

37% (more than a third) of African Americans living in northwest Kansas are prisoners. 
I realize that this doesn't mean that 37% of African Americans living in the area go to jail, but instead there are two underlying mechanisms that end in this demographic distortion:

  • African Americans are incarcerated at a rate higher than whites.
  • Prisoners are shipped from more African American heavy parts of the State, to the very white northwest Kansas. 


I don't do a lot of commentary on this blog, but I will here, as northwest Kansas is sort of my "homeland."  Two final scenarios to think about:

  • Imagine being an African American from eastern Kansas, incarcerated and moved west into a town, that's 99% white, but your fellow inmates are disproportionately black.  The optics of that, combined with the cultural history of our country create a disturbing situation.

  • Imagine being a young white child growing up in a county with a prison.  Sure, you see African Americans in media, but the only ones in your community are prisoners or former prisoners.  How might that impact the development of your perceptions of the world?

Thursday, July 9, 2015

Analytics for Dating: Local Gender Ratios?

The other day I heard a friend in her mid-30's say: "I used to be on, but they were always connecting me to guys in their 50's."  I thought this was weird, did Match have some kind of built-in "sugar-daddy" bias?  Maybe.  I remember reading the book "Dataclysm" last year,  pointed towards that type of behavior from guys on dating sites.  But could something else in the data (demographics) be to blame?

Today a light-bulb went off,  I had recently seen some weird age trending data for Kansas showing a proportional dropoff in people in their 30's living in Kansas.  Were the two related?


So in a quick conversation on twitter yesterday, I was led in the direction of how Kansas may be demographically different than the US in general.  I ran a few numbers, but one *anomaly* stuck out to me: Kansas has proportionately fewer people between 34 and 47 than one would expect based on US population data.  This double axis graph lays it out fairly well:

Could this be the root cause of my friend's issue, just more people in their 50's than 30's?  I thought maybe, but the effect size is fairly low (about a 5% differential). 

But what if the age anomaly was somehow gender driven, then maybe?  I broke down the numbers by gender.  No huge difference, though another trend emerged: the gender ratio (rate of males to females) is higher in Kansas in young adulthood than in other parts of the US.  And.. down the rabbit hole I went.


Quick background on gender ratio (or human sex ratio, yes I know the difference, but one matters to initial biology, the other to dating, so, we have to consider each).  The ratio varies over different human populations, but generally is about 1.06 at birth (more males than females).

The ratio steadily declines with age as men die faster due to biology and doing dumb things (shout out to my wife, yes I'll go to the doctor this decade).  Sometime in early adulthood the ratio approaches 1.0, and dips down significantly beyond that.  By 60 the ratio has fallen to 0.93, by age 80 it is around 0.7.

Although the ratio starts well above 1.0, the early deaths of males mean that population averages are generally less than 1.0.  For instance the ratios for Kansas and the United States as a whole are .984 and .967 respectively.  By that measure, there are more dudes proportionally in Kansas, so my friend should have no problems finding a guy?  (cue Levi getting slapped)


So what does this all mean?  If gender ratios vary significantly by geography, then what are some potential research questions and implications?

  • What underlying factors drive gender ratio trends?
  • What makes communities more female friendly versus male friendly?
  • Because gender ratio varies by age, are "older" areas inherently more female?
  • A dating implication? : Go to this high-ratio county to meet men tonight!!!
Out of curiosity, I mapped gender ratios by Kansas counties.  To moderate surprise, there was quite a bit of variation across the state.   Specifically, there were some clear heavily-male outlier counties. Here's the map by county.

I wrote a quick email to my friend, suggesting that she go live in one of the three heavily-male blue counties (Norton, Pawnee, Ellsworth).  Wait. Those outliers are actually un-realistically high. What is happening?

Prisons.  Those three rural counties are home to large male correctional facilities.  Correction email to friend: you should move to one of those three counties if you're into the dating prisoners thing.

On to the next tier of outlier male counties: 
  • Leavenworth: Prison.
  • Riley: Military Base.
  • Ottawa: Unsure, though I believe they have a jail that houses out-of-county inmates.
  • Two counties in far Southwest KS: Unsure, but they are surrounded by a statistically significant male leaning block.
For our out of state readers, Southwest Kansas has been defined by a massive influx of central-American immigrants over the past few decades.  These immigrants move here generally looking for work in meat packing and other agriculturally centered businesses in Southwest Kansas. Looking at the data, it appears that this movement for jobs in southwest Kansas has also impacted the gender ratio.  

Here's a map of Kansas counties by % Hispanic.  Note that the highly Hispanic area in southwest Kansas correlates nearly perfectly to the gender ratio outliers in the same area.


I started out trying to solve an anomaly on a dating site for a friend, and ended up looking at a lot of different issues.  

From a women seeking men perspective: I can say if women want to put themselves in areas with abundant men, then look at areas with high "traditionally male" labor demand.  Also prisons, if you're into that kind of thing.  I have a lower level (within Johnson County) analysis I will post tomorrow, that's a bit more informative, useful, and interesting.

Why are their few 30-something in Kansas: No great hypothesis here, but a few that are testable.

  • May not be a "dip" at 34, but instead two ongoing factors.  An aging native-Kansas population causing a hump to the right side of the distribution, combined with 20-somethings moving into the state to find work (hence male skew), such as Fort Riley and southwest Kansas. This would cause an apparent lack of 30-somethings, but is actually just an abundance in the high and low age groups.
  • Liberals will tell you that 30-somethings are leaving the state in droves to escape the Brownback administration and go to economically better states.  I will say that I've had several friends leave Kansas for Colorado in the past five years, so this is anecdotally possible.
  • Birth rate issues: haven't really explored this at all, but it could have shifted over that time period.

Tuesday, July 7, 2015

Tuesday Jams: Judgment Night Soundtrack

It's been a while since I've written about music on my data blog, but I've been listening to an album quite a bit lately, so I thought I would write this up.  The album is the Judgement Night Soundtrack (side note: I've never seen the movie Judgment Night, and have heard it was horrible).

If you remember the late 90's/early 2000's you probably remember a lot of really bad rap metal and nu-metal.  This album may be somewhat to blame for that.  Released in 1993, the album was a simple formula: match up a rapper/rap group with a rock band and have them do a song together.  Interesting concept, and fairly new at the time, save for an Aerosmith/Run DMC collaboration, and a few others.  I've seen many nu-metal bands say this album was inspiration for the crap they turned out.

Though, the results of this album are generally much better than the late 90's rap metal, and it's fun music to relax to.  A few highlights, here's a live performance of one of the songs on the Arsenio Hall show:

And here is the full album:

Monday, July 6, 2015

Updates On Prior Analysis and Future Possibilities

It's been almost a week since I last blogged, largely due to busy life stuff, and a touch of writers (analyzers?) block. So, I thought I would give some prior updates, as well as what I have planned for this week.


It's been about a week since my fitness week series, and I haven't really been paying too much attention to these numbers.  A couple of notes though:

  • I have finally set a hard weekly goal, of 150,000 steps a week.  Looking at history, this seemed to be both an attainable, and sufficiently difficult goal that would have me averaging more than 21,000 steps each day.
  • Weekend targeting has become "worse." If you look at my original post on targeting psychological thresholds, you can see a fairly profound, pattern.  My more recent data has become a little more profound, especially at the 30K mark.  (side note: I don't think this is necessarily bad, just shows me setting hard targets and going after them)


A few things have happened in this space since I last blogged about it.  First, the Kansas court system has ruled that Kansas doesn't spend enough money on education.  A lot of fallout with that, but it essentially means that Kansas may need to again increase taxes in the next few months, leading to more tax debate, and more distributional impacts as we saw from my prior analysis.   A lot of potential for interesting debates and analysis in Kansas tax policy.

Speaking of that area, one of my posts receiving the input over the last week has been my look at the idea of driving to Missouri to avoid the sales tax.   I have also seen a few news articles and other analyses on this subject over the last few weeks.  Unfortunately many of these "analysts" don't look at costs/opportunity costs of the drive and thus simply see a lower rate as an opportunity for cost savings.

These types of analyses have the potential to be materially harmful, especially to the lowest income individuals.  First, they are most likely to impact the behavior of lowest income individuals (unlikely to "run" their own numbers; more cost sensitive) by making the drive seem to make sense.  Second, the driving has a disparately higher impact on these individuals due to higher effective costs (wear on older cars, lower gas mileage, less "slack" time, tighter budgets).


No one has invited me to play in any fantasy leagues this year (not even my league from last year; I may be "uninvited" for using the PED of statistical modeling in a fantasy league).  So a few things on that:
  • I'll be anxiously waiting any fantasy league invitations.
  • My earlier model that created team level projections also created game-level projections, so I'll be tracking my model performance during this entire football season, to see how well it does.
  • I have also created a continuous model, that will allow me to make in-season weekly game projections using good models.  Can I beat the sportswriters?
  • I may create player fantasy ratings before fantasy drafts occur.  Not sure I will find time and data to do this, but it can at least be a goal at this point.


So, there's your update.  Hopefully I'll find time to blog about some of these things later in the week!