Wednesday, December 30, 2015

How We Injure Ourselves By Age

Yesterday I found this article that looks at wall punching by age, gender and various other dimensions. I found the article interesting, wall punching seems to correlate with Male Testosterone change rates, but what was more interesting was underlying dataset, the NEISS.

The NEISS is a dataset of injuries reported by hospitals, with  details on the injured person, objects involved with the injury, and the nature of the act that caused the injury.  I downloaded the data, immediately found it interesting, and quite rich data.

I had a few questions about the original article's wall punching analysis, specifically: did women see an uptick in wall punching in their mid-teens, or was that trend limited to men?  I replicated the analysis below, while both men and women see and increase in wall punching in their mid-teens, the women's increase is less pronounced.

I also noticed a lot of people in the dataset were being injured by their toilets, largely through falling while on their toilet.  It would be amusing to create the same chart as above, but for toilet injuries.

That's fascinating, it appears to be an inversion of the other distribution.  This makes sense though as toilet injuries seem to be related to older people falling on the toilet.  But why more women than men?  Because there are simply more old women.  But for a moment, let's ignore gender and plot both toilet and wall injuries on the same chart:

That's interesting because it shows that the injury curves are inversions of each other, but the number of injuries from toilet never quite reaches that from wall punching.... or does it?  The problem with this is that there are far more people per year in their teens and twenties than those in their seventies and eighties.  What if we control for population age distribution, and turn this into a risk rate per year analysis:

This final chart shows two things first, the annual risk rate for people in their eightiess and nineties due to toilets is far higher than the annual risk rate from younger people punching walls.  Second, there's a point, sometime in your forties, when your risk from falling on the toilet turns higher than your risk from punching things... a true sign of maturity.

Update 2015-12-31 8:42AM

Someone disagreed with my conclusion that the difference between female and male toilet injuries was due to demographic lifespan issues (women live longer so there are more old women).  The contention was that the difference in how men and women use toilets cause the issue-and they appear to be correct. While lifespan still accounts for about half the gender variance in total toilet injuries in the elderly, women still have a higher overall risk rate.  Here's a chart controlling for gender population differences in older Americans.  

Tuesday, December 29, 2015

Year End Summary and Good News

I haven't blogged as much as normal recently, but we'll blame that on the holidays and end of year, or on a bit of good news I'll get to in a bit.  But first, some commentary on this blog.


I started this blog in December 2014 as an outlet for creativity, analysis, and to get me out of my day-to-day life.  In those regards, the blog has been an outstanding success.  Unexpectedly, the blog has turned into a resume builder too.  In interviews it can be difficult for a analyst to demonstrate their true skillset-this blog gives me a portfolio of work.  In fact, I had a couple of interviews this year where I could use this blog as part of the interview process, and received positive feedback from managers.

That's great but boring, and everyone loves lists, how about our final top five posts of 2015?
  1. Daraprim: Price increase or Leveraged Financial System - I talk about the actions of the person who many have called the "most hated man of 2015"
  2. My Data Science Toolkit - A list of my five favorite Data Science software tools.
  3. Peer Group Determination: Library Peer Groups  - A project I conducted with my wife to use a novel methodology to identify peer groups for public libraries. 
  4. Kansas Election Fraud - My first in what ended up being a seven part series criticizing the work of statisticians who claim that statistical anomalies point to a rigged voting system.
  5. Kansas Election Fraud: Part 6 Sedgwick County Suburbs - A followup to the the post above, where I dig into why the initial anomalies exist using GIS mapping technology.

In sum, this blog started as a small project for me where I planned on posting 2-5 times per month, but quickly morphed into a 130+ post blog with just over 100 readers per day.  I look forward to where 2016 will take this blog.


Remember when I mentioned that I had used this blog in a couple of job interview this year?  One of those interviews turned into a new exciting position for me.  I will being starting a new job across the State line in Missouri on January 4th. I am very excited and looking forward to new challenges and opportunities.  My last day at my current job will be December 31st, making this a new year and new beginning.

Tuesday, December 22, 2015

First Look: Mass Shooting Data

I've wanted to take a deep-dive into mass shooting data for quite a while, but I didn't want it to be in the heat of the moment following another mass shooting.  Over the next few days I am going to take a deep analysis into the mass shooting data we have available, what it means, and why numbers differ between sources.


There are two main datasets with mass shooting data, the Mother Jones data and the Shooting Tracker data.  Here is a brief summary of each dataset:

  • Mother Jones Data: Mother Jones focuses on multiple-death, non-gang public mass shootings.  Essentially, the kind of things we see on the news.  
  • Shooting Tracker Data: Shooting tracker focuses on any event where multiple people are shot, a very basic definition of mass shootings.


For this post I created an initial comparison of the data; just to get a sense for differences between the data.  The first issue is that the Shooting Tracker Data only goes back three years, but Mother Jones goes back to 1982.  We can generally solve for this, but any longitudinal analysis will have to be based on Mother Jones data.

As one might expect, the Shooting Tracker data tells us the number of mass shootings in the United States is much higher than Mother Jones does.  In fact Shooting Tracker tells us that we average one mass shooting a day, whereas Mother Jones tells us we average one a quarter:

An additional component is when mass shootings occur, Mother Jones shows most shootings occur during the week, whereas Shooting Tracker shows shootings occur disproportionately on weekends.

There's even a disagreement on the seasonality of mass shootings.  Mother Jones Mass shootings are scattered fairly evenly throughout the year, whereas Shooting Tracker shows a strong summer-bias.


This is just an initial first-look at mass shooting statistics, but it shows an important deviation in the way we talk about mass shootings.  I will dig into these datasets more in the next few days, attempting to understand the following:
  • Why are the datasets so different?
  • What would make a researcher choose one data set or another?
  • Which dataset is a more accurate presentation of "risk"?

Thursday, December 17, 2015

Martin Shkreli Dropped Some Hints

Today I woke up to what most of the world considered good news, that "Pharma CEO" and expensive music buyer Martin Shkreli had been arrested.  This post is going to involve a lot of embedded tweets, let's get it started,  I actually learned about the arrest by Ben Casselman's epic tweet:


It appears that the arrest is on charges related prior securities fraud, not his current venture; what initially brought him press (the pricing of Daraprim) is not related to his current legal issue.  I have written about Shkreli's Daraprim pricing once before on this blog, specifically pointing out... well  this:
The CEO's rhetoric tells us that he is leveraging higher pricing against the financial-insurance system, and effectively betting on the ability to extract large mid-term profits from it.  The insurance system, as it exists,  enables this type of cost increase by giving *ordinary* people *extraordinary* ability to pay for effectively one-time services.
Actually, I pointed out three things in my blog:
  1. While it's fathomable that the drug was under-priced to the point of not being profitable, the price shock he used was likely exorbitant. 
  2. His claims of using pricing to create capital for future research is likely just "CEO BS."
  3. He's just leveraging against incentives in the current insurance system.


Shkreli had alluded to using the insurance system to leverage his profits, but never came out and said it.  Yesterday, he did.  Starting with this nice sounding tweet, no one ever pays more than $10 out of pocket!

And he continues to say nice sounding things like this:
But the best tweet of the day from Shkreli (where he actually admitted to the concept of my prior blog post) was this one:

Effectively Shkreli is outright saying here, let me charge the insurance system huge prices, or I'll just give it away for free.  The full details of why this happens relates to diffuse impacts of insurance system, this being a low-use drug, and what happens when you artificially give people more "ability to pay" for something.  I detailed that out in more detail in my prior post.


This situation gives me mixed feelings. I'm not really a supporter of single pay health insurance, but Shkreli may be a good argument for it.  At least in this sense: as long as the current quasi-governmental insurance incentive system exists, "bad actors" like Shkreli will have incentives to push the system for personal profits.  Even if not every Pharma CEO is like Shkreli, this will likely lead to micro pricing increases due to the relative consumer price insensitivity.  Shkreli could be the initial call for a reformed healthcare system.

Friday, December 11, 2015

Creating Better GIS Maps in QGIS

Quite often on this blog I use GIS mapping, especially maps of Kansas, to make a point.  Generally when I am making a map I use a simple block colored map, like this one of counties:

That map is great, especially for Kansas policy wonks, or people who can identify areas of Kansas without greater context.  But what if we want more contextual data on our map, like roads, city names, and topographical features?  This is very easy in QGIS, using the following steps:

  1. Install the QGIS OpenLayers plugin using the Plugins Menu.
  2. From the Web menu Choose the OpenLayers plugin. 
  3. Choose which layer you want to add to your map (I have best luck with the OpenStreetMap layer).
  4. Move the OpenStreetMap layer to be the bottom layer.
  5. If your top layer (what you are analyzing) is a polygon, it will likely cover your context layer.  Change this top layer to be partially transparent using this dialogue.  

And, here's our ouptut  (same map as we were looking at before):

This gives even more context on close views, like this one:

And is even more helpful for close-up in-town views (like a precinct map of Northern Johnson County Kansas):

Wednesday, December 9, 2015

County Level Unemployment and Determining Cause from Correlation

(A bit of data/mapping ADHD, we're going to go through a lot of maps, very fast!)

Someone pointed out to me that the Kansas side of the KC metro area currently has its lowest unemployment rate in 15 years.  Though this is interesting and positive,  it started me down a path that led to far too many mappings, and a tie into potentially spurious voting correlations.


If you have spent much time in the Kansas side of the Kansas city metro, you know it's the tale of two counties: white; affluent Johnson County, and a mixed, poor Wyandotte County.  Here's how the two counties compare on selected metrics:

Huge differences between these adjacent counties, but I wondered how that played out in terms of unemployment rate (disclaimer: unemployment rate isn't a great metric for a variety of reasons, but works in this scenario).  Luckily the Bureau of Labor Statistics offers county-level unemployment statistics.  Here's a map of unemployment rates I created for northeast Kansas (2014, annualized):

This is close to what might be expected, Johnson county has the lowest rate in northeast Kansas, Wyandotte county has the highest, and the margin between them (3.1%) is striking.  But while I have this data on my desktop, why don't we look at the entire state. 

Three things struck me about this map:
  • There's a huge urban-rural divide (not surprising, very rural, agricultural areas can have very low unemployment rates for multiple reasons).
  • Southeast Kansas is a rural area with high unemployment rate.  This is also not surprising, as this is has been a high-poverty rural area for the past few decades.
  • This looks a little like another map I made.  Specifically this one: 

What is that map?  A map of percent of voters voting for Sam Brownback by county.  Interesting.  I wondered if there might be a correlation.  Two charts proved that there is a fairly significant correlation.

Significant correlation! Low unemployment led to Brownback's win in Kansas! .. probably not..

People say "correlation is not causation" so often that it annoys me.  But this is a great case for explanation.  My thoughts:
  • A priori. One way people debug the correlation/causation is by looking at a priori or functional theory.  This roughly means that we can develop a reasonable theory for the causal mechanisms underlying the correlation.  In this case, we have a pretty clear (and compelling) a priori theory: people in low unemployment counties view the economy as performing well, tend to prefer status quo (incumbent candidates).
  • Covariates: A "gotcha" in the correlation/causation is outside factors simultaneously causing or impacting both correlating variables (this is likely what is occurring here):
    • Other variables: The counties with low unemployment have other things in common, (example: agricultural, more conservative, whiter, more rural).  All of these things lead to supporting more conservative candidates independent of county level unemployment rate. 
    • Pre-existing preferences: The counties with lower unemployment rates voted for Brownback in 2010, before he had any impact on those rates, before a incumbency bias would have been established.
There's another factor at hand though, which *could* have an effect.  Areas with lower unemployment rates could be more conservative generally, due to rational benefits.  The argument here is that low unemployment leads to lower political demand for social services, which are generally considered a liberal policy.  In this case unemployment rates could be at play, but more broadly as a general indicator of well being, and not a preference for an individual candidate.


Obviously this post has been a bit ADHD, but a few takeaways:
  • The Kansas side of the KC metro region may have historically low unemployment rates, but that is in no way homogeneous across the region.
  • There is a significant correlation between unemployment rate and propensity to vote for Sam Brownback.  Brownback won counties with < 3% unemployment with nearly 70% of the vote.
  • It's unlikely that the low unemployment rates in counties are directly responsible for Brownback's support, but alternatively, counties with lower demand for social services, may have more conservative preferences.

And a few more maps, unemployment rates over time:



2010 (height of recession)

Monday, December 7, 2015

Obama, Gun Control, and ISIL

Late last week, the Whitehouse announced that President Obama would address the country Sunday night regarding the recent San Bernardino shootings.  There was quite a bit of speculation on what the speech would cover, but as more information was made available, it seemed the speech would cover the following three items:
  • Sympathy for families.
  • Policy speech for containing ISIS.
  • Gun control speech.
Leading into the speech, I noticed the gun control portion seemed to get the most attention, but that turned out not to be the nature of the speech.


The speech itself was almost entirely focused on two issues: strategies currently being used and to be used in the fight against ISIS (foreign/immigration policy) and a request for tolerance towards Muslim Americans.  Gun policy was mentioned, but was only a minor part of the speech.  In fact, measured by either time or word count, the gun control portion of the speech represented about 7% of the total.  In fact, here is the entire portion of the speech related to gun control:
To begin with, Congress should act to make sure no one on a no-fly list is able to buy a gun. What could possibly be the argument for allowing a terrorist suspect to buy a semi-automatic weapon? This is a matter of national security.
We also need to make it harder for people to buy powerful assault weapons like the ones that were used in San Bernardino. I know there are some who reject any gun safety measures. But the fact is that our intelligence and law enforcement agencies -- no matter how effective they are -- cannot identify every would-be mass shooter, whether that individual is motivated by ISIL or some other hateful ideology. What we can do -- and must do -- is make it harder for them to kill.
Only a minority of the Obama speech was on gun control, but what about the reaction?


To compare the Obama speech to it's reaction, we need to analyze the types of terms used in the original speech to the reaction (here measured by twitter).  I imported the speech as a text document into R and then downloaded unique tweets that used the term #Obamaspeech.  After a quick descriptive analysis and topic modelling, a couple of realizations:
  • A majority of the tweets are either supporting or mocking the President's speech, without reference to policy.
  • A minority of tweets actually deal with the real policy issues.
As the tweets are fairly topic sparse, I focused on the most common terms used by the President and Tweeters respectively.  The terms I use are post stemming (which finds the root of each word) and stopwords (which removes common english words). Here's the list of the top twenty words for each set of text:

Most notably, the most common word used in tweets was "gun," a word not even found in the President's top 20 words (and only 7% of the speech, in total).  Also of note is that the American people generally use the term ISIS, whereas the president prefers ISIL.  Even Donald Trump weighed in on this one:

As much as I hate to admit it, people prefer word clouds to lists of words.  Here's a comparison between what the president said, and what tweeters were saying:



Interestingly, a speech where the president focused on strategies for defeating ISIS and tolerance towards American Muslims lead to a larger twitter conversation about gun control.  Gun control is obviously a hot button issue, and this is yet another way to measure that: Obama spent only 7% of his speech on gun control, whereas the word "gun" became the most common word used on twitter in response to it.  Also, Donald Trump was right about something: almost EVERYONE says ISIS instead of ISIL.

Tuesday, December 1, 2015

Revenue Estimates, Part 2

Another month of tax revenues released by the State of Kansas, another month of arguments on Twitter.  Beyond amusement at the arguments, is there a reason I post on this type of thing? Yes-it's because the arguments present are common to other data problems, specifically in people misunderstanding, misstating, and misleading. 

I've posted on this before, but the short of it is this: Kansas governor Sam Brownback reduced tax rates in an attempt to grow the economy, tax revenues declined (predictably), and the government has struggled to create a workable budget and satisfy the budgetary demands of schools and other government functions.  More recently, the initial Fiscal Year 2016 revenues were missing estimates, forcing a lower revision of the estimates in November 2015 (less money for government functions).  This of course intensified the political argument over the tax cuts. 

The argument between policy wonks really comes down to two political dimensions: size of government and economic effect of tax cuts.  Here are the general positions I've seen:

  • Modified Laffers: We can cut the short term tax rate spurring economic growth, and within a couple of years, tax collections (read: potential size of government) will be greater under new lower rates than it would under lower rates (government stays the same or increases).
  • Pure Laffers: We can cut the short term tax rate, and the size of government to what we believe is a reasonable level.  This will spur economic growth, and thus an increase in tax revenues.  In turn we can again reduce the rate in the future, spurring more economic growth.
  • Keynsians: Reducing the tax rate won't necessarily improve the economy, especially when combined with short-term reduction of government spending.  This reduction could lead to a death-spiral, in which government is continually underfunded, spending is reduced.
  • General Cynics:  (ok, I only know one person like this): Tax cuts will not create growth or increase spending, but we should cut taxes anyways because teachers and other government employees make too much money (read: the government is too big).
Anyways, that lays out the landscape, let's look at the arguments, largely the misconceptions and what is true.  I'll cover three main issues: actual revenue amounts, accuracy of estimates, and the long-term revenue growth.


One of the arguments made this year is that tax revenues are actually up. This is true, but only for the last Fiscal Year (2015).  Revenues are down for Fiscal Years since 2012.  In fact, in FY 2015 the State took in less than it did in 2012 by more than $400 Million, or 7.6%.  Since 2008 (last year prior to recession) revenues have only increased 2% in total, while averaging 5.6% annually over the past 40 years (will get to this later).

FYI, FY 2016 revenues will not exceed 2008 either, with the current revenue estimate at 6.1 billion dollars.


One of the criticisms leveled at the administration is that they are continually missing revenue estimates, so the estimates must be bad.  I looked into this, and found that by normal statistical methods (mean absolute deviation, mean squared error), the last three years have been very accurate.  In general, from 1976-2012, estimators missed by an absolute average of 5.2%, whereas the last three years, they have missed by 1.1%, 4.9%, and 1.1%.

Great?  Actually this is big problem.  If we look at the past 25 years, the estimates are hedged low during non-recession years (actuals always end up higher, creating slack).  That slack can be used as a buffer in recession years. The problem: the last three years are the first time since 1988 that a revenue estimate has been missed in a non-recession year.  The chart below demonstrates that variance.  If there is no slack in non-recession budgets, a recession in the next few years could create a massive budgetary problem.


One of the questions I see a lot is the about the longer term trends.  Specifically, how much have revenues grown recently?  For this I looked at the last 40 years, 1976-2015.   The numbers have increased quite a bit, mostly for explainable reasons, here are the most explainable:
  • Population Growth: More people = more government spending.  Kansas populations has increased since 1976 from about 2.3 million to 2.9 million.
  • Inflation: In common terms things get more expensive over time, so government has to spend more on salaries, benefits, and goods in order to prove the same level of service.  Though, it's not clear that CPI is a clear estimator of inflation's impact on government expenditures.
  • Role of Government: Government provides  more in terms of services than it has in prior periods of American History.
This is another place where metrics get tricky.  
  • If we look at raw numbers, revenues have increased by over 800% since 1976.  But this is a compound growth problem so that doesn't mean 800%/40 years  = 20% growth a year.
  • Using a correct compound growth calculation, the growth is actually 5.6% a year.*
  • Compensating for population growth, the actual growth rate is 4.9% annually.  By way of comparison CPI (which may or may not be a good measure of inflation in this case) grew by 3.7% annually over this time period.
Quick point of reference, the correct equation for compound annualized growth rate is below. I know that this is an area that many people struggle with mathematically, but it is important to realize the equation becomes exponential because you are effectively "growing prior growth":

I'll likely do more in the future to parse out government expansion versus inflationary increases, but this gives you an idea for now.  It appears revenue has grown 4.9% annually while inflation has grown 3.7%.  1ish% annual increase in size of government?  Maybe.  But there are a lot of other factors that could play into that number.   A couple of charts to inform the discussion.


A few points we can take away from this:
  • The tax cuts, as could be expected, had an immediate negative impact on tax revenues.
  • Tax revenue projections have been good from a statistical perspective, but bad from the positions that:
    • This is not a recession.
    • This is a negative miss.
    • This may mean that not enough slack exists to deal with future recessions.
  • Historically, per capita revenues have grown at approximately 4.9% a year, over a period that averaged 3.7% CPI growth.

KU Losses and a Simulation Engine

My mid-season bet that KU would go winless this year turned out to be true, so I probably need to post how right I was about that.  Don't worry though, this isn't a blog post of me simply gloating.  Well the first part is, but I have built a helpful piece of code too, which we'll get to later.  


So let's do a post mortem analysis of whether my predictions for KU Football were accurate.  What was the initial prediction I made?  

About a 50-50 chance of a winless season starting from the first week of the conference season (game four).

Some people might look at this prediction and say "50-50 means you don't know," which is kind of true.  I wasn't sure whether KU would go winless or not.  But in this case the the absolute probability doesn't matter, but information gain does.  Information gain in this case, means how much more we know about the probability of something over a "par" or "average probability.

College football teams go winless about 0.2% of the time, making this a fairly rare event.  To be able to say that KU had a 50% chance of going winless means that they were effectively 250 times more likely to go winless than any random college football team, a huge gain in information from this prediction.

As an example, let's say that the daily probability that an elevator in your building would get stuck on a ride is 0.1%. However, I have modeled the performance of elevators in your building, and tell you that the elevator you're about to get on has a 50-50 chance of getting stuck.   I still "don't know" whether the elevator will get stuck,  but the model is actually quite useful because it provides a lot of information about this specific elevator ride over the normal 0.1% par probability. In essence, the 50% probability is not certain, but is still useful.

On the other hand, an event predicted at the 50% level should come true about 50% of the time, how can we be sure that it wasn’t actually more likely than I had originally predicted?  Without looking at my estimates over a series of seasons, there isn’t a good way to determine the accuracy of the overall predictions.  Some cynics would have claimed prior to the season that KU would almost certainly (>90%?) go winless.  It’s hard to falsify that statement, now that the team has gone 0-12, however there were a few games that were played close (SDSU,Texas Tech, TCU), and that tells me the team had a legitimate shot to win a game along the way.  Add this to the *probability* of when a team under a new coach will play well, then 50-50 still seems like it was a reasonable estimate (imagine if KU played the way they did in the TCU game when they were playing Iowa State).



Now that I have that out of my system, how about some actual statistical work?  One piece of my toolset that I’ve had in SQL or Excel but never in R before is a batch probability simulation engine.  The point of a simulation engine is to look at a set of event with probabilities fixed to them, and simulate them thousands of times, to get a sense of how things might turn out together (likely outcomes for a season).  A concrete way of looking at this is like letting Madden play 1,000,000 seasons (computer versus computer), and then setting probabilities based on what team wins the super bowl the most.

To write a probability simulation engine you need a few general parts:
  • A set of input probabilities (e.g. a vector of probabilities of a team winning each of their games this season).
  • Creating a matrix of random probabilities with columns = number of events (games), rows = to number of simulations.
  • A win/loss classifier that compares random probabilities to the set event probabilities.
  • A summarizer, to summarize total numbers of wins and loses, and the season outcomes.

My code to do this is below.  There are actually some different pieces you can add in, for instance bootstrap modifiers that account for dependencies between events, and other modifiers to run many teams at once.  I'll work on that later.

How does this actually work?  I simulated KU’s season 1 million times (only takes about 2 seconds), and summarized the results. Here’s how the seasons set out in terms of number of wins (including some higher probability wins, e.g. SDSU):

That's a bit depressing.  Even including the "easier" non-conference season KU would go winless 37% of the time.  KU would become bowl eligible (wins>=6) once in every 10,000 seasons.

Here’s a look at just the conference season.  Over 50% of the time, KU won zero games.

How bad was KU compared to a “par” team?  I made a par power-conference team which has three non-conference games of .80 probability to win, and .50 percent to win each conference game.  Here's what that looks like.

And here’s the R code that got me here (this is really simple, but I will expand on it to handle multiple scenarios, and simulate full leagues).


Just a couple of points to cap this one off:
  • I was generally correct about KU's chances of winning a game this season (gloating).
  • It's fairly straight forward, after creating probabilities of winning each game, to simulate how teams may actually perform during the year.  
  • With KU's current performance, they will go winless one out of every three years, and go to a bowl once every 10,000 seasons.

Tuesday, November 24, 2015

The Different Ways We Talk About Candidates

A couple of months ago I found a website with extremely rich data, an event which usually makes me very happy.  This website didn't have that effect on me.  I was trying to figure out the weight of a specific baseball player, and stumbled upon a database of detailed celebrity body measurements (all women, of course), found here.  Later I found that data included political candidates, and it raised a question in my mind about the different ways we talk about men and women in politics.

Simultaneously, I was looking for a way to measure the presence of certain ideas across the internet.  I can already measure sentiments and topics on twitter, but Twitter is only a portion of the internet, and most people access the internet through Google search when seeking out new information.  Could I write code that would start my text mining operations through Google Search?


(NON-Nerds Skip this)

I had a social idea (how we talk about candidates based on gender) and a coding/statistical concept to test: to mine google search results.  I went forward with a formalized test plan:
  • I would use the google search API to pull results for "Candidate's Name" + Body Measurements.
  • I would capture the data and turn it into mine-able text.
  • I would compare the results of top words, and generally compare them.  (note: rate limits on the Google API as well as some Google restrictions slow me down, in the future I may apply more sophisticated text mining techniques).
I wrote some code pull the Google Search results, the google API only allows us to pull 4 results at a time, so I wrote a loop to pull four at a time.  Here's what that looks like (building step by step for ease of understanding):


So what are the results of googling Candidate Names + Body Measurements?  I googled four candidates, two men, two women.  My observations:
  • Men: The men's results were generally about the campaign, with each returning a few references to BMI (Body Mass Index).
  • Women: The women's results were heavily focused on the size of their bodies.  In fact, the top four words for each women were the same: size, weight, height, and bra.  

This table shows the top 10 words returned for each candidate.  This is obviously on a small sample size (four candidates, only top 44 google results for each) but is interesting nonetheless.  

And because I know everyone likes wordclouds (sigh) I created wordclouds for each candidate at the bottom of this post, below conclusion.


Some final takeaways from this analysis:
  • It's definitely possible to use text mine google results in order to find prevalence on the internet.  I probably need to refine my methodology in the future, and obviously implement more sophisticated techniques, but the basic scraping method is complete.  
  • There exists relatively little information on the internet regarding the body measurements of male candidates.  And I really wanted to know Ben Carson's waist to hip ratio!
  • Female candidates are talked about online a lot more in terms of their body.  I'm not an expert in feminist discourse analysis, or even really qualified to give an opinion here, but I have certainly measured a difference in the way candidates are talked about online.




Friday, November 20, 2015

Corrected Polling Numbers

A few weeks ago I posted a fairly hefty critique of a survey conducted by Fort Hays State University researchers on the political climate in Kansas.  The survey claimed a lot of things, but the issue receiving the most press was that Kansas Governor Brownback had an 18% approval rate.  I took issue with that number for various reasons, largely due demographic skews in the data, hinting at sampling or response bias.


Sometime later a twitter user asked me, if not 18%, what do I really think Brownback's approval rating might be.  I looked again at the skews, did some quick math, adjusting for prior demographic distributions and likely errors and came up with a range.  This was me really just trying to back into a number from bad polling data.  Here's my response on twitter:


This week another survey was published that reviews the approval rate of all governors in the US.  You can find that study here.  I haven't fully vetted the methodology, but the methodology indicates they at least tried to deal with demographic issues.

What did that study tell us?
Brownback's approval rate is 26%.  LOOK THAT'S IN MY RANGE!
But that dataset also provides information on other governor approval ratings, what can those tell us?


While I was correct that Brownback's likely approval rate is above 18%, his approval rate is still dismal compared to other governors.  In fact Brownback is 9 percentage points below any other governor, and a huge outlier.  I could bore you with p-values and z-scores (-2.8) and other statistical nerdery, but two charts can easily describe how bad his approval rate is. (Brownback in red)


Takeaways bullets:
  • Brownback's approval rate is likely above 18%, closer to 26% (read: I was right).
  • Brownback has the lowest approval rate among US governors.
  • Brownback's approval rating is an extreme low outlier.  

Wednesday, November 18, 2015

Tragedy Hipsters and the Connection of Mizzou to Paris

Over the weekend I was confronted with a new term that needed no explanation: Tragedy Hipster.  The term refers to people who respond to tragedies by mentioning other tragedies, and this behavior was prevalent over the weekend.

Specifically, there were a string of tweets from Mizzou protesters complaining about the world being more upset about Paris attacks (120 deaths) than issues surrounding Mizzou protests earlier in the week (racism).  Other "hipsters" were mentioning the Beirut bombings which received much less press coverage than Paris.  Later in the weekend, I noted another string of tweets referencing both Paris and Mizzou that tended to be from conservatives complaining about the earlier tragedy hipsters.  All fascinating to me.

So a few data questions emerged, can we identify tragedy hipster behavior in data?  Can we differentiate the hipsters from their critics?  How much of the linkage between Paris and Mizzou was the original hipsters versus their critics?  I looked at twitter data to examine the behavior of tragedy hipsters.

A few takeaways from the data:
  • There was a high amount of tweet traffic following the initial event comparing Mizzou to Paris.
  • Much of that first wave could be considered "Tragedy Hipsters" while others were just asking prayers for both.
  • Following that initial wave, the tweets were mainly from conservatives criticizing the initial wave.


First a definition: 
Tragedy Hipsters-People who react to the initial news of a tragedy by trying to cite a cooler tragedy.  (Side note:"cooler" might mean lesser known, more deaths, better cause, etc.)
This usage is a direct corollary to the way music hipsters talk about music, by trying to reference a "cooler" (read: more obscure, harder core, weirder) band whenever someone brings up music.  The music instance usually goes something like this:
Person 1: Hey have you heard the new Red Fang album?
Hipster: Red Fang sucks, they're just a rip off of the band Sleep.
Person 1: Whatever, hipster.
If that's what music hipster behavior looks like, what does a tragedy hipster look like?  Topic mining the tweets over the weekend containing the words "Paris" and "Mizzou" found a specific topic (topic 3, see below) that was both direct and indirect Tragedy Hipster Behavior.  Here are some examples:

Direct comparisons:



Let's mine some data, shall we? Here's what I did:
  • Downloaded tweets from the beginning of the attack until Tuesday at noon that contained both "Paris" and "Mizzou"
  • Topic mined the data to find true topics underlying tweets, looking for tragedy hipsters versus their backlash.
  • Analyzed topics used and how they changed over time.
  • Sentiment mined the data for emotions.
First a word cloud, just to identify the top words.  Unsurprisingly, "attack" is the top word overall, but it's closely followed by "spotlight," "stole," and "unbelievable."

A topic model might be helpful to understand why the weird terms are used, it converged around five topics:

On analyzing the tweets, topics 1,2,4,5 were mainly made of conservative criticisms "Tragedy Hipster" behavior, and topic 3 was... well tragedy hipsters.  By topic:

  • Topic 1: TCOT (Top Conservatives on Twitter)/foxnews references criticizing Mizzou protestors.
  • Topic 2: People talking about Mizzou activists having their spotlight stolen.
  • Topic 3: Tragedy Hipster behavior, a good amount of it showing legitimate sympathy/empathy.
  • Topic 4: Another topic of conservatives making fun of Mizzou protesters.
  • Topic 5: Another topic of student being mad about losing their media spotlight.

Because the "Tragedy Hipster" topic segregated so well from other topics, we can plot topic usage over time to show how the situation evolved. The chart below shows time in three hour blocks (UTC) and the proportion in each topic.

Note that topic three dominated the conversation for the first few hours after the attacks, but was supplanted by the other four topics over time.  Specifically, the initial ratio was 4:1 Tragedy Hipster to Conservative, but after the first day, the ratio reversed to between 1:4 and 1:9.  Overall, the total reaction has been 30% tragedy hipster, 70% conservative backlash.  In essence, the initial hipster reaction regarding Mizzou and Paris led to a few days of criticism from conservatives.

I embedded tweets for topic 3 above, so it's only fair I embed a couple of tweets associated with the other four topics:

I did one last thing to the data: sentiment mining.  Nothing too much of note here, except the emotion expressed was significantly angrier than other sets of tweets I've looked at, including Kansas Legislature Tweets and the Royals.


The concept of a tragedy hipster is somewhat fascinating.  On one hand, I understand the part of human nature that leads us to react strongly to tragedies that seem closest to them (white people dying in a western country). On the other hand, I understand the part of human nature that, when faced with a tragedy getting attention, point out another tragedy that is *worse* in someway, or closer to them (Mizzou students seeing discrimination, versus people they don't know dying in Paris).

But are there really any takeaways from this?  In my mind:
  • The initial tragedy hipster tweets were simultaneously overshadowed and made more popular by their conservative reactions.
  • This whole situation created a lot of anger between groups, possibly undermining progress made in the Mizzou protests.
  • If prone to tragedy hipster type behavior, you should be aware of optics, because there will be a backlash, and many may attempt to make you look foolish.

One last thing, for fun I wrote some code to auto-create wordclouds based on each topic. Here are the five for this analysis.  A couple of notable topics:
  • Topic 1 is a mess of conservative symbols, due to the repeated tweets of some conservative commentators.  
  • Topic 3 is notable in being much different (as noted before) than the other three topics.