Wednesday, December 30, 2015

How We Injure Ourselves By Age

Yesterday I found this article that looks at wall punching by age, gender and various other dimensions. I found the article interesting, wall punching seems to correlate with Male Testosterone change rates, but what was more interesting was underlying dataset, the NEISS.

The NEISS is a dataset of injuries reported by hospitals, with  details on the injured person, objects involved with the injury, and the nature of the act that caused the injury.  I downloaded the data, immediately found it interesting, and quite rich data.

I had a few questions about the original article's wall punching analysis, specifically: did women see an uptick in wall punching in their mid-teens, or was that trend limited to men?  I replicated the analysis below, while both men and women see and increase in wall punching in their mid-teens, the women's increase is less pronounced.


I also noticed a lot of people in the dataset were being injured by their toilets, largely through falling while on their toilet.  It would be amusing to create the same chart as above, but for toilet injuries.


That's fascinating, it appears to be an inversion of the other distribution.  This makes sense though as toilet injuries seem to be related to older people falling on the toilet.  But why more women than men?  Because there are simply more old women.  But for a moment, let's ignore gender and plot both toilet and wall injuries on the same chart:



That's interesting because it shows that the injury curves are inversions of each other, but the number of injuries from toilet never quite reaches that from wall punching.... or does it?  The problem with this is that there are far more people per year in their teens and twenties than those in their seventies and eighties.  What if we control for population age distribution, and turn this into a risk rate per year analysis:



This final chart shows two things first, the annual risk rate for people in their eightiess and nineties due to toilets is far higher than the annual risk rate from younger people punching walls.  Second, there's a point, sometime in your forties, when your risk from falling on the toilet turns higher than your risk from punching things... a true sign of maturity.


Update 2015-12-31 8:42AM

Someone disagreed with my conclusion that the difference between female and male toilet injuries was due to demographic lifespan issues (women live longer so there are more old women).  The contention was that the difference in how men and women use toilets cause the issue-and they appear to be correct. While lifespan still accounts for about half the gender variance in total toilet injuries in the elderly, women still have a higher overall risk rate.  Here's a chart controlling for gender population differences in older Americans.  

Tuesday, December 29, 2015

Year End Summary and Good News

I haven't blogged as much as normal recently, but we'll blame that on the holidays and end of year, or on a bit of good news I'll get to in a bit.  But first, some commentary on this blog.

THE BLOG


I started this blog in December 2014 as an outlet for creativity, analysis, and to get me out of my day-to-day life.  In those regards, the blog has been an outstanding success.  Unexpectedly, the blog has turned into a resume builder too.  In interviews it can be difficult for a analyst to demonstrate their true skillset-this blog gives me a portfolio of work.  In fact, I had a couple of interviews this year where I could use this blog as part of the interview process, and received positive feedback from managers.

That's great but boring, and everyone loves lists, how about our final top five posts of 2015?
  1. Daraprim: Price increase or Leveraged Financial System - I talk about the actions of the person who many have called the "most hated man of 2015"
  2. My Data Science Toolkit - A list of my five favorite Data Science software tools.
  3. Peer Group Determination: Library Peer Groups  - A project I conducted with my wife to use a novel methodology to identify peer groups for public libraries. 
  4. Kansas Election Fraud - My first in what ended up being a seven part series criticizing the work of statisticians who claim that statistical anomalies point to a rigged voting system.
  5. Kansas Election Fraud: Part 6 Sedgwick County Suburbs - A followup to the the post above, where I dig into why the initial anomalies exist using GIS mapping technology.

In sum, this blog started as a small project for me where I planned on posting 2-5 times per month, but quickly morphed into a 130+ post blog with just over 100 readers per day.  I look forward to where 2016 will take this blog.

GOOD NEWS

Remember when I mentioned that I had used this blog in a couple of job interview this year?  One of those interviews turned into a new exciting position for me.  I will being starting a new job across the State line in Missouri on January 4th. I am very excited and looking forward to new challenges and opportunities.  My last day at my current job will be December 31st, making this a new year and new beginning.

Tuesday, December 22, 2015

First Look: Mass Shooting Data

I've wanted to take a deep-dive into mass shooting data for quite a while, but I didn't want it to be in the heat of the moment following another mass shooting.  Over the next few days I am going to take a deep analysis into the mass shooting data we have available, what it means, and why numbers differ between sources.


THE DATA

There are two main datasets with mass shooting data, the Mother Jones data and the Shooting Tracker data.  Here is a brief summary of each dataset:

  • Mother Jones Data: Mother Jones focuses on multiple-death, non-gang public mass shootings.  Essentially, the kind of things we see on the news.  
  • Shooting Tracker Data: Shooting tracker focuses on any event where multiple people are shot, a very basic definition of mass shootings.

INITIAL COMPARISON

For this post I created an initial comparison of the data; just to get a sense for differences between the data.  The first issue is that the Shooting Tracker Data only goes back three years, but Mother Jones goes back to 1982.  We can generally solve for this, but any longitudinal analysis will have to be based on Mother Jones data.

As one might expect, the Shooting Tracker data tells us the number of mass shootings in the United States is much higher than Mother Jones does.  In fact Shooting Tracker tells us that we average one mass shooting a day, whereas Mother Jones tells us we average one a quarter:




An additional component is when mass shootings occur, Mother Jones shows most shootings occur during the week, whereas Shooting Tracker shows shootings occur disproportionately on weekends.


There's even a disagreement on the seasonality of mass shootings.  Mother Jones Mass shootings are scattered fairly evenly throughout the year, whereas Shooting Tracker shows a strong summer-bias.



CONCLUSION

This is just an initial first-look at mass shooting statistics, but it shows an important deviation in the way we talk about mass shootings.  I will dig into these datasets more in the next few days, attempting to understand the following:
  • Why are the datasets so different?
  • What would make a researcher choose one data set or another?
  • Which dataset is a more accurate presentation of "risk"?


Thursday, December 17, 2015

Martin Shkreli Dropped Some Hints

Today I woke up to what most of the world considered good news, that "Pharma CEO" and expensive music buyer Martin Shkreli had been arrested.  This post is going to involve a lot of embedded tweets, let's get it started,  I actually learned about the arrest by Ben Casselman's epic tweet:


HISTORY ON SHKRELLI


It appears that the arrest is on charges related prior securities fraud, not his current venture; what initially brought him press (the pricing of Daraprim) is not related to his current legal issue.  I have written about Shkreli's Daraprim pricing once before on this blog, specifically pointing out... well  this:
The CEO's rhetoric tells us that he is leveraging higher pricing against the financial-insurance system, and effectively betting on the ability to extract large mid-term profits from it.  The insurance system, as it exists,  enables this type of cost increase by giving *ordinary* people *extraordinary* ability to pay for effectively one-time services.
Actually, I pointed out three things in my blog:
  1. While it's fathomable that the drug was under-priced to the point of not being profitable, the price shock he used was likely exorbitant. 
  2. His claims of using pricing to create capital for future research is likely just "CEO BS."
  3. He's just leveraging against incentives in the current insurance system.

NEW TWEETS

Shkreli had alluded to using the insurance system to leverage his profits, but never came out and said it.  Yesterday, he did.  Starting with this nice sounding tweet, no one ever pays more than $10 out of pocket!



And he continues to say nice sounding things like this:
But the best tweet of the day from Shkreli (where he actually admitted to the concept of my prior blog post) was this one:


Effectively Shkreli is outright saying here, let me charge the insurance system huge prices, or I'll just give it away for free.  The full details of why this happens relates to diffuse impacts of insurance system, this being a low-use drug, and what happens when you artificially give people more "ability to pay" for something.  I detailed that out in more detail in my prior post.

CONCLUSION/TAKEAWAYS

This situation gives me mixed feelings. I'm not really a supporter of single pay health insurance, but Shkreli may be a good argument for it.  At least in this sense: as long as the current quasi-governmental insurance incentive system exists, "bad actors" like Shkreli will have incentives to push the system for personal profits.  Even if not every Pharma CEO is like Shkreli, this will likely lead to micro pricing increases due to the relative consumer price insensitivity.  Shkreli could be the initial call for a reformed healthcare system.




Friday, December 11, 2015

Creating Better GIS Maps in QGIS

Quite often on this blog I use GIS mapping, especially maps of Kansas, to make a point.  Generally when I am making a map I use a simple block colored map, like this one of counties:


That map is great, especially for Kansas policy wonks, or people who can identify areas of Kansas without greater context.  But what if we want more contextual data on our map, like roads, city names, and topographical features?  This is very easy in QGIS, using the following steps:

  1. Install the QGIS OpenLayers plugin using the Plugins Menu.
  2. From the Web menu Choose the OpenLayers plugin. 
  3. Choose which layer you want to add to your map (I have best luck with the OpenStreetMap layer).
  4. Move the OpenStreetMap layer to be the bottom layer.
  5. If your top layer (what you are analyzing) is a polygon, it will likely cover your context layer.  Change this top layer to be partially transparent using this dialogue.  


And, here's our ouptut  (same map as we were looking at before):


This gives even more context on close views, like this one:



And is even more helpful for close-up in-town views (like a precinct map of Northern Johnson County Kansas):


Wednesday, December 9, 2015

County Level Unemployment and Determining Cause from Correlation

(A bit of data/mapping ADHD, we're going to go through a lot of maps, very fast!)

Someone pointed out to me that the Kansas side of the KC metro area currently has its lowest unemployment rate in 15 years.  Though this is interesting and positive,  it started me down a path that led to far too many mappings, and a tie into potentially spurious voting correlations.

JOHNSON V WYANDOTTE

If you have spent much time in the Kansas side of the Kansas city metro, you know it's the tale of two counties: white; affluent Johnson County, and a mixed, poor Wyandotte County.  Here's how the two counties compare on selected metrics:


Huge differences between these adjacent counties, but I wondered how that played out in terms of unemployment rate (disclaimer: unemployment rate isn't a great metric for a variety of reasons, but works in this scenario).  Luckily the Bureau of Labor Statistics offers county-level unemployment statistics.  Here's a map of unemployment rates I created for northeast Kansas (2014, annualized):




This is close to what might be expected, Johnson county has the lowest rate in northeast Kansas, Wyandotte county has the highest, and the margin between them (3.1%) is striking.  But while I have this data on my desktop, why don't we look at the entire state. 
 

Three things struck me about this map:
  • There's a huge urban-rural divide (not surprising, very rural, agricultural areas can have very low unemployment rates for multiple reasons).
  • Southeast Kansas is a rural area with high unemployment rate.  This is also not surprising, as this is has been a high-poverty rural area for the past few decades.
  • This looks a little like another map I made.  Specifically this one: 



What is that map?  A map of percent of voters voting for Sam Brownback by county.  Interesting.  I wondered if there might be a correlation.  Two charts proved that there is a fairly significant correlation.







Significant correlation! Low unemployment led to Brownback's win in Kansas! .. probably not..

People say "correlation is not causation" so often that it annoys me.  But this is a great case for explanation.  My thoughts:
  • A priori. One way people debug the correlation/causation is by looking at a priori or functional theory.  This roughly means that we can develop a reasonable theory for the causal mechanisms underlying the correlation.  In this case, we have a pretty clear (and compelling) a priori theory: people in low unemployment counties view the economy as performing well, tend to prefer status quo (incumbent candidates).
  • Covariates: A "gotcha" in the correlation/causation is outside factors simultaneously causing or impacting both correlating variables (this is likely what is occurring here):
    • Other variables: The counties with low unemployment have other things in common, (example: agricultural, more conservative, whiter, more rural).  All of these things lead to supporting more conservative candidates independent of county level unemployment rate. 
    • Pre-existing preferences: The counties with lower unemployment rates voted for Brownback in 2010, before he had any impact on those rates, before a incumbency bias would have been established.
There's another factor at hand though, which *could* have an effect.  Areas with lower unemployment rates could be more conservative generally, due to rational benefits.  The argument here is that low unemployment leads to lower political demand for social services, which are generally considered a liberal policy.  In this case unemployment rates could be at play, but more broadly as a general indicator of well being, and not a preference for an individual candidate.

CONCLUSION

Obviously this post has been a bit ADHD, but a few takeaways:
  • The Kansas side of the KC metro region may have historically low unemployment rates, but that is in no way homogeneous across the region.
  • There is a significant correlation between unemployment rate and propensity to vote for Sam Brownback.  Brownback won counties with < 3% unemployment with nearly 70% of the vote.
  • It's unlikely that the low unemployment rates in counties are directly responsible for Brownback's support, but alternatively, counties with lower demand for social services, may have more conservative preferences.

And a few more maps, unemployment rates over time:

1990

2000

2010 (height of recession)







Tuesday, December 1, 2015

KU Losses and a Simulation Engine

My mid-season bet that KU would go winless this year turned out to be true, so I probably need to post how right I was about that.  Don't worry though, this isn't a blog post of me simply gloating.  Well the first part is, but I have built a helpful piece of code too, which we'll get to later.  

WAS PREDICTION ACCURATE?

So let's do a post mortem analysis of whether my predictions for KU Football were accurate.  What was the initial prediction I made?  

About a 50-50 chance of a winless season starting from the first week of the conference season (game four).
<Gloating>

Some people might look at this prediction and say "50-50 means you don't know," which is kind of true.  I wasn't sure whether KU would go winless or not.  But in this case the the absolute probability doesn't matter, but information gain does.  Information gain in this case, means how much more we know about the probability of something over a "par" or "average probability.

College football teams go winless about 0.2% of the time, making this a fairly rare event.  To be able to say that KU had a 50% chance of going winless means that they were effectively 250 times more likely to go winless than any random college football team, a huge gain in information from this prediction.

As an example, let's say that the daily probability that an elevator in your building would get stuck on a ride is 0.1%. However, I have modeled the performance of elevators in your building, and tell you that the elevator you're about to get on has a 50-50 chance of getting stuck.   I still "don't know" whether the elevator will get stuck,  but the model is actually quite useful because it provides a lot of information about this specific elevator ride over the normal 0.1% par probability. In essence, the 50% probability is not certain, but is still useful.

On the other hand, an event predicted at the 50% level should come true about 50% of the time, how can we be sure that it wasn’t actually more likely than I had originally predicted?  Without looking at my estimates over a series of seasons, there isn’t a good way to determine the accuracy of the overall predictions.  Some cynics would have claimed prior to the season that KU would almost certainly (>90%?) go winless.  It’s hard to falsify that statement, now that the team has gone 0-12, however there were a few games that were played close (SDSU,Texas Tech, TCU), and that tells me the team had a legitimate shot to win a game along the way.  Add this to the *probability* of when a team under a new coach will play well, then 50-50 still seems like it was a reasonable estimate (imagine if KU played the way they did in the TCU game when they were playing Iowa State).

<Gloating/>


SIMULATING SEASON

Now that I have that out of my system, how about some actual statistical work?  One piece of my toolset that I’ve had in SQL or Excel but never in R before is a batch probability simulation engine.  The point of a simulation engine is to look at a set of event with probabilities fixed to them, and simulate them thousands of times, to get a sense of how things might turn out together (likely outcomes for a season).  A concrete way of looking at this is like letting Madden play 1,000,000 seasons (computer versus computer), and then setting probabilities based on what team wins the super bowl the most.

To write a probability simulation engine you need a few general parts:
  • A set of input probabilities (e.g. a vector of probabilities of a team winning each of their games this season).
  • Creating a matrix of random probabilities with columns = number of events (games), rows = to number of simulations.
  • A win/loss classifier that compares random probabilities to the set event probabilities.
  • A summarizer, to summarize total numbers of wins and loses, and the season outcomes.

My code to do this is below.  There are actually some different pieces you can add in, for instance bootstrap modifiers that account for dependencies between events, and other modifiers to run many teams at once.  I'll work on that later.

How does this actually work?  I simulated KU’s season 1 million times (only takes about 2 seconds), and summarized the results. Here’s how the seasons set out in terms of number of wins (including some higher probability wins, e.g. SDSU):

That's a bit depressing.  Even including the "easier" non-conference season KU would go winless 37% of the time.  KU would become bowl eligible (wins>=6) once in every 10,000 seasons.

Here’s a look at just the conference season.  Over 50% of the time, KU won zero games.


How bad was KU compared to a “par” team?  I made a par power-conference team which has three non-conference games of .80 probability to win, and .50 percent to win each conference game.  Here's what that looks like.


And here’s the R code that got me here (this is really simple, but I will expand on it to handle multiple scenarios, and simulate full leagues).





CONCLUSION

Just a couple of points to cap this one off:
  • I was generally correct about KU's chances of winning a game this season (gloating).
  • It's fairly straight forward, after creating probabilities of winning each game, to simulate how teams may actually perform during the year.  
  • With KU's current performance, they will go winless one out of every three years, and go to a bowl once every 10,000 seasons.



Tuesday, November 24, 2015

The Different Ways We Talk About Candidates

A couple of months ago I found a website with extremely rich data, an event which usually makes me very happy.  This website didn't have that effect on me.  I was trying to figure out the weight of a specific baseball player, and stumbled upon a database of detailed celebrity body measurements (all women, of course), found here.  Later I found that data included political candidates, and it raised a question in my mind about the different ways we talk about men and women in politics.

Simultaneously, I was looking for a way to measure the presence of certain ideas across the internet.  I can already measure sentiments and topics on twitter, but Twitter is only a portion of the internet, and most people access the internet through Google search when seeking out new information.  Could I write code that would start my text mining operations through Google Search?

THE TEST

(NON-Nerds Skip this)

I had a social idea (how we talk about candidates based on gender) and a coding/statistical concept to test: to mine google search results.  I went forward with a formalized test plan:
  • I would use the google search API to pull results for "Candidate's Name" + Body Measurements.
  • I would capture the data and turn it into mine-able text.
  • I would compare the results of top words, and generally compare them.  (note: rate limits on the Google API as well as some Google restrictions slow me down, in the future I may apply more sophisticated text mining techniques).
I wrote some code pull the Google Search results, the google API only allows us to pull 4 results at a time, so I wrote a loop to pull four at a time.  Here's what that looks like (building step by step for ease of understanding):



DATA RESULTS

So what are the results of googling Candidate Names + Body Measurements?  I googled four candidates, two men, two women.  My observations:
  • Men: The men's results were generally about the campaign, with each returning a few references to BMI (Body Mass Index).
  • Women: The women's results were heavily focused on the size of their bodies.  In fact, the top four words for each women were the same: size, weight, height, and bra.  


This table shows the top 10 words returned for each candidate.  This is obviously on a small sample size (four candidates, only top 44 google results for each) but is interesting nonetheless.  


And because I know everyone likes wordclouds (sigh) I created wordclouds for each candidate at the bottom of this post, below conclusion.

CONCLUSION

Some final takeaways from this analysis:
  • It's definitely possible to use text mine google results in order to find prevalence on the internet.  I probably need to refine my methodology in the future, and obviously implement more sophisticated techniques, but the basic scraping method is complete.  
  • There exists relatively little information on the internet regarding the body measurements of male candidates.  And I really wanted to know Ben Carson's waist to hip ratio!
  • Female candidates are talked about online a lot more in terms of their body.  I'm not an expert in feminist discourse analysis, or even really qualified to give an opinion here, but I have certainly measured a difference in the way candidates are talked about online.




BEN CARSON

HILLARY CLINTON


CARLY FIORINA
BERNIE SANDERS


Friday, November 20, 2015

Corrected Polling Numbers

A few weeks ago I posted a fairly hefty critique of a survey conducted by Fort Hays State University researchers on the political climate in Kansas.  The survey claimed a lot of things, but the issue receiving the most press was that Kansas Governor Brownback had an 18% approval rate.  I took issue with that number for various reasons, largely due demographic skews in the data, hinting at sampling or response bias.

ACTUAL APPROVAL RATE?

Sometime later a twitter user asked me, if not 18%, what do I really think Brownback's approval rating might be.  I looked again at the skews, did some quick math, adjusting for prior demographic distributions and likely errors and came up with a range.  This was me really just trying to back into a number from bad polling data.  Here's my response on twitter:

WARNING: "I TOLD YOU SO" coming. 

This week another survey was published that reviews the approval rate of all governors in the US.  You can find that study here.  I haven't fully vetted the methodology, but the methodology indicates they at least tried to deal with demographic issues.

What did that study tell us?
Brownback's approval rate is 26%.  LOOK THAT'S IN MY RANGE!
But that dataset also provides information on other governor approval ratings, what can those tell us?

COMPARISON TO OTHER GOVERNORS

While I was correct that Brownback's likely approval rate is above 18%, his approval rate is still dismal compared to other governors.  In fact Brownback is 9 percentage points below any other governor, and a huge outlier.  I could bore you with p-values and z-scores (-2.8) and other statistical nerdery, but two charts can easily describe how bad his approval rate is. (Brownback in red)




CONCLUSION

Takeaways bullets:
  • Brownback's approval rate is likely above 18%, closer to 26% (read: I was right).
  • Brownback has the lowest approval rate among US governors.
  • Brownback's approval rating is an extreme low outlier.  

Tuesday, November 3, 2015

Testing Opinion Polls: Do they really measure what they say they do?

**edited 2015-11-05 to include additional demographic information

Generally, I am not a fan of survey research and prefer economic numbers or other data measured not by "calling people and asking them how they feel."  Polls can bring in a lot of bias, not just the normal sampling error that some statisticians are obsessed with measuring and testing against, but also response bias, sampling bias, biases from the way you ask questions etc.  That's not to say that opinion polls and surveys are all worthless (if you want to do one, I know a guy, his name is Ivan, he's great with this stuff).

This is why when developing political models I only partially rely on recent opinion polls, but also heavily weight historic voting trends.  Remember how I used a model with additional data to predict in-margin the Brownback re-election?  (I'll be bringing this up for at least another three years).

A new poll has been making the rounds in Kansas and national media, making claims such as "Obama is more popular in Kansas than Brownback."  Keep in mind that Obama lost Kansas by 21 percentage points in 2012, and Brownback just won re-election in Kansas by about four percentage points.  This is obviously quite a claim, but how seriously should we take it?  Moreover, are there some basic steps we can use to vet how good opinion surveys are.

BACKGROUND: TYPES OF BIAS

So what makes a survey accurate versus inaccurate? The truth is, there are a lot of good ways to mess up a survey.  Here are the general ways surveys are incorrect:
  • Sampling error.  Many statisticians spend a majority of their careers measuring sampling error (this is part of the frequentist versus bayesian debate, and for another post).  Sampling error is the error caused simply by using a sample smaller than the entire population.  Assuming the sample is randomly selected from the population, there will still be a small amount of error.  This is the (+/-)  3% you see in most public opinion poll, though it varies by the size of the sample.
  • Sampling bias. Sampling bias is different than sampling error, though this is an issue that is somewhat difficult to understand.  This is the bias introduced through problems with the process of choosing a random sample of a population.  How does this kind of bias crop up?
    • Bad random number generators.
    • Bad randomization strategies (just choosing top 20% of a list).
    • Bad definition of population (list of phone numbers with people systematically missing)
  • Response Bias.  Response bias is what occurs when certain groups of recipients of a survey respond at different rates than others.  This occurs due to varying "propensity to respond" by demographic or opinion groups within a population.  Examples of how that occurs:
    • Women can be more willing to respond 
    • Older people (retirees) can have more time to respond
    • Minority groups can be less trusting of authorities, and less willing to respond
    • Certain political groups may not trust polling institutions and be less willing to respond
Once again, this is just a starter list of what can go wrong with taking samples within surveys, and I may add to this list as we go, but this is a good primer.

DATA: THE FHSU SURVEY


Let's jump right into study at hand.  The study was conducted by the Docking Institute of Public Affairs at Fort Hays State University.  The study can be found here.

Did the authors in this study consider the error and bias issues?  Absolutely, and they reference it in the study.  Here's a snapshot from their methodology.  



A couple of things from my first section.

  • First, they're referencing a sampling error (3.9%, +/-) for the sample.  That means we know that any number in the survey can be considered accurate within 3.9%.  
  • Second they make a passing reference to response bias, assuming it away.  But how can we test to determine there is no response bias?  Elsewhere in the paper they say that they contacted 1,252 Kansans, and 638 responded.  That means if the 50% that responded are "different" demographically than the 50% that didn't respond, the conclusions of the survey could be misleading.
  • Third there's no reference here to sampling bias, but they de facto address it elsewhere, taking about how they pulled the sample.  The report says: "The survey sample consists of random Kansas landline telephone numbers and cellphone numbers. From September 14th to October 5th, a total of 1,252 Kansas residents were contacted through either landline telephone or cellphone."

Looking at these three sources of potential bias, sampling error is simple math (based on sample size).  Response bias is assumed away by the researchers, and its impossible to know if the list of phone numbers used can create an unbiased sample of Kansans; can we be sure that this sample is an accurate representation of Kansans?

We can never be certain that this is a good sample free of response and sampling bias, but we can do some due diligence to determine if everything looks correct, specifically through fixed values testing.  In essence, there are some numbers that we know about the population through other sources (census data, population metrics, etc) that we can test to make sure the sample matches up.  Let's start with gender.

In the paper on page 39 there's a summary of respondents by gender, both for population and sample.  Keep in mind that the margin of error for this sample is 3.9%, so we would expect gender ratios to fall within this margin.  They do not (5% variance), meaning that the gender differences in this survey can not be attributed to random sampling error.



Also on page 39 is a summary of sample and population by income bracket.  Reported income brackets are a bit fuzzier than reported gender, but the chart below show how those line up.  Because there are multiple categories here, we can't do a simple (3.9% +/-) calculation (technically a binomial test of proportions).  Instead we rely on a test called a chi-squared goodness of fit test to determine if the difference are due to sampling error or an underlying bias.  If the values were statistically similar, we would expect a Chi-Square value of under 14.1.   The test finds that the results exceed the limits of sampling error, and indicate an underlying bias to the results.  


We also have fixed numbers for party affiliation in Kansas per the most recent registration numbers from the Secretary of State.  Those numbers are shown in the below on the left side of the chart, about 45% of Kansans are currently registered as Republican.  On page 39 of the survey we see the reported party affiliations of survey takers.  This analysis is a bit fuzzier because the way people identify doesn't always match their actual party affiliations, but we wouldn't expect that to cause the level of observed deviations in the chart.  As shown below more of the sample respondents responded as unaffiliated, about 12% points fewer Republican, and 5% points fewer democrat.  This also insinuates the sample was significantly less conservative than the registered voters of Kansas.




CONCLUSION


All of the data above speaks to how different the sample was than the general population of Kansas, but what are the takeaways from that?

  • The significant differences in population versus sample demographics undermine the 3.9% margin of error, making it unknown, and potentially much larger.  More concerning, is that the direction of the issues, make it appear that the survey was biased in a way that favored democrats.
  • Significant differences in sample versus population measured values can be indications of other underlying problems with the sample in unmeasured values.  We know that the sample was more female, affluent, left-leaning than the population, could that mean that our sample was biased in a way that made it more urban?  Unknowable with available data, but certainly problematic.
  • The researchers released the paper, with the metrics outside of margin of error, and didn't talk about it.  This is the most troubling part, because in research there are many times that statistical issues like this crop up, but they can be otherwise tested away or quantified in their impact to the margin of error.
My last thought on this:  I agree that Sam Brownback likely has a low approval rate, however the 18% approval rate, as well as other numbers related in the survey are likely an under-statement of his true approval rate, given the bias presented.

**Added 2015-11-05 
After seeing more people cite this survey I realized this survey has been occurring in Kansas since 2010, so I thought I would see if the demographic trends were consistent.  They were consistent, and there were some additional demographics added, specifically: age, race, and education level.  Not going to do a full write-up on these, but they were also significantly inconsistent with population demographics.


Oh and one last thing, I didn't talk about how questions are framed can impact results, and this this survey had one really wonky question in it that I'm not a fan of.  Specifically:

Thinking about what you paid in sales tax, property tax and state income tax together, compared to two years ago, the amount you pay in state taxes has increased, remained the same or decreased?

This question has received some press time, with the byline being that 74% of Kansans now pay more in taxes under Brownback's policies.  Because of the term "Amount" versus "Rate" in the question, I would count myself part of the 74%, but not because of Brownback's policy changes.  I pay more now because I make more money and live in a bigger house, actually an indication of success over the past four years.  I certainly don't think this is what the researchers or the press are purporting to measure.

Friday, October 23, 2015

Peer Group Determination: Library Peer Groups

Libraries are an institution generally associated with books and reading, and not necessarily math and data.  But as an avid reader who is married to a librarian, this data guy has an interest in the data behind libraries.  When my wife made me aware of some library datasets that might be interesting, I dug in and looked at the numbers.

I realized this dataset gave me the opportunity to potentially help libraries and tackle a subject I have wanted to on this blog for a while: Peer Group Identification.  Specifically, the research question here is: can we use a data-driven methodology to identify peer groups for individual libraries?  If we can, librarians can use these peer groups for purposes of benchmarking, best practices setting, and networking with those in similar situations.

THE PEER GROUP

Mixing it up a bit by putting results before detailed methodology.  It's down below if you're interested, both in a math intensive and non-math intensive formats. For data, I used the publicly available 2013 IMLS dataset.

I used a "nearest neighbor" methodology to find peer libraries for my home library system, Johnson County Library (JCL).  The nearest neighbor method is widely used across many fields, here's an example from medical research.  The factors I matched on were population served, branches, funding per capita, and visits per population.  

The result established a peer group in the chart below, with libraries that have between 11 and 13 branches, similar funding levels, similar populations and visits.  There is one extremely close neighbor, which is the Saint Charles City-County Library District. This library is similar to JCL in data, but also in serving affluent suburban areas near mid-western towns.  

I ran this list by my wife and she liked it.  So success?  In the short-term at least, but may be potential room for refinement (see conclusion).






SIMPLE METHODOLOGY

The "nearest neighbor" methodology to determining peer groups is fairly easy to understand at a basic level.  If we wanted to determine a peer group for Johnson County Library without using advanced analytics, we might start by simply looking at all Libraries that serve between populations between 400K and 500K.  

That might give us a good start, but upon diving in we would learn that many of those libraries face different challenges and experiences.  Some would be less affluent with lower funding levels, while others may see far different use patterns.  So we would add in a second variable, lets say funding per population, which would look like this:  


In this case we would choose the libraries closest to JCL, roughly the circle in the above graph.  But once again, there's a lot more to the attributes of a library than funding and population served.  What about use patterns, and number of branches?  

This is where I lose most people in the math.  Using this methodology, we can use as many dimensions as we want, and simply calculate the nearest neighbors simultaneously on all variables.  The best way you can imagine it is an extension of the above graph, just extended into 3-dimensional, 4-dimensional and n-dimensional space.

NERD METHODOLOGY

This is a methodology I have been using since early in my career, to choose peer groups and in the form of the k-NN algorithm.  Computationally, this is similar to the k-NN algorithm, especially in the first phases.  Some generally nerdy notes:

  • Computation Method: This is easy, really, it's just a minimization of euclidean distance in multi-dimensional space.  Effectively, a minimization of d in this equation:

  • Computation Strategy: The k-NN machine learning algorithm is both computationally elegant and costly.  It is elegant because it is simple to write, basically we just compute euclidean distance in n-dimensional space, and find the "closest" "k" neighbors to the point we're interested in.  Simple. I didn't even use an R package for this, I just rewrote the algorithm, in about 10 minutes of R coding, so I could have more control. It is costly because, in its predictive form, it requires distances calculated between every point in a dataset (which you can imagine in million+ row data tables can be slow).  Luckily in this case, I'm only interested in the distance between Johnson County Library and others, so it's computationally cheaper.
  • Variable Normalization: If you input raw data into the nearest neighbors algorithm, attributes will establish their importance in the equation by variance (because we're simply measuring the space raw).  I take three steps:
    • I conduct some attribute transformations.  Most importantly, I take logarithms of any variables showing a power-law distribution to reduce variance.
    • I Z-score each attribute, such that we are now dealing with equivalent variance units.
    • I (sometimes) multiply the Z-score by a weighting factor, when I want one factor to matter more than others.  In this case, I don't have a good a priori reason to weight factors, but could reconsider if Librarians think other factors matter.

CONCLUSION

In this post I have covered the methodology for determining peer groups, and created peer groups for the local Johnson County Library.  I hope that it is both demonstrative of methodology that could be implemented in many fields, and also adds value to the field of Librarianship.

If any librarians are interested in this analysis, or details of it feel free to reach out to me at datasciencenotes1@gmail.com.  I would be happy to provide you a custom peer group for your library.  Also would be interested in any thoughts on improving the peer groupings either by:
  • Using additional factors or variables.
  • Weight certain factors as more important than others (is funding or # of branches more important than # served?)
I'll leave you with a final view of the JCL data.  Below is a graph of libraries by visits and population, with the Johnson county peer group highlighted blue.  Note that there are some orange dots inter-mixed in the peer groups-these are libraries that were good matches on these two factors (visits/population) but not on our other two factors.
Johnson County Peer Group Versus All Libraries

Wednesday, October 21, 2015

Mapping Kansas Jobs

When researching my post from earlier this week, I downloaded some data that wasn't immediately helpful: By county total employment statistics from 2001 to 2014.  The data wasn't initially useful, but yesterday I created some maps that are fairly interesting when looking at job patterns in Kansas 2001-2014.  A few takeaways I found:
  • Johnson County (+) The biggest positive growth over the past years at 35,000, which is actually greater than the total for all counties.  That means, if we remove Johnson County, all other counties see net-negative job growth 2001-2014.
  • Sedgwick and Shawnee Counties (-) These two counties have seen the worst aggregate job reductions, losing from 3,500 to 5,000 each.
  • Smaller Counties (+/-) Some good, some bad.  Positives for Southwest Kansas, and near suburban counties (e.g. Butler), and those along I-70 (e.g. Ellis), negatives for Southeast Kansas, and those along Nebraska border (e.g. Jewell).
  • Commuter Counties (?) Weird data thing here, but if I divide workers in a county by total population, I can get a sense of commuting patterns.  For instance, the ratio is .870 for Saline County, .396 for neighboring Ottawa County.  Huge difference, but growing up there, I know there's a net-inbound commuter into Saline County and out of Ottawa County. 

TOTAL WORKERS BY COUNTY

To start off, I mapped total workers by county, to get a sense of where people worked.  First by 50K buckets.  




That's not as meaningful as I thought it might be, only four of 105 counties have more than 50,000 workers.  Johnson and Sedgwick are dominant.  Let's break that down into a bit more meaningful buckets, here each bucket has an approximately equal number of counties in it.



That's a bit more meaningful, and the combination of the two maps give us an idea about where most jobs are in Kansas.  Johnson and Sedgwick Counties, then the rest of eastern Kansas, then a few hot spots in Western Kansas (Ellis, Finney, Ford).

EMPLOYMENT CHANGE


First I let QGIS break down the counties as it saw them, by 10,000 worker change buckets.  Below dark red indicates net loss of jobs, orangish is a gain of jobs between 0 and 10,000.  The bright green is Johnson County at +35,000.



This gives us an idea that Johnson County is an outlier in jobs growth, but by how much?  A summary chart below makes the point in numbers, but essentially: without Johnson County, Kansas counties are net-negative in Jobs since 2001.

The above map shows aggregate job change, but a better measure of impact to individual counties is percent change.  Here's a map of job growth and loss mapped by percent change. There's also a chart below that shows the biggest winners and losers in jobs over this time period.



And below is the same information in chart form.




COMMUTER POPULATIONS

Because the data was available in my GIS layer, I calculated the ratio of total workers to population aged 18-65.   Here's what that looks like:


The variance range was larger than I expected but two things seemed to pop out.

  1. Moderately low employment rates over all in *extremely rural*, older, agriculture based populations.  (e.g. Decatur, Jewell, Hodgeman, Edwards Counties).
  2. Near urban "commuter" communities have extremely low employment rates.  These communities are fairly obvious if you're familiar with Kansas demographics. Some county examples:
    • Ottawa commutes to Saline
    • Butler commutes to Sedgwick
    • Miami commutes to Johnson

CONCLUSION

I just pulled these maps together from data already on my desktop, but here are a few summary points:
  • If you remove Johnson County from Kansas, there has been essentially no job growth over the past 14 years.
  • Sedgwick, Shawnee, and a few rural counties have seen especially negative results over this time frame.
  • By looking at ratios of jobs to population, we can detect net commuter/commutee counties, though there's probably a more direct way to detect this.


Wednesday, October 14, 2015

Democratic Debate #1 Summary


After the democratic debate last night, my wife asked me who I thought won.  I had no clue, actually. I could say three people who didn't win, but the debate wasn't a clear decisive victory for either front-runner, Hillary Clinton or Bernie Sanders.  Why not try to derive who won the debate from social media like I did for the Republicans?

TWO SENTENCE SUMMARIES

I'll just start out with my take from the debate, somewhat sarcastic, and limited to two sentences each.
  • Hillary Clinton - She'll likely win the nomination, so a good part of her debate strategy is like the four corners offense in basketball or "prevent" defense in football.  She's just running out the clock and trying not to mess it up.
  • Bernie Sanders - He made some points indicating willingness to work with people of differing opinions, weird for a debate (gun control).  Came off only half as crazy as portrayed in the media.
  • Jim Webb - Seems to be angling for an appointment Secretary of Defense.  Here's his picture as Assistant Secretary of Defense, 30 years ago.
  • Martin O'Malley - Still had to google him this morning to figure out anything about him.  I guess he was governor of Maryland?
  • Lincoln Chafee - Is he the one that looks kind of like a bird?  Yes, both a real bird AND Larry Bird.

VOLUME AND POLARITY

Ssame methodology as normal, I downloaded and analyzed a sample of tweets from after the debate and analyzed.  Then I calculated the number of tweets related to each candidate and the positive percentage.  Clinton and Sanders had a similar number of tweets, but tweets mentioning Clinton were a bit more positive.  The rest of the field went Webb, O'Malley, and Chafee, with Chafee getting only slightly more attention than Barack Obama (not running, FYI).



Oh, and a wordcloud demonstrating talk after the debate.  Note that only Clinton, Sanders, and Webb make the cloud.


TOPIC MODEL

So what topics were talked about following the debate?  I ran a quick Correlated Topic Model to determine the topics.




TOPIC 1: People talking about facts of the debate generally the morning after, focusing on who "won" the debate.  
TOPIC 2: Topic focused on people talking about Bernie Sanders saying he was tired of hearing about Hillary Clinton's emails.
TOPIC 3: This is a topic focused on what is being referred to as the "Warren Wing" (Elizabeth Warren) of the party (warrenw is the stemmed hashtag).  Bernie Sanders is largely seen as the most Warren-esque candidate.
TOPIC 4: This is the Hillary focused topic, centered around Hillary's performance, and CNN all-but declaring her the winner of last night's debate.


WHO WON?


Then I looked to see if any candidate names are disproportionately associated with the word "won" and "winner to declare a winner in this debate (obviously this isn't a great predictive strategy, but it's at least an amusing way to see who is most associated with winning the debate in social media).  Here's what I got:



Senator Warren's name comes up here once again, many people are pointing out, using similar language that the ideas presented were largely Elizabeth Warren's.  But she's not running, do any actual candidates come up?  The name Bernie does, but the word "snore" proceeds it in the ranking.  For the word "winner," "unclear" is a top word, which looks to be a fairly broad consensus in media as well.  It's unclear who the actual winner of last night's debate was.