Tuesday, September 29, 2015

Voter Suspension Demographics

I have blogged a bit about election issues in Kansas, specifically Beth Clarkson's claims of election fraud.  There's another issue in the news though, a move to purge voters who haven't provided proof of citizenship from the registration lists.  The move would begin purging voters from the 30,000+ voter suspense list if they haven't provided proof of citizenship within 90 days.  

I won't provide a lot of context on this issue (the article I linked to, plus a lot of commentary on the internet does this for me), but politically it is seen as favoring Republicans.  That's because it's believed to impact the younger, more minority, poorer, and generally more Democratic voters.  But is that true?  And how big is the impact?

We received a copy of the list for analysis this morning.  In this post I am going to create some quick graphics, maps and calculations, with more promised at a later date.


One early question I had about the voter suspense list was did: it impact voters differently from different parts of the State?  I created a metric called suspense rate, which is effectively the number of voters in suspense divided by total registered voters in each county.  Here's the map of that metric:

The map shows a high level of variation in the metric, with some counties having three times the suspense rate as other counties.  There's not a clear pattern that I am picking up on, though some urban and near-urban areas have higher suspense rates.  I ran some quick correlations (will post tomorrow) and found that racial and age demographics correlate with suspense rate, though somewhat weakly.  I suspect that many factors can drive this, including local election policies and procedures.


From the Wichita Eagle we know that younger voters are impacted more, but what does the age distribution of suspended voters look like?  Here it is by party.

*Happy moment: Excel auto-assigned Democrats as blue and Republicans as red.

That's a steep curve, and there are a lot of people under 25 on the suspense list.   But how does that compare to general population demographics?

Here in the above graph I broke the suspense list and census data down by age bucket.  Essentially,  2.2% of Kansans between 20 and 24 are on the suspense list while only .2% over 65 are on the list.  Impact?  If you're between 20 and 24 you are 10 times more likely to be on the suspense list than a senior citizen.


We know that younger people tend to be more liberal, and that opponents of citizenship verification laws tend to be democrats.  But can we infer from the actual impact?  I acquired total voter registration numbers from the Secretary of State's office and compared them to voter suspense list. Here are the numbers:

The thing that sticks out is that unaffiliated voters make up a larger portion of the suspense list than all registered voters, likely because younger voters have less strong political ties.  But as far as party impact, there are far fewer Republican voters on the suspense list than the total registration list.  Once again, this could be largely driven by age demographics, but the net is a 6,000 differential favoring Republicans in elections.  Oh, and here's a "pretty" version:


Just a couple of points here:
  • Voters are on the suspense list are generally younger, more unaffiliated (moderate?) than the general population, and less Republican.
  • Tomorrow I will look at racial demographic trends, as well as investigate if age is all that is driving the party differences in the suspense list.

Monday, September 28, 2015

Will University of Kansas Football Win Any Games This Season?

My dual personalities have two responses to the the question in our title.

Levi the KU Football Fan: Absolutely not. They will lose every conference game, most by 3+ touchdowns, and be the embarassment of the Big 12 this year. 95% sure of this.

Levi the Statistician: 47% chance they will win a conference game.
Why the different responses?  Two real reasons: the cynicism of a KU Football fan and the way low-probability aggregation works.


I became a KU Football fan after arriving on campus to start graduate work in Fall 2003.  KU Football wasn't very good compared to my undergraduate alma mater Kansas State University, but the games where more fun to watch, for a few reasons:

  • There's a hill on campus overlooking the stadium, so you don't have to pay to get into games (beer allowed).
  • The team wasn't expected to be very good, so everyone just had a good time and good food no matter what was happening in the game.
  • They celebrated "small victories."  In one of the first games I watched, they tore down the goal posts and threw them in a lake because they beat a 23rd ranked team in the country.  
Don't get me wrong, I still follow and enjoy K-State Football, but I also see why KU Football is fun: low expectations.  Recently though, expectations have hit rock bottom.

A quick overview of KU Football over the past few years.  In the mid 2000's (2006-2008) the team got really good.  As in Orange Bowl champions, finishing season ranked in top-10, sending pro-bowl caliber players to the NFL good.  Then the team had a couple of rough (read: normal KU Football) years, the coach was fired for supposedly being abusive to players.

This leads us to the current day in KU Football.  The last five years have seen three coaches, two seasons with no conference wins, and generally embarrassing play.  Rather than relive that nightmare with description here, I'll just post the conference results of the past five years.

Yeah. 3-41.  Ouch.


To think about the probabilities of a winless season, we have to look at the mechanics of a college football season.  The above chart is just the conference schedule, but there's also a non-conference schedule.  In the years prior, KU would win one of their three non-conference games, which are generally easier, against non Division 1 FBS opponents.  The problem is, this year KU has already lost all of the non-conference games this year, so we can focus on the conference schedule now.  

To calculate the actual probability of a winless season, all we really need to do is calculate the probability of losing the last nine games consecutively.  But to know that, we need to know the probability of KU winning each game.  There's not a great way to do that because the team wins so few games, that a probability of winning any individual game is unknown.  The team is more likely to beat Iowa State than Oklahoma, but how much more likely?  At this point, KU winning a game is  more of a matter of random luck, or the other team having a tragically bad day.  

The futility of the KU Football program over the past five seasons make conference wins look like matters of random luck.  What would be a reasonable probability estimate of per-game futility?   Hard to say, but we have historical win percentage in recent conference games to use: .068 (see chart above).  


Now that we have a good estimate of per game probabilities, how do we extrapolate that out over the rest of the season.  Probability aggregation. And here's how that works.  Essentially, the probability of KU losing EVERY game for the rest of the season is the product of the probabilities of their opponent winning each game (1-.068): 93.2%.  So in this case, the aggregate probability of KU losing out can be computed by (.932) ^ 9 or approximately 53%.  The chart below shows how that probability will increase over the season if they continue losing each game.

But wait, KU has a 47% chance of winning a game this season, but they will be underdogs by roughly 93%-7% in each game?  How can that be?  This is a part of aggregate sports probabilities that gives some people trouble.  It's difficult to imagine that individual events with such low odds (KU Football Victories) will ever "occur" in real life, but over time, probability theory shows us it will.  An extreme non-sports case example would go something like this:

Setup an experiment where people walk across a field and you record whether they get struck by lightning.  The odds of each person being struck by lightning will be impossibly low, let's say 1 in 500,000.  So the odds of "winning" a single game would be impossibly low, but repeat that experiment 500,000 times, and the probability of a "successful" lightning strike increases to 63%.
Here's what that looks like.


Looking back this is a lot of weird work to do on a college football team.  But a couple of takeaways:

  • The University of Kansas football team is so bad, that we are now only able to talk about them winning in low-probability aggregation terms.
  • The University of Kansas football team, has 53% chance of going winless this year.

Friday, September 25, 2015

Kansas Election Fraud: Part 7

Big news, Beth Clarkson finally engaged with me in debating her statistical analysis.  Oddly enough, the discussion occurred on Esquire.com's forums.  Yes, in the forums of a "men's" magazine. Weird.

I would describe Clarkson's argument online as follows: 
  • a bizarre focus on a null hypothesis test.
  • an admission that she hasn't looked that deeply into the statistics that are the basis of a lawsuit.
  • a refocus on principles of open government, and that she needs access.
More on the Esquire conversation later, but first, some updates on data.


Sometimes I blog about data issues and don't adequately explain to my lay audience.  One twitter commentator has even said I should start a series on basic statistics so people can understand these types of things.  Here's an example of my recent commentary that possibly wasn't the best explanation:

One quick side note.  There's something else that increases correlation when we aggregate results.  Because the majority of super large precincts are in Sedgwick County, it gives leverage to some of these precincts.  And because all-in Wichita is a more conservative region than Johnson County, that leverage serves to increase the correlation, though due to no nefarious or unexplained phenomena.    
The biggest problem here is that this is the concept of leverage, as well as "mix" (ie, the mix of counties) needs to be shown graphically.  Luckily, this is easy in the ggplot2 R library.  Here is a scatter of Clarkson's correlation with counties color coded.  In this you notice the more liberal counties tend to have mid-large precincts (Wyandotte, Douglas, even Johnson county) while more conservative Counties (Sedgwick, Other Rural) make up a propensity of the largest precincts.  This enhances Clarkson's correlation when counties are combined, simply due to the mix of counties, not in-county nefarious action by voting machines.


I'm still working on this data.  To be honest the OCR worked horribly, and I received no adequate response from the County.  I sent a followup email to the Secretary of State's office this morning.  I will post their response here.


I am not the only person working on this.  Some of the better conversation regarding this topic has occurred over on DailyKos (I know, I know), with several "diaries" devoted to the subject.  One of the better ones recently was by user HudsonValleyMark, exploring the same correlations I looked reviewed.  Specifically, his work draws the correlation back to original voter registration data.

What does that mean?  It means that party affiliation at registration is also correlated to total number of voters, long before we get to the voting machine.  I suggest reading his work, but I also validated his work using Johnson County data in 2004, see chart below, first validating Clarkson's correlation, then replicating HudsonValleyMark's work.


Remember my three points from earlier in this post on Beth Clarkson's arguments in the Esquire forum?  Let's revisit and go through those one by one.  Here's her final response to me, which encompasses all three arguments:

We seem to be in agreement that the null in my case isn't true. I disagree that it invalidates my work because I feel the cause is what is under debate. Your suggestion of assuming a particular prior distribution may or may not be appropriate. I haven't looked at it deeply enough to know for sure. In short, I'm agreeing that you could well be right about that. 
That our electronic voting machines are eminately hackable and have no post-election audit procedures in place are established facts and are equally concerning to me. Do you diagree about that aspect? Are you satisfied with assuming a distribution that fits the pattern? Or do you agree that our voting system should be (but isn't) transparent enough for citizens to feel confident that the results are accurate?
Here's my response one by one:
  1. On the NULL case not being true.  I agree with Clarkson that we can "reject the NULL hypothesis", in fact in my first post on the subject (and above in this post) I replicated her results.  But all Clarkson is saying here by claiming the null case if false is that she found a non-zero correlation.  I agree, there is a non-zero positive correlation, but if we dive deeper why are we testing a null hypothesis?  And if we can reject it, have we done the research to say that there aren't reasonable alternate explanations (I have, and there are)?  Keep in mind my prior work on this subject, that show demographic and precinct creation reasons create this correlation.  In essence rejecting the null hypothesis here is in no way meaningful because it is only testing the false assumption that there should not be a correlation.  That has been the point of this blog's work on the subject, that the null hypothesis is irrelevant.  For more information on the flaw in null hypothesis testing, see here from Nate Silver.
  2. On her admission that she hasn't looked deeply into this.  She concedes that I may be right. A lot of thoughts here.  So essentially she has been threatening lawsuits and doing newspaper interviews over something she hasn't deeply reviewed.  She also said earlier in her comments that she hasn't had access to demographic or mapping data.  I have been able to compile that data, usually in a matter minutes, whenever I have wanted to look at it.  Access to data is easy, and it's the job of a modern statistician or data scientist to acquire it and test your work, due diligence.  Effectively here, she admits she's done less work on the subject than I have, and admits I may be right.
  3. On open government concerns.  I have always agreed with her on this concern, on this blog, and publicly, multiple times.  I have also offered to help, if I can, should she get access to that data.


Quick summary of what we've talked about in this blog post:
  • I gave you a little better view into the "mix" and leverage issues that enhance Clarkson's correlation, though not indicative of Fraud.
  • Shawnee County is still living in the data dark ages.
  • HudsonValleyMark's work over on DailyKos (which I validated) also demonstrates another way to disprove the Clarkson correlation os related to fraud.
  • Finally, in my argument with Clarkson over on Esquire.com, she admits she hasn't looked into this issue deeply, and that my analysis may be correct.

Tuesday, September 22, 2015

Daraprim: Price increase or Leveraged Financial System

Going a bit off topic for this site today, but it is generally business-finance related, so still relevant to this blog, generally speaking. 

Yesterday, on every social media site, I saw one dominant story.  Specifically this one: 

Always interested in the reasons behind business decisions, I watched a video of the CEO.  Generally, I had a bit different reaction.  To get started, here's the video:


Rather than just react from the hip regarding the terms "Hedge Fund Manager" and 5,500% price increase and AIDS (or a name that starts with four consecutive consonants) I thought I would look a little more into the CEO's argument.  Here's what it breaks down to:

  1. The drug is a low-demand, rare disease drug, historically under-priced.  It was price far below its peers (other rare-disease drugs) and was effectively not profitable, largely due licensing and other back-end non-production costs.  Key comment: It was only $1,000 to save your life, which is a lot more valuable than that.  
  2. We are changing the service model.  Here he's making some claims that they will provide a higher level service, and better serve the needs of the customer.  Uses term "dedicated patient services."  Also more R&D to help patients have access to a better drug. (read: blah blah blah, I have a business plan, but we want to make this profitable now to access that)
  3. We also offer pro-bono services of the drug.  This is key. Here he argues that the drug will be offered for free to patients who can't afford to pay.  Also mentions co-pay assistance programs for people who can't afford to pay insurance co-pays, and "even if we're having a disagreement with the insurer, we'll send them drug for free until that ..."


I'll just address the CEO's points one by one.
  1. It's completely fathomable that the drug was under-priced to the point of being unprofitable.  Especially if it was owned by large pharmaceutical companies where it was an ignored net-neutral accounting line. It's also completely reasonable to want to test price increases (or decreases) to a profit maximizing standard. I actually do this quite regularly, the simplified assumption being that you're balancing price sensitivity and profit and finding a profit maximizing equilibrium.  But in normal circumstances slow, incremental pricing changes are more telling for a model, and also safer from a revenue perspective.  This is a giant pricing change, why?  
  2. This sounds like CEO BS.  So there's this scene in Halt and Catch Fire where John Bosworth just got out of prison and Cameron re-hires him, and basically says "I don't know just do CEO stuff."  CEO stuff is what this sounds like.  He's trying to talk about his plan for the business, and he may have every intention of R&D and future "dedicated patient services" but at this point, this company is probably just selling the same old product: daraprim.
  3. The pro-bono services are telling.  What's most telling is "working with patients" through insurance problems, copay assistance.   Effectively, they are communicating that this price increase is designed AROUND an insurance system (and rich people, but mostly the insurance system). Once something becomes part of an easy finance-able system (like insurance, financing, think mattresses over past 20 years, easy mortgages 2002-2007) you increase the customers ability to pay and thus you change the functional demand curve/price sensitivity of customers.  Under this view, my final point here: The CEO's rhetoric tells us that he is leveraging higher pricing against the financial-insurance system, and effectively betting on the ability to extract large mid-term profits from it.  The insurance system, as it exists,  enables this type of cost increase by giving *ordinary* people *extraordinary* ability to pay for effectively one-time services.  A fairly classic perverse incentives/moral hazard problem.

Side note.  At what point did the term "Hedge Fund Manger" become derogatory?  Like this person was trusted with millions (potentially billions) in assets from high net-worth individuals and that somehow is an indictment of character?  Are we getting "Hedge Funds" confused with "Trust Funds?"  Certainly it would be negative if it were a Ponzi scheme or other financial scam, but at this point we're only dealing with a pricing change.

Monday, September 14, 2015

Introduction to Creating GIS Maps

I'm not a huge data visualization guy, but a few of my recent posts have used mapping technology for illustrative purposes. I've received some questions on how to create the maps, or how to get started with mapping technology.  The process is actually quite straight forward, so I thought I would provide a simple tutorial to get us started.


  • First acquire and install the GIS software you want to use.  I've posted on this before, but I like QGIS, which can be freely downloaded here.
  • Next acquire the map data that you want to use.  I suggest starting with Shapefiles, which are an industry standard format.  These provide the geographical data that can create the map as well as some attributes about each unit on the map.  Some examples of where to acquire this data include the Census Bureau, DASC, or local governments.  
  • Unpack the shapefiles.  You can get in the folder and take a look at these.  The most interesting is the *.dbf file, which can be opened in Excel, and gives the additional attributes for each unit.  So, for instance, if you downloaded a shapefile of Kansas counties, the dbf file would include attributes such as census population attributes and demographic breakdowns for each county.
  • Import the data into QGIS.  Here you use the "add a vector layer" function in QGIS.  See screenshot below, and click button.
  • Choose browse, and navigate to the shapefile directory.  In the directory, choose the *.shp file, click open and open.

  • Your shapefile should import, and look something like this.  By default mine imported as hot pink.  Not my choice.

  • The shapefile now shows up in the "layers" window, as something looking like the image below.  Right click on the name of the newly created layer in this box, and click on properties.

  • On the dialogue box click on style.  This allows you to play around with the styling of the data shown on the map.  You can choose different types of scales, a column from the shapefile to vary on, number of classes, and a variety of color ranges.  I chose to use a graduated scale, on the column "white", using color range of "blues" and five classes.  When you've chose what you want to do, click "classify", "apply" and "OK."

  • And your finished map.

This tutorial is designed to get you started using GIS technology through an easy interface, and simple first map.  There are virtually limitless possibilities of other things that you can do, from custom calculated fields to selecting specific elements to overlaying additional layers and geospatial queries.  There's even a module for advanced statistical and predictive analysis.  Once you get the basics of this tutorial down, you can explore forums and help pages on the internet, or simply reply to this post with additional questions.

Wednesday, September 2, 2015

Kansas Election Fraud: Part 6 Sedgwick County Suburbs

And yet another post on my continuing series on Kansas election fraud.  Why do I keep posting on this?  First, there is still a lot of interest in the media.  Each time I open up social media, and occasionally when looking at local news stories this story seems to pop up. Second, because no one else is doing due diligence on the numbers, and this type of strategic trend information may be useful in understanding both how our democracy works, and what it takes to win elections.  Today I will cover three subjects:
  • Data Availability (I have a huge gripe here)
  • Sedgwick County Mapping
  • Bucketization Analysis


Someone needs to call Shawnee County Kansas and let them know it's freaking 2015.  I have a device on my wrist that tracks my steps and sleep, syncs that data to my phone, which then I can dump to a MySQL database via API and analyze my activity level hourly.  I analyze a data warehouse with 50+ terabytes of data.  I have code that can download tweets, turn text data into numeric analyzable data, and model that data, and return topics, visualizations, and sentiments all in about 20 seconds.  I can take precinct results data, join it to geospatial map data (freely available online) and create visualizations of spatial voting patterns.  Big data, numbers, are everywhere.  Except:

Shawnee County Kansas can't provide me a numeric format of their 2014 by-precinct election results.  
I have the results from every other county, either in an Excel document from the Secretary of State's office or from a digital online format (SG) or PDF's with text meta data that allows me to easily scrape the underlying data (WY, JO). Yesterday I contacted the Shawnee County elections office to ask for some kind of numeric format (excel, pdf with selectable text, anything) of the precinct data.

No dice.  The response I received back was that the only existing form of this data is PDF (no selectable text meta data) or paper. Nothing in excel, nothing analyzable.  Yes I know I could OCR the PDF, and I've started doing that, though it's not a high quality PDF, so it produces a lot of errors.

While I don't agree with Beth Clarkson's conclusions, I can see where her and the people who agree with her are coming from.  It feels as though the system was not designed to be analyzed after elections.  


In my last blog post, I found that Sedgwick County also demonstrated "Clarkson's Correlation" where larger precincts tended more Republican.  I wondered if the same visualization technique as applied to Johnson County could be applied to Sedgwick County.  The answer was yes. 

First, a look at how Sedgwick County voting patterns by precinct.  Blue (Davis-favoring) precincts in the center city, while the suburbs and outer-rural areas tend more republican, as expected.  

Now on to our overlay of precincts by sizes. There are a lot of 500+ voter precincts in Sedgwick county, but the largest of those are not in the center city, but instead in the suburban ring.  This is an area we know to be overall, whiter, more elite, and to lean more Republican than the center city.

All of this is additional complementary evidence to my prior posts on Clarkson's theory, that it is effectively based on a broken a priori notion: That after 500 voters, there should be no correlation between precinct size and % of vote Republican.  The specific reason is broken is that the precinct creation was not random, and in fact suburbanization caused the largest of the precincts to be in whiter, richer, and more Republican leaning areas.  

But I have only demonstrated this for Sedgwick and Johnson County's, how much do those two counties actually matter?


Let's take a deeper look into large precincts.  An easy way is to break precincts into buckets by size, and talk about them in this way.  Here are the size buckets I am using:

  • Regular Precincts: 0-500 voters (Clarkson Ignored These)
  • Large Precincts: 500-1000 voters
  • Super-Large Precincts: 1000+ voters

So, first, how did Brownback do by each size-grouping of precincts?  Here's a chart:

This chart actually backs up Clarkson's correlation.  Effectively Brownback did best in regular and super-large precincts.  The fact that he did better in super-large precincts than large precincts is the exact correlation that Clarkson is talking about.  This is just another validation that the correlation exists.

But how much do suburbanization patterns in JoCo and Sedgwick County matter in this?  A lot.  A series of pie charts.  First, JO/SG make up only 14% of the regular sized precincts.
But they make up almost two thirds of large precincts.  
 And they make up 97% of super-large precincts, with 66% of those being in Sedgwick county.

If we look at Clarkson's analysis, over 2/3rds of the sample can be attributed to JoCo or Sedgwick county, where we know that her a priori assertion is broken.  Moreover, when we run the correlation on the other 1/3rd we see no correlation.  The effect is only observable in urban/suburban counties.  Effectively: Sedgwick and Johnson counties are all that matter to the observed correlation.  Here's an R output for the other 101 counties:

One quick side note.  There's something else that increases correlation when we aggregate results.  Because the majority of super large precincts are in Sedgwick County, it gives leverage to some of these precincts.  And because all-in Wichita is a more conservative region than Johnson County, that leverage serves to increase the correlation, though due to no nefarious or unexplained phenomena.  


  • Shawnee County: GET. WITH. THE. PROGRAM.
  • Sedgwick County: Though much different than Johnson County, the suburbanization pattern created a similar pattern, the largest precincts are in the suburbs. This pattern subverts Clarkson's a priori assumption of stochastic creation of precincts.  
  • Bucketization: An interesting illustration of how Brownback did well in very large precincts, which are mostly located in Johnson and Sedgwick Counties.