Tuesday, September 29, 2015

Voter Suspension Demographics

I have blogged a bit about election issues in Kansas, specifically Beth Clarkson's claims of election fraud.  There's another issue in the news though, a move to purge voters who haven't provided proof of citizenship from the registration lists.  The move would begin purging voters from the 30,000+ voter suspense list if they haven't provided proof of citizenship within 90 days.  

I won't provide a lot of context on this issue (the article I linked to, plus a lot of commentary on the internet does this for me), but politically it is seen as favoring Republicans.  That's because it's believed to impact the younger, more minority, poorer, and generally more Democratic voters.  But is that true?  And how big is the impact?

We received a copy of the list for analysis this morning.  In this post I am going to create some quick graphics, maps and calculations, with more promised at a later date.


One early question I had about the voter suspense list was did: it impact voters differently from different parts of the State?  I created a metric called suspense rate, which is effectively the number of voters in suspense divided by total registered voters in each county.  Here's the map of that metric:

The map shows a high level of variation in the metric, with some counties having three times the suspense rate as other counties.  There's not a clear pattern that I am picking up on, though some urban and near-urban areas have higher suspense rates.  I ran some quick correlations (will post tomorrow) and found that racial and age demographics correlate with suspense rate, though somewhat weakly.  I suspect that many factors can drive this, including local election policies and procedures.


From the Wichita Eagle we know that younger voters are impacted more, but what does the age distribution of suspended voters look like?  Here it is by party.

*Happy moment: Excel auto-assigned Democrats as blue and Republicans as red.

That's a steep curve, and there are a lot of people under 25 on the suspense list.   But how does that compare to general population demographics?

Here in the above graph I broke the suspense list and census data down by age bucket.  Essentially,  2.2% of Kansans between 20 and 24 are on the suspense list while only .2% over 65 are on the list.  Impact?  If you're between 20 and 24 you are 10 times more likely to be on the suspense list than a senior citizen.


We know that younger people tend to be more liberal, and that opponents of citizenship verification laws tend to be democrats.  But can we infer from the actual impact?  I acquired total voter registration numbers from the Secretary of State's office and compared them to voter suspense list. Here are the numbers:

The thing that sticks out is that unaffiliated voters make up a larger portion of the suspense list than all registered voters, likely because younger voters have less strong political ties.  But as far as party impact, there are far fewer Republican voters on the suspense list than the total registration list.  Once again, this could be largely driven by age demographics, but the net is a 6,000 differential favoring Republicans in elections.  Oh, and here's a "pretty" version:


Just a couple of points here:
  • Voters are on the suspense list are generally younger, more unaffiliated (moderate?) than the general population, and less Republican.
  • Tomorrow I will look at racial demographic trends, as well as investigate if age is all that is driving the party differences in the suspense list.

Monday, September 28, 2015

Will University of Kansas Football Win Any Games This Season?

My dual personalities have two responses to the the question in our title.

Levi the KU Football Fan: Absolutely not. They will lose every conference game, most by 3+ touchdowns, and be the embarassment of the Big 12 this year. 95% sure of this.

Levi the Statistician: 47% chance they will win a conference game.
Why the different responses?  Two real reasons: the cynicism of a KU Football fan and the way low-probability aggregation works.


I became a KU Football fan after arriving on campus to start graduate work in Fall 2003.  KU Football wasn't very good compared to my undergraduate alma mater Kansas State University, but the games where more fun to watch, for a few reasons:

  • There's a hill on campus overlooking the stadium, so you don't have to pay to get into games (beer allowed).
  • The team wasn't expected to be very good, so everyone just had a good time and good food no matter what was happening in the game.
  • They celebrated "small victories."  In one of the first games I watched, they tore down the goal posts and threw them in a lake because they beat a 23rd ranked team in the country.  
Don't get me wrong, I still follow and enjoy K-State Football, but I also see why KU Football is fun: low expectations.  Recently though, expectations have hit rock bottom.

A quick overview of KU Football over the past few years.  In the mid 2000's (2006-2008) the team got really good.  As in Orange Bowl champions, finishing season ranked in top-10, sending pro-bowl caliber players to the NFL good.  Then the team had a couple of rough (read: normal KU Football) years, the coach was fired for supposedly being abusive to players.

This leads us to the current day in KU Football.  The last five years have seen three coaches, two seasons with no conference wins, and generally embarrassing play.  Rather than relive that nightmare with description here, I'll just post the conference results of the past five years.

Yeah. 3-41.  Ouch.


To think about the probabilities of a winless season, we have to look at the mechanics of a college football season.  The above chart is just the conference schedule, but there's also a non-conference schedule.  In the years prior, KU would win one of their three non-conference games, which are generally easier, against non Division 1 FBS opponents.  The problem is, this year KU has already lost all of the non-conference games this year, so we can focus on the conference schedule now.  

To calculate the actual probability of a winless season, all we really need to do is calculate the probability of losing the last nine games consecutively.  But to know that, we need to know the probability of KU winning each game.  There's not a great way to do that because the team wins so few games, that a probability of winning any individual game is unknown.  The team is more likely to beat Iowa State than Oklahoma, but how much more likely?  At this point, KU winning a game is  more of a matter of random luck, or the other team having a tragically bad day.  

The futility of the KU Football program over the past five seasons make conference wins look like matters of random luck.  What would be a reasonable probability estimate of per-game futility?   Hard to say, but we have historical win percentage in recent conference games to use: .068 (see chart above).  


Now that we have a good estimate of per game probabilities, how do we extrapolate that out over the rest of the season.  Probability aggregation. And here's how that works.  Essentially, the probability of KU losing EVERY game for the rest of the season is the product of the probabilities of their opponent winning each game (1-.068): 93.2%.  So in this case, the aggregate probability of KU losing out can be computed by (.932) ^ 9 or approximately 53%.  The chart below shows how that probability will increase over the season if they continue losing each game.

But wait, KU has a 47% chance of winning a game this season, but they will be underdogs by roughly 93%-7% in each game?  How can that be?  This is a part of aggregate sports probabilities that gives some people trouble.  It's difficult to imagine that individual events with such low odds (KU Football Victories) will ever "occur" in real life, but over time, probability theory shows us it will.  An extreme non-sports case example would go something like this:

Setup an experiment where people walk across a field and you record whether they get struck by lightning.  The odds of each person being struck by lightning will be impossibly low, let's say 1 in 500,000.  So the odds of "winning" a single game would be impossibly low, but repeat that experiment 500,000 times, and the probability of a "successful" lightning strike increases to 63%.
Here's what that looks like.


Looking back this is a lot of weird work to do on a college football team.  But a couple of takeaways:

  • The University of Kansas football team is so bad, that we are now only able to talk about them winning in low-probability aggregation terms.
  • The University of Kansas football team, has 53% chance of going winless this year.

Friday, September 25, 2015

Kansas Election Fraud: Part 7

Big news, Beth Clarkson finally engaged with me in debating her statistical analysis.  Oddly enough, the discussion occurred on Esquire.com's forums.  Yes, in the forums of a "men's" magazine. Weird.

I would describe Clarkson's argument online as follows: 
  • a bizarre focus on a null hypothesis test.
  • an admission that she hasn't looked that deeply into the statistics that are the basis of a lawsuit.
  • a refocus on principles of open government, and that she needs access.
More on the Esquire conversation later, but first, some updates on data.


Sometimes I blog about data issues and don't adequately explain to my lay audience.  One twitter commentator has even said I should start a series on basic statistics so people can understand these types of things.  Here's an example of my recent commentary that possibly wasn't the best explanation:

One quick side note.  There's something else that increases correlation when we aggregate results.  Because the majority of super large precincts are in Sedgwick County, it gives leverage to some of these precincts.  And because all-in Wichita is a more conservative region than Johnson County, that leverage serves to increase the correlation, though due to no nefarious or unexplained phenomena.    
The biggest problem here is that this is the concept of leverage, as well as "mix" (ie, the mix of counties) needs to be shown graphically.  Luckily, this is easy in the ggplot2 R library.  Here is a scatter of Clarkson's correlation with counties color coded.  In this you notice the more liberal counties tend to have mid-large precincts (Wyandotte, Douglas, even Johnson county) while more conservative Counties (Sedgwick, Other Rural) make up a propensity of the largest precincts.  This enhances Clarkson's correlation when counties are combined, simply due to the mix of counties, not in-county nefarious action by voting machines.


I'm still working on this data.  To be honest the OCR worked horribly, and I received no adequate response from the County.  I sent a followup email to the Secretary of State's office this morning.  I will post their response here.


I am not the only person working on this.  Some of the better conversation regarding this topic has occurred over on DailyKos (I know, I know), with several "diaries" devoted to the subject.  One of the better ones recently was by user HudsonValleyMark, exploring the same correlations I looked reviewed.  Specifically, his work draws the correlation back to original voter registration data.

What does that mean?  It means that party affiliation at registration is also correlated to total number of voters, long before we get to the voting machine.  I suggest reading his work, but I also validated his work using Johnson County data in 2004, see chart below, first validating Clarkson's correlation, then replicating HudsonValleyMark's work.


Remember my three points from earlier in this post on Beth Clarkson's arguments in the Esquire forum?  Let's revisit and go through those one by one.  Here's her final response to me, which encompasses all three arguments:

We seem to be in agreement that the null in my case isn't true. I disagree that it invalidates my work because I feel the cause is what is under debate. Your suggestion of assuming a particular prior distribution may or may not be appropriate. I haven't looked at it deeply enough to know for sure. In short, I'm agreeing that you could well be right about that. 
That our electronic voting machines are eminately hackable and have no post-election audit procedures in place are established facts and are equally concerning to me. Do you diagree about that aspect? Are you satisfied with assuming a distribution that fits the pattern? Or do you agree that our voting system should be (but isn't) transparent enough for citizens to feel confident that the results are accurate?
Here's my response one by one:
  1. On the NULL case not being true.  I agree with Clarkson that we can "reject the NULL hypothesis", in fact in my first post on the subject (and above in this post) I replicated her results.  But all Clarkson is saying here by claiming the null case if false is that she found a non-zero correlation.  I agree, there is a non-zero positive correlation, but if we dive deeper why are we testing a null hypothesis?  And if we can reject it, have we done the research to say that there aren't reasonable alternate explanations (I have, and there are)?  Keep in mind my prior work on this subject, that show demographic and precinct creation reasons create this correlation.  In essence rejecting the null hypothesis here is in no way meaningful because it is only testing the false assumption that there should not be a correlation.  That has been the point of this blog's work on the subject, that the null hypothesis is irrelevant.  For more information on the flaw in null hypothesis testing, see here from Nate Silver.
  2. On her admission that she hasn't looked deeply into this.  She concedes that I may be right. A lot of thoughts here.  So essentially she has been threatening lawsuits and doing newspaper interviews over something she hasn't deeply reviewed.  She also said earlier in her comments that she hasn't had access to demographic or mapping data.  I have been able to compile that data, usually in a matter minutes, whenever I have wanted to look at it.  Access to data is easy, and it's the job of a modern statistician or data scientist to acquire it and test your work, due diligence.  Effectively here, she admits she's done less work on the subject than I have, and admits I may be right.
  3. On open government concerns.  I have always agreed with her on this concern, on this blog, and publicly, multiple times.  I have also offered to help, if I can, should she get access to that data.


Quick summary of what we've talked about in this blog post:
  • I gave you a little better view into the "mix" and leverage issues that enhance Clarkson's correlation, though not indicative of Fraud.
  • Shawnee County is still living in the data dark ages.
  • HudsonValleyMark's work over on DailyKos (which I validated) also demonstrates another way to disprove the Clarkson correlation os related to fraud.
  • Finally, in my argument with Clarkson over on Esquire.com, she admits she hasn't looked into this issue deeply, and that my analysis may be correct.

Thursday, September 24, 2015

2015 NFL Picks: Week 3

My predictions got killed this week.  Luckily the pundits suffered the same fate.  It was a weird week in the NFL, that saw quite a few upsets, and just weird games.  It also saw some more significant injuries, including my favorite fantasy football quarterback, Tony Romo going down.


Last week was bad, really bad.  I went 7-9, which historically, the model has only performed that poorly twice in 7 years of data it was trained on.  Luckily though, the pundits picked the same way that I did in a few of the major upsets, and my 7-9 was actually above the pundit average at nflpickwatch.com

That means that while I am now only 17-15 on the season, I actually moved up in the pundit ratings.  Instead of 57 pundits out of 135 ahead of me, my model is now only behind 50 pundits.  #winning  To further make my point, my model developed in May, is outperforming the picks of several NFL greats who are now pundits (looking at you, Boomer Esiason and Tony Gonzales) who have all the knowledge of all the personnel moves and performance changes that have occurred since May. Again, #winning.

Anyways,  here's what the pundit results look like after two weeks.


I promised an in-season model this week, but I got a little off-topic with a former hedge-fund manager, so maybe next week.  Here are our picks from our pre-season model:

The Patriots are hugely favored, as well as the 0-2 Seahawks.  As for my Chiefs?  Probably going to lose at Green Bay, even with Jordy Nelson out.  

Tuesday, September 22, 2015

Daraprim: Price increase or Leveraged Financial System

Going a bit off topic for this site today, but it is generally business-finance related, so still relevant to this blog, generally speaking. 

Yesterday, on every social media site, I saw one dominant story.  Specifically this one: 

Always interested in the reasons behind business decisions, I watched a video of the CEO.  Generally, I had a bit different reaction.  To get started, here's the video:


Rather than just react from the hip regarding the terms "Hedge Fund Manager" and 5,500% price increase and AIDS (or a name that starts with four consecutive consonants) I thought I would look a little more into the CEO's argument.  Here's what it breaks down to:

  1. The drug is a low-demand, rare disease drug, historically under-priced.  It was price far below its peers (other rare-disease drugs) and was effectively not profitable, largely due licensing and other back-end non-production costs.  Key comment: It was only $1,000 to save your life, which is a lot more valuable than that.  
  2. We are changing the service model.  Here he's making some claims that they will provide a higher level service, and better serve the needs of the customer.  Uses term "dedicated patient services."  Also more R&D to help patients have access to a better drug. (read: blah blah blah, I have a business plan, but we want to make this profitable now to access that)
  3. We also offer pro-bono services of the drug.  This is key. Here he argues that the drug will be offered for free to patients who can't afford to pay.  Also mentions co-pay assistance programs for people who can't afford to pay insurance co-pays, and "even if we're having a disagreement with the insurer, we'll send them drug for free until that ..."


I'll just address the CEO's points one by one.
  1. It's completely fathomable that the drug was under-priced to the point of being unprofitable.  Especially if it was owned by large pharmaceutical companies where it was an ignored net-neutral accounting line. It's also completely reasonable to want to test price increases (or decreases) to a profit maximizing standard. I actually do this quite regularly, the simplified assumption being that you're balancing price sensitivity and profit and finding a profit maximizing equilibrium.  But in normal circumstances slow, incremental pricing changes are more telling for a model, and also safer from a revenue perspective.  This is a giant pricing change, why?  
  2. This sounds like CEO BS.  So there's this scene in Halt and Catch Fire where John Bosworth just got out of prison and Cameron re-hires him, and basically says "I don't know just do CEO stuff."  CEO stuff is what this sounds like.  He's trying to talk about his plan for the business, and he may have every intention of R&D and future "dedicated patient services" but at this point, this company is probably just selling the same old product: daraprim.
  3. The pro-bono services are telling.  What's most telling is "working with patients" through insurance problems, copay assistance.   Effectively, they are communicating that this price increase is designed AROUND an insurance system (and rich people, but mostly the insurance system). Once something becomes part of an easy finance-able system (like insurance, financing, think mattresses over past 20 years, easy mortgages 2002-2007) you increase the customers ability to pay and thus you change the functional demand curve/price sensitivity of customers.  Under this view, my final point here: The CEO's rhetoric tells us that he is leveraging higher pricing against the financial-insurance system, and effectively betting on the ability to extract large mid-term profits from it.  The insurance system, as it exists,  enables this type of cost increase by giving *ordinary* people *extraordinary* ability to pay for effectively one-time services.  A fairly classic perverse incentives/moral hazard problem.

Side note.  At what point did the term "Hedge Fund Manger" become derogatory?  Like this person was trusted with millions (potentially billions) in assets from high net-worth individuals and that somehow is an indictment of character?  Are we getting "Hedge Funds" confused with "Trust Funds?"  Certainly it would be negative if it were a Ponzi scheme or other financial scam, but at this point we're only dealing with a pricing change.

Friday, September 18, 2015

GOP Primary Debate Number Two: CNN Summary And Text Mining

Three hour debate?  Seriously?  After my wife and I realized it was going to be three hours we started recording the debate and watching something else.  I intended to watch the rest of the debate later, but why would I do that when I can just run an algorithm and review the results?  Here's a summary take-away of what I found:
  • The candidates drawing the most attention in tweet volume in order were Trump, Fiorina, Bush (with others coming in much lower).
  • Winner of the debate was likely Fiorina, with twitter most associating  the term "loser" with Kasich.
  • Donald Trump dominated the social media conversations following the debate, but conversations about him were focused on how "funny" he was and his insults of other candidates, not issues or policy stances.


I downloaded about 100,000 tweets with the hashtag #GOPDebate the morning after the debate.  I used normal text scrubbing, stemming, and data-izing methodologies.  Then I went to work analyzing the data, including a wordcloud, topic modeling, some basic sentiment modeling, and summarizing of data. 

What everyone wants, here's our resultant wordcloud, note that Trump dominated the text, with "Trump" being even more common than the word "debate."

I also created a few topic models to try to parse out what topics were being discussed during the debate.

It turned out the topic models didn't converge as well as I would hope, why?  A few reasons:

  • Almost every topic we pulled out would be dominated by Trump, because Trump dominated the conversation.  Monotonic documents don't lead to very good topic models.
  • Even the non-Trump conversations were dominated by Trump's interactions with other candidates.  For instance Topic 4 focuses on Jeb Bush, but still has "Donald" and "Trump" in its top 5 associated terms.  Sames with topic two and Carly Fiorina.


If a debate is a competition to generate twitter traffic then Trump won clearly.  Here's a graph of top candidates and their respective twitter volume.  I also overlayed some sentiment data as well.  Trump clearly wins with nearly double the twitter volume of any other candidate.  He also had a lower negative tweet percent than any other candidates, but there's a reason for that, which I will address later.

What about a more objective view of who won, given that debates aren't really Twitter volume competitions?  We can look at the terms that are most associated with the words "winner" and "loser".

Notice a lot of reporter names and other references end up in this list of most associated terms, largely because they are the ones discussing winner/loser outcomes.  However, in our winner list Fiorina is the only candidate name to show.  In our loser list we see John Kasich's name.  From the little bit of debate I saw, Kasich wasn't necessarily a loser, it was just very unclear why he is still in the race.  This isn't a perfect methodology, but it does speak to a twitter consensus:  When discussing winners of the CNN GOP Debate, Carly Fiorina was the most frequent topic of conversation.


Looking at the issues and topics statistically associated with candidates has a couple of applications.

  • We can determine which issues are each candidate's signature issue.  For instance, last debate we recognized that Huckabee is effectively a social-conservative issues candidate, as his top associations were abortion and gay marriage.  
  • We can identify the most prevalent negative issue for each candidate.  For instance, Trump's comment on Fiorina's face coming through as top issue.
Here are the related terms for some top candidates:

And some quick analysis:

  • Fiorina: Focuses on her comments on Planned Parenthood, related videos, and relationship to other candidates.  
  • Trump: Focuses on words like "funniest", "reaction", "msm", and "drudge" a lot of terms related to the spectacle of Trump's candidacy.  This also relates to the reason Trump has a low-negative-tweet ratio: the conversatin regarding his candidacy is not issue-based, it's largely talking about how hilarious he is.
  • Huckabee: So apparently there's an actor with that last name on The Walking Dead.  Not good that that association is popping to the top, meaning that Huckabee is not a high-popularity candidate.  The weird words that keep popping up for Huckabee are "soulless" , "hell","heathen", and "atheist".  Also "epicfail."
  • Bush: Focuses on his family (George, Barbara), whether his Brother kept us safe.  Also a reference to pot (likely regarding Fiorina's joke about Jeb smoking week weed).


A few bullet points to close this out:
  • Trump is still dominating the social media and conversation around the Republican Presidential primary, though more out of spectacle than actual issues.
  • Fiorina appears to have won the debate from analysis of twitter conversations.  
  • Bush, who is still favored by many real analysts to win the debate, is being talked about larger context of his family, and his brother's presidency.

Thursday, September 17, 2015

2015 NFL Picks: Week 2

And it's time for week two!  Last week my home town Kansas City Chiefs won despite my pre-season model picking them to lose.  This week my model chose the Chiefs to beat the Denver Broncos, though I think that might be a crazy pick.


Last week, I went 10-6, and was correct on both of my "high certainty" picks.  But how good is 10-6 (62.5% correct)?  Here's how it performs against other "null" statistical models:
  • It's better than a null model where we pick random teams (50% correct).
  • It's better than a "home" model where we pick the home team (55% correct).
  • It's better than choosing the team who won more games last year (59% correct).
It's great that it outperforms other naive models, but how does it do against professional Football analysts.  For this analysis, I checked against the site NFLPickWatch, which aggregates and compares the game picks made by experts and sportswriters.  I've included data below, but in essence, for the first week, my model performed as well as the median pundit (10 correct predictions).  I will continue to track this comparison metric in future weeks.


Stay tuned for this next week, the in-season model is not significantly better than the pre-season model for week two, because we don't have enough new data for this season.


And for this week, our pre-season model predictions for week two.  I generally agree with these predictions, based on my not-so-exhaustive knowledge of football, but have some concerns:
  • The team my model finds most certain to lose is the New York Jets, despite the team having a coach with the last name Bowles (no relation).
  • I don't have faith that the Chiefs will beat the Broncos, based on recent history, and the existence of Peyton Manning.
I will update this again next week, let's hope these picks work out!  

Monday, September 14, 2015

Introduction to Creating GIS Maps

I'm not a huge data visualization guy, but a few of my recent posts have used mapping technology for illustrative purposes. I've received some questions on how to create the maps, or how to get started with mapping technology.  The process is actually quite straight forward, so I thought I would provide a simple tutorial to get us started.


  • First acquire and install the GIS software you want to use.  I've posted on this before, but I like QGIS, which can be freely downloaded here.
  • Next acquire the map data that you want to use.  I suggest starting with Shapefiles, which are an industry standard format.  These provide the geographical data that can create the map as well as some attributes about each unit on the map.  Some examples of where to acquire this data include the Census Bureau, DASC, or local governments.  
  • Unpack the shapefiles.  You can get in the folder and take a look at these.  The most interesting is the *.dbf file, which can be opened in Excel, and gives the additional attributes for each unit.  So, for instance, if you downloaded a shapefile of Kansas counties, the dbf file would include attributes such as census population attributes and demographic breakdowns for each county.
  • Import the data into QGIS.  Here you use the "add a vector layer" function in QGIS.  See screenshot below, and click button.
  • Choose browse, and navigate to the shapefile directory.  In the directory, choose the *.shp file, click open and open.

  • Your shapefile should import, and look something like this.  By default mine imported as hot pink.  Not my choice.

  • The shapefile now shows up in the "layers" window, as something looking like the image below.  Right click on the name of the newly created layer in this box, and click on properties.

  • On the dialogue box click on style.  This allows you to play around with the styling of the data shown on the map.  You can choose different types of scales, a column from the shapefile to vary on, number of classes, and a variety of color ranges.  I chose to use a graduated scale, on the column "white", using color range of "blues" and five classes.  When you've chose what you want to do, click "classify", "apply" and "OK."

  • And your finished map.

This tutorial is designed to get you started using GIS technology through an easy interface, and simple first map.  There are virtually limitless possibilities of other things that you can do, from custom calculated fields to selecting specific elements to overlaying additional layers and geospatial queries.  There's even a module for advanced statistical and predictive analysis.  Once you get the basics of this tutorial down, you can explore forums and help pages on the internet, or simply reply to this post with additional questions.

Saturday, September 12, 2015

#ksleg Power Rankings September 12th

I spent last week on vacation out of state, not watching news reports or reading local papers, so I'm fairly clueless about what happened in Kansas while I was gone.  Is there a better way to remedy that than another out-of-session round of #ksleg power rankings and topic mining?  Probably.  Is there a funner/nerdier way? Probably not.


So first the most important (sarcasm) part.  A wordcloud.

Not too much surprising there, we are generally seeing the normal top words, like court, teacher, school, and fund.  On to the topic mining, what topics did our topic mining algorithm find (correlated topic models)?

The model found five individual topics within the data.  Here's what the tweets most associated with those topics referenced:
So that was a quick rundown of topics, and by quickly googling around, these appear to cover the biggest stories in Kansas over the past seven days.


Interesting topics, but what about how people feel? Our sentiment mining algorithm helps with that.   First we mined the general polarity of tweets, and found that the tweets were slightly more negative than the last time we looked.  Specifically, whereas in late August the negative:positive ratio was 2:3, it is now approximately 3:4.

We can also award the @MichaelofAustin memorial award for the most emotionally-negative reply tweets.  These are the accounts most often replied to with the emotions of fear,sadness, and anger.  .  These are effectively, this week's most hated #ksleg accounts, per our sentiment mining algorithm.  Looks like the governor caught the most negativity.  (I was going to make this a top three list, but when jrclaeys, who I attended high school with, popped up at number four, I decided to expand)


And now, the moment that no one really cares about, our #ksleg power rankings.  This time, I went to the trouble of actually developing a working index, based on tweets, retweets, and favorites that correlates to actual reach.  The top tweeter is indexed to 1.000, and other tweeters get scores that are effectively ratios of that.  Ratios between tweeters in this index are also meaningful, whereas a tweeter with a .400 index would have roughly double the reach of one with a .200 index. 

Thursday, September 10, 2015

2015 NFL Picks: Week 1

Someone tell my wife I will see her in February, the NFL regular season starts tonight! (not really... see you about 5:30)


If you read this blog regularly, you know that back in early summer I developed model to predict playoff teams and end-season team records. (first post, final post)  I assumed I would be posting about the performance more during the season, and I will in fact, on three specific areas:

  • Preseason Model Picks:  The preseason model I created also can be used to predict individual games, though because it relies on last season's data, the predictions are not great.   Should be fun to watch though, and it's predictive capabilities will decay as the season progresses and more things happen that the model doesn't know about (injuries, over-performing players, etc).  This model is effectively: if we just looked at last season's performance, how would this game turn out. 
  • In-Season Model Picks:  This is another model I've created that predicts results of weekly games, based on prior season AND in-season results.  This model should be more predictive than the preseason model, and more predictive as the season progresses.
  • Performance Evaluation: Each week I will evaluate the performance of the picks compared to a "null" model (random picks) and how well certain sportswriters do in weekly team picks. I am considering Tom Keegan from the Lawrence Journal World as a benchmark pundit, but not sure.
Ok, on to the real picks


Let's get this started, my first week's picks:

A couple of notes on terminology and form.

  • The yellow highlighted teams are my picks for the week. 
  • The odds ratio is the odds ratio of the home team winning the game.  (e.g. the Patriots are 2.01:1 favorites effectively)
  • The certainty level speaks to how certain we are about the prediction.  Keep in mind that NFL games are still relatively unpredictable, and even highly probable favorites will lose a quarter of the time or greater.
This is the first week of the season, so there is no in-season or review of prior performance, but that will come next week.  At this point, we have an idea from the pre-season model, but no real empirical evidence of how NFL teams will perform this year. Now to sit back and see how this all pans out.

Wednesday, September 2, 2015

Kansas Election Fraud: Part 6 Sedgwick County Suburbs

And yet another post on my continuing series on Kansas election fraud.  Why do I keep posting on this?  First, there is still a lot of interest in the media.  Each time I open up social media, and occasionally when looking at local news stories this story seems to pop up. Second, because no one else is doing due diligence on the numbers, and this type of strategic trend information may be useful in understanding both how our democracy works, and what it takes to win elections.  Today I will cover three subjects:
  • Data Availability (I have a huge gripe here)
  • Sedgwick County Mapping
  • Bucketization Analysis


Someone needs to call Shawnee County Kansas and let them know it's freaking 2015.  I have a device on my wrist that tracks my steps and sleep, syncs that data to my phone, which then I can dump to a MySQL database via API and analyze my activity level hourly.  I analyze a data warehouse with 50+ terabytes of data.  I have code that can download tweets, turn text data into numeric analyzable data, and model that data, and return topics, visualizations, and sentiments all in about 20 seconds.  I can take precinct results data, join it to geospatial map data (freely available online) and create visualizations of spatial voting patterns.  Big data, numbers, are everywhere.  Except:

Shawnee County Kansas can't provide me a numeric format of their 2014 by-precinct election results.  
I have the results from every other county, either in an Excel document from the Secretary of State's office or from a digital online format (SG) or PDF's with text meta data that allows me to easily scrape the underlying data (WY, JO). Yesterday I contacted the Shawnee County elections office to ask for some kind of numeric format (excel, pdf with selectable text, anything) of the precinct data.

No dice.  The response I received back was that the only existing form of this data is PDF (no selectable text meta data) or paper. Nothing in excel, nothing analyzable.  Yes I know I could OCR the PDF, and I've started doing that, though it's not a high quality PDF, so it produces a lot of errors.

While I don't agree with Beth Clarkson's conclusions, I can see where her and the people who agree with her are coming from.  It feels as though the system was not designed to be analyzed after elections.  


In my last blog post, I found that Sedgwick County also demonstrated "Clarkson's Correlation" where larger precincts tended more Republican.  I wondered if the same visualization technique as applied to Johnson County could be applied to Sedgwick County.  The answer was yes. 

First, a look at how Sedgwick County voting patterns by precinct.  Blue (Davis-favoring) precincts in the center city, while the suburbs and outer-rural areas tend more republican, as expected.  

Now on to our overlay of precincts by sizes. There are a lot of 500+ voter precincts in Sedgwick county, but the largest of those are not in the center city, but instead in the suburban ring.  This is an area we know to be overall, whiter, more elite, and to lean more Republican than the center city.

All of this is additional complementary evidence to my prior posts on Clarkson's theory, that it is effectively based on a broken a priori notion: That after 500 voters, there should be no correlation between precinct size and % of vote Republican.  The specific reason is broken is that the precinct creation was not random, and in fact suburbanization caused the largest of the precincts to be in whiter, richer, and more Republican leaning areas.  

But I have only demonstrated this for Sedgwick and Johnson County's, how much do those two counties actually matter?


Let's take a deeper look into large precincts.  An easy way is to break precincts into buckets by size, and talk about them in this way.  Here are the size buckets I am using:

  • Regular Precincts: 0-500 voters (Clarkson Ignored These)
  • Large Precincts: 500-1000 voters
  • Super-Large Precincts: 1000+ voters

So, first, how did Brownback do by each size-grouping of precincts?  Here's a chart:

This chart actually backs up Clarkson's correlation.  Effectively Brownback did best in regular and super-large precincts.  The fact that he did better in super-large precincts than large precincts is the exact correlation that Clarkson is talking about.  This is just another validation that the correlation exists.

But how much do suburbanization patterns in JoCo and Sedgwick County matter in this?  A lot.  A series of pie charts.  First, JO/SG make up only 14% of the regular sized precincts.
But they make up almost two thirds of large precincts.  
 And they make up 97% of super-large precincts, with 66% of those being in Sedgwick county.

If we look at Clarkson's analysis, over 2/3rds of the sample can be attributed to JoCo or Sedgwick county, where we know that her a priori assertion is broken.  Moreover, when we run the correlation on the other 1/3rd we see no correlation.  The effect is only observable in urban/suburban counties.  Effectively: Sedgwick and Johnson counties are all that matter to the observed correlation.  Here's an R output for the other 101 counties:

One quick side note.  There's something else that increases correlation when we aggregate results.  Because the majority of super large precincts are in Sedgwick County, it gives leverage to some of these precincts.  And because all-in Wichita is a more conservative region than Johnson County, that leverage serves to increase the correlation, though due to no nefarious or unexplained phenomena.  


  • Shawnee County: GET. WITH. THE. PROGRAM.
  • Sedgwick County: Though much different than Johnson County, the suburbanization pattern created a similar pattern, the largest precincts are in the suburbs. This pattern subverts Clarkson's a priori assumption of stochastic creation of precincts.  
  • Bucketization: An interesting illustration of how Brownback did well in very large precincts, which are mostly located in Johnson and Sedgwick Counties.