Sunday, September 10, 2017

So You Want To Be A Data Scientist

Quite often, someone I know asks me a question that I don't have a great answer for:

How would I go about becoming a data scientist?

This is always a tough place to start a conversation, especially if data science is not a great fit for the individual I'm talking to, but there are generally two types of people who ask me this question:

  • Young professionals:  I get the joy of working with quite a few interns and "first jobbers," who, BTW generally give me a reason to be hopeful about the future of America. (Ironically most of them aren't Americans, but whatever...) Most of these people are in computer science or some kind of analytical program and want to know what they should do to become a real "data scientist."
  • People my age: I also get this question from people in their mid-30's, many of whom have limited relevant education background.  For certain mid-career professionals this could be a great option, especially if they have both computer science and math in their background, but this often isn't the case.  They seem to be drawn to data science because they've seen the paycheck, or it just sounds mysterious and sexy.  Often these people say "I love data, I'd be great at data science" (though this claim is somewhat dubious, and by this they often mean that they like USA Today infographics). 

I'm writing this blog post as a place to point both of these groups, in order to give a fair full-breadth look at the skills that I would expect from data scientists.  I break these skills down into three general areas (with some bonus at the end):

  • Math Skills
  • Computer Science Skills
  • Business Skills
Fictionalized Portrayal of a Data Scientist (that looks somewhat like my work setup)


Math is the language of data science, and it's pretty difficult to make it 10 minutes in my day without some form of higher math coming into play.  Point being: if you struggle and/or dislike math this isn't the career for you.  And if the highest level math you've taken is college algebra, you're also in trouble.  Knowledge of algebra is absolute assumption in data science, and most of the real work is done in higher-order math classes.  I would consider four types requirements:

  • Calculus (differential + integral):  I use calculus daily in my job, when calculating equilibrium, optimization points, or spot change.  Three semesters of college-level calculus is a must for data scientists.
  • Matrix/Linear Algebra: The algorithms that we use to extract information from large data sets is written in the language of matrix and vector algebra.  This is for many reasons, but it allows data scientists to write large scale computations very quickly without having to manually code 1000's of individual operations.
  • Differential Equations: This is an extension of calculus, but is extremely helpful in calculating complex variable interactions and change-based relationships.
  • Statistics:  Don't just take the stats class that is offered as part of your major, which tends to be a bare-necessities look.  Take something that focuses on the mathematics underlying statistics. I suggest a stats class at your university that requires calculus as prerequisite.

If this equation is intimidating to you, then data science is likely not a great option.


Here's the guidelines I give young data scientists: The correct level of computer science skill is such that you could get a job as a mid-level developer (or DBA) at a major company.  This may seem like a weird metric, but it plays into the multi-faceted role of data scientists: we design new algorithms and process data which involves designing the programs that analyze that data.  Being able to write code as dynamic programs allows for automated analysis and model builds that take minutes rather than weeks. Here are some courses/skills to pick up before becoming a data scientist:

  • Introduction to Programming: Simply knowing how computer programming works, the keys to functional and object-oriented programming.
  • Introduction to Database theory: Most of the data we access is stored or housed in some kind of database.  In fact, Hadoop, is just a different type of database, but it's good to start with the basics in elementals.  As part of this course, it's vital to learn the basics of SQL which is still (despite claims and attempts to the contrary) the primary language of data manipulation for business.
  • Python: Python is becoming the language of data science, and it is also a great utility language, which has available packages and add-ons for most computing purposes.  It's good to have a utility language in your toolkit as many data wrangling and automation tasks don't exclusively require the tools of data manipulation (e.g.: audio to text conversion).
  • R: R is my primary computing language, though I work in Python and SQL in equal proportions these days (and sometimes SAS). R has extensive statistical and data science computing packages, so it's a great language to know.  The question I get most often is: should I learn R, Python or SAS?  My answer: have a functional understanding and ability to write code in all three, be highly proficient in at least one.


When asking about business skills, the question I most often receive is: Should I get an MBA?  In a word, no.  But it is helpful to understand business concepts and goals, especially to understand and explain concepts to coworkers fluently.  You don't have to go deep into business theory, but a few helpful courses:

  • Accounting: Often data scientists are asked to look at accounting data in order to create financial analyses, or to merge financial data with other interesting areas of a business.  Understanding the basics of the meaning of accounting data, accounting strategies, and how data is entered into financial systems can be helpful.
  • Marketing: Much of the use of data science over the past five years has dealt with targeted marketing both online and through other channels.  Understanding the basics of targeted marketing, meaning of lift, acquisition versus retention, and the financials underlying these concepts is also helpful.
  • Micro-Econ: Though technically an economics class knowing the basics of micro theory allows you to analyze a business more wholly.  Some relevant analyses may be demand and pricing elasticity,  market saturation modeling, and consumer preference models.  It also helps you with personally valuable analysis, like evaluating the viability of a start-up you might be thinking about joining.
Supply-demand relationships are relevant to many data science business applications.


Though the above set of skills are necessities for data science, there are a few "honorable mention" classes that are helpful:

  • Social Sciences: When modeling aggregate consumer behavior, it's important to understand why people do the things they do.  Social sciences are designed to analyze this; I recommend classes in economics, political science (political behavior or institutional classes), and behavioral psychology.
  • Econometrics: Econometrics is a blending of economics and statistical modeling, but the focus on time-series and panel analysis is especially helpful in solving certain business problems.  
  • Communication: One of the most common complaints I hear about data scientists is "yeah _____'s smart, but can't talk to people."  A business communication class can help rememdy this before it becomes a serious issue.


There are many options as the road to data science is not fixed.  This road map gives you all the skills you will need to be a modern data scientist.  People who want to become data scientists should focus on three major skillsets: math, computer science, and business.  Some may notice that I omitted artificial intelligence and machine learning, but the statistics, math, and computer science courses on this list more than give one a head start on those skills.

Friday, March 17, 2017

College Basketball Analysis: Big 12 Home Court Advantages

A few weeks ago my grad school alma mater (University of Kansas (KU)) won their thirteenth consecutive Big 12 conference championship (I wasn't watching the game, I have better things to do).  Much has been made on how large an outlier this streak is, if performance was random the odds would be about 1 in a trillion to win thirteen straight (not hyperbole, actual probability). 

Along with this streak, there have also been some accusations that the University of Kansas receives preferential treatment in Big 12 Basketball, has an unfair home advantage, or outright cheats to win. The home-court advantage is actually staggering, as KU is 75-3 in conference home games over the past nine years, nearly a 95% win rate.

Half joking, I shot off a quick tweet commenting on both the conference win streak and the accusations.  People quickly reacted, KU fans calling me names while Kansas State University fans agreeing, generally, though more willing to charge KU with cheating. The accusations and arguments raise an interesting question: Do certain teams have statistically different home-court advantages, and is the University of Kansas one of those teams?


The main issue in calculating home-win-bias is that different teams perform better or more poorly over time, and thus we can't look at simple win rates at home over a series of years.  We need a robust methodology to set expectations for home win percentages, and compare that to actual performance.  As such I devised a method to set expectations based on road wins, and apply that information to each team for analysis. 

The underlying premise of this analysis looks at ratios of home win percent to road win percent over-time and calculates the advantage of playing at home for each team, and how it differs from other teams over multiple seasons.  In detail:

  • The theory here is that some "home advantages" (KU, specifically) are higher than others either due to natural advantages, out-right cheating, or bush-league behaviors.
  • In order to disprove whether home advantages differ, we need a methodology to control for quality of team independent of home performance, and compute home performance in relation to that absolute advantage. Enter predicting home wins using road wins.
  • In aggregate, we would expect to be able to predict a team's home win percentage by looking at their road win percentage, as better teams should perform better in both venues.  If a team has a systematic advantage on their home court, we would expect their home win percentage to over-perform the predictive model developed from road wins.
  • I build a predictive model to predict home wins based on road wins for a team each year.  The models are developed for each Big 12 school as a hold out model, to remove each school's self-bias in the numbers.  Then I predict the model using the held-out school, calculate the  residuals on the hold out and move to the next school.
  • The residuals here represent a Wins Above Expectation metric. We can do two things with the residual data: 
    1. Calculate the mean residual and distribution over time which indicates the overall home bias of the school (which schools systematically over-perform at home) 
    2. Determine the best and worst performances at home for individual schools.
The initial models performed well, and show that road wins are fairly predictive of home wins, with a .52 R-squared value (variance accounted for) and a 0.4 elasticity in the log-log specification of the model. 


Starting with a visual inspection of the data, we can get an idea of the relationships between teams, home and away games, and seasons.  First a data point, teams perform far better at home (65.6 average win % wins) than on the road (34.4 average win%) winning nearly twice as often on their home court as on the road. But let's go back to our initial question, does KU win more often than other Big 12 schools at home?  The answer here is yes. 

KU outperforms all schools, with the closest neighbor being Missouri (who has a limited sample as they left the conference a few years ago).  We then see a cluster of schools with about 70% home win percentages, and a few bad schools at the end of the distribution (TCU, notably).  This indicates that KU is an outlier in terms of home performance, but is that because KU is a much better team, or indicative of other issues?

Road % helps us answer this question, KU is the best road team in the conference, by a large margin.  Kansas in fact is the only team in the conference with a winning road record over the past decade, winning close to 75% of games on the road.  Even consistent KU rival and NCAA tournament qualifier Iowa State regularly wins fewer than 40% of their road games. 

We know that KU wins a lot of games at home and on the road, but is there a way to determine if their home wins exceed a logical expectation?  Before moving on to the modeling that can answer our question, we should prove out an underlying theory: whether road wins and home wins correlate with one another:

The chart and a basic model provide some basic answers:

  • Road wins are highly correlated to home wins at a correlation coefficient of .68.
  • Few teams (3%) finish a season with more road wins than home wins.


With the initial knowledge that KU performs highly both at home and on the road, we can start our model building process. If you're interested in the detailed model, look at the methodology section above.

Using the model to calculate how teams perform relative to peers in terms of home and road wins, I calculated the average home-court-boost, or the number of wins above road-based-expectations, shown below:

Oklahoma State has the largest home-court advantage in the conference, followed by Iowa State, Kansas and Oklahoma.  Each of these schools receive about a full extra-win per season over expectations due to their home-court advantage. TCU has the worst home performance followed by West Viriginia, and Baylor.  

A further interesting (and nerdy) way to view the data is a boxplot for each school representing the last ten years of wins-over performance.  This shows that some schools like Kansas and Iowa State have fairly tight distributions representing consistent performance above road expectation.  Other schools, like Kansas State and Baylor, have a wide distribution representing inconsistent home performance related to road expectations.

Using the same scoring method we can score individual year performances, and determine which teams have the best and worst home versus road years.

Most interesting here is that K-State's home-court advantage was a pretty amazing over the years 2014 - 2015.  During those years, Kansas State was 15-3 at home and 3-15 on the road.  At that time at least, it appears Kansas State's Octagon of Doom (I don't remember what it's really called, even though its where I received my Bachelor's degree) was a far greater advantage that KU's Allen Field House.


From the models developed we can reach several conclusions about the types of home advantages held by Big 12 teams:

  • The home advantage for the University of Kansas at Allen Field House is high (about +1 game a year) but in-line with several other top-tier Big 12 teams.  This doesn't necessarily fit the story-line that KU cheats at home, but doesn't rule out other theories given by Kansas State fans: that KU cheats/gets unfair deferential treatment both at home AND on the road.  
  • The top home advantages in the Big 12 are: Oklahoma State, Iowa State, Kansas and Oklahoma.  In fact both Oklahoma State ("Madison Garden of the Plains" .. seriously?) and Iowa State ("Hilton Magic"...) hold moderately larger home advantages than the Kansas at Allen Field House.
  • The worst home advantages in the Big 12 are: TCU, West Virginia, and Baylor.
  • Some individual team-years show volatile performance, specifically Kansas State through 2014 - 2015.

Tuesday, February 7, 2017

Data Science Trolling at the Airport: Using R to Look Like the Matrix

I've been quite busy at work lately, but I am working on some new, serious posts (one on Support Vector Machines, another on Trumps claims of voter fraud).  For today though, just a fun post about trolling people who like to stare at programmers in public.  

Occasionally, I am stranded somewhere in public (like an airport or a conference) when I need to do some serious coding (side note: one large company has a significant piece of their software architecture running that I wrote on a bench in Las Vegas).  In these situations I open my laptop and start coding, but often notice that people are staring at my computer screen. It doesn't bother me that much, but over time it does become a bit annoying.  A few theories for the stares:

  • It's novel: They've never seen a programmer at work before.
  • It's me specifically: I tend to fidget, and be generally annoying when coding.
  • Its evil: When normal people see programmers writing code in movies, they're always doing something fun and exciting, or evil. They're hackers. 

A Solution:

Anyway, I eventually get annoyed with people staring at me, and based on my third bullet point, decide to give them a show, troll them just a bit.  I wrote this very short piece of code a few years ago in an airport, it's for R and actually quite simple:

It uses a loop, runif() and some rounding to create 70,000,000 random number and print them to the console.  It basically makes your computer screen look like something from a hacker or Matrix movie.  The Sys.sleep() function is something to parameterize based on your system settings, but the point of it is to make the console animate as though you are in a movie.  What does this code create?  I  Here's video of the code running: 

Recommendations for maximum silliness: 

  • When you start the code, maximize the command line portion of your computer.
  • Make sure you're setup for black background and white or green text, for maximum appearance of evil.
  • For bonus points try to give the appearance that you are really up to something:
    • Wringing of hands or nervous fingers help with this appearance.
    • Rocking back and forth slightly while intently staring at the screen.
    • Mumbling under breath things like "it's working, it's working!" or "almost in... almost in."

Wednesday, January 18, 2017

Data Science Method: MARS Regression

People often ask which data science methods I use most often on the job or in exploring data in my free time.  This is the beginning of a series in which I describe some of those methods, and how they are used to explore, model and extrapolate large data sets.

Today I will cover MARS regression (Multi-Adaptive Regression Splines), a regression methodology that automates variable selection, detection of interactions, and accounts for non-linearities.  This methodology at times has become my hammer (from the saying, when you have a hammer in your hand sometimes everything looks like a nail) due to it's usefulness, ease of implementation, and accurate predictive capabilities.

The algorithm for MARS regression originated in 1991 by Jerome Friedman, and I suggest reading his original article for a full understanding of the algorithm.  BTW, because MARS is a proprietary method, the packages in many statistical programs (including R) is called "earth."  Essentially though the algorithm boils down to this:

  1. The Basics: The basic mechanics to MARS involves simple linear regression using ordinary least squares (OLS) method. But there are a few twists. 
  2. Variable Selection: MARS self-selects variables, first using a forward stepwise method (greedy algorithm based on variables with highest squared-error reduction) followed by a backward (in this case, truly back-out) method to remove over-fit coefficients from the model.
  3. Non-Linearity: MARS uses multiple "splines" or hinge functions inside of OLS to account for potentially non-linear data.  Piecewise-linear-regression is a rough analog to the hinge functions, except in the case of MARS, the location of hinges are auto detected through multiple iterations. That is to say, through the stepwise process the algorithm iteratively tries different break-points in the linearity of the model, and selects any breakpoints that fit the data well.  (Side note: sometimes when describing these models to non-data scientists, I refer to the hinges humorously as "bendies."  Goes over much better than "splines" or "hinges.")
  4. Regularization: The regularization strategy for MARS models uses Generalized Cross Validation (GCV) complexity versus accuracy tradeoffs during the backwards pass of the model.  GCV involves a user set "penalty factor," so there is room for some manipulation if you run into overfit issues.  As dynamic hinge functions give MARS flexibility to conform to complex functions (intuitively eats degrees of freedom with more effective factors considered in the equation), it increases probability of overfitting.  As such, it is very important to pay attention to regularization procedures.

The hinge function takes this type of form in the equation, allowing the regression splines to adapt to the data across the x axis.


  • Ease of Fit: Two factors impact MARS models ease of fit: variable selection and hinge functions. A while back I was faced with a task where I needed to fit about 120 models (all different dependent variables) in two weeks. Due to the power of the MARS algorithm in variable selection and non-linearity detection, I was able to create these models quite easily without a lot of additional data preparation or a priori knowledge.  I still tested, validated, and pulled additional information from each model, however the initial model build was highly optimized.
  • Ease of Understanding: Because the basic fit (once you get past hinge functions) is OLS, most data scientists can easily understand the coefficient fitting process.  Also, even if your final model will involve a different method (simple linear regression for instance) MARS can provide a powerful initial understanding of function shapes, from which you may decide to use related transformation (quadratic, log) in your final model form.
  • Hinge Optimization: One question I often receive from business users takes the form "what is the value at which x maximizes it's value with y."  In many of these cases, depending on data form, that can be calculated directly by determining the hinge point from a MARS output, much like a local maximum point or other calculus-based optimization strategy.


  • Can be Overfit: Some people get overly confident over the internal regularization of MARS and forget that normal data science procedures are still necessary.  Especially in highly-dimensional and highly-orthogonal space, MARS regression will create a badly overfit model. Point being: ALWAYS USE A HOLDOUT TEST/VALIDATION SET. I have seen more of these types of models overfit in the past year than all other algorithms combined.
  • Hinge Functions can be Intimidating: Right now, if I went to a business user (or other data scientist) and said that a coefficient on an elastic equation was 0.8, we would have an easy shared understanding of what that meant.  However, if I give that same business user a set of three hinge functions, that's more difficult to understand.  I recommend always using the "plotmo" package in R to show business users partial dependency plots when building MARS models. This provides a simple and straightforward way to describe linear relationships.


And finally, a quick example from real world data.  The Kansas education data set I've used before on this blog can be modeled using a MARS algorithm.  In this case I pretended I wanted to understand the relationship between FTE (the size of the school) and spending per pupil.  From an economics perspective, very small schools should have higher costs due to lacking economies of scale.  I created a model in R, including a few known covariates for good measure.  Here's what the output with hinge functions look like:

That's all a bit difficult to read, what if we use a partial dependency plot to describe the line fit to the FTE to Spending relationship?  Here's what that looks like:

The green dots represent data points, the black line represents the line fit to the data per MARS regression.  The extreme left side of the graph looks appropriate, fitting an economy of scale curve, and the flat right side of the graph appears to be an appropriate flat line.  The "dip" between the two cuvrves is concerning, and for further analysis. (On futher analysis this appears to be a case of omitted variable bias, in which that category of districts contains many low-cost-of-living mid-rural districts, whereas larger districts tend to be in higher cost areas, so prices (e.g. teacher wages) are higher).

Sunday, November 20, 2016

Supreme Court Death-Loss Simulations

Since the nomination of Donald Trump as President on Nov. 8th, the media has featured many narratives on the negative impacts of the future administration.  These stories vary widely in scope and impact, and while there are likely some legitimate fears given the rhetoric of the Trump campaign, there's also a fair chance that some narratives are actually fear mongering.  One fear that seems valid is Trump's impact on the supreme court.  It goes something like this:

In the next four to eight years, there is a reasonable chance that Donald Trump will be able to replace at least one liberal or moderate Justice of the Supreme Court due to death.  If Trump decides to replace that Justice with an ultra-conservative, it could change the majority that has held in recent decisions (e.g. gay marriage; abortion) and impact case law for generations to come.

When my friends (both liberal and conservative), especially those who have a vested interest in gay marriage, abortion, or the affordable care act, hear this, they respond in very emotional ways.  I certainly understand this, as these issues cut to heart of people's identities, livelihoods and health. As a data scientist though, my reaction is to simply ask the question: What likely outcome is indicated by data, and how might that impact the future political landscape of the Supreme Court?


The political background of this situation is complex, but I will stick to derivation of assumptions for this analysis:
  • Relevant Cases: For people in my generation, two very recent seem to have the most impact on their outlook on the supreme court.  
    • Whole Woman's Health v. Hellerstedt: Case regarding what kind of additional restrictions states can place on women seeking an abortion.  The court found 5-3 that states cannot place restrictions that create an "undue burden."
    • Obergefell v. Hodges: This is the infamous gay marriage case which the court held that same sex couples have a right to marry.  This was a 5-4 decision.
  • Current Court Dynamics: Obviously justices vote differently on different issues, but for these two key cases, majority was the same.  Here's how it lays out:
    • Liberals: Kagan, Ginsberg, Sotomayor
    • Moderates (joined liberals in the majority): Breyer, Kennedy
    • Conservatives: Thomas, Alito, Roberts, Scalia (now deceased)
  • Presidential Politics: The progressive theory that underlies the fear of a Trump Court, is that Donald Trump will nominate ultra-conservatives to the court, and they will be confirmed.  There are two main issues with this theory:
    • Trump Conservative?:  There is still an open question regarding how conservative Trump will govern, and what his *real* opinion of social issues like abortion and gay marriage may be.  Beyond this, there is an open question of how much influence Vice President Mike Pence will have, and we're slightly more certain of Pence's conservative agenda.
    • Merrick Garland Stall: Following Scalia's death earlier this year, President Obama nominated Merrick Garland as a replacement.  Congressional Republicans have stalled on confirming that nomination, with the assumption that Trump will replace Scalia in January with another conservative.  This raises a further question for Trump's nominees and this analysis: If a liberal justice dies during the last year of a Trump presidency, will congressional democrats consider this Garland incident a precedent?

What does all of this mean?  For the remainder of this analysis we'll refer to Breyer and Kennedy as liberals, because of their impact on these two social cases.  And obviously the Supreme Court is complex, but generally if Trump can replace one member of the liberal wing of the court with an ultra-conservative, we may see very different future court decisions on social issues like gay marriage and abortion.


General Assumptions: This methodology makes the (likely safe) assumption that no member of the liberal wing of the court will resign their position to be replaced by a Trump nominee.  Given the politics and recent history of the court this seems like a reasonably safe assumption.  As such, calculation and simulation engine I derived below are based upon actuarial probabilities of death.

Nerdy Methodology (feel free to skip): I use annualized mortality risk by age and gender for US citizens, and then use a Kaplan-Meier estimator to determine survival probability rates over an assumed four year and eight year administration. Whereas a parametric Weibull-based survival model may have been more elegant in solution, the sample on which mortality estimates are based was sufficiently large, and curve fitting in Weibull may introduce other types of error.  For straight forward probability estimates, this methodology is sufficient, however some complex scenarios require a simulation based solution.  I created a Supreme Court Survival Simulation Engine (SCSSE, I guess) which simulate who will survive the next eight years to answer these more complex questions.  


Before I go into estimates of mortality/survival, we should first coverwhat makes an individual more likely to die.  Without knowing health risk from detailed medical histories, the three most predictive factors in short-term mortality are age (older = more likely to die) and gender (male = more likely to die), and affluence, measured in various ways (we'll think in terms of income percentiles).  This means we can get general survival odds by looking at four general things:
  • Age
  • Gender
  • Any Known Public Health History
  • Affluence
I will ignore affluence for now, because mid-life affluence seems to be predictive, and all Supreme Court Justices are similarly affluent. But here's a summary of what we know about current Supreme Court Justices risk:

To summarize:
  • Conservatives: Rather boring, three men, all in their 60s, and only Roberts has even a rumored health issue.
  • Liberals: The news is relatively bad for liberals.  First is an 83 year old woman who had cancer twice, which is risky enough.  Then add two men (aged 78 and 80) with similarly high mortality risk levels.  

I used this data to create cumulative mortality risk estimates for each Supreme Court Justice, the supreme court as a whole (what is the probability that all 8 current justices will survive) and liberals and conservatives separately.  

First a chart for each justice separately.  The numbers and line on the chart represent the probability of surviving through each year of the administration.  You will see that relatively healthy, young justices (e.g. Kagan) have relatively low chances of dying even over the second term of Trump (flatter lines).  Older justices, however, (e.g. Kennedy, Ginsburg) have a lower shot at surviving, some less than 50% chance of surviving a second Trump term.

We can then aggregate these yearly by-justice probabilities into survivability-as-whole numbers for the entire supreme court.  Below I've created annual survival probabilities for three groups of Supreme Court Justices: 1. All Supreme Court Justices, 2. Conservative justices (defined as three living dissenters in two above cases), 3. Liberal justices (defined as those in majority in above cases).  The Y-axis is the probability that all members of each sub-group will survive through each year of the Trump administration (Years represented on X-axis).  Here's what that looks like:

Survival charts cane be a bit complex to read, a summary though:

  • First Term: At the end of the first term, there is only a 34% chance that all justices will survive, 42% that all five liberal justices will survive, and 80% chance that all conservative justices will survive.  This means there is 58% chance that at least one liberal justice will die, to be replaced by Trump.
  • Second Term: At the end of a potential Trump second term, there is only a 6% chance that all justices will survive, 11% that all five liberal justices will survive, and 68% chance that all conservative justices will survive. (i.e. an 89% chance that Trump will have opportunity to replace a liberal on the court)
In essence, the probability is better than 50/50 that Trump will get to replace at least one liberal justice in his first term, and nearly 90% that he will be able to shift the balance of power by the end of his second term (if he so chooses).  But what about more complex scenarios, what are the odds that Trump gets to replace two liberal justices in four or eight years?  Enter a simulation engine.


My prior analysis showed a high probability that Trump may be able to shift the current balance of power in the Supreme Court, but can we predict the odds for replacing more than one liberal justice?  To do this we need a simulation engine with some robust matrix-algebra/storage capabilities, which I designed in R.  The simulation engine is somewhat novel in it's ability to calculate number of survivors for heterogeneous groups of people, and could, theoretically be applied to any group of people, including families.  Moving on.

The first simulation involved all eight current Supreme Court Justices and calculated the number (independent of ideology) that would survive the first and (potential) second term of the Trump administration.  I ran one million simulations, and output the graph below, (two term on left, one term on right).  The Y axis and bar label represent the number of simulations that ended in this outcome, the X axis is the number of justices surviving in the simulation.  Essentially: the height of the bar/1,000,000 is the probability of each outcome.

What do these simulations reveal?
  • First term: About a 35% chance that all justices survive, 41% that that all but one survive, and 20% chance that 2 die. 
  • Second term: only a 6% chance that all justices survive, 20% chance that all but once survive, 35% chance of two dying, and 25% chance that three die in that time.  And a marginal, yet 3 in 1,000,000 chance that all eight Supreme Court Justices die in the next 8 years.
That's interesting, but shifting the balance of power involves specifically replacing liberal justices.  Let's re-simulate and only analyze the five liberal justices.  Here's what that looks like:

And these simulations
  • First Term: about a 40% chance of all justices surviving, 42% of one liberal justice dying, and 15% chance of two liberal justices dying in that time.
  • Two Term: 10% chance of all liberal justices surviving, 34% chance of one liberal justice dying, 38% chance of two dying, and 16% of three dying.  Also, again a minute but real 8 in 10,000 chance of Trump being able to replace all liberal justices by the end of 8 years.
Another interesting scenario is just looking at the three most liberal justices (coincidentally? all female).  Here's an analysis of the death probabilities for just Kagan, Sotomayor, and Ginsburg

  • One Term: There is a 68% chance of all three liberal justices surviving the first term, 30% chance of one dying (most likely Ginsburg).
  • Two Term: There is a 39% chance of the three justices surviving the second term, 53% chance of one dying, and 8% chance of losing two of the three most liberal justices.  There's also a roughly 0.3% chance of all three female justices dying by the end of Trump's second term.


I've taken reasonable steps to create accurate estimates in terms of calculating mortality risk for each Supreme Court Justice and pools of justices.  There are a couple of potential sources of bias, which may or may not be adequately controlled for:

  • Health - Generally speaking Supreme Court Justices are fairly healthy despite their age.  The types of illnesses we see among Supreme Court Justices seem fairly normal for cohort of their age, if not slightly more healthy than average similar American groups (the Supreme Court is, after all, largely a group of still-able-to-work senior citizens).  The outlier here is Ruth Bader Ginsburg, who has survived cancer.  Twice.  Those cancers (colon 1999 and pancreatic 2009) both carry very high mortality risk, so it's difficult to acquire accurate post seven year mortality multipliers. Since she has survived seven years, I make the likely assumption that her pancreatic cancer was caught in time, and is no-longer a risk.

  • Affluence-We know that affluence, and specifically income levels at middle age tend to impact mortality risk later in life.  Supreme Court Justices are likely in the top 2-3 percentiles of Americans in terms of income and education (have law degrees, make $244K annually).  This means that our mortality estimates may over-estimate the death probabilities for Justices who may, ceteris paribus, live longer due to income, affluence and privilege.  Reviewing relevant data, and available information on the relationship between mortality rates at the median versus top percentiles, it is likely that annualize mortality risk for Supreme Court Justices is 25-40% lower than the median. 
As I pointed out, there isn't a good case to make in-line adjustments to mortality estimates based on health, but affluence seems a different matter.  Using the estimates developed above, I re-ran the annualized probability of all cohort justices surviving, results below.

To summarize, if we account for the impacts of income and affluence, the four and eight year risk of replacing at least one Supreme Court Justice falls to 46 and 81% (from 58% and 89% respectively).  If we make an affluence assumption these values may be more accurate, however it's difficult to definitively know the impact that affluence has had on each Supreme Court Justices.


The results of this analysis are fairly straight forward:
  • There is a 58% chance that at least one liberal justice will die during Trump's first term, and 89% chance of the same if Trump is elected to a second term.
  • If Trump is elected to two terms, it is likely (57%) that two liberal justices will die during his presidency. 
  • These estimate may slightly overstate the risk of death for liberal justices, due to economic affluence, and as such, the risk of death of at least one liberal justice may be reduced to 46% (four year) and 89% (eight year).
  • These are simply the death risks, and to assume that any of these deaths would shift the balance of power, also assumes that Trump would appoint conservatives, they would approved by Congress, and the Merrick Garland precedent is not used by Democrats.

This analysis may seem somewhat cold, robotic and mathematic to some users.  And it is.  Mortality is difficult to discuss, and to place hard mathematical numbers to the odds of surviving past a certain date is a bit frightening.  But it is also necessary at times to look at the world in these terms, such that smooth and open succession planning can occur.   There are a couple of obvious implications to this analysis in my mind:
  • I (or someone) should have conducted this analysis prior to the election, so everyone could have better understood the full implications of the Trump presidency on a few issues important to both conservatives and liberals.
  • There's a greater question here on how we look at our own mortality, and how we manage risk around it.  I'll close with a question.  If the liberals on the court would have known these odds of survival (and conversely, having their replacement chosen by Trump), would any have resigned in 2014?

Monday, November 14, 2016

Quick and Easy Geographic Maps in R

Over 10 years working in analytics and data science I have found policy makers and business executives gravitate towards analytical maps in order to understand business, social, and demographic relationships.  Though I have created these types of maps in one form or another for over a decade, I've found many young analysts have trouble understanding the data and code that at the base of geographical analysis.  The purpose of this post is to demonstrate the mapping capabilities now available in R, and describe how to quickly create attractive graphical representations of spatial data.


A few months ago on Twitter, I was criticizing R's ability to create quick, functional, and attractive maps. My essential criticism was this:

In order create good map visualizations, I often have to pull my data out of the R statistical engine and merge it with a shapefile inside of a GIS system like QGIS.  QGIS is great, and I can create awesome visualizations in that system.  

Other R users jumped in and encouraged me to check out some newer functionality, specifically that found in the ggmap package.  This package is related to the popular ggplot2 package that I often use for creating graphical representations of data and models at work and on this blog.  Using this new(ish) functionality, I was able to code the following map in just a few minutes, and with just a few lines of code.

African American % by Precinct, Sedgwick County Kansas


The code to create this map was straight forward, here are my comments on the capabilities of dealing with GIS data in R:

  1. readShapeSpatial is a function that allows us to ingest shapefiles into R.  Shapefiles are a standard data type for geographical data, for more information see here.  
  2. fortify is a function that we can run against a shapefile to transform the geospatial data into an R data.frame.  I would recommend analyzing the output of this process, it is informative about your dataset, as well as how geospatial data "works."
  3. @data  is the classic data element of the shapefile (holds demographics, generally), which we can reference as (shapefile@data) and treat like a data.frame in R (see code below).
  4. qmap  is the analog to ggplot2's qplot.  It is a way to quickly create maps, without requiring much syntax or handling.  A few things of note:
    1. The function allows us to underlay google maps against our shapefile, the first parameter here is the text "search" of google maps on which to center our map.
    2. Zoom we also pick a zoom function, which tells us zoomed in on our search area the map should be.  I recommend just playing with this until it looks good.
    3. geom_polygon is a function that tells qmap what to do with the shapefile.  You'll notice that the rest of the syntax looks much like that in ggplot2.  If you need help with that type of syntax I recommend this cheat sheet.


 #grab my shapefile  
 shapefile <- readShapeSpatial("KLRD_2012VotingDistricts.shp")  
 #create id from rownames   
 shapefile$id <- rownames(shapefile@data)  
 #fortify shapefile, creates a dataframe of shapefile data  
 data <- fortify(shapefile)  
 #join data file to the @data which is the attribute table dbf element of the shapefile  
 data = join(data, shapefile@data, by="id")  
 #subset by FIPS for county level data  
 data <- subset(data,substr(data$VTD_2012,3,5) =="173")  
 #calculate % African American  
 data$AA_PERC <- data$BLACK/data$POPULATION  
 #run qmap for Sedgwick County Kansas  
 qmap('Sedgwick County Kansas', zoom = 10) +  
      geom_polygon(aes(x = long, y = lat, group = group, fill = AA_PERC), data = data,  
      colour = 'white', alpha = .6, size = .3)+   
      scale_fill_gradient(low = "green",high = "red")  

Sunday, October 23, 2016

A Response to Election Fraud Claims and "Electoral Systems in Crisis"

The world is complex.  As statisticians and data scientists, our job is to create models that describe that complexity, and reduce it to simplified equations from which we can understand and predict how the world acts and reacts.  This is a story of what occurs when we use models that oversimplify the world, and fail to adequately describe its complexity.  It is also a response to a paper (found here) that mentioned my prior work on election fraud.  In the paper there was an attempted, but non-substantial critique of my work, which I will address later in this post.     

Early this summer I was approached online by a journalist claiming ABC News credentials and stating that she was doing a story on "Election Fraud" and asking for my involvement in the project.  It was quite a busy summer for me, but I said I might be interested in participating.  That interest quickly waned as I learned three things:
  • The Press Credentials Were Not Relevant.  While the journalist used her ABC affiliation to get me on board, after further questioning on the details of the project, I learned it had nothing to do with ABC.
  • The Talent For the Problem Seemed Incorrect.  Early conversations with the journalist demonstrated a poor understanding of highly dimensional statistical problems, both by herself and the other team members.  There was also no demographer on the team, though what we were dealing with certainly has demographic elements.
  • The Authors Weren't Interested in Legitimate Falsification.  After interacting with the lead author, journalist lulu Fries’dat and observing the nature of the work, I found the authors were not interested in actual scientific falsification of theory.  They were already convinced by the evidence, and real steps at falsification were futile.
As a result, I opted out of the project.  Then, in late July, the resulting paper started to be circulated with a half-attempted critique of my work.  I read the paper, reacted, sent it to a few friends with the general question: should I respond?  

The consensus from friends was that I should not respond, for a few reasons including the poor quality of the science, the fact that the statistical work was far too shoddy to be published in any serious academic journal, and the general belief that I shouldn't waste my time.

I let the paper go for a few months, but it kept bothering me.  And, with context in the run-up to the US presidential election, I felt I needed to respond, for two reasons:
  • Rhetoric of Systemic Rigging.  Throughout this election cycle, especially from the Bernie Sanders campaign, but also from the Donald Trump campaign, there has been a substantial dialogue on the idea that the "system is rigged."  This has ranged from the economic system, but also the electoral system.  My worst nightmare here is that we wake up November 9th, Trump has lost the election, but massive social unrest occurs due to his followers believing he was robbed.  This paper (Fries’dat,  and Sampietro)  would certainly feed that sentiment, such that this response is necessary.
  • Involvement of Fritz Scheuren: Fritz is a former President of the American Statistical Association, and is at least being name dropped in this paper.  The list of authors has him as "with" so not a primary author on the piece but at least we should view him as a contributor to the general concepts in the paper.  Though it's disappointing that Scheuren would sign on to poor research, he's certainly free to do so.  My concern here is that his name will lend credibility to otherwise implausible analytical techniques.


In April 2015 I became aware of a Kansas statistician named Beth Clarkson who was making some fairly astounding claims.  She essentially claimed that she had found evidence of massive fraud during a statistical analysis of voting records.  I was at first intrigued and dug deep into the data, and posted in a series of three posts in April 2015.  For this background section, I have pulled out and edited those three posts, though the original posts are still available on this blog.  


A Wichita State University statistician filed a lawsuit in Sedgwick County Court regarding the Kansas 2014 election.  She is trying to gain access to voting machine tallies to rule out the potential of voter fraud. 

Clarkson, a statistician, who works as a QA engineer, has found some "voting anomalies"... essentially that Republicans receive larger than expected vote shares in larger precincts.  Keep in mind that QA (Quality Assurance) engineers are trained to look for anomalies, things you wouldn't expect in data, and focus on them so that systems don't fail. 

Of note from her original article:
“This is not just an anomaly that occurred in one place,” Clarkson said. “It is a pattern that has occurred repeatedly in elections across the United States.”

"The pattern could be voter fraud or a demographic trend that has not been picked up by extensive polling, she said."

When we consider these comments on their face, considering they come from a research statistician, they don't seem to be out of the ordinary.  Just a researcher looking into anomalies, searching for potential causes.  I did think that putting "voter fraud" out there as a possibility seemed a little aggressive at this point, but didn't see this as an issue that would get a lot of attention.  But consider the political climate of Kansas in early 2015:

  1. The political climate in Kansas right now is tense, largely due to a highly contested gubernatorial election in 2014.  In that election, many polls and sources projected Democratic challenger Paul Davis to win, by a narrow margin, however unpopular Republican incumbent Sam Brownback won by three points. 
  2. Progressives are especially upset because they just lost an election, which their leaders told them would be a fairly easy win.
  3. I've seen the Clarkson article above posted by many progressive friends as fodder, evidence, and proof that, statistically speaking, Brownback was probably re-elected due to election fraud.
This seems to be a heating-up massive conspiracy theory.  But let's be calm and analyze:

What we know

From Clarkson's comments, she has no evidence of voter fraud.  She has found a small statistical anomaly that exists nationwide and wants to use Kansas to verify that it isn't due to fraud or voting machine issues.

But what is that anomaly? 

The anomaly is that after a certain size threshold of voters in a precinct (500), there is a positive correlation between precinct size and percent Republican votes.  

Why is that an issue?

Clarkson is a QA engineer in anomaly detection mode.  She's starting from an a priori premise that precinct size should not determine results, and thus, statistically significant correlations should not exist.

Is Clarkson's analysis of the data correct?

Though disagreeing with her conclusions, I tend to view the mechanics of her analysis as correct.  In fact, I was able to replicate, using the 2010 Kansas Gubernatorial Election.  The relationship is weak statistically speaking, but statistically significant.  This indicates that something non-random is happening in the data. Regression statistics and visual plot below. (for help in interpreting regression outputs, click here)

And a scatter plot of the analysis.
The line indicates the line of best fit for the data, demonstrating Clarkson's correlation

The correlations exist, so does that mean people must be acting nefariously in those large districts?

Here we go.  Absolutely not.  And here's why: covariates.  In the real world, multiple variables often correlate with one another, causing us to find relationships that are really measuring something else.  Clarkson's comments allude to this when she talks about potentially underlying and undetermined demographic factors. 

There are many what-if's here.  What if other variables also correlate with precinct size?  Age, Race, Wealth, Urbanity, etc, etc, etc.  In these cases we could be latently measuring other factors through measuring precinct size.   Some may speculate that conservative populaces somehow push for fewer, larger precincts and less division, though there is little evidence to back up that type of conclusion.  Specifically for the demographic factors, we should keep in mind, this is a weak relationship so slight and correlated latent groups can play a relatively large role.

The author of this analysis already drop the smallest precincts as a whole, because they tend to be more rural, and thus more Republican.  By doing this, the authors tacitly admit that underlying demographic factors can impact the correlation between precinct size and voting behavior.  We haven't gone through the steps to exclude all other demographic factors, so why are we making vague accusations of fraud and garnering media attention?  


I posted twice later that month, again responding to Clarkson’s analysis, and looking at specific factors that might be amiss in her data.  It still bothered me that Clarkson had posted essentially no work in attempts to falsify demographic trends (she has not posted any attempted falsification work, to date).  I set out to look into alternate causes of Clarkson’s correlation.  Here’s a synopsis of my analysis:

  1. I changed data sets to the 2008 Presidential election.  This is valid because the correlation of interest (Republicans doing better in larger precincts) still holds up.
  2. Nearly every demographic covariate I threw at the equation was statistically significant and more important than the "number of voters" variable.  What does this mean? Demographics are certain to play in vote shares (somewhat obviously). 
  3. If I create a large predictive model, using other variables such as population density, county size, and relationships between local precincts, the number of voters voting in the precinct becomes statistically insignificant.  In essence, the underlying cause of Clarkson's correlation is a demographic profile that tends to covary with precinct size.

This means: The original analysis was looking at a very small relationship in a world where much more important relationships exist.  And if we look at the data in a way where we simultaneously account for multiple factors, the correlation from Clarkson's original commentary is simply non-existent.

First for validation of Clarkson's initial simple correlation, did it hold up?  Absolutely and here is the evidence.

Correlation between total precinct votes in 2008 and Republican share is positive and statistically significant.

But what else correlates to the percentage that votes Republican in an election?  A lot of our variables, it turns out.  Here's a correlation matrix.  Notice that size of precinct is actually the lowest absolute correlation value.  Also of note, many of them are cross-correlated with size of precinct, pointing towards multi-collinearity.

What happens if I add some of these demographic variables that a priori make sense in a model?  Number of voters in the precinct is no longer a significant predictor of vote share, but other variables end up highly significant. One variable worth noting is the percent voting.  This variable is important, because as the percent of voters voting increases, the percent Republican increases significantly.  This is partial verification of a prior concern I had regarding higher turnout in Republican districts being an underlying cause of Clarkson's correlation.

Multiple factor model at precinct level, with total votes, percent voting, land area, percent voting age, area ratio to county, and county population.

Statistical impact: The key takeaway is that the small correlation found by Clarkson and previous authors is most likely due to other correlated variables.  These variables generally measure demographic factors and precinct design concerns (and correlate conceptually with the ideas from commentors on this blog and elsewhere). We have no good a priori reason to believe that precinct size should be uncorrelated to vote share %.


Clarkson’s theory is that most precincts under 500 voters are rural and Republican leaning districts, whereas over 500 voters, she would expect the precinct size relationship to Republican percentage to level out, or become more Democrat leaning, because larger precincts would be at the urban core.  I have explained the problems with this logic in several different blog posts:  Effectively, precinct creation is not a randomized process, thus many covariates, demographic and otherwise come into play.  I even demonstrated how Clarkson's analysis withers up when we expose the data to demographic covariates.

Let's look at this through a few visual examples.

Focusing on Johnson County and Sedgwick County, I highlighted all of the precincts with 500 or more voters in the 2014 Governor's election, and then went on to classify each one by larger buckets.  After 500 voters per precinct, the smallest precincts are the ones closest to the urban core, while the largest are in the outer-rural suburbs.  This map demonstrates that relationship:

Johnson County KS, Center City to Upper Right
Additionally, for Sedgwick County KS:

Sedgwick County Kansas, Center City in Center-Right

On face, we don't necessarily know that the gentrified areas closest to the urban core are going to be the most liberal.  So I mapped this as well, validating that the areas closest to the urban core tend to be the most liberal, with the most conservative areas outside of the 435 loop, to the south and west of the inner urban core. 
Johnson County KS, Vote % by Party

Sedgwick County also shows a very obvious pattern, with the most democratic areas in the central city area, moving more Republican towards the suburbs.

Sedgwick County Kansas, Vote % by Party

What does this mean?  Two concepts are validated
  • Precinct creation is not random, and the larger precincts within Johnson County and Sedgwick County do not lie closest to the "democrat" urban core, or randomly throughout the region, but instead in rural and near-urban suburbs-in direct opposition to Clarkson's hypothesis.
  • Those suburban areas (outside of the 435-loop in JoCo, suburban non-core areas in  SGCO) tend to also be more conservative.

In essence, the primary hypothesis of Clarkson's analysis is flawed, because these a priori relationships exist.  In this case, it appears that the patterns of suburbanization over time have led to this scenario: in large cities, the largest precincts are not at the urban core, but instead lie in the ring suburban areas which have gradually developed over the past 50 years.  (In Wichita, a large portion of new-development ring-suburban area lies inside city limits). 

Clarkson's theory is based on a broken a priori notion:  after 500 voters, there should be no correlation between precinct size and percent that vote Republican.  The specific reason her theory is broken is that the precinct creation was not random, and in fact suburbanization caused the largest of the precincts to be in whiter, richer, and more Republican leaning areas.  


I have now demonstrated this for Sedgwick and Johnson Counties, how much do those two counties actually matter to overall Kansas results?


Let's take a deeper look into large precincts.  An easy way is to break precincts into buckets by size, and talk about them in this way.  Here are the size buckets I am using:

  • Regular Precincts: 0-500 voters (Clarkson ignored these)
  • Large Precincts: 500-1000 voters
  • Super-Large Precincts: 1000+ voters

First, how does Brownback perform by each size-grouping of precincts?  Here's a chart:

This chart actually verifies Clarkson's correlation.  Effectively, Brownback did best in regular and super-large precincts.  The fact that he did better in super-large precincts than large precincts is the exact correlation that Clarkson cited in her original work.  This is just another validation that the correlation exists, but NOT of Clarkson's interpretation of this finding as evidence of fraud.

But how much do suburbanization patterns in Johnson and Sedgwick County matter in this?  A lot.   First, Johnson and Sedgwick counties make up only 12% of the regular sized precincts. But they make up almost two thirds of large precincts, and 70% of the super-large precincts.  Sedgwick county has more than 2/3rds of the super-large precincts statewide, and a higher ratio of super large::large precincts.  

A quick aside, because the independent variable in Clarkson's correlation is "precinct size" this means that Sedgwick County can create a correlation simply by  1.being more conservative than Johnson County  and 2. having larger precincts than Johnson County.  More on that below though.

If we look at Clarkson's analysis, over 2/3rds of the sample can be attributed to Johnson or Sedgwick counties, where we know that her a priori assertion is broken.  Moreover, when we run the vote count to Republican percent correlation on the other 1/3rd we see no correlation.  The effect is only observable in urban/suburban counties (where demographically significant suburbanization processes have occurred).  Thus Sedgwick and Johnson counties are all that matter to the observed correlation and we don't need to explore additional counties.  Here's an R output for the other 101 counties:

One more thing.  There's another factor that increases correlation when we aggregate results.  Because the majority of super large precincts are in Sedgwick County, it gives leverage to Sedgwick County over Johnson County.  And because generally Wichita is a more conservative region than Johnson County, that leverage serves to increase the correlation, though due to no nefarious or unexplained phenomena.  

The concept of leverage, or "mix" (i.e., the mix of counties), can easily be shown graphically.  Luckily, this is easy in the ggplot2 R library.  Here is a scatter of Clarkson's correlation with counties color coded.  In this you notice the more liberal counties tend to have mid-large precincts (Wyandotte, Douglas, even Johnson county) while more conservative Counties (Sedgwick, Other Rural) make up a majority of the largest precincts (also, bar chart above demonstrates this).  This enhances Clarkson's correlation when counties are combined, simply due to the mix of counties, not in-county nefarious action by voting machines or bad actors at precincts.

This analysis demonstrates that a portion of the correlation at the statewide level for Kansas is due to relative conservative ratios in separate urban areas. In essence, it's not surprising that we would see 


I am not the only person working on this issue.  A political scientist named Mark Lindeman, (also mentioned in the Fries'dat and Sampietro paper) explored the same correlations Clarkson and I have reviewed, but analyzing different dimensions of the data.  Specifically, his work draws the correlation back to original voter registration data, and demonstrates that the correlation doesn't start at the ballot box (link to Lindeman's most recent paper)

What does that mean?  Party affiliation at registration is also correlated with total number of voters, long before we get to the voting machine.  That means that Clarkson’s correlation exists prior to the voting system, and in fact the correlation between large precincts and Republican registration occurs effectively in nature.  What could cause this to exist in nature? A few things, demographic effects and patterns of suburbanization included.  I suggest reading his work, and I also validated his work using Johnson County data in 2004, see chart below, first validating Clarkson's correlation, then replicating Lindeman's  correlation to registration patterns. 


A few months after my initial post, Esquire magazine’s online version ran a quick piece of Clarkson’s work.   I wrote some text with links to my work in the comments section of the article.  Beth Clarkson herself later responded and we had an engaged conversation about her work.  Here is Clarkson's final statement to me on Esquire forums:

We seem to be in agreement that the null in my case isn't true. I disagree that it invalidates my work because I feel the cause is what is under debate. Your suggestion of assuming a particular prior distribution may or may not be appropriate. I haven't looked at it deeply enough to know for sure. In short, I'm agreeing that you could well be right about that. That our electronic voting machines are eminately hackable and have no post-election audit procedures in place are established facts and are equally concerning to me. Do you diagree about that aspect? Are you satisfied with assuming a distribution that fits the pattern? Or do you agree that our voting system should be (but isn't) transparent enough for citizens to feel confident that the results are accurate?

Here's my response one by one:
  1. On the NULL case not being true.  I agree with Clarkson that we can "reject the NULL hypothesis", in fact in my first post on the subject (and above in this post) I replicated her results.  But all Clarkson is saying here by claiming the null case is false is that she found a non-zero correlation.  I agree, there is a non-zero positive correlation, but if we dive deeper why are we testing a null hypothesis?  And if we can reject it, have we done the research to say that there aren't reasonable alternate explanations (I have, and there are)?  Keep in mind my prior work on this subject, that show demographic and precinct creation reasons create this correlation.  In essence rejecting the null hypothesis here is in no way meaningful because it is only testing the false assumption that there should not be a correlation.  That has been the point of this blog's work on the subject, that the null hypothesis is irrelevant.
  2. On her admission that she hasn't looked deeply into this.  She concedes that I may be right. A lot of thoughts here.  She has been threatening lawsuits and doing newspaper interviews over something she hasn't deeply reviewed.  She also said earlier in her comments that she hasn't had access to demographic or mapping data.  I have been able to compile that data, usually in a matter minutes, whenever I have wanted to look at it.  Access to data is easy, and it's the job of a modern statistician or data scientist to acquire it and test your work, due diligence.  Effectively here, she admits she's done less work on the subject than I have, and concedes I may be right.
  3. On open government concerns.  I have always agreed with her on this concern, on this blog, and publicly, multiple times.  I have also offered to help, if I can, should she get access to that data.  At this point this is a silly rhetorical ploy. Clarkson has conceded she hasn't done due diligence on her work on intellectual grounds, and has started making vague allusions to things that sound appealing to everyone.  Overall, this is largely a diversionary tactic from the real issue at hand, the failing of her theories.


In late July of this summer, a paper was published that cited Clarkson’s work and served as an attempted critique of my and Mark Lindeman’s work.  As reviewers of this paper I spoke to stated, this is not a publishable paper, and the statistical work is over-simplistic, especially for complex elections, in a demographically influenced, multi-dimensional world.  As I stated earlier, responding to this paper is probably not worth my time.  However, in the context of a highly contested national election, I feel obliged to respond.

Point 1:The authors punt on actually critiquing models.
Outside of Clarkson’s attempt to invalidate my model (I rebut below), the totality of criticism of both myself and Mark Lindeman was rhetorical, and not pointed at specific issues of the model.  One outright untruth in the paper is that Lindeman fails to deliver any data, which he has repeatedly done in other public places (in fact, the report links to Lindeman's data and code).  The authors also suggest I don’t use the right statistical model, but fail to point out why it is statistically invalid. 

Effectively, the authors are delivering a debate critique of simply standing up and yelling *wrong*, without delivering a critique of the actual data, statistics, methods, or underlying theory. The critique doesn’t work largely because it’s non-existent.  This paper would struggle to stand up to peer review required by reputable publications.  Their paper also doesn't explain why the data is weak, and as I have responded, line by line, with each of their actual concerns, I consider Clarkson, Fries’dat, and Sampietro to be thoroughly disproven.      

But the concept that Lindeman doesn’t deliver any substantive data is still troubling to me, because he certainly does, and I actually proved that previously in this post.  Actually, Lindeman’s work tends to make my work somewhat irrelevant (as it demonstrates the nearest root cause to Clarkson’s correlation).  Let’s revisit his theory: that we can prove that Clarkson’s relationship exists prior to the voting machines, and in fact, that more Republicans register in larger districts than smaller.  Can this be demonstrated once again?  Yes. Once again, in the 2010 Kansas Governor's Election.

First, beyond 500 voters, Republican registration correlates with precinct size:

Second, once we account for voter registration patterns in our model that relationship disappears:

To summarize, the paper’s (Fries’dat  and Sampietro) critique is insufficient, and fails to address any of the work of Lindeman and Bowles.  Here I have once again replicated Lindeman’s work which is a statistical proof of his hypothesis: that Clarkson’s correlation isn’t unexpected, but it in fact follows voter registration patterns.

Point 2: The Concept of Suburbanization Over Time

To quote the authors:
"Bowles’ critique does not provide an explanation for the appearance of the pattern since the year 2000. Precincts have never been randomly created districts. So why wasn’t this pattern present in earlier elections?"

A few reasons why this is also false:
  1. Their evidence for the pattern not existing prior to 2000 is simply “Phil Evans told us.” In the section of the paper addressing my work they state “didn’t exist prior to 2000” as fact.  This claim is mentioned throughout the paper as fact, but there is only one apparent citation, which is on page 24, where they declare that “Phil Evans Told Us” (PAGE 24)  Citing what someone said (hearsay, effectively)  rather than actual statistical evidence is not sufficient, especially when you’re trying to prove that something wasn’t observed before a prior date (proving an absence).  In this case you need a fairly exhaustive analysis, looking across multiple regions and time horizons to prove.  Traditionally, large meta-analysis study would be required before this type of claim (absence) should be made in an academic setting.  
  2. The pattern DID appear before the year 2000.   In fact I found it in the first race prior to 2000 I analyzed, Dole/Kemp v Clinton/Gore, 1996, Sedgwick County (here I replicated the graphing method from the Fries'dat and Sampietro paper).
  3. The theory accounts for over-time effects. It also seems that the authors fail to account for "time" in "suburbanization over time.".  The point of my overall theory is that suburbanization patterns that occur *over time* would lead to a scenario where more Republican suburbs would have large, high-turnout precincts, whereas inner cities created prior would have smaller, old created precincts.  The process of suburbanization has occurred *over time* (specifically, from the 1950s to present) and thus we would expect two things:
    1. At some point in time, during the process of suburbanization, the relationship would become statistically significant.
    2. The relationship would generally increase in strength over time.

Point 3: Wichita is not 100% inner-city.

The authors of the study include a graph that says 
“Fig. 19 — 2014 Kansas Senate race - the increase of a candidate’s percentage in the large precincts is only seen in the inner city precincts, not the suburban precincts”  

This claim is problematic for several reasons:
  • They misunderstand the concept of suburbanization over time.  In this case Clarkson created an overall simplistic model binarizing the world into “urban” versus “rural.”  Because suburbanization has been a long gradual process over time, binarizing into two lines (urban versus rural) doesn’t appropriately account for this long-time effect.  In fact, we would expect the correlation to exist at subunits, especially in large cities with new-developing suburban areas inside city limits (like Wichita, Kansas).  The two new maps below demonstrate this graduated standard.  The effect of large precincts and a priori suburbanization exist inside city limits.
  • Once again in misunderstanding the theory, Clarkson has validated our analysis. In this case, the test is actually further validation of our patterns of suburbanization, because the city of Wichita  includes large suburban outer-ring precincts (still fully inside city limits) where we would expect the correlation to exist.  The other test, "the rural line on her chart" actually becomes a test of rural-country versus “rural-towns” read: towns that exist inside vast suburban areas.  This second line, thus, correlates with population density, and with the traditional theory that less-rural people tend to be more liberal. (why < 500 voter precincts are generally excluded)

Let’s dive deeper, shall we?  Now that we’ve disproven Clarkson’s critique on prima facie theoretical grounds, we should dive in for a deeper understanding of theories of suburbanization.  Let’s start with Wichita, Kansas presidential election 2008.

The maps demonstrate the trends at hand in precinct size.  First map is Wichita Kansas only (mirrors Clarkson’s analysis) and shows a significant increase in size as we move away from the city center, marked with black dot.  Because Wichita grew from this city center, this shows that higher vote total precincts tend to exist further from the center city, validating the first part of my processes of suburban precinct creation theory.  

But is this a statistically significant relationship?  We can test that using a regression analysis for large precincts.  This analysis demonstrates that the distance from center city (calculated via haversines method) is a positive, statistically significant predictor of number of voters.  This is important, because it is confirmatory of my initial theory, of precinct creation in near-suburban areas being key to understanding the functional basis of Clarkson’s theory. 

But distance from center city is also very good predictor of Republican vote share, validating the original hypothesis that ceteris paribus, we would expect, inside a large city, that larger precincts would relate strongly to large Republican vote shares.  We prove this out, in what’s actually a quite predictive model (R-squared = .45).  This is important as well, because it demonstrates (though logical) that inside a city, suburban areas tend to vote substantially more Republican than inner-city districts.

This creates an initial framework for our theoretical basis: precincts larger become in size as we move from inner-city to the suburban areas and also suburban areas tend to vote more Republican.  But how does this relate to Clarkson’s correlation?

The same analysis can be conducted for Johnson County, Kansas, which shows a very strong
"Clarkson's" correlation for multiple, consecutive elections. Though Johnson County Kansas didn’t develop in the same center-city method of Wichita, we can still estimate a model that accounts for these factors.  In this case we have to consider both the root of growth (area closest to Kansas City Missouri) and indicators of center micro-cities in Johnson County  (Lenexa, Shawnee, Overland Park, Olathe) and the best demographic proxy for these small-precinct old-center-cities: percent Hispanic.  In the map below the black dot represents center of growth for county:

Validation of the relative strength and existence for Clarkson’s calculation, in Johnson County for 2008 is below.

The output below validates that we can predict precinct size by looking at distance from center city  plus percentage Hispanic, proving that voters in a precinct is correlated with underlying demographics.

Also though a corollary point, that the relationship isn't just at the voter level, but also that distance to center city is predictive of population of a precinct.  This means that we aren't just talking about who shows up to vote, or potentially that votes were added to some districts, but that in fact the precincts are larger from a census-population perspective.

We then create a secondary analysis, where we try to estimate Clarkson’s correlation, controlling for known demographic patterns that impact voting factors.  In this case we see that Clarkson’s correlation fully disappears after accounting for these two factors (distance to center city, percent hispanic).

To bring this full circle (a bit for my own amusement) we can also predict Lindeman’s voter registration percentages, looking at distance to center city, as well as Hispanic percentages.

Point 4: State variation in models demonstrate same underlying process.

The authors contend that precinct creation varies by state, each having their own system of rules and and strategies.  This may be true, but all systems of precincts are in reactions to underlying population growth trends, and thus the underlying theory of precinct interaction with population demographics holds up.  This isn’t so much of an argument, as the authors granting my Kansas analysis.  Similar analysis has discovered similar findings in other states regarding relative demographics of suburbs and precinct construction.  It also ignores the confirmatory analysis from Mark Lindeman (which he has conducted, for many other states) where we know that the Clarkson correlation begins at registration NOT at the voting machine.


Throughout this paper I’ve outlined Clarkson’s original analysis, its drawbacks, and my critique.  Furthermore, I’ve created a detailed analysis and response to the Fries'dat and Sampietro paper that serves as a basis going forward.  In summary:
  • Clarkson offers a correlation which she claims may be voter fraud without providing further falsifying demographic or alternate cause analysis.
  • Lindeman and I have spent a good deal of time and effort analyzing the underlying cause of the correlation. These demographic and initial registration factors easily explain the correlation on two grounds:
    • The correlation is traceable back to the initial registration.
    • The underlying reason for the correlation relates to underlying demographics, specifically the impact of suburban/urban divides in large cities.
  • The Fries'dat and Sampietro paper is an insufficient critique of the theory, as there are no substantive critical points in the paper and those that do exist are easily refuted through a deep look and underlying theoretical constructs and data.

A final thought on the methods and issues at hand.  There is a further effort to validate the upcoming election, and Clarkson is leading the charge.  I was sent the image below seeking volunteers for a “citizen exit poll”… putting aside the validity of “citizen exit polls” as a mechanism for validating election results (it’s not).  The image is a clinic in how to bias results in studies, mainly by soliciting volunteers who will be motivated by the “rigged” headline.   This poster reiterates the reason it was important for me to respond: there will likely be information coming from dubious sources after this election claiming that the result was rigged. As social and demographic data is highly dimensional and *soft* we will need to be very rigorous in our analysis and acceptance of those sources.  That said-I would welcome a statistically valid and rigorous post audit to validate the results of the 2016 presidential election.