Yesterday's post on election fraud issues in Kansas got quite a bit of response, so I thought I would followup with an additional analysis. Also, R was still open, with my data frame loaded when I got up this morning, so... what the hell.
Oh, and this analysis moves quickly into some fairly technical areas, but I think most of them can be understood in general terms. If you have any methodological questions please post in the comments.
My biggest contention from yesterday's model (which was really my implementation of Clarkson's model) is that underlying, unmeasured demographic terms were likely causing the correlation. So, my goal here (and over a series of posts in the future, theoretically) is to systematically look at other possibilities.
Also the R-squared metric (a common metric for how good a regression model is) was VERY low (about 0.02). The model is still significant, it's just, not very predictive.
I only had one additional variable that I could use in my data frame, which was the county each precinct was located in. Because election results are highly variable by county, and counties are also not homogeneous in demographic factors, county can be used as a proxy for these demographic and regional variations.
In this case, if demographic terms that vary significantly by county are what are responsible for the precinct size: republican vote share correlation, we would expect that introducing county into the equation would decrease the importance of our precinct size variable. That's not what happened.
Methodology: I created the same model as yesterday, but added in a "fixed effect" for the county of each precinct. That's it. And by the way, I shouldn't use the term "fixed effect" because it's confusing and every statistician uses it differently.
Here's a summary of what happened in the models:
- The R-squared shot through the roof (this is expected) from 2% to 46%.
- The effect (parameter estimate) of the precinct size variable increased increase significantly (not expected).
- The statistical significance of the precinct size variable increase significantly (not expected).
What does this mean?
The weird correlation is still there, and is stronger when we clear up some other exogenous factors. I'll need to dive into additional data to figure out the real root cause.
Also, I thought of an additional possibility, though it's still in the early phases of ideation. Here are the basics:
- There's a potential endogenous relationship between % republican voters and number of voters.
- If precincts are formed based on census/geographical tracts, then there's a key intervening variable of turnout.
- If conservative turnout is statistically "better" then having a more republican district, could cause our independent variable (voters in precinct) to rise significantly.
Still thinking through this one though, any input is appreciated.