Friday, April 15, 2016

Fun Friday Data: Campaign Donations by Occupation

A few days ago, I stumbled upon another data set that was.. well a bit interesting.  It's a set of campaign finance filings made available by the Sunlight Foundation that include detailed information about the top 0.01% of donors to political campaigns.


The data isn't necessarily interesting from a big-data, predictive analytics perspective, but it's one of those data sets that can be cut and sliced all day long. It has some rich data fields, like "occupation" and percentages to Democrat and Republican.

It also has names, so I can figure out which rare Hollywood actors are Republicans (hint: Scott Baio).  The occupation data is self reported, so there are some value-density issues (some reported CEO, some Chief Executive Officer), I didn't go and fix those because this is just a fun project.

The two pieces of data below both focus the "Republican Ratio" (ratio of Republican to Democrat contributions) by occupation.  I calculated the ratio two ways: a weighted average (ratio of total contributions by profession) and an individual average (average of each contributors contribution ratio by profession.

The reason both ratios were necessary was that some very large donors had a tendency to skew the total contribution number.  Take for instance the "business executive" category below, the weighted average shows this as being a blue (Democrat-leaning) category, but we know most business people tend to contribute to Republicans.  When I dug in I found that most individuals in this category gave more money to Republicans, but the overall number was influenced by multi-million dollar democrat donations by one person: George Soros.

Some amusing findings:
  • "Homemakers" are the second largest group of donors and appear to be very Republican.  (Though this is likely many executives funneling money through their wives.  Federal election laws limit individual contributions, but a second person allows double the contribution.) 
  • C.E.O.'s are much more conservative than CEO's by either measure.
  • There are apparently some very generous "students" out there.  When I reviewed the list, the big donors here seem to be obviously college-aged children of rich people.  Again, federal campaign contribution limits are based on the individual (See Homemakers).
  • The most liberal groups with the high number of donors were: Professors, Artists, Writers and (cue the giggles from Republicans) "Not Employed."

That was the serious group of the top professions by number of contributors, now for the group of occupations that are amusing to me.  No data scientists were on the list, else I would be forced to include.  Also, the actor donations to Republicans is once again Scott Baio.


I find this data highly amusing, and it's fun to cut and play with.  If anyone has any ideas on different ways to cut this data, I can run it fairly quickly, however I probably won't put names on this blog to avoid doxing anyone.  Except Scott Baio, because he's a ridiculously bad actor.

Thursday, April 7, 2016

The Incremental Cost of a High School Education in Kansas

A few years ago I was at a conference where a vendor presented a product for adults to get a high school diploma (not a GED).  The cost of the product was about $10,000, but this was *justified* because high school graduates make about $5,000 a year more, so the investment *obviously* pays off.   There are logical challenges to this argument (that will be dealt with in the second half of this piece), but the presentation of data caught my attention and raised some questions in my mind around the cost and benefits of a high school diploma.

Yesterday, the Kansas Center for Economic Growth (KCEG) released a study that again caught my attention relating to the value of a high school graduation.  The first part of the study is especially interesting because it relates the incurred additional costs of lack of education (what is the cost to social service system/prisons of failing to educate) and future salary costs of the individual.  KCEG is largely (as I see it) a center-left (for Kansas) think tank, so the piece provides great information in quantifying education value, but I think there are some more sophisticated and better ways to attack the problem.


Two key findings of the KCEG piece caught my eye:
  • Kansas high school graduates earn far more than their peers who do not graduate ($5,400 annually)
  • High school graduates save the state money (by saving on social service, prisons, etc.) ($80,000 lifetime)
I don't question that these are factually true in direction, and in fact the direct causal links can be imagined (many jobs require a high school diploma, not having a job can lead to more social service cost).  But I do have a few areas of improvement and extension for the study:
  1. Bright-lining graduation may not be accurate.  Even dropouts get the benefit of often 10 or 11 years of public education, so comparing the benefits of having a high school diploma versus not is not comparing value of education versus no education.  Instead, it's simply  comparing education level 12 to education level 10. This creates bias in an ROI calculation, because drop outs would likely be much worse off if receiving no education.
  2. Public education isn't going away. The KCEG study is interesting in that alludes to education being beneficial then compares back to total education costs (state level).  The problem here is that public education is an institution that will continue to provide education (certainly education is generally beneficial), and the real policy question seems to be about how much we should fund public education.  If we're looking at the benefit of graduating high school as a metric, then our cost should be related to the incremental costs of making borderline students graduate, assuming future existence of public education as an institution at similar funding levels.
  3. The initial logic of the data is problematic.  The logic in the benefit studies takes the differentials in outcomes between graduates and non-graduates and states this as a result of graduating, which isn't quite true.  I expound upon this in the second part of this blog entry.


The initial work by KCEG points out that the value of a high school education is high, and I will go with that assumption (specifically the $80,000 future State cost) for the purposes of this section.  My two initial critiques of the work come down to incrementallity, or the concept that the differential in cost between graduates and non-graduates isn't the cost of a full 12 year education.  The KCEG study seems to look at the entire cost of education but only looks at benefits on the binary output of: did you finish the last couple of years (graduate)?

I will attempt to correct this basis difference by estimating the actual cost of an incremental graduate.  Because most dropouts go to school only a couple years fewer than graduates, it may seem like annualized costs for two years of school would be a good proxy.  This is false however because although the last two years of high school are how that degree is attained, it may be more costly to provide the supports/environment that will keep borderline dropout kids in school.  In essence the cost to create the behavior of staying in school may not be equivalent to the actual cost of two years of school.  With that background I had two additional research questions:
  1. Can we improve graduation rates by spending more money on education?
  2. How much is the cost of an incremental graduate (with other values staying the same)?
Whereas KCEG took an axe to the problem of cost, I looked to use a statistical scalpel to create very accurate estimates of the impact of spending more or less money. I realized I had some data on my desktop that relates spending, various school district attributes, and also contains graduation rates.  The data was substantial enough and contained enough covariates to estimate a spending:graduation elasticity, so I went to model building.

The model relates per-pupil spending, district sizes, fixed-effects for years, free lunch percent (a poverty-level stand in) to graduation rates per district.  Fairly simple model, and it ends up looking like this:

The model makes a priori sense, the two most interesting variables are free lunch (poverty) and spending per pupil. Poverty appears to play the biggest role in graduation rates, which matches empirical research.

I do have a methodological concern though: there is a high level of multi-colinearity between spending, district size (FTE), and drop out rate; making it difficult to ensure stable variance explanation.  A thick example is that we know that rural schools have higher graduation rates, partially due to demographic issues but that they also spend more money due to lack of economies of scale.  As an additional coefficient test I repeated the analysis using only districts above the 500 FTE mark, and 1200 FTE mark, to demonstrate that the coefficients are robust.

For all these tests my coefficient of interest stayed statistically significant and in the 2.0-3.0 range, as a central estimator I will use 2.5 which seems reasonable going forward.  But how much does that mean an incremental student costs?  Well a 1 point change in the natural logarithm of stude.... screw it, this is why I learned calculus.

A quick calculus solution is obvious, we'll assume we're holding all other variables constant so we will just throw them away as they would go to zero in the first derivative anyway.  Here's our solution:
Y = Graduation Rate
B0 = Constant (including all other resolved attributes in equation)
2.5 = Estimate of our coefficient of interest.
X = Per Pupil Spending
Our equation from the regression is thus:
Y = B0 + 2.5* ln(X)
First derivative:
dY = 2.5dX/X
Simple interpretation: the change in Y is 2.5 the change in X divided by X, OR the change in Y is 2.5 times the percent change in X.  Now for some some math.

Using a reasonable set of estimates we can calculate the cost of an incremental graduate to be $70,000.  This uses 30,000 as an estimate of future graduates per year, $6.1 billion in annual State + Local + Fed education spending (pulled from KPI website, if someone has a better source please let me know), and a current graduation rate of about 86%.

Before we go to extrapolation, there are some caveats, specifically assumptions one would have to make:

  1. Current relationships between spending and graduation would hold up when more money is applied.
  2. The current relationship isn't measuring other outside factors that impact graduation rates (communities that see more potential spend more and also expect more out of their kids in graduating).
  3. The kids who are currently not graduating in the difficult districts (KCK for instance) are *inside the model* like the ones we see graduating from better districts.

In other words: Does Levi think we could just add $70,000 for each kid that isn't graduating currently and defer KCEG's future estimate of $80,000 in costs?  Maybe, but it might fail in many regions as well, due to outside effects.  On the other hand, I think this is an interesting starting point to look at how money impacts incremental education outcomes.


The other issue with the KCEG study is the notion that benefits can be measured by the difference in outcomes. The problem here can be viewed as an issue of statistical sampling: the difference between high school graduates and non-high school graduates isn't simply a diploma, but there are non-random processes that push certain kids to get a degree or not a degree including talent, socioeconomic and structural factors.

The additional problem is that those outcomes are also correlated with outcomes beyond the degree, so that the numbers represented in the KCEG study don't only represent the difference between graduating and not-graduating, but also those pre-existing factors (in this way, graduation can be seen as intervening).  What evidence do we have that graduation tends to covary with other factors that impact outcome?  A few, but our regression from above demonstrates that substantially that graduation rates are correlated to poverty and district size (here likely a proxy for urbanity).  Here's how poverty relates to graduation:

What does this all mean?  That we know that graduation from high school is not a random process, and that the factors that impact graduation would also likely impact post-graduation outcomes, making estimates of the value of education somewhat biased.  Also, because we know other things impact both graduation rates and long-term outcomes, it may be more effective to attack those other causes than education itself (e.g.: attack childhood poverty rather than spend more on education).

Partial Conclusion: Further research needed, not from me.


All in all, I like the KCEG study, especially in its attempt to quantify the benefit to education. However, I have two major improvements that could be made to these numbers:
  • Quantifying the cost of incremental high school graduates. This post is an initial attempt at quantifying that cost, at around $70,000, which is under KCEG's impact to social service costs $80,000.  Potentially education is an effective way to decrease future government costs costs (though there is a time-value issue and a full distributional impact issue here for further analysis), but more work needs to be done to prove the cost-benefit.
  • Needs a better estimate of value of high school graduation.  This is because current graduation is not a result of random processes, but has many exogenous causes that will also impact downstream outcomes, and poverty play such a large role such that it may be more effective to solve poverty first.

Monday, April 4, 2016

Can Bernie Still Win? Pluses and Minuses?

Updating my Democratic Primary numbers over the weekend in preparation for what appears to be a big showdown in Wisconsin, I noticed something a bit... unusual.  Bernie's chances to win the election significantly changed, without winning additional primaries.  What happened?  His polling somewhat significantly improved over my last model in two key states: Wisconsin and New York.  I normally wouldn't publish anything the day before a primary, but this is interesting.  Let's take a look.


Here is a current status of delegate count as it sits now:  

Clinton still has a significant lead, and Bernie needs to win 56.5% of remaining delegates to win the "pledged delegate" nomination.  Due to the prior mentioned polling shifts, this means Bernie needs to create a logit shift of .53, or beat polls by an average of 13% in order to win.  This may seem like a huge number, but it's actually a moderate improvement over our past projects.  

Here's a look at our normal breakdown.  By the way, in order to stay "on pace" Bernie needs to win 64% of the vote tomorrow night in Wisconsin (methodology found here).

Now a quick rundown of his chances:


There are a couple of things that are currently in Bernie's Favor:
  • His polling numbers seem to be improving, so he still has a bit of momentum.
  • The most important states with remaining primaries, notably California and New Jersey have fairly sparse polling so far, so there may be more Bernie-favorable variance in those polls.


A few unfavorable points about Bernie's current position:
  • He's still far behind, and has to massively out-perform polls in order to win.
  • Though there is some history of Bernie over-performing, most of those over-performances (save for Michigan) have occurred in caucus states where things behave differently. Now for a couple of charts to demonstrate this.  First, a chart depicting polling of delegates received for Bernie, demonstrating Bernie's over-performance over polling in caucuses (straight line represents polling 1:1 with performance):
Now a distribution of Bernie-favorable polling error, showing Bernie over-performing polls massively in caucuses, slightly in primaries.

  • So there's a history of Bernie over-performing, but we know that occurs more heavily in Caucus states.  The real problem for Bernie: less than 10% of remaining delegates will be awarded through a caucus rather than a primary.


Although Bernie has seen a polling bump recently, as well as a few other positive, he still has to massively and consistently over-perform polling to win.  In addition to polling challenges, his greatest issue going forward may be 90% of delegates will be assigned through a true-primary process.  On the other hand, momentum seems to be in his favor at this point.