Monday, February 29, 2016

Kansas Education Policy: Building a Funding Formula, Pt2: Poverty

A couple of weeks ago I posted my first part in a series on rebuilding the Kansas education funding formula.  Along with this post I made a promise to analyze these issues each week until we get something figured out.  Here we are two weeks later, and I'm just getting around to posting again... but hey, I'm making more progress than anyone else on this.  

WHERE WE ARE

Last time we started putting together a regression model for our new State funding formula.  We're building this thing slowly, and putting some thought into it; hopefully we'll be done by the time the issue is legislatively solved.  

The issue we analyzed last week dealt with economies of scale for different school district sizes.  Essentially: smaller school districts are more costly due to the lack of economies of scales in procurement and general operations.  We estimated what that curve may look like, and came up with this general chart for 2015 data:  


A quick note, economies of scale curves like this always bring up the argument: why don't we force some of the smaller districts to consolidate, thus moving them down the cost curve.  Obviously some economies of scale could be created this way.  On the other hand, some of these districts are located in such rural areas that structurally prohibit some of the economies of scale available to an urban district.  We will deal with this in a future post.  


ADDITIONALLY THIS WEEK

This week we're looking at the impact that poverty has on education outcomes, current education spending, and its use in a current policy.  So why do we look at poverty:
  • From an a priori perspective, current education literature points to kids in poverty having more issues in the education system. A reasonable review of that is found here.  There is the question of whether those difficulties can be solved with more money for public schools, however, there are some proven ways to defeat these issues (such as stronger preschools and early intervention) that cost money.  As such, it's plausible that creating programs to deal with childhood poverty would increase necessary spending for high-poverty districts.  
  • From a Kansas perspective, we know that Kansas schools with higher poverty rates perform more poorly, many researchers have proven this in the past.  Actually, I'll do it again just for fun:


This is a regression against average 11th grade performance measures (weighted by log(FTE) to smooth out any low end volatility for all of you nerds out there).  The negative coefficient on the percent_free_lunch variable is the important part here, because it demonstrates that the more kids in poverty (eligible for free lunches through federal programs) the lower education outcomes.  

This free lunch variable is the traditional way to measure poverty in schools, largely due to ease of measurement.  There is some argument that this allows school districts to commit fraud and "game" the system by pushing the program harder to students, but for the purposes of these estimates "% free lunch" should be directionally correct.  

Following this analysis, we jumped back to our original equation from the prior post.  We tried a couple of different transformations of the free lunch variable, and found the one that best fit the data was raw entry (not logged).  Here's what that looks like:


The positive coefficient on the percent_free_lunch tells us that costs increase as the number of students on free lunch increases.  Because this is a log-linear equation the interpretation isn't easy to understand, but it's effectively a 0.2% increase in costs for every 1% point increase in free lunches.  

here's what that looks like graphed:


What does this mean for our funding formula?  For every 10 percentage point increase in poverty, school districts will spend between $250 and $350 per fte more.  

(Yes we are still concerned about the ispo facto nature of this being included in the old funding formula, thus driving old spending rates from which our empirical data is derived.  We will deal with the complications of that in a future post).

CONCLUSION AND WHAT'S LEFT

In our work on the Kansas education funding we have covered these basic attributes, and believe that any functioning funding formula will need a way to control for these attributes:

  • Economies of scale: The formula will need to account for the lack of economies of scale in running rural, low enrollment school districts.  We have established a fairly good picture of what that cost curve looks like.
  • Poverty: Education literature demonstrates that poverty negatively impacts education outcomes, and we demonstrated that schools with kids in poverty in Kansas perform poorer.  Currently districts with higher poverty rates are also spending more, so any future funding formula likely needs to account for varying levels of poverty in schools.
In addition to this, I started keeping a list of elements I need to address in the future (contact me if you would like anything added to the list):
  • Transportation Funding (geographical size modifier).
  • Performance Measures.
  • Special Education.
  • Teacher Salaries.
  • Consolidation Equilibrium.
  • Ability to Pay, Equality, and Efficient Spending
  • Avoiding the Ipso Facto: making sure our regression equation doesn't mirror old funding formula


Wednesday, February 24, 2016

Statistical and R Tools for Data Exploration

Yesterday an old friend who works in a *much* different field than I (still data, just .. different data) contacted me regarding dealing with new data. He apparently runs into the same issue that I do from time to time: looking at a new data set and effectively exploring the data on it's face, without bringing pre-existing knowledge to the table. I started writing what I thought would be a couple of bullet points, it turned into something more, and I thought, hey, I should post this. So, here are my tips for on-face evaluations of new-to-you data sets.

BTW, this is somewhat of a followup post to
this one. 

1. Correlation Matrix: This is simple, but gives you pretty quick insight into a data set, creates a matrix of all Pearson correlations for each set of variables. Syntax is cor(data.frame) in base R. Also a few graphical ways to view it, including the corrplot library. Here's an example of the raw and visualized output.

2. Two Way Plot Viz: This is similar in function to a correlation matrix, but gives you more nuance about actual distributional relationship. This is basically a way to plot pairwise variables inside your data set. In base R syntax is pairs(data.frame). Here's an example from the data above:

3. Decision Tree: Especially when trying to predict, or find relationships back to a single variable, decision trees are fairly helpful. They can be used to predict both categorical and continuous data, and can also accept both types as predictors. They are also easily visualized and explained to non-data people. They have all kinds of issues if you try to use them in actual predictions - use randomforests instead, but are very helpful in trying to describe data. Use the Rpart library in R to crate a decision tree, syntax is rpart(depvar ~indyvar + ....,data = data.frame). Also install the rattle library so that you can use the function fancyRpartPlot to visualize. Here's an example from some education data:

4. Stepwise Regression: This is another often mis-used statistical technique that is good for data description. It's basically multiple regression that automatically figures out which variables are most correlated to the dependent variable and includes them in that order. I use the MASS package in r, and use the function stepAIC(). I guess you could use this for actual predictions, if you don't care about the scientific method.
5. Plotting: Generally speaking I run new data through the ringer by plotting it in a myriad of ways, histograms for distributional analysis, scatter plots for two way correlations. Most of all, plotting longitudinal change over time helps view trending (or over time cycling, if it exists). I generally use ggplot2, here's an example using education data:

6. Clustering and Descriptives: So I assumed you already ran descriptives on your data, just running summary(data.frame), but that's for the entire data set, there's an additional question which is: are there natural subsets of observations in the data that that we may be interested in? Kmeans is the easiest statistical method (and available in base-r) DBSCAN is more appropriate for some data (avaliable in DBSCAN package). After clusters are fit you can calculate summary stats using aggregate and the posteriors from the cluster fit.

Thursday, February 18, 2016

Kansas Education Policy: Building a Funding Formula

I haven't posted recently on Kansas Education Policy, mainly because I have been busy with other projects, and just busy generally.  But Education policy was pushed back into my interest this last week with the Kansas Supreme Court's striking down of the way Kansas currently funds its schools. I'll get to statistics in a second, but lets start with some background.  

BACKGROUND

The recent history of education funding in Kansas is a tale of multiple lawsuits, funding formulas, and adequate funding research projects.  The entire background is too much for this blog, however the Wichita Eagle has a great timeline, found here.  Here's the short of very recent history:
In 2015, the Kansas legislature replaced an old "funding formula" with a new block grant system.  The old system used variables describing a school district's attributes (poverty rates, student FTE, transportation needs, special education, etc) and dynamically calculated funding each year.  The new system allocated a certain amount to each district based on legislative allocation.  In essence, the new system created a process that removed a dynamic mathematical equation with a direct political process to fund schools.
A week ago, the Kansas Supreme Court invalidated that block grant formula for reasons you can read about here.  Now the Kansas legislature has to figure out a new way to fund schools (there's another ruling coming later this spring that might actually cost the legislature more money)
That's where we are now, the Kansas Legislature needs to figure out a new way to funds schools.  But I haven't heard too much about how they're going to do it.  I thought-what the hell-why don't I get a head start on them anyways, and help out the government when they're down?

FORMULA OF THESE POSTS

To create a new funding formula, we will need to identify and evaluate factors that should be included.  In essence these will be the factors that cause certain districts to be more costly than others.  Some of those can be evaluated through a statistical process, others are more difficult to analyze, so we may look at them in other ways. Each week, I'll look at a new factor, and see how it plays into the puzzle.  Please feel free to comment on this blog with your ideas of where we might look next.

Quick general methodology: We downloaded twelve years of education data 2004-2015 and cleaned it for missing data and outliers (e.g. districts on military bases).  Then we regressed.

TODAY'S FACTOR: DISTRICT SIZE

One of the primary factors determining how much school districts need to spend is the size of the district.  We're already calculating our output variable as $/FTE, so we're already compensating for the fact that more kids = more total dollars to the district. (Why do Wichita schools gets more total money than Salina schools? More kids.)

But there's still another way that size of the district impacts costs: efficiencies and economies of scale:  Very small school districts operate at higher costs per student due to lack of economies of scale. 

From a methodological standpoint, we can estimate this cost/economy of scale curve by regressing cost per FTE by district size.  Here's what it looks that looks like out of R for the nerds.



A few notes, we're using a log-log equation here, so the coefficients are effective elasticities, relative to the "missing case" which is districts from 0-100 students.  

So what does this interpretation mean to all of the non-nerds: schools of 1700 - 2500 FTE  are 46.5% less costly than the smallest schools (0-100 FTE) due to economies of scale. Here's what the entire cost curve looks like relative to 2015 spending.  (Keep in mind this is a basic model, and we need to add more factors to make accurate predictions for individual school districts.)

Quick note, if I'm in optimization mode here, I would be thinking, hey, let's consolidate some of these smaller school districts and move them further down the cost curve.  That may work for some, but for many districts structural factors will prevent true economies of scale (the additional transportation and still having to run multiple buildings).  Luckily, there are some statistical ways to look at that coming in the next few weeks.


DRAWBACKS AND METHODOLOGICAL ISSUES

We need to address some methodological issues here that we'll deal with down the road two in particular:
  • Size of district varies with a lot of other factors, how are you accounting for those: Answer: we'll get to that in the next few weeks.  There's actually evidence of other things at play in our above regression.  For instance, why do the very largest districts actually cost MORE than mid-sized districts.  The classic story is that these largest districts come in only two forms, both of which are very high cost: affluent suburban (high demands, cost of living for teachers) and high poverty (costs more to get similar results.)  We will evaluate this in the next few weeks.
  • Data is based on a prior deterministic formula! Answer: So one of the drawbacks is that we are measuring outcomes (prior spending) that was mostly determined by a prior funding formula.  That means that our dependent variable (spending per FTE) could be seen as just measuring that prior formula and how do we know that actual the spending that formula created created wasn't artificially inefficient or austere?  This is actually a huge concern with statistically measuring something that was recently a determined formula.  This is a concern we will also be dealing with conceptually next week.

CONCLUSION

  • Over the next few weeks, we will be doing the government's work for them, and identifying the parts of a new funding formula.
  • We can successfully estimate economies of scale, and get an idea of how funding needs to be adjusted for small school districts.
  • There are some methodological issues, including additional exogenous factors and deterministic funding formula, that we will be working on in a future analysis.

Monday, February 15, 2016

Fake Tax Policy Experts

Last election cycle I saw ad for a local politician who was running for the State House of Representatives.  The main tag of the politician's campaign was this: "If you send me to Topeka, I will fix Kansas' broken tax system."

A tax wonk candidate?  You have my attention.  So I checked out the background of the candidate.  What I found shocked me a bit, as the candidate's résumé was not close to a typical tax wonk résumé.  I shot off an email to the politician along these lines (this is not something I would normally do, but hey, I was bored):
Hey (insert name here), I saw you're running for office AND that you're interested in tax policy-good for you.  No need for lengthy details, but could you give me a sense of what you think is broken and what a new tax structure would look like? Thanks, LB
The response I received was.  Well.  Here:
Levi-Thank you for your response, I actually don't know that much about the details of Kansas tax policy, I just know that our current system is broken and we need to fix it.  I would look forward to hearing feedback from voters like you on where we should go from here.
Let me get this straight:  You are a candidate for office, running on a platform that "Tax policy is broken, and I'm the one to fix it" and you tell me you don't that much about (cue the profanity storm).

I could go off on this obviously, but that's not the purpose of this post.  The point here, is that there are a lot of people, including politicians and people who are otherwise assumed to be "in the know," who know nothing about tax policy.  The purpose of this post is twofold:
  • To give you the skills (questions) to identify when someone is parroting talking points versus actually knowing what they are talking about.
  • To know a bit about those talking points, so that you don't get sucked into false flag arguments.
Some common talking points, and followup questions to ask.


Talking point: Hedge Fund managers pay less tax than teachers, that's not right! We have to fix that!

Question: What part of current tax law creates this issue; are there any cases in which that lower rate should still apply?

Response: If you don't hear the term "capital gains" in the response then you are probably not talking to a tax scholar. There are good reasons for taxing general capital gains at a lower rate (making money from investments simply functions differently than making money through wages), but (long story short here) some earnings are being mis-classified as capital gains, causing this issue.

Suggested reading (I don't endorse these positions, this is just for background):

http://www.businessinsider.com/hedge-fund-tax-loophole-is-outrageous-2012-1



Talking point: The rich don't pay enough!  We need to raise income taxes on the rich!

Question: How much do you think a person earning ($200,000) in wages, without significant write-offs actually pays? How much should they pay? (hand them paper and pencil)

Response: So, I'm not a great tax scholar, but I can get to a rough estimate of what someone pays in tax by income in a few seconds just by what I know of tax code.  You should expect anyone who says "this person" pays too little in taxes to at least have an idea of how much that person pays.  There's an urban legend (that likely descends from the capital gains tax) that high-wage earners don't pay much in taxes.  This is simply untrue.

Suggested reading:
http://www.bankrate.com/finance/taxes/tax-brackets.aspx


Talking point: If we lower income tax rates, especially on businesses, the economy will grow faster and all will be better off in the long-run.

Questions: Economic theory says this is true, however the growth occurs over time, not immediately, and there could be budgetary and spending issues as a result.  Show us your growth models, and...
  • What year will be the break-even year in terms of tax generation (what year does the economic growth lead us to collect the same revenue as before the cuts (adjusted))?
  • Prior to break-even, what are some interim metrics we should watch as early indicators of growth?
Response: You should expect anyone that plans on implementing this strategy to have long-term plans for growth, and likely budgetary outcomes (what are the range of options?  Will we have to cut spending in short term to create long-term growth?  Are parties on board with that?)

Suggested reading:
Search Twitter for "#ksleg tax" and enjoy the read.


Talking point: Tax policy is broken and unfair, we have to change it?

Question: How?  To both parts of your statement.

Response: I agree that there are some issues with tax policy, but coming out guns-blazing *the system is broken* is pandering to groups that think they should be better off.  In this case you have to force politicians to give real answers. Why do you think it's broken?  What would you do to fix it?  What does someone earning the median income pay now?  And, how would you fix it?  What is the economic effect of your change?


Talking point: That tax policy is regressive!

Question: Relatively or absolutely regressive? (Regressive relative to current policy, or do poor people actually pay more than rich people?) Who actually pays more than they currently do, who pays less?

Response: It's really easy to claim that a tax or a tax-change is regressive as regressive tax has become a slur in recent debate, but understanding the ins and outs of policy change generally requires some math.  Before you let someone off the hook with saying a tax is regressive, at least make sure they can explain their position and who is impacted.

Thursday, February 11, 2016

Understanding "margin of error" on Opinion Polls

Over the weekend, I saw a somewhat frustrating comment on Facebook.  Irritation at Facebook happens all the time for me actually, but this time it had a statistical polling slant.  Here's what was said:
Bernie is only behind 3%, and the margin of error on the poll is 4%!  It's a statistical dead heat, he's totally going to win!
I'm generally a pretty calm person, but not when people use the term "statistical dead heat." (or the word impactful.  Or the word guesstimate.  Or the word mathemagician. Or the term "correlation is not causation...")  But anyways..

There are a couple of issues with the Facebook poster's logic, but underlying it all is a general misunderstanding of what that margin of error means.  This blog seems like an ideal place to explore what polling margin of error actually is, and how we should interpret it.

"MARGIN OF ERROR"

What people are actually talking about when they talk about "Margin of Error" is the statistical concept of "sampling error."  Sampling error is a bit difficult to explain, but it's effectively this:
The error that arises from trying to determine the attributes of a Population (all Americans) by talking to only a sample (1000 Americans). 
That's pretty straight forward, but many people still misunderstand this, here are a few detailed points:
  • Sampling error doesn't include sampling bias: The +/- 3% that you see on most opinion polls represents only the bias due to looking at a number smaller than the entire population.  That means it has an intrinsic assumption that the sample was randomly and appropriately selected.  There's an additional issue called sampling bias (in essence the group from which the random sub-sample was selected, systematically excluded certain groups).  An example of this is that if we sample randomly from the phone book, we would exclude a large number of millennials who only have cell phones only, and thus our sample would be biased.  Error arising from sampling bias occurs above and beyond the +/- 3% of sampling error.
  • Sampling error doesn't include poor survey methodology: Another reason that polls can be incorrect is poor survey methodology, which is once again not included in the +/-3%. A few ways that poor survey methodology can contribute to additional errors:
    • Poor Screening: Most opinion polls involve reducing population to "likely" voters by asking screening questions.  If these screening questions work poorly, or incentives are provided, the poll results will not accurately reflect the population of likely voters.
    • Poor Question Methodology: Asking questions in ways that make it more likely for a voter to answer one way or another can also create additional error.  This is especially true in non-candidate questions where it may be shameful or embarrassing to hold one opinion or another, making the words used in the question very important.  Another poor question example: questions that contain unclear language (e.g. double negatives) or are long and winding may confuse voters.
This may all seem like overly detailed statistics information, but in reality, those other forms of bias are alive and well in the primary polling system.  If those other errors were not occuring: 1. most polls would essentially agree with each other (they don't) and 2. polls would be extremely predictive of actual outcomes (not really true either).  A statistic from the Iowa Caucuses:

The last seven polls leading into the Iowa Caucuses gave Trump an average of 4.7% lead with a margin of error on each poll being 4%.   Trump lost the Iowa Caucuses by 4%, a swing of 8.7%.  

STATISTICS

Let's pretend that we ran the perfect poll with a perfect sample and perfect questions, then how do we calculate an accurate margin of error?  The statistics side of opinion polling is actually a bit boring.  Calculating a margin of error on opinion poll is generally done using what's called a binomial confidence interval.  That calculation is relatively (in stats terms) simple, and only uses the sample size, the proportion of votes a candidate is receiving, and a measure of confidence (e.g. we want to be 95% sure the value will fall within +/-4%).   Here's the normal calculator:


That calculator is great, but if you play around with it a little, or if you tend to do derivatives in your head of any equation you see (ahem),  you realize something:  That +/- 3% that you see on opinion polls is completely bogus.  That's because the margin of error varies significantly by what percentage a candidate is receiving, and generally that 3% is only valid for a candidate currently standing at 50%.  The margins of error compress as a candidates share of the vote approaches 0% or 100%.  So for a candidate like Rick Santorum, habitually at 1-2%, we aren't at +/-3%, we're actually at +/- 1%.  Here's a graph showing how that compression at the margins works:



A quick note on this statistical calculation: Our Facebook poster from earlier said that the race was a "statistical dead heat" due to the margin of error.  In a perfect poll, that's not true, especially with a 3% lead in a 4% margin of error poll.  The 4% margin is calculated at 95% confidence, but at 3% we're 85% certain that Clinton is leading.  85% certainty of a Clinton lead is not exactly a "dead heat."


And just to show what horrible people statisticians are, I want to point out one last thing.  You know how I told you how easy it is to calculate the margin of error?  That's still true, but know that arguing statisticians have created eleven total ways to calculate that statistic, all of which create nearly identical results.  They also regularly argue about the appropriateness of these methods.  No joke.

Here's a demonstration of the similarity of the methods, at 50% and 1.5% on a 1000 person sample.



CONCLUSION 

A few takeaway points from our look at margin of error:
  • The +/-3% on most opinion polls don't account for all the types of error a poll could have.  In fact, it seems likely that other forms of error are pushing polling error upwards in modern American political polling.
  • The margin of error stated on opinion polls is valid for a candidate receiving 50% of the vote.  It compresses at very high and very low vote shares.  
  • Statisticians are nerds who use 11 distinct ways to get to essentially the same results.

Tuesday, February 9, 2016

New Hampshire Primary 2016 Predictions

Last week's predictions of the Iowa caucus results were quite popular, especially for people searching for explanations of Cruz's win so we thought another post was due for the New Hampshire Primary this Tuesday.  First our thoughts on last week's Iowa Caucus.
  

IOWA SUMMARY

Last week we picked Clinton to win the Democratic caucus (she did) and Trump to win the Republican caucus (he lost by 4% to Ted Cruz), but with major caveats for each party.  Some thoughts on actual performance:
  • Democrats: This race was quite a bit closer than expected, with Hillary winning by 0.2%.  It seems that was largely due to Bernie's voters showing up in higher numbers than expected, as we pointed out in our blog entry, that was the biggest question in the democratic race. This is probably the most positive news for the Sanders Campaign so far, almost winning a Midwestern state. 
  • Republicans: Last week we mentioned that there were a lot of questions around turnout of Trump voters.  We've seen some data this week that demonstrates Trump supporters just didn't show up, especially when compared to Cruz supporters.  That's bad for Trump, as the biggest concern for his campaign is that Trump voters might end up being a bit.. well, flaky.  Another impressive element was Cruz's ground game,* which seemed to go to work hard in the last week of the campaign; possibly tipping the race into Cruz's favor.

NEW HAMPSHIRE

And on to New Hampshire, here are our picks (with probabilities):



Other sites are a bit more bullish in their odds for Sanders and Trump winning (in excess of 80% at times), both candidates have nearly double digit leads in recent polling.  I'm not as convinced that it's as cut and dry, for a few reasons:
  1. Polling in Iowa was only accurate to a level of about +/- 8.5 % , meaning we shouldn't have a ton of confidence in even 10% leads in primaries.  We'll do a future post on polling certainty.
  2. Trump's leading, but his voters may not turnout.  We have at least some evidence from Iowa that the Trump people may not be the greatest at showing up.
  3. Cruz's ground game may give him an edge.  See note below.
  4. Sanders lead in polling looks like it was closing in latest poll.

*Ted Cruz's impressive ground game, we're using this term of art for the efficacy of the campaign in the last few days, including two controversial (by some commentators unethical, or even fraudulent). 


Thursday, February 4, 2016

R Statistical Tools For Dealing With New Data Sets

As I have shared on this blog, I recently started a new job, a very positive move for me.  The biggest challenge in starting a new job in data science or any analytics field is learning the business model as well as the new data structures.

The experience of learning new data structures and using other people's data files has allowed me to reach back into my R statistical skill set for functions I don't use regularly.  Over the past weeks, I have worked a lot with new data, and here are the tools I am finding most useful for that task (code found below list):
  1. Cast:  Cast is a function in the "shape" package that is essentially-pivot tables for R. I'm pulling data from a normalized Teradata warehouse and that normalization means that my variables come in "vertically" when I need them horizontally (e.g. a column for each month's totals).  Cast allows me to quickly create multiple columns. 
  2. tolower(names(df)): One of the more difficult things I have to deal with is irregularly named columns, or columns with irregular capitalization patterns I'm not familiar with.  One quick way to eliminate capitalization is to lower case all variable names. This function is especially helpful after a Cast, when you have dynamically created variable names from data.  (Also, before a cast, on the character data itself)
  3. Merge: Because I'm still using other people's data (OPD), and that involves pulling together disparate data sources, I find myself needing to combine datasets.  In prior jobs, I've had large data warehouse staging areas, so much of this "data wrangling" has occurred in SQL pre-processing before I'd get into the stats engine.  Now I'm less comfortable with the staging environment, and I'm dealing with a lot of large file-based data, so merge function works well. Most important part of the below is all.x = TRUE which is the R equivalent of "left outer join".
  4. Summary: This may seem like a dumb one, but the usage is important in new organizations for a few reasons.  First, you can point it at almost any object and return top level information, including data frames.  The descriptive statistics returned both give you an idea of the nature of the data distribution and a hint of data type, in the case of import issues.  Second, you can pull model statistics from the summary function of a model-this may not make sense now, but check out number five.
  5. Automated model building:  This is a tool that is useful in a new organization where you don't know how variables correlate, and just want to get a base idea.  I created an "auto-generate me a model" algorithm a few years ago, and can alter the code in various ways to incrementally add variables, test different lags for time series, and very quickly test several model specifications.  I've included the *base* code for this functionality in the image below to give you an idea of how I do it.
Code examples from above steps:



 #1 CAST  
 mdsp <- cast(md, acct ~ year, value = 'avg_num')  
 #2 TOLOWER(NAMES)  
 names(md) <- tolower(names(md))  
 #3 MERGE  
 finale <- merge(x = dt1,y = dt3,by = "acct", all.x = TRUE)  
 #4 SUMMARY  
 summary(model)  
 summary(df)  
 #5 AUTO MODEL GENERATION  
 #setup dependent and dataset  
 initial <- ("lm(change~")  
 dat <- "indyx"  
 #setup a general specificatoin and lagset to loop over  
 specs <- c("paste(i,sep='')", "paste(i,'+',i,'_slope',sep='')")  
 month <- c("february","march","april","june","july","august","september","october","november")  
 #setup two matrices to catch summary stats  
 jj <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(jj) <- month  
 colnames(jj) <- specs  
 rsq <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(rsq) <- month  
 colnames(rsq) <- specs  
 mods <- NULL  
 #loop through models  
 for(j in specs){  
 for(i in month) {  
      model <- paste(initial,eval(parse(text = j)),",data=",dat,")")  
      print(model)  
      temp <-summary(eval(parse(text = model)))  
      jj[[i,j]] <- mean(abs(temp$residuals))  
      rsq[[i,j]] <- temp$r.squared  
 }  
 }  
 #choose best model (can use other metrics too, or dump anything ot the matrices)  
 which(rsq == max(rsq), arr.ind = TRUE)  

Monday, February 1, 2016

Iowa Caucus Day-Of Predictions

The Presidential primaries this year have been so weird that I have delayed putting out any type of by-state projections until I had more information. I have kind of run out of time now, haven't I? (Iowa Caucuses are today) As I see it, there are two main political questions outstanding:
  • Republican: Is Donald Trump a legit candidate, and will Republican voters continue to support him after he is more thoroughly vetted?
  • Democrat: Is Bernie Sanders a legit candidate and do democratic voters believe he can win?
These questions are largely open, however both candidates are still being taken seriously enough to poll highly going into Iowa, so on with projections.  

WIN PROJECTIONS

I created a quick model based on prior Iowa data and recent polling results.  The polls have been especially volatile in Iowa, and for other reasons that I will get to in a bit, things could turn out much differently than this.  Anyways, here are our quick projections, with probability to win Iowa:


Generally, I think Trump and Clinton will win.  But there are still a lot of questions out there, putting Cruz and Sanders firmly still in the hunt.

QUESTIONS OUTSTANDING

Going into the Iowa Caucuses, and 2016 elections in general there are still many outstanding data questions:  
  • Political Polls: Political polls have been less reliable in the past two elections than in prior years, first leaning too Republican then too Democrat.  Are current polls accurately reflecting potential outcomes?  There are many reasons political polls can be inaccurate, from samples that aren't representative to turnout issues (covered in next bullet).
  • Turnout: One of the better explanations for the poor polling predictions is that pollsters aren't doing a good job vetting who is and isn't a likely voter, or modeling people's propensity to show up at the polls.  Because voter turnout is often less than 50%, and those people who show up aren't a random subset of Americans, having inaccurate turnout models can significantly bias polling outcomes.
  • Trump/Sanders Viability: One of the general theories on why Trump and Sanders are doing well is that they appeal to politically disaffected people:  Trump to white conservatives who dislike Obama and the direction of the country, Sanders to young people who see little future in the current US economy.  Disaffected groups have a tendency to turnout poorly on election day depending on motivation, will these groups even show up? Combine this with doubt in current polling and turnout models, and the truth is, we just don't know.

CONCLUSION

Due to questions remaining from the past two elections regarding the accuracy of polling, we are still unsure of the results of the Iowa caucuses. That said, our best guess for tonight is a win for Trump and Clinton.  Or maybe Sanders and Cruz, depending on the accuracy of political polling and pollsters ability to determine who may turnout.