Wednesday, February 24, 2016

Statistical and R Tools for Data Exploration

Yesterday an old friend who works in a *much* different field than I (still data, just .. different data) contacted me regarding dealing with new data. He apparently runs into the same issue that I do from time to time: looking at a new data set and effectively exploring the data on it's face, without bringing pre-existing knowledge to the table. I started writing what I thought would be a couple of bullet points, it turned into something more, and I thought, hey, I should post this. So, here are my tips for on-face evaluations of new-to-you data sets.

BTW, this is somewhat of a followup post to
this one. 

1. Correlation Matrix: This is simple, but gives you pretty quick insight into a data set, creates a matrix of all Pearson correlations for each set of variables. Syntax is cor(data.frame) in base R. Also a few graphical ways to view it, including the corrplot library. Here's an example of the raw and visualized output.

2. Two Way Plot Viz: This is similar in function to a correlation matrix, but gives you more nuance about actual distributional relationship. This is basically a way to plot pairwise variables inside your data set. In base R syntax is pairs(data.frame). Here's an example from the data above:

3. Decision Tree: Especially when trying to predict, or find relationships back to a single variable, decision trees are fairly helpful. They can be used to predict both categorical and continuous data, and can also accept both types as predictors. They are also easily visualized and explained to non-data people. They have all kinds of issues if you try to use them in actual predictions - use randomforests instead, but are very helpful in trying to describe data. Use the Rpart library in R to crate a decision tree, syntax is rpart(depvar ~indyvar + ....,data = data.frame). Also install the rattle library so that you can use the function fancyRpartPlot to visualize. Here's an example from some education data:

4. Stepwise Regression: This is another often mis-used statistical technique that is good for data description. It's basically multiple regression that automatically figures out which variables are most correlated to the dependent variable and includes them in that order. I use the MASS package in r, and use the function stepAIC(). I guess you could use this for actual predictions, if you don't care about the scientific method.
5. Plotting: Generally speaking I run new data through the ringer by plotting it in a myriad of ways, histograms for distributional analysis, scatter plots for two way correlations. Most of all, plotting longitudinal change over time helps view trending (or over time cycling, if it exists). I generally use ggplot2, here's an example using education data:

6. Clustering and Descriptives: So I assumed you already ran descriptives on your data, just running summary(data.frame), but that's for the entire data set, there's an additional question which is: are there natural subsets of observations in the data that that we may be interested in? Kmeans is the easiest statistical method (and available in base-r) DBSCAN is more appropriate for some data (avaliable in DBSCAN package). After clusters are fit you can calculate summary stats using aggregate and the posteriors from the cluster fit.

Thursday, February 11, 2016

Understanding "margin of error" on Opinion Polls

Over the weekend, I saw a somewhat frustrating comment on Facebook.  Irritation at Facebook happens all the time for me actually, but this time it had a statistical polling slant.  Here's what was said:
Bernie is only behind 3%, and the margin of error on the poll is 4%!  It's a statistical dead heat, he's totally going to win!
I'm generally a pretty calm person, but not when people use the term "statistical dead heat." (or the word impactful.  Or the word guesstimate.  Or the word mathemagician. Or the term "correlation is not causation...")  But anyways..

There are a couple of issues with the Facebook poster's logic, but underlying it all is a general misunderstanding of what that margin of error means.  This blog seems like an ideal place to explore what polling margin of error actually is, and how we should interpret it.

"MARGIN OF ERROR"

What people are actually talking about when they talk about "Margin of Error" is the statistical concept of "sampling error."  Sampling error is a bit difficult to explain, but it's effectively this:
The error that arises from trying to determine the attributes of a Population (all Americans) by talking to only a sample (1000 Americans). 
That's pretty straight forward, but many people still misunderstand this, here are a few detailed points:
  • Sampling error doesn't include sampling bias: The +/- 3% that you see on most opinion polls represents only the bias due to looking at a number smaller than the entire population.  That means it has an intrinsic assumption that the sample was randomly and appropriately selected.  There's an additional issue called sampling bias (in essence the group from which the random sub-sample was selected, systematically excluded certain groups).  An example of this is that if we sample randomly from the phone book, we would exclude a large number of millennials who only have cell phones only, and thus our sample would be biased.  Error arising from sampling bias occurs above and beyond the +/- 3% of sampling error.
  • Sampling error doesn't include poor survey methodology: Another reason that polls can be incorrect is poor survey methodology, which is once again not included in the +/-3%. A few ways that poor survey methodology can contribute to additional errors:
    • Poor Screening: Most opinion polls involve reducing population to "likely" voters by asking screening questions.  If these screening questions work poorly, or incentives are provided, the poll results will not accurately reflect the population of likely voters.
    • Poor Question Methodology: Asking questions in ways that make it more likely for a voter to answer one way or another can also create additional error.  This is especially true in non-candidate questions where it may be shameful or embarrassing to hold one opinion or another, making the words used in the question very important.  Another poor question example: questions that contain unclear language (e.g. double negatives) or are long and winding may confuse voters.
This may all seem like overly detailed statistics information, but in reality, those other forms of bias are alive and well in the primary polling system.  If those other errors were not occuring: 1. most polls would essentially agree with each other (they don't) and 2. polls would be extremely predictive of actual outcomes (not really true either).  A statistic from the Iowa Caucuses:

The last seven polls leading into the Iowa Caucuses gave Trump an average of 4.7% lead with a margin of error on each poll being 4%.   Trump lost the Iowa Caucuses by 4%, a swing of 8.7%.  

STATISTICS

Let's pretend that we ran the perfect poll with a perfect sample and perfect questions, then how do we calculate an accurate margin of error?  The statistics side of opinion polling is actually a bit boring.  Calculating a margin of error on opinion poll is generally done using what's called a binomial confidence interval.  That calculation is relatively (in stats terms) simple, and only uses the sample size, the proportion of votes a candidate is receiving, and a measure of confidence (e.g. we want to be 95% sure the value will fall within +/-4%).   Here's the normal calculator:


That calculator is great, but if you play around with it a little, or if you tend to do derivatives in your head of any equation you see (ahem),  you realize something:  That +/- 3% that you see on opinion polls is completely bogus.  That's because the margin of error varies significantly by what percentage a candidate is receiving, and generally that 3% is only valid for a candidate currently standing at 50%.  The margins of error compress as a candidates share of the vote approaches 0% or 100%.  So for a candidate like Rick Santorum, habitually at 1-2%, we aren't at +/-3%, we're actually at +/- 1%.  Here's a graph showing how that compression at the margins works:



A quick note on this statistical calculation: Our Facebook poster from earlier said that the race was a "statistical dead heat" due to the margin of error.  In a perfect poll, that's not true, especially with a 3% lead in a 4% margin of error poll.  The 4% margin is calculated at 95% confidence, but at 3% we're 85% certain that Clinton is leading.  85% certainty of a Clinton lead is not exactly a "dead heat."


And just to show what horrible people statisticians are, I want to point out one last thing.  You know how I told you how easy it is to calculate the margin of error?  That's still true, but know that arguing statisticians have created eleven total ways to calculate that statistic, all of which create nearly identical results.  They also regularly argue about the appropriateness of these methods.  No joke.

Here's a demonstration of the similarity of the methods, at 50% and 1.5% on a 1000 person sample.



CONCLUSION 

A few takeaway points from our look at margin of error:
  • The +/-3% on most opinion polls don't account for all the types of error a poll could have.  In fact, it seems likely that other forms of error are pushing polling error upwards in modern American political polling.
  • The margin of error stated on opinion polls is valid for a candidate receiving 50% of the vote.  It compresses at very high and very low vote shares.  
  • Statisticians are nerds who use 11 distinct ways to get to essentially the same results.

Thursday, February 4, 2016

R Statistical Tools For Dealing With New Data Sets

As I have shared on this blog, I recently started a new job, a very positive move for me.  The biggest challenge in starting a new job in data science or any analytics field is learning the business model as well as the new data structures.

The experience of learning new data structures and using other people's data files has allowed me to reach back into my R statistical skill set for functions I don't use regularly.  Over the past weeks, I have worked a lot with new data, and here are the tools I am finding most useful for that task (code found below list):
  1. Cast:  Cast is a function in the "shape" package that is essentially-pivot tables for R. I'm pulling data from a normalized Teradata warehouse and that normalization means that my variables come in "vertically" when I need them horizontally (e.g. a column for each month's totals).  Cast allows me to quickly create multiple columns. 
  2. tolower(names(df)): One of the more difficult things I have to deal with is irregularly named columns, or columns with irregular capitalization patterns I'm not familiar with.  One quick way to eliminate capitalization is to lower case all variable names. This function is especially helpful after a Cast, when you have dynamically created variable names from data.  (Also, before a cast, on the character data itself)
  3. Merge: Because I'm still using other people's data (OPD), and that involves pulling together disparate data sources, I find myself needing to combine datasets.  In prior jobs, I've had large data warehouse staging areas, so much of this "data wrangling" has occurred in SQL pre-processing before I'd get into the stats engine.  Now I'm less comfortable with the staging environment, and I'm dealing with a lot of large file-based data, so merge function works well. Most important part of the below is all.x = TRUE which is the R equivalent of "left outer join".
  4. Summary: This may seem like a dumb one, but the usage is important in new organizations for a few reasons.  First, you can point it at almost any object and return top level information, including data frames.  The descriptive statistics returned both give you an idea of the nature of the data distribution and a hint of data type, in the case of import issues.  Second, you can pull model statistics from the summary function of a model-this may not make sense now, but check out number five.
  5. Automated model building:  This is a tool that is useful in a new organization where you don't know how variables correlate, and just want to get a base idea.  I created an "auto-generate me a model" algorithm a few years ago, and can alter the code in various ways to incrementally add variables, test different lags for time series, and very quickly test several model specifications.  I've included the *base* code for this functionality in the image below to give you an idea of how I do it.
Code examples from above steps:



 #1 CAST  
 mdsp <- cast(md, acct ~ year, value = 'avg_num')  
 #2 TOLOWER(NAMES)  
 names(md) <- tolower(names(md))  
 #3 MERGE  
 finale <- merge(x = dt1,y = dt3,by = "acct", all.x = TRUE)  
 #4 SUMMARY  
 summary(model)  
 summary(df)  
 #5 AUTO MODEL GENERATION  
 #setup dependent and dataset  
 initial <- ("lm(change~")  
 dat <- "indyx"  
 #setup a general specificatoin and lagset to loop over  
 specs <- c("paste(i,sep='')", "paste(i,'+',i,'_slope',sep='')")  
 month <- c("february","march","april","june","july","august","september","october","november")  
 #setup two matrices to catch summary stats  
 jj <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(jj) <- month  
 colnames(jj) <- specs  
 rsq <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(rsq) <- month  
 colnames(rsq) <- specs  
 mods <- NULL  
 #loop through models  
 for(j in specs){  
 for(i in month) {  
      model <- paste(initial,eval(parse(text = j)),",data=",dat,")")  
      print(model)  
      temp <-summary(eval(parse(text = model)))  
      jj[[i,j]] <- mean(abs(temp$residuals))  
      rsq[[i,j]] <- temp$r.squared  
 }  
 }  
 #choose best model (can use other metrics too, or dump anything ot the matrices)  
 which(rsq == max(rsq), arr.ind = TRUE)  

Monday, February 1, 2016

Iowa Caucus Day-Of Predictions

The Presidential primaries this year have been so weird that I have delayed putting out any type of by-state projections until I had more information. I have kind of run out of time now, haven't I? (Iowa Caucuses are today) As I see it, there are two main political questions outstanding:
  • Republican: Is Donald Trump a legit candidate, and will Republican voters continue to support him after he is more thoroughly vetted?
  • Democrat: Is Bernie Sanders a legit candidate and do democratic voters believe he can win?
These questions are largely open, however both candidates are still being taken seriously enough to poll highly going into Iowa, so on with projections.  

WIN PROJECTIONS

I created a quick model based on prior Iowa data and recent polling results.  The polls have been especially volatile in Iowa, and for other reasons that I will get to in a bit, things could turn out much differently than this.  Anyways, here are our quick projections, with probability to win Iowa:


Generally, I think Trump and Clinton will win.  But there are still a lot of questions out there, putting Cruz and Sanders firmly still in the hunt.

QUESTIONS OUTSTANDING

Going into the Iowa Caucuses, and 2016 elections in general there are still many outstanding data questions:  
  • Political Polls: Political polls have been less reliable in the past two elections than in prior years, first leaning too Republican then too Democrat.  Are current polls accurately reflecting potential outcomes?  There are many reasons political polls can be inaccurate, from samples that aren't representative to turnout issues (covered in next bullet).
  • Turnout: One of the better explanations for the poor polling predictions is that pollsters aren't doing a good job vetting who is and isn't a likely voter, or modeling people's propensity to show up at the polls.  Because voter turnout is often less than 50%, and those people who show up aren't a random subset of Americans, having inaccurate turnout models can significantly bias polling outcomes.
  • Trump/Sanders Viability: One of the general theories on why Trump and Sanders are doing well is that they appeal to politically disaffected people:  Trump to white conservatives who dislike Obama and the direction of the country, Sanders to young people who see little future in the current US economy.  Disaffected groups have a tendency to turnout poorly on election day depending on motivation, will these groups even show up? Combine this with doubt in current polling and turnout models, and the truth is, we just don't know.

CONCLUSION

Due to questions remaining from the past two elections regarding the accuracy of polling, we are still unsure of the results of the Iowa caucuses. That said, our best guess for tonight is a win for Trump and Clinton.  Or maybe Sanders and Cruz, depending on the accuracy of political polling and pollsters ability to determine who may turnout.