Wednesday, February 24, 2016

Statistical and R Tools for Data Exploration

Yesterday an old friend who works in a *much* different field than I (still data, just .. different data) contacted me regarding dealing with new data. He apparently runs into the same issue that I do from time to time: looking at a new data set and effectively exploring the data on it's face, without bringing pre-existing knowledge to the table. I started writing what I thought would be a couple of bullet points, it turned into something more, and I thought, hey, I should post this. So, here are my tips for on-face evaluations of new-to-you data sets.

BTW, this is somewhat of a followup post to
this one. 

1. Correlation Matrix: This is simple, but gives you pretty quick insight into a data set, creates a matrix of all Pearson correlations for each set of variables. Syntax is cor(data.frame) in base R. Also a few graphical ways to view it, including the corrplot library. Here's an example of the raw and visualized output.

2. Two Way Plot Viz: This is similar in function to a correlation matrix, but gives you more nuance about actual distributional relationship. This is basically a way to plot pairwise variables inside your data set. In base R syntax is pairs(data.frame). Here's an example from the data above:

3. Decision Tree: Especially when trying to predict, or find relationships back to a single variable, decision trees are fairly helpful. They can be used to predict both categorical and continuous data, and can also accept both types as predictors. They are also easily visualized and explained to non-data people. They have all kinds of issues if you try to use them in actual predictions - use randomforests instead, but are very helpful in trying to describe data. Use the Rpart library in R to crate a decision tree, syntax is rpart(depvar ~indyvar + ....,data = data.frame). Also install the rattle library so that you can use the function fancyRpartPlot to visualize. Here's an example from some education data:

4. Stepwise Regression: This is another often mis-used statistical technique that is good for data description. It's basically multiple regression that automatically figures out which variables are most correlated to the dependent variable and includes them in that order. I use the MASS package in r, and use the function stepAIC(). I guess you could use this for actual predictions, if you don't care about the scientific method.
5. Plotting: Generally speaking I run new data through the ringer by plotting it in a myriad of ways, histograms for distributional analysis, scatter plots for two way correlations. Most of all, plotting longitudinal change over time helps view trending (or over time cycling, if it exists). I generally use ggplot2, here's an example using education data:

6. Clustering and Descriptives: So I assumed you already ran descriptives on your data, just running summary(data.frame), but that's for the entire data set, there's an additional question which is: are there natural subsets of observations in the data that that we may be interested in? Kmeans is the easiest statistical method (and available in base-r) DBSCAN is more appropriate for some data (avaliable in DBSCAN package). After clusters are fit you can calculate summary stats using aggregate and the posteriors from the cluster fit.

Thursday, February 11, 2016

Understanding "margin of error" on Opinion Polls

Over the weekend, I saw a somewhat frustrating comment on Facebook.  Irritation at Facebook happens all the time for me actually, but this time it had a statistical polling slant.  Here's what was said:
Bernie is only behind 3%, and the margin of error on the poll is 4%!  It's a statistical dead heat, he's totally going to win!
I'm generally a pretty calm person, but not when people use the term "statistical dead heat." (or the word impactful.  Or the word guesstimate.  Or the word mathemagician. Or the term "correlation is not causation...")  But anyways..

There are a couple of issues with the Facebook poster's logic, but underlying it all is a general misunderstanding of what that margin of error means.  This blog seems like an ideal place to explore what polling margin of error actually is, and how we should interpret it.


What people are actually talking about when they talk about "Margin of Error" is the statistical concept of "sampling error."  Sampling error is a bit difficult to explain, but it's effectively this:
The error that arises from trying to determine the attributes of a Population (all Americans) by talking to only a sample (1000 Americans). 
That's pretty straight forward, but many people still misunderstand this, here are a few detailed points:
  • Sampling error doesn't include sampling bias: The +/- 3% that you see on most opinion polls represents only the bias due to looking at a number smaller than the entire population.  That means it has an intrinsic assumption that the sample was randomly and appropriately selected.  There's an additional issue called sampling bias (in essence the group from which the random sub-sample was selected, systematically excluded certain groups).  An example of this is that if we sample randomly from the phone book, we would exclude a large number of millennials who only have cell phones only, and thus our sample would be biased.  Error arising from sampling bias occurs above and beyond the +/- 3% of sampling error.
  • Sampling error doesn't include poor survey methodology: Another reason that polls can be incorrect is poor survey methodology, which is once again not included in the +/-3%. A few ways that poor survey methodology can contribute to additional errors:
    • Poor Screening: Most opinion polls involve reducing population to "likely" voters by asking screening questions.  If these screening questions work poorly, or incentives are provided, the poll results will not accurately reflect the population of likely voters.
    • Poor Question Methodology: Asking questions in ways that make it more likely for a voter to answer one way or another can also create additional error.  This is especially true in non-candidate questions where it may be shameful or embarrassing to hold one opinion or another, making the words used in the question very important.  Another poor question example: questions that contain unclear language (e.g. double negatives) or are long and winding may confuse voters.
This may all seem like overly detailed statistics information, but in reality, those other forms of bias are alive and well in the primary polling system.  If those other errors were not occuring: 1. most polls would essentially agree with each other (they don't) and 2. polls would be extremely predictive of actual outcomes (not really true either).  A statistic from the Iowa Caucuses:

The last seven polls leading into the Iowa Caucuses gave Trump an average of 4.7% lead with a margin of error on each poll being 4%.   Trump lost the Iowa Caucuses by 4%, a swing of 8.7%.  


Let's pretend that we ran the perfect poll with a perfect sample and perfect questions, then how do we calculate an accurate margin of error?  The statistics side of opinion polling is actually a bit boring.  Calculating a margin of error on opinion poll is generally done using what's called a binomial confidence interval.  That calculation is relatively (in stats terms) simple, and only uses the sample size, the proportion of votes a candidate is receiving, and a measure of confidence (e.g. we want to be 95% sure the value will fall within +/-4%).   Here's the normal calculator:

That calculator is great, but if you play around with it a little, or if you tend to do derivatives in your head of any equation you see (ahem),  you realize something:  That +/- 3% that you see on opinion polls is completely bogus.  That's because the margin of error varies significantly by what percentage a candidate is receiving, and generally that 3% is only valid for a candidate currently standing at 50%.  The margins of error compress as a candidates share of the vote approaches 0% or 100%.  So for a candidate like Rick Santorum, habitually at 1-2%, we aren't at +/-3%, we're actually at +/- 1%.  Here's a graph showing how that compression at the margins works:

A quick note on this statistical calculation: Our Facebook poster from earlier said that the race was a "statistical dead heat" due to the margin of error.  In a perfect poll, that's not true, especially with a 3% lead in a 4% margin of error poll.  The 4% margin is calculated at 95% confidence, but at 3% we're 85% certain that Clinton is leading.  85% certainty of a Clinton lead is not exactly a "dead heat."

And just to show what horrible people statisticians are, I want to point out one last thing.  You know how I told you how easy it is to calculate the margin of error?  That's still true, but know that arguing statisticians have created eleven total ways to calculate that statistic, all of which create nearly identical results.  They also regularly argue about the appropriateness of these methods.  No joke.

Here's a demonstration of the similarity of the methods, at 50% and 1.5% on a 1000 person sample.


A few takeaway points from our look at margin of error:
  • The +/-3% on most opinion polls don't account for all the types of error a poll could have.  In fact, it seems likely that other forms of error are pushing polling error upwards in modern American political polling.
  • The margin of error stated on opinion polls is valid for a candidate receiving 50% of the vote.  It compresses at very high and very low vote shares.  
  • Statisticians are nerds who use 11 distinct ways to get to essentially the same results.

Thursday, February 4, 2016

R Statistical Tools For Dealing With New Data Sets

As I have shared on this blog, I recently started a new job, a very positive move for me.  The biggest challenge in starting a new job in data science or any analytics field is learning the business model as well as the new data structures.

The experience of learning new data structures and using other people's data files has allowed me to reach back into my R statistical skill set for functions I don't use regularly.  Over the past weeks, I have worked a lot with new data, and here are the tools I am finding most useful for that task (code found below list):
  1. Cast:  Cast is a function in the "shape" package that is essentially-pivot tables for R. I'm pulling data from a normalized Teradata warehouse and that normalization means that my variables come in "vertically" when I need them horizontally (e.g. a column for each month's totals).  Cast allows me to quickly create multiple columns. 
  2. tolower(names(df)): One of the more difficult things I have to deal with is irregularly named columns, or columns with irregular capitalization patterns I'm not familiar with.  One quick way to eliminate capitalization is to lower case all variable names. This function is especially helpful after a Cast, when you have dynamically created variable names from data.  (Also, before a cast, on the character data itself)
  3. Merge: Because I'm still using other people's data (OPD), and that involves pulling together disparate data sources, I find myself needing to combine datasets.  In prior jobs, I've had large data warehouse staging areas, so much of this "data wrangling" has occurred in SQL pre-processing before I'd get into the stats engine.  Now I'm less comfortable with the staging environment, and I'm dealing with a lot of large file-based data, so merge function works well. Most important part of the below is all.x = TRUE which is the R equivalent of "left outer join".
  4. Summary: This may seem like a dumb one, but the usage is important in new organizations for a few reasons.  First, you can point it at almost any object and return top level information, including data frames.  The descriptive statistics returned both give you an idea of the nature of the data distribution and a hint of data type, in the case of import issues.  Second, you can pull model statistics from the summary function of a model-this may not make sense now, but check out number five.
  5. Automated model building:  This is a tool that is useful in a new organization where you don't know how variables correlate, and just want to get a base idea.  I created an "auto-generate me a model" algorithm a few years ago, and can alter the code in various ways to incrementally add variables, test different lags for time series, and very quickly test several model specifications.  I've included the *base* code for this functionality in the image below to give you an idea of how I do it.
Code examples from above steps:

 #1 CAST  
 mdsp <- cast(md, acct ~ year, value = 'avg_num')  
 names(md) <- tolower(names(md))  
 #3 MERGE  
 finale <- merge(x = dt1,y = dt3,by = "acct", all.x = TRUE)  
 #setup dependent and dataset  
 initial <- ("lm(change~")  
 dat <- "indyx"  
 #setup a general specificatoin and lagset to loop over  
 specs <- c("paste(i,sep='')", "paste(i,'+',i,'_slope',sep='')")  
 month <- c("february","march","april","june","july","august","september","october","november")  
 #setup two matrices to catch summary stats  
 jj <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(jj) <- month  
 colnames(jj) <- specs  
 rsq <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(rsq) <- month  
 colnames(rsq) <- specs  
 mods <- NULL  
 #loop through models  
 for(j in specs){  
 for(i in month) {  
      model <- paste(initial,eval(parse(text = j)),",data=",dat,")")  
      temp <-summary(eval(parse(text = model)))  
      jj[[i,j]] <- mean(abs(temp$residuals))  
      rsq[[i,j]] <- temp$r.squared  
 #choose best model (can use other metrics too, or dump anything ot the matrices)  
 which(rsq == max(rsq), arr.ind = TRUE)