Wednesday, February 24, 2016

Statistical and R Tools for Data Exploration

Yesterday an old friend who works in a *much* different field than I (still data, just .. different data) contacted me regarding dealing with new data. He apparently runs into the same issue that I do from time to time: looking at a new data set and effectively exploring the data on it's face, without bringing pre-existing knowledge to the table. I started writing what I thought would be a couple of bullet points, it turned into something more, and I thought, hey, I should post this. So, here are my tips for on-face evaluations of new-to-you data sets.

BTW, this is somewhat of a followup post to
this one. 

1. Correlation Matrix: This is simple, but gives you pretty quick insight into a data set, creates a matrix of all Pearson correlations for each set of variables. Syntax is cor(data.frame) in base R. Also a few graphical ways to view it, including the corrplot library. Here's an example of the raw and visualized output.

2. Two Way Plot Viz: This is similar in function to a correlation matrix, but gives you more nuance about actual distributional relationship. This is basically a way to plot pairwise variables inside your data set. In base R syntax is pairs(data.frame). Here's an example from the data above:

3. Decision Tree: Especially when trying to predict, or find relationships back to a single variable, decision trees are fairly helpful. They can be used to predict both categorical and continuous data, and can also accept both types as predictors. They are also easily visualized and explained to non-data people. They have all kinds of issues if you try to use them in actual predictions - use randomforests instead, but are very helpful in trying to describe data. Use the Rpart library in R to crate a decision tree, syntax is rpart(depvar ~indyvar + ....,data = data.frame). Also install the rattle library so that you can use the function fancyRpartPlot to visualize. Here's an example from some education data:

4. Stepwise Regression: This is another often mis-used statistical technique that is good for data description. It's basically multiple regression that automatically figures out which variables are most correlated to the dependent variable and includes them in that order. I use the MASS package in r, and use the function stepAIC(). I guess you could use this for actual predictions, if you don't care about the scientific method.
5. Plotting: Generally speaking I run new data through the ringer by plotting it in a myriad of ways, histograms for distributional analysis, scatter plots for two way correlations. Most of all, plotting longitudinal change over time helps view trending (or over time cycling, if it exists). I generally use ggplot2, here's an example using education data:

6. Clustering and Descriptives: So I assumed you already ran descriptives on your data, just running summary(data.frame), but that's for the entire data set, there's an additional question which is: are there natural subsets of observations in the data that that we may be interested in? Kmeans is the easiest statistical method (and available in base-r) DBSCAN is more appropriate for some data (avaliable in DBSCAN package). After clusters are fit you can calculate summary stats using aggregate and the posteriors from the cluster fit.