Thursday, February 4, 2016

R Statistical Tools For Dealing With New Data Sets

As I have shared on this blog, I recently started a new job, a very positive move for me.  The biggest challenge in starting a new job in data science or any analytics field is learning the business model as well as the new data structures.

The experience of learning new data structures and using other people's data files has allowed me to reach back into my R statistical skill set for functions I don't use regularly.  Over the past weeks, I have worked a lot with new data, and here are the tools I am finding most useful for that task (code found below list):
  1. Cast:  Cast is a function in the "shape" package that is essentially-pivot tables for R. I'm pulling data from a normalized Teradata warehouse and that normalization means that my variables come in "vertically" when I need them horizontally (e.g. a column for each month's totals).  Cast allows me to quickly create multiple columns. 
  2. tolower(names(df)): One of the more difficult things I have to deal with is irregularly named columns, or columns with irregular capitalization patterns I'm not familiar with.  One quick way to eliminate capitalization is to lower case all variable names. This function is especially helpful after a Cast, when you have dynamically created variable names from data.  (Also, before a cast, on the character data itself)
  3. Merge: Because I'm still using other people's data (OPD), and that involves pulling together disparate data sources, I find myself needing to combine datasets.  In prior jobs, I've had large data warehouse staging areas, so much of this "data wrangling" has occurred in SQL pre-processing before I'd get into the stats engine.  Now I'm less comfortable with the staging environment, and I'm dealing with a lot of large file-based data, so merge function works well. Most important part of the below is all.x = TRUE which is the R equivalent of "left outer join".
  4. Summary: This may seem like a dumb one, but the usage is important in new organizations for a few reasons.  First, you can point it at almost any object and return top level information, including data frames.  The descriptive statistics returned both give you an idea of the nature of the data distribution and a hint of data type, in the case of import issues.  Second, you can pull model statistics from the summary function of a model-this may not make sense now, but check out number five.
  5. Automated model building:  This is a tool that is useful in a new organization where you don't know how variables correlate, and just want to get a base idea.  I created an "auto-generate me a model" algorithm a few years ago, and can alter the code in various ways to incrementally add variables, test different lags for time series, and very quickly test several model specifications.  I've included the *base* code for this functionality in the image below to give you an idea of how I do it.
Code examples from above steps:

 #1 CAST  
 mdsp <- cast(md, acct ~ year, value = 'avg_num')  
 names(md) <- tolower(names(md))  
 #3 MERGE  
 finale <- merge(x = dt1,y = dt3,by = "acct", all.x = TRUE)  
 #setup dependent and dataset  
 initial <- ("lm(change~")  
 dat <- "indyx"  
 #setup a general specificatoin and lagset to loop over  
 specs <- c("paste(i,sep='')", "paste(i,'+',i,'_slope',sep='')")  
 month <- c("february","march","april","june","july","august","september","october","november")  
 #setup two matrices to catch summary stats  
 jj <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(jj) <- month  
 colnames(jj) <- specs  
 rsq <- matrix(nrow = length(month), ncol = length(specs))  
 rownames(rsq) <- month  
 colnames(rsq) <- specs  
 mods <- NULL  
 #loop through models  
 for(j in specs){  
 for(i in month) {  
      model <- paste(initial,eval(parse(text = j)),",data=",dat,")")  
      temp <-summary(eval(parse(text = model)))  
      jj[[i,j]] <- mean(abs(temp$residuals))  
      rsq[[i,j]] <- temp$r.squared  
 #choose best model (can use other metrics too, or dump anything ot the matrices)  
 which(rsq == max(rsq), arr.ind = TRUE)