Thursday, December 18, 2014

R in Production Diaries: The GLM Trap

So.  Time for an admission.  I run R in the production environment.  Our R linux box serves as a "decision engine," getting hit by live web requests thousands of times a day.  This isn't an optimal solution, but it was the quick solution, and performance isn't horrible.  But I'll post more on R in production later...

One of the biggest "tricks" about running R in an online processing is that it wasn't built for quick one-time processing jobs.  The R service, the language itself, and most libraries are built for scientists analyzing large data sets and concluding on scientific hypotheses.  This creates some behavior that is not ideal for online processing, but can often be fixed.  

Specifically, if you use the GLM function (I use it early in our decision engine as part of an ensemble process) you'll notice that saved models can be huge.  Early in our production deployment, I found that it was taking on average 8-10 seconds just to load the GLM model (which was 95% of overall processing time).  

Further inspection of the GLM function code showed that when the model is saved, it saves a copy of data, as well as summary statistics, none needed to create predictions.  I tried what seemed like too easy of a solution, to reduce the size of the model:

glm$data <- NULL

This worked, and reduced the disk size of the model by 98%, and reduced load time to less than half a second.  Here's a full code out of what this looks like:




No comments:

Post a Comment