Tuesday, December 23, 2014

R in Production Diaries: The GLM trap.. Part 2

So this week, following last weeks post regarding bloated GLMs, and their impact on the R in production environment, I hit an even bigger snag.  A GLM I built with an even larger data set was still huge after I NULL the data.  So. .I went hunting.

1. first get a list of associated objects in the GLM using ls() command.
2. use the object.size() function to get a list of bloated object parts.
3. use sg$XXX <- NULL to get rid of the problematic parts.

My code included below, basically I was able to reduce the size from 96MB to 3MB.  This is still larger than it needs to be for predictions, but the qr component appears to be necessary for the prediction to work.  So I did some more searching... turns out the large object in sg$qr that is huge is sg$qr$qr.. which is not necessary for prediction.  So I NULL out that come in at around 80kb.  I can live with that.



> as.data.frame(ls(sg))
              ls(sg)
1                aic
2           boundary
3               call
4       coefficients
5          contrasts
6            control
7          converged
8               data
9           deviance
10           df.null
11       df.residual
12           effects
13            family
14     fitted.values
15           formula
16              iter
17 linear.predictors
18            method
19         na.action
20     null.deviance
21            offset
22     prior.weights
23                qr
24                 R
25              rank
26         residuals
27             terms
28           weights
29           xlevels





> object.size(sg$aic)
32 bytes
> object.size(sg$boundary)
32 bytes
> object.size(sg$call)
1368 bytes
> object.size(sg$coefficients)
464 bytes
> object.size(sg$contrasts)
0 bytes
> object.size(sg$control)
328 bytes
> object.size(sg$converged)
32 bytes
> object.size(sg$data)
82537928 bytes
> object.size(sg$deviance)
32 bytes
> object.size(sg$df.null)
32 bytes
> object.size(sg$df.residual)
32 bytes
> object.size(sg$effects)
444352 bytes
> object.size(sg$family)
68720 bytes
> object.size(sg$fitted.values)
1627928 bytes
> object.size(sg$formula)
1072 bytes
> object.size(sg$iter)
32 bytes
> object.size(sg$linear.predictors)
1627928 bytes
> object.size(sg$method)
64 bytes
> object.size(sg$na.action)
1893264 bytes
> object.size(sg$null.deviance)
32 bytes
> object.size(sg$offset)
0 bytes
> object.size(sg$prior.weights)
1627928 bytes
> object.size(sg$qr)
3404832 bytes
> object.size(sg$R)
1232 bytes
> object.size(sg$rank)
32 bytes
> object.size(sg$residuals)
1627928 bytes
> object.size(sg$terms)
4264 bytes
> object.size(sg$weights)
1627928 bytes
> object.size(sg$xlevels)
104 bytes


> object.size(sg)
96499472 bytes

> sg$residuals <- NULL
> sg$weights <- NULL
> sg$fitted.values <- NULL
> sg$prior.weights <- NULL
> sg$na.action<- NULL
> sg$linear.predictors <- NULL
> sg$fitted.values <- NULL
> sg$effects <-NULL
> sg$data <- NULL

> object.size(sg)
3483976 bytes

> sg$qr$qr <- NULL

> object.size(sg)
79736 bytes


1 comment: