Wednesday, December 31, 2014

Explaining what I "do"

Being from central Kansas, it's always a challenge to explain to relatives what I do for a living.  There aren't a lot of businesses in rural KS, certainly not many using high-level analytics.  Generally the conversation goes like this:

(please read bolded sections with strong central KS accent)

What do you do for a living?

I write programs and code that analyzes data and makes decisions like who we should market to, and who we should make loans to.  

So you're a computer programmer?

Kind of.. but that's just a window into the analysis side of....

(eyes roll back into head)

Yeah, sure, I'm a computer programmer.

This used to be worse because I worked at  (doing similar work to what I do now) .. and would receive questions about whether I help load boxes onto trucks (of course, I didn't).  In fact, relatives acted more positively about idea of me loading boxes, because it's physical labor (which they respect more), and something they can easily understand.

The interesting thing is my relatives interact with a number of companies quite often that use data science.  They all use google, facebook, and a variety of other modern businesses.  But it seems that the lack of those businesses (and professionals) in their communities makes it difficult to understand the types of jobs that allow those businesses to function.

Explaining what I do doesn't bother me a lot anymore, but I find it funny, and possibly why there aren't a ton of small town kids going into data science or STEM in general.  When you don't have exposure to large businesses, and their demands for weird combinations of math, statistics, software engineering and programming skills it's hard to imagine why you would ever try to develop those skills.

Tuesday, December 23, 2014

R in Production Diaries: The GLM trap.. Part 2

So this week, following last weeks post regarding bloated GLMs, and their impact on the R in production environment, I hit an even bigger snag.  A GLM I built with an even larger data set was still huge after I NULL the data.  So. .I went hunting.

1. first get a list of associated objects in the GLM using ls() command.
2. use the object.size() function to get a list of bloated object parts.
3. use sg$XXX <- NULL to get rid of the problematic parts.

My code included below, basically I was able to reduce the size from 96MB to 3MB.  This is still larger than it needs to be for predictions, but the qr component appears to be necessary for the prediction to work.  So I did some more searching... turns out the large object in sg$qr that is huge is sg$qr$qr.. which is not necessary for prediction.  So I NULL out that come in at around 80kb.  I can live with that.

1                aic
2           boundary
3               call
4       coefficients
5          contrasts
6            control
7          converged
8               data
9           deviance
10           df.null
11       df.residual
12           effects
13            family
14     fitted.values
15           formula
16              iter
17 linear.predictors
18            method
19         na.action
20     null.deviance
21            offset
22     prior.weights
23                qr
24                 R
25              rank
26         residuals
27             terms
28           weights
29           xlevels

> object.size(sg$aic)
32 bytes
> object.size(sg$boundary)
32 bytes
> object.size(sg$call)
1368 bytes
> object.size(sg$coefficients)
464 bytes
> object.size(sg$contrasts)
0 bytes
> object.size(sg$control)
328 bytes
> object.size(sg$converged)
32 bytes
> object.size(sg$data)
82537928 bytes
> object.size(sg$deviance)
32 bytes
> object.size(sg$df.null)
32 bytes
> object.size(sg$df.residual)
32 bytes
> object.size(sg$effects)
444352 bytes
> object.size(sg$family)
68720 bytes
> object.size(sg$fitted.values)
1627928 bytes
> object.size(sg$formula)
1072 bytes
> object.size(sg$iter)
32 bytes
> object.size(sg$linear.predictors)
1627928 bytes
> object.size(sg$method)
64 bytes
> object.size(sg$na.action)
1893264 bytes
> object.size(sg$null.deviance)
32 bytes
> object.size(sg$offset)
0 bytes
> object.size(sg$prior.weights)
1627928 bytes
> object.size(sg$qr)
3404832 bytes
> object.size(sg$R)
1232 bytes
> object.size(sg$rank)
32 bytes
> object.size(sg$residuals)
1627928 bytes
> object.size(sg$terms)
4264 bytes
> object.size(sg$weights)
1627928 bytes
> object.size(sg$xlevels)
104 bytes

> object.size(sg)
96499472 bytes

> sg$residuals <- NULL
> sg$weights <- NULL
> sg$fitted.values <- NULL
> sg$prior.weights <- NULL
> sg$na.action<- NULL
> sg$linear.predictors <- NULL
> sg$fitted.values <- NULL
> sg$effects <-NULL
> sg$data <- NULL

> object.size(sg)
3483976 bytes

> sg$qr$qr <- NULL

> object.size(sg)
79736 bytes

Monday, December 22, 2014

Director of Simple Math

So, we hear a lot these days about Americans being bad at math, and this is fairly well documented in articles like this.  A lot of theories on this, from schools that don't focus on math, to letting kids off the hook with the excuse "math is hard."

Usually I experience this when I try to hire someone, and can find about 1 American to every 10 immigrants with the math skills that the job requires.  But what I'm experiencing more and more is a general ineptitude in simple math skills among relatively high level directors in big companies.  A couple examples:

  • Just last week, I spent well over thirty minutes (while beer buzzed, mind you) explaining to a director how to multiply 0.4% X 100K. (bit of hyperbole, but the number work behind this was just too much) 
  • The week before that I had to explain how interest amortization works to a high level finance employee. 
This math problem is something we need to address long-term, likely through our education system. But possibly in the short term we need to hire a "Director of Basic Math" to come in, do all the math for high level employees, especially when the problems exceed adding whole numbers.  

(This post is largely hyperoble, but come on America, we have to do something about our math problem).

Thursday, December 18, 2014

R in Production Diaries: The GLM Trap

So.  Time for an admission.  I run R in the production environment.  Our R linux box serves as a "decision engine," getting hit by live web requests thousands of times a day.  This isn't an optimal solution, but it was the quick solution, and performance isn't horrible.  But I'll post more on R in production later...

One of the biggest "tricks" about running R in an online processing is that it wasn't built for quick one-time processing jobs.  The R service, the language itself, and most libraries are built for scientists analyzing large data sets and concluding on scientific hypotheses.  This creates some behavior that is not ideal for online processing, but can often be fixed.  

Specifically, if you use the GLM function (I use it early in our decision engine as part of an ensemble process) you'll notice that saved models can be huge.  Early in our production deployment, I found that it was taking on average 8-10 seconds just to load the GLM model (which was 95% of overall processing time).  

Further inspection of the GLM function code showed that when the model is saved, it saves a copy of data, as well as summary statistics, none needed to create predictions.  I tried what seemed like too easy of a solution, to reduce the size of the model:

glm$data <- NULL

This worked, and reduced the disk size of the model by 98%, and reduced load time to less than half a second.  Here's a full code out of what this looks like:

Wednesday, December 17, 2014

Big Data

I was at a presentation a few months ago where executives from various organizations were being sold a "big data" product. Well, that's what they were told they were getting.  The presenter used all the "right" cliche jokes.. such as

"Big data is like teenage sex, everyone is talking about it, 
everyone thinks everyone else is doing it, but no one really is"

Wow, these guys must be creating a new, innovative analytics product if they're willing to talk about other analysts and the rest of the industry that way.

So.  What were they selling?  

A desktop based GIS-type database that appeared to hold about 50MB of data. No. Shit.  It's a product that I could put together using publicly available data, a QuantumGIS backend, and throwing a junior analyst at it for a couple of days.

What's my point?  

The term "Big Data" is powerful branding.  Especially to executives who have been lead to think they need it, yet don't understand it, and react positively to colorful GIS maps.  Part of the job of the analyst, is explain what we can do easily, what is difficult, and what potential lift is.  Essentially there is a meta-analysis that has to occur, which is the business case/potential profit of the analysis/analytics product itself.

Blog Start

So.. this is my data science blog.. just a place for me to put together thoughts on what I do for a living, code I write, challenges, and the avoidance of the term "big data." 

Hopefully I provide useful information here, or at least something for other people's amusement.