Wednesday, January 28, 2015

Now Hiring!

So, I'm hiring again.  I just went through the office budgetary process, and it was decided that my department is creating enough value, to add more people, in order to better the business.

Good news! Good news for the field, and good news for me personally.

Except now I have to actually hire someone.  

The last hire I made is working out great, but I still dread this process.  There will be a seemingly endless string of interviews, offers, negotiations, and hopefully, a few really good candidates.

The problem with hiring data scientists is fairly well documented around the internet.  Being in the Kansas City area makes it even more difficult.  Data scientists can choose to live anywhere they want.. why would they choose Kansas City over Denver or the Bay Area?

What am I looking for? 

Well, I'm looking for a well rounded data analyst.. much like is described in this article on Pandora's first data scientist.   Basically I'm looking for someone who is part analyst, part software engineer, and has a good background in machine learning and statistical methods.  In Kansas City.  (comment on this article and I'll help you apply)

The next few months should be interesting.


Side note: From the article I posted above, I had assumed that Pandora had extremely advanced algorithms long before this guy started... but apparently they were using some other kind of playlist strategy?


Monday, January 26, 2015

The simplest solution...

Often people from different parts of the business come to me when they're in trouble.  This trouble generally stems from some program not working like they expected, or a revenue shortfall.

A couple of things define these requests:

  1. They're looking for some "data magic"... essentially some large scale statistical model to figure out the problem, define a path to solution, implement some kind of decisioning, or at the very least, explain why they shouldn't be fired. 
  2. By the time they come to me the situation is really bad.


The second point here is interesting, because of a correlation I've observed:

The worse a data problem seems, the simpler and more fundamental the likely problem.

This means that I field quite a few requests, where the solution is quite simple, and generally, quite upsetting to the business.  Here are a few examples of data science requests, matched with their solutions:

Q: Can you put a model together to determine why our paid search campaign isn't working?
A: The "apply now" button has been broke for six months.

Q: Why is our web conversion rate low?
A: Because your website looks most like a spam site.

Q: Why are final revenue numbers 20% lower than initial, is accounting wrong?
A: Because you have a 20% cancellation rate.

Q: Why aren't sales people doing what we want?
A: Because your incentive plan creates a perverse incentive not to work hard.

I realize that these type of problems will likely continue to come my way, but they form the basis for an important lesson learned:
When analyzing seemingly horrible and counter-intuitive business results, start from the beginning, and look for the simplest solutions.

Friday, January 23, 2015

Un-Scaling

The company I work for is trying to start a new online brand, one that's heavily dependent on analytics and data science work.  I see this as a huge and exciting opportunity, however one that needs scale.

The biggest issue I run into is that online analytics do not "unscale" well: you almost have to go big or go home.

We're starting small, not even throwing a full analyst at the problem, but also, we're not able to throw a ton of marketing resources in, meaning low traffic and low volume at this point.

The low volume is a HUGE issue in analytics, especially in the online space.  Here's the main issues I've identified:


1. Low Volume = Low N.  Simple statistical concept, I can't develop highly accurate predictive algorithms and models without a large population and experience.  

2. Low Volume means it takes a while to figure out issues.  We have seen several issues, both in web-side technical matters, but also in the first set of derived analytics rules we set for the website.  Simply not seeing many transactions means it takes longer to diagnose issues.

3. Low Volume means resource constrained.  In essence, we aren't spending a lot of money here, and we aren't making a lot of money in the near future, so business decision makers don't want us to spend a lot of time here.  This makes the business ramp-up curve even longer.  

4. Low Volume creates bad models.  This is specific to our business.  High fraud rates (that grow quickly out of the box and aren't dependent on marketing) and low legitimate traffic volume makes the total fraud rate approach 100%.  This creates an issue where predictive models (based on experience) become overly tight,  even in spaces where legitimate business exists.  Think in terms of a K-NN or an SVM, in a world where 99% of what an algorithm sees is fraud, it's difficult to detect vector-spaces where near-points would vote "not-fraud". 

Wednesday, January 21, 2015

Space Shuttle Certainty


So.. I've thought about making this required reading for my team:
http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt

The link is to Richard Feynman's appendix to the Rogers Commission report on why the Challenger Space Shuttle failed.  Though not the best read on statistical uncertainty, the point is simple: an inaccurate accounting of statistical uncertainty led to 7 deaths and a major setback in the US space program.  I think people serious about creating mathematical models should read this, but the important part is this:

There were a cloud of points some twice above, and some twice below the fitted curve, so erosions twice predicted were reasonable from that cause alone. Similar uncertainties surrounded the other constants in the formula, etc., etc. When using a mathematical model careful attention must be given to uncertainties in the model.
Effectively the space shuttle failed largely because a statistical estimate of wear/tolerance was treated as a deterministic mathematical equation, and no one accounted for uncertainty in multiple dependent parameter estimates.

Though most models aren't matters of life and death, the assessment of uncertainty is something we can't ignore if we are to make accurate predictions.

Feynman ends his appendix with a slap to NASA management.. though not relevant to this blog, is worth remembering:
For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.


Tuesday, January 20, 2015

Data Vis for Algorithms: plotmo and fancyRpartPlot

One thing data scientists often struggle with is quickly describing how their predictive models work.  When the model is directly business impacting, executives want to know things like what factors impact the prediction, how significant is each factor, and is there a point when the factor doesn't matter anymore.

Because the models are complex, and model coefficient matrices and parameters aren't directly interpretable for decision makers, we need a high-level way to describe what factors matter, when they don't matter, and the "direction" of them mattering.  

I've also found, that when someone on my team needs to describe quickly how their model works to the team (beyond performance, ROC, MSE, and other performance metrics) , it helps to have a quick visual reference).  



Plotmo works for many model types (we use it for neural nets, svms, and spline regression (earth) models).  It is described as a "poor man’s partial dependence plot" and gives a view of ceteris parabus linear relationships to the response variable.


Two dimension and three plotmo outputs




fancyRpartPlot works for simple decision trees.  We don't use these for predictions often, but I sometimes use these to demonstrate basic data partitioning to executives.  For quite a while, I didn't have a good way to quickly make a "pretty" and easily readable output plot.  With a single command wrapper, I can easily create a plot that looks like:
fancyRpartPlot output



Monday, January 19, 2015

R in Production Diaries: Timeouts

On Friday I received 150+ timeout notifications from our production Linux R box.

The timeout is set at a monstrous 15 seconds, and so far we are only running 20% of all transactions through the box, so the timeouts are a material, yet sporadic issue.

The situation is even more weird in that the timeouts around the process aren't occurring at the R box, but instead in an internal service layer executes well prior to calling R.  In essence, the underlying issue has nothing to do with R in production.  

I knew the issue wasn't in the R box two months ago, but was only recently able to convince IT to put the logging on their service layer that would prove it (I had logging in R from the beginning).  To quote one IT director
We know what your logs say, but still there's 99.9% this HAS to be in your process.
I feel like I should have given a lesson in statistical certainty, but that's for another time.

So.  Why did IT think it HAD to be a problem with R?  I have a few thoughts:

  1. They are all Microsoft/C# developers, and seem skeptical of the one Linux box in production (The OS doesn't even have a GUI, and was FREE!)
  2. R is a new product to them, and as such, it's the first place they look for problems.
  3. There seemed to be an assumption that each call to the R box was conducting massive statistical computations (kind of true, but nothing that modern hardware can't handle in <1 second).
Reasons 1 and 2 are just how developers (especially Microsoft developers) view new, open source products.  Reason 3 is actually a misconception about the process, and I think it's a great lesson learned:

It's important to educate IT and developers on the kind of computations the server is actually performing, especially if they are going to have to interface with our processes.




Thursday, January 15, 2015

On audit-ability in machine learning

Again a moment of honesty:  I started my career working for the Kansas Auditor's office (known as the Kansas Legislative Division of Post Audit) working on school funding, government efficiency and fraud.  I held the absurd title of "Principal Data Mining Auditor"... but.. that was a long time ago (5 years).  I don't regret my experience, though after leaving I swore I'd never work with auditors ever again, I just like being more creative than that.

Fast forward five years, and I'm working in financial services, and suddenly auditing is key again.  This is partially due to financial managers wanting to understand the "innards" of models, but also due to outside auditors and the government (CFPB) wanting to audit our decision-making algorithms.
As a result, I sometimes have to make a decision about which algorithms to use not based on performance, but based on the ability of government auditors to understand what I do.

For instance, I use an ensemble learning process (multiple algorithms) for part of our decision making, part of which uses an SVM..  I can only train the SVM on a truncated data set, missing several variables including some employment and age information, because.. in simple terms, I have to be able to prove that the output function provides a continuous one-direction first derivative.

So, this is my short list (or generalization) of audit-able versus non-auditable methods:

Can be audited:
Multivariate Regression
GLM (logistic)
Spline Regression
Simple Decision Trees
Naive Bayes


Difficult to audit:
Artificial Neural Networks
Support Vector Machines
Relevancy Vector Machines
Random Forest