As background, here are some of the issues I've encountered with decision trees:
- They can have massive over-fitting problems (this is well documented).
- The simplest versions (think the default rpart() in R) tend to predict in big buckets of homogeneous probabilities that don't fit the real world.... and don't do a great job of discriminating outcomes in a meaningful way.
- This methodology doesn't do well with bucketizing continuous variables. Especially if the probability implications of the continuous variable is also.. continuous.
- It can perform poorly with new encountered attribute combinations.
We saw a combination of these four things come together to create extremely bad predictions in one of our new processes. Here's a description of our procuess:
- 500K + records to be evaluated nightly.
- Training set is 100 million+ records.
- Our dependent variable is "success".
- Most records 90%+ will be very low probability of success (<1%).
- The business payoff, is in evaluation a smallish set of records with >4% of chance of success.
So, here's a chart I put together to evaluate the process. Horizontal axis is our scoring method (which is a tree + a couple of other methodologies) and vertical axis is actual outcome. The vertical bars represent volume in each bucket.
So, this appears to predict well given our assumptions. Most predictions are in the lowest probability bucket, and that bucket succeeds at approximately 0%. Then we see a gradual shift up as our scoring model increases until... CRAP WHAT HAPPENED AT .08?
I had a feeling I knew what was going on. We dug in, and found that there was a somewhat major issue with one of the buckets of the decision tree... for a combination of my four reasons above. Mainly, the new data encountered involved some attributes not accounted for by the original tree, causing some predictions to be scored much higher than their actual propensity towards success.
We are still working through this issue, but we think using a random forest methodology should overcome these issues. In the worst case scenario, we can just remove the tree and use our other trained models.
- Test everything. Always.
- Be wary of decision trees, especially the simple ones. We are turning over to a random forest methodology, which we believe will fix our problem.