Tuesday, January 20, 2015

Data Vis for Algorithms: plotmo and fancyRpartPlot

One thing data scientists often struggle with is quickly describing how their predictive models work.  When the model is directly business impacting, executives want to know things like what factors impact the prediction, how significant is each factor, and is there a point when the factor doesn't matter anymore.

Because the models are complex, and model coefficient matrices and parameters aren't directly interpretable for decision makers, we need a high-level way to describe what factors matter, when they don't matter, and the "direction" of them mattering.  

I've also found, that when someone on my team needs to describe quickly how their model works to the team (beyond performance, ROC, MSE, and other performance metrics) , it helps to have a quick visual reference).  

Plotmo works for many model types (we use it for neural nets, svms, and spline regression (earth) models).  It is described as a "poor man’s partial dependence plot" and gives a view of ceteris parabus linear relationships to the response variable.

Two dimension and three plotmo outputs

fancyRpartPlot works for simple decision trees.  We don't use these for predictions often, but I sometimes use these to demonstrate basic data partitioning to executives.  For quite a while, I didn't have a good way to quickly make a "pretty" and easily readable output plot.  With a single command wrapper, I can easily create a plot that looks like:
fancyRpartPlot output


  1. Hello, for the fancyRpartPlot I get a similar result like yours. With n=16+3 and 45% for example. What does this mean? I don't understand it at all because the examples on many websites look differently and a bit more logical to me. For example, like this: http://media.tumblr.com/a9f482ff88b0b9cfaffca7ffd46c6a8e/tumblr_inline_mz7pyuaYJQ1s5wtly.png
    Can you please explain what the n=16e+3 part and the percentage mean? And why it's different than the example I showed you?
    I would be very grateful

    1. Great question, the numbers aren't clearly labeled. In my case, my tree is trained on a simple 0/1 binary dependent variable. The number at the top of each node is the density (average) of the dependent variable for that node. The exponential number at lower left is the number of observations that for into that node. The percent at lower right is the percent of the total population that fits into that node. So starting at my top nice, you see that my entire population involves 38k observations, with a .39 dependent average, which accounts for 100% of the total population

  2. This comment has been removed by the author.