Wednesday, January 18, 2017

Data Science Method: MARS Regression

People often ask which data science methods I use most often on the job or in exploring data in my free time.  This is the beginning of a series in which I describe some of those methods, and how they are used to explore, model and extrapolate large data sets.

Today I will cover MARS regression (Multi-Adaptive Regression Splines), a regression methodology that automates variable selection, detection of interactions, and accounts for non-linearities.  This methodology at times has become my hammer (from the saying, when you have a hammer in your hand sometimes everything looks like a nail) due to it's usefulness, ease of implementation, and accurate predictive capabilities.

The algorithm for MARS regression originated in 1991 by Jerome Friedman, and I suggest reading his original article for a full understanding of the algorithm.  BTW, because MARS is a proprietary method, the packages in many statistical programs (including R) is called "earth."  Essentially though the algorithm boils down to this:

1. The Basics: The basic mechanics to MARS involves simple linear regression using ordinary least squares (OLS) method. But there are a few twists.
2. Variable Selection: MARS self-selects variables, first using a forward stepwise method (greedy algorithm based on variables with highest squared-error reduction) followed by a backward (in this case, truly back-out) method to remove over-fit coefficients from the model.
3. Non-Linearity: MARS uses multiple "splines" or hinge functions inside of OLS to account for potentially non-linear data.  Piecewise-linear-regression is a rough analog to the hinge functions, except in the case of MARS, the location of hinges are auto detected through multiple iterations. That is to say, through the stepwise process the algorithm iteratively tries different break-points in the linearity of the model, and selects any breakpoints that fit the data well.  (Side note: sometimes when describing these models to non-data scientists, I refer to the hinges humorously as "bendies."  Goes over much better than "splines" or "hinges.")
4. Regularization: The regularization strategy for MARS models uses Generalized Cross Validation (GCV) complexity versus accuracy tradeoffs during the backwards pass of the model.  GCV involves a user set "penalty factor," so there is room for some manipulation if you run into overfit issues.  As dynamic hinge functions give MARS flexibility to conform to complex functions (intuitively eats degrees of freedom with more effective factors considered in the equation), it increases probability of overfitting.  As such, it is very important to pay attention to regularization procedures.

The hinge function takes this type of form in the equation, allowing the regression splines to adapt to the data across the x axis.

• Ease of Fit: Two factors impact MARS models ease of fit: variable selection and hinge functions. A while back I was faced with a task where I needed to fit about 120 models (all different dependent variables) in two weeks. Due to the power of the MARS algorithm in variable selection and non-linearity detection, I was able to create these models quite easily without a lot of additional data preparation or a priori knowledge.  I still tested, validated, and pulled additional information from each model, however the initial model build was highly optimized.
• Ease of Understanding: Because the basic fit (once you get past hinge functions) is OLS, most data scientists can easily understand the coefficient fitting process.  Also, even if your final model will involve a different method (simple linear regression for instance) MARS can provide a powerful initial understanding of function shapes, from which you may decide to use related transformation (quadratic, log) in your final model form.
• Hinge Optimization: One question I often receive from business users takes the form "what is the value at which x maximizes it's value with y."  In many of these cases, depending on data form, that can be calculated directly by determining the hinge point from a MARS output, much like a local maximum point or other calculus-based optimization strategy.