Wednesday, January 18, 2017

Data Science Method: MARS Regression

People often ask which data science methods I use most often on the job or in exploring data in my free time.  This is the beginning of a series in which I describe some of those methods, and how they are used to explore, model and extrapolate large data sets.

Today I will cover MARS regression (Multi-Adaptive Regression Splines), a regression methodology that automates variable selection, detection of interactions, and accounts for non-linearities.  This methodology at times has become my hammer (from the saying, when you have a hammer in your hand sometimes everything looks like a nail) due to it's usefulness, ease of implementation, and accurate predictive capabilities.

The algorithm for MARS regression originated in 1991 by Jerome Friedman, and I suggest reading his original article for a full understanding of the algorithm.  BTW, because MARS is a proprietary method, the packages in many statistical programs (including R) is called "earth."  Essentially though the algorithm boils down to this:

  1. The Basics: The basic mechanics to MARS involves simple linear regression using ordinary least squares (OLS) method. But there are a few twists. 
  2. Variable Selection: MARS self-selects variables, first using a forward stepwise method (greedy algorithm based on variables with highest squared-error reduction) followed by a backward (in this case, truly back-out) method to remove over-fit coefficients from the model.
  3. Non-Linearity: MARS uses multiple "splines" or hinge functions inside of OLS to account for potentially non-linear data.  Piecewise-linear-regression is a rough analog to the hinge functions, except in the case of MARS, the location of hinges are auto detected through multiple iterations. That is to say, through the stepwise process the algorithm iteratively tries different break-points in the linearity of the model, and selects any breakpoints that fit the data well.  (Side note: sometimes when describing these models to non-data scientists, I refer to the hinges humorously as "bendies."  Goes over much better than "splines" or "hinges.")
  4. Regularization: The regularization strategy for MARS models uses Generalized Cross Validation (GCV) complexity versus accuracy tradeoffs during the backwards pass of the model.  GCV involves a user set "penalty factor," so there is room for some manipulation if you run into overfit issues.  As dynamic hinge functions give MARS flexibility to conform to complex functions (intuitively eats degrees of freedom with more effective factors considered in the equation), it increases probability of overfitting.  As such, it is very important to pay attention to regularization procedures.

The hinge function takes this type of form in the equation, allowing the regression splines to adapt to the data across the x axis.


  • Ease of Fit: Two factors impact MARS models ease of fit: variable selection and hinge functions. A while back I was faced with a task where I needed to fit about 120 models (all different dependent variables) in two weeks. Due to the power of the MARS algorithm in variable selection and non-linearity detection, I was able to create these models quite easily without a lot of additional data preparation or a priori knowledge.  I still tested, validated, and pulled additional information from each model, however the initial model build was highly optimized.
  • Ease of Understanding: Because the basic fit (once you get past hinge functions) is OLS, most data scientists can easily understand the coefficient fitting process.  Also, even if your final model will involve a different method (simple linear regression for instance) MARS can provide a powerful initial understanding of function shapes, from which you may decide to use related transformation (quadratic, log) in your final model form.
  • Hinge Optimization: One question I often receive from business users takes the form "what is the value at which x maximizes it's value with y."  In many of these cases, depending on data form, that can be calculated directly by determining the hinge point from a MARS output, much like a local maximum point or other calculus-based optimization strategy.


  • Can be Overfit: Some people get overly confident over the internal regularization of MARS and forget that normal data science procedures are still necessary.  Especially in highly-dimensional and highly-orthogonal space, MARS regression will create a badly overfit model. Point being: ALWAYS USE A HOLDOUT TEST/VALIDATION SET. I have seen more of these types of models overfit in the past year than all other algorithms combined.
  • Hinge Functions can be Intimidating: Right now, if I went to a business user (or other data scientist) and said that a coefficient on an elastic equation was 0.8, we would have an easy shared understanding of what that meant.  However, if I give that same business user a set of three hinge functions, that's more difficult to understand.  I recommend always using the "plotmo" package in R to show business users partial dependency plots when building MARS models. This provides a simple and straightforward way to describe linear relationships.


And finally, a quick example from real world data.  The Kansas education data set I've used before on this blog can be modeled using a MARS algorithm.  In this case I pretended I wanted to understand the relationship between FTE (the size of the school) and spending per pupil.  From an economics perspective, very small schools should have higher costs due to lacking economies of scale.  I created a model in R, including a few known covariates for good measure.  Here's what the output with hinge functions look like:

That's all a bit difficult to read, what if we use a partial dependency plot to describe the line fit to the FTE to Spending relationship?  Here's what that looks like:

The green dots represent data points, the black line represents the line fit to the data per MARS regression.  The extreme left side of the graph looks appropriate, fitting an economy of scale curve, and the flat right side of the graph appears to be an appropriate flat line.  The "dip" between the two cuvrves is concerning, and for further analysis. (On futher analysis this appears to be a case of omitted variable bias, in which that category of districts contains many low-cost-of-living mid-rural districts, whereas larger districts tend to be in higher cost areas, so prices (e.g. teacher wages) are higher).