Sunday, December 3, 2017

Twitter Data Related to the Passing of the Senate Tax Bill

Over the past few weeks I have been heavily analyzing Twitter data: looking for methods to find mass-blocking patterns (I've tweeted on this extensively at @leviabx and may write that analysis up in the future).  A few nights ago saw some users on Twitter wanting people to save tweets of Senators leading into passing of controversial tax legislation-I thought-hey that's actually super easy.  

I downloaded that data and plan on making this data available here on the blog for other analysts to look at.  If this is popular, I will think of posting more generated datasets to this site. I also think this could give some readers of this blog a taste of what social media data looks like, and what it's like to work with it.


METHOD

I used the Twitter API to pull down the most recent 500 tweets for each current sitting US Senator on Saturday night, December 2nd 2017.  For this I used a list of Twitter handles.  A couple of notes:
  1. I found this list on the internet, and made the obvious changes, so if you find any errors please let me know.
  2. Senators often have multiple Twitter handles, so I'm hoping I have the right one for this type of policy discussion.

If you have any issues on this blog you can either comment on this blog (open comments) or hit me up directly on Twitter at @leviabx.

DATA

The data I pulled down is a list of tweets found here.  This data was pulled mid-day on December 2nd so it includes tweets from immediately after tax bill passage. A few notes on (some of) the data fields:
  1. text: This is a cleaned version of the text from the original tweet.
  2. original_text: this is the original text from the tweet.
  3. created_at: the UTC timestamp of when the tweet was sent.  This is a standardized time that is hours ahead of US eastern time.. impact: subtract 5 hours to get it to DC time.
  4. emotions(anger, anticipation, positive, negative): for the user's convenience I ran this data through a sentiment algorithm (see Plutchik's 8 emotions).
  5. tax: a TRUE/FALSE indicator of whether "tax" was mentioned in the tweet.
  6. tl: this is a link to go look at the tweet directly in browser (just copypasta it to the browser). I wrote this piece of code 3 years ago and have no clue what I meant by "tl".
  7. screen_name: the screen name of the senator who sent the tweet.
  8. geo fields: There are a ton of geo location fields for Twitter data.. mostly to be ignored as it's only filled out on opt in from the user.
  9. retweet/fav counts: number of times an individual tweet is retweed or favorited.
WordCloud of related tax related tweets from Senators during the week of tax reform.


USE CASE

Playing with this data can be interesting and somewhat fun.  Here are some use cases you can do, from least to most technical:
  1. Find your Senators and see what they Tweeted this week.
  2. Sort the spreadsheet by "created_at" and follow the tweets by the timeline of bill passage and after bill passage time.
  3. Find tweets you like/dislike (search, emotion, names), then use the "tl" field to go to the Tweet directly and react.
  4. By sorting the emotion fields, find the Senators who were the most happy (trust, joy) versus least happy (anger, disgust) about the bill.


EXAMPLES

I'm not going to work on this dataset extensively, but I did pull together the happiest and angriest tweets regarding tax reform:

First angriest:







Now happiest:






Friday, November 24, 2017

Python versus R: A Porting Project

I have been programming in R for over a decade, and during that time, especially more recently, I have built robust pipelines to create large numbers of machine learning and statistical classification models at a time.  The purpose of these pipelines are to evaluate multiple model types against a single dependent variable (usually in highly-dimensional space), quickly determine which works best, and automatically move to the next variable to be modeled.  Like many data science projects the pipelines include five steps:

  1. Data Cleaning,
  2. Feature Selection,
  3. Running Models,
  4. Model Evaluation, and
  5. Report Production (create a PDF for review by business owners, if they so choose).

I can write all five of these steps easily in R, and haven't really had problems with this type of modeling. But I also know Python, which has similar Machine Learning and analytical packages-and has been referred to as the "future of Data Science."  I had used these packages before (pandas, numpy, sklearn), but I most often use Python for non-modeling tasks or to access frameworks like Spark.  

Two weeks ago, on testing one of my pipelines, I had the idea to port my primary model building pipeline into Python.  The reasons for this were two-fold:
  1. To make use and test Python's different data science methods and packages
  2. To make use of Python's flexibility as a programming language (as well as it's status as a "real" all-purpose language).  
The coding was more difficult at first as I figured out some details of pandas data frames and numpy arrays.  I'm mostly finished now, just now fixing small breaks I find.  Generally the models created are nearly identical in quality, with Python maybe showing a slight edge.  Other than that, my initial thoughts:

PythonR                                              Versus

  1. Categorical handling: If I have a categorical variable in an R data frame, and I want to pass that to an R model, I can pass the variable directly to an algorithm, and R efficiently creates numerical data on the fly without user intervention.  Python, however, generally requires a preprocessing step to map the categorical into per-dimension binaries.  There are drawbacks of both methods:
    1. R is like an "automatic transmission," it is less work for the user and makes the data frame in memory easier to manipulate.  On the other hand when using this method, some R  methods force all levels of a categorical variable (minus one) into an algorithm, when sometimes optimal models would feature-select to far fewer (some models handle this, some don't). 
    2. Python is more of a "manual transmission," situation where the user has to intervene to decide on a categorical encoding strategy.  (e.g. pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder()). This ends in more work for the user, massive data frames, but allows for more control of feature selection (in some algorithms) at run time. (This is actually a problem I've seen in R for quite some time, and through being less-developed in this space, Python has "solved" the problem)
  2. Different algorithms:  This is generally to say that Python is not a primary language for statisticians and research data-scientists, (Python is new to the game) making Python a bit behind the curve for algorithm availability.  One example of such a missing case is a shrunken centroids model which I had found useful in a few specific types of classification.
  3. Some models run faster: When I run a model in R versus Python I get similar results within tolerance, except that the Python models tend to compile on my hardware much faster.  As a test I ran XGBoost in both systems.  The models were substantially similar (AUC= .713 v AUC = .716), however the Python version finished in 3 seconds versus 32 seconds R.  Both were still under a minute, and this may not seem substantial, however inside of a analytics pipeline where you may be building a few thousand models, the timing difference at multiplication becomes substantial.
  4. More consistency between models: R is a bit of the "wild west" in terms of consistency both in model parameters and model object outputs.  For direct comparison of models (or to run different model types under similar parameters) one often has to rely on third-party packages like "caret" or "broom."  This makes R's advantage in packages and model types less-than-ideal in that traversing those model types is not straight-forward.  Generally in Python's sklearn I can count on classification packages of similar types to give me similar output objects and methods.
  5. Some things don't work at all: I've had more issues in Python of certain functions not working *out of the box* as stated in documentation-many of these seem to be fixed in down-line bug fixes.  I *think* this is likely because sklearn is still mainly a package under development.
  6. Plotting: To be honest, I'm still figuring this one out.  Matplotlib appears to be the preferred plotting strategy in Python (though there is a Python version of ggplot), but honestly rewriting all my diagnostic plotting strategies (and getting labels, titles, axis, and legends correct) has been one of the biggest pains in this entire process.  It's difficult to determine whether Python is actually more difficult, or if it's just painful because I've spent several years developing my own plots in R.
  7. Object Oriented: Python has a bit more straight-forward syntax as a programming language, and my code for the Python pipeline is more object oriented-and quite honestly-better coded than what I have in R.  That said, the whitespace and syntax requirements in Python took some getting used to versus my "I do what I want" attitude of coding in R.


Overall-both platforms have advantages and disadvantages.  My takeaways are this:
  • R is likely better (in the short-term at least) for data exploration and manual or "academic" model builds due to relative ease of coding and availability of models and methods.
  • Python may be better for large-scale model builds where speed and consistency between models is necessary (and also if you an adversion to hearing the term "tidy").

Sunday, September 10, 2017

So You Want To Be A Data Scientist

Quite often, someone I know asks me a question that I don't have a great answer for:

How would I go about becoming a data scientist?

This is always a tough place to start a conversation, especially if data science is not a great fit for the individual I'm talking to, but there are generally two types of people who ask me this question:

  • Young professionals:  I get the joy of working with quite a few interns and "first jobbers," who, BTW generally give me a reason to be hopeful about the future of America. (Ironically most of them aren't Americans, but whatever...) Most of these people are in computer science or some kind of analytical program and want to know what they should do to become a real "data scientist."
  • People my age: I also get this question from people in their mid-30's, many of whom have limited relevant education background.  For certain mid-career professionals this could be a great option, especially if they have both computer science and math in their background, but this often isn't the case.  They seem to be drawn to data science because they've seen the paycheck, or it just sounds mysterious and sexy.  Often these people say "I love data, I'd be great at data science" (though this claim is somewhat dubious, and by this they often mean that they like USA Today infographics). 

I'm writing this blog post as a place to point both of these groups, in order to give a fair full-breadth look at the skills that I would expect from data scientists.  I break these skills down into three general areas (with some bonus at the end):

  • Math Skills
  • Computer Science Skills
  • Business Skills
Fictionalized Portrayal of a Data Scientist (that looks somewhat like my work setup)


MATH SKILLS


Math is the language of data science, and it's pretty difficult to make it 10 minutes in my day without some form of higher math coming into play.  Point being: if you struggle and/or dislike math this isn't the career for you.  And if the highest level math you've taken is college algebra, you're also in trouble.  Knowledge of algebra is absolute assumption in data science, and most of the real work is done in higher-order math classes.  I would consider four types requirements:

  • Calculus (differential + integral):  I use calculus daily in my job, when calculating equilibrium, optimization points, or spot change.  Three semesters of college-level calculus is a must for data scientists.
  • Matrix/Linear Algebra: The algorithms that we use to extract information from large data sets is written in the language of matrix and vector algebra.  This is for many reasons, but it allows data scientists to write large scale computations very quickly without having to manually code 1000's of individual operations.
  • Differential Equations: This is an extension of calculus, but is extremely helpful in calculating complex variable interactions and change-based relationships.
  • Statistics:  Don't just take the stats class that is offered as part of your major, which tends to be a bare-necessities look.  Take something that focuses on the mathematics underlying statistics. I suggest a stats class at your university that requires calculus as prerequisite.

If this equation is intimidating to you, then data science is likely not a great option.

COMPUTER SCIENCE SKILLS


Here's the guidelines I give young data scientists: The correct level of computer science skill is such that you could get a job as a mid-level developer (or DBA) at a major company.  This may seem like a weird metric, but it plays into the multi-faceted role of data scientists: we design new algorithms and process data which involves designing the programs that analyze that data.  Being able to write code as dynamic programs allows for automated analysis and model builds that take minutes rather than weeks. Here are some courses/skills to pick up before becoming a data scientist:

  • Introduction to Programming: Simply knowing how computer programming works, the keys to functional and object-oriented programming.
  • Introduction to Database theory: Most of the data we access is stored or housed in some kind of database.  In fact, Hadoop, is just a different type of database, but it's good to start with the basics in elementals.  As part of this course, it's vital to learn the basics of SQL which is still (despite claims and attempts to the contrary) the primary language of data manipulation for business.
  • Python: Python is becoming the language of data science, and it is also a great utility language, which has available packages and add-ons for most computing purposes.  It's good to have a utility language in your toolkit as many data wrangling and automation tasks don't exclusively require the tools of data manipulation (e.g.: audio to text conversion).
  • R: R is my primary computing language, though I work in Python and SQL in equal proportions these days (and sometimes SAS). R has extensive statistical and data science computing packages, so it's a great language to know.  The question I get most often is: should I learn R, Python or SAS?  My answer: have a functional understanding and ability to write code in all three, be highly proficient in at least one.


BUSINESS SKILLS


When asking about business skills, the question I most often receive is: Should I get an MBA?  In a word, no.  But it is helpful to understand business concepts and goals, especially to understand and explain concepts to coworkers fluently.  You don't have to go deep into business theory, but a few helpful courses:

  • Accounting: Often data scientists are asked to look at accounting data in order to create financial analyses, or to merge financial data with other interesting areas of a business.  Understanding the basics of the meaning of accounting data, accounting strategies, and how data is entered into financial systems can be helpful.
  • Marketing: Much of the use of data science over the past five years has dealt with targeted marketing both online and through other channels.  Understanding the basics of targeted marketing, meaning of lift, acquisition versus retention, and the financials underlying these concepts is also helpful.
  • Micro-Econ: Though technically an economics class knowing the basics of micro theory allows you to analyze a business more wholly.  Some relevant analyses may be demand and pricing elasticity,  market saturation modeling, and consumer preference models.  It also helps you with personally valuable analysis, like evaluating the viability of a start-up you might be thinking about joining.
Supply-demand relationships are relevant to many data science business applications.


OTHER

Though the above set of skills are necessities for data science, there are a few "honorable mention" classes that are helpful:

  • Social Sciences: When modeling aggregate consumer behavior, it's important to understand why people do the things they do.  Social sciences are designed to analyze this; I recommend classes in economics, political science (political behavior or institutional classes), and behavioral psychology.
  • Econometrics: Econometrics is a blending of economics and statistical modeling, but the focus on time-series and panel analysis is especially helpful in solving certain business problems.  
  • Communication: One of the most common complaints I hear about data scientists is "yeah _____'s smart, but can't talk to people."  A business communication class can help rememdy this before it becomes a serious issue.


CONCLUSION

There are many options as the road to data science is not fixed.  This road map gives you all the skills you will need to be a modern data scientist.  People who want to become data scientists should focus on three major skillsets: math, computer science, and business.  Some may notice that I omitted artificial intelligence and machine learning, but the statistics, math, and computer science courses on this list more than give one a head start on those skills.


Friday, March 17, 2017

College Basketball Analysis: Big 12 Home Court Advantages


A few weeks ago my grad school alma mater (University of Kansas (KU)) won their thirteenth consecutive Big 12 conference championship (I wasn't watching the game, I have better things to do).  Much has been made on how large an outlier this streak is, if performance was random the odds would be about 1 in a trillion to win thirteen straight (not hyperbole, actual probability). 

Along with this streak, there have also been some accusations that the University of Kansas receives preferential treatment in Big 12 Basketball, has an unfair home advantage, or outright cheats to win. The home-court advantage is actually staggering, as KU is 75-3 in conference home games over the past nine years, nearly a 95% win rate.

Half joking, I shot off a quick tweet commenting on both the conference win streak and the accusations.  People quickly reacted, KU fans calling me names while Kansas State University fans agreeing, generally, though more willing to charge KU with cheating. The accusations and arguments raise an interesting question: Do certain teams have statistically different home-court advantages, and is the University of Kansas one of those teams?

METHODOLOGICAL PREMISE


The main issue in calculating home-win-bias is that different teams perform better or more poorly over time, and thus we can't look at simple win rates at home over a series of years.  We need a robust methodology to set expectations for home win percentages, and compare that to actual performance.  As such I devised a method to set expectations based on road wins, and apply that information to each team for analysis. 

The underlying premise of this analysis looks at ratios of home win percent to road win percent over-time and calculates the advantage of playing at home for each team, and how it differs from other teams over multiple seasons.  In detail:

  • The theory here is that some "home advantages" (KU, specifically) are higher than others either due to natural advantages, out-right cheating, or bush-league behaviors.
  • In order to disprove whether home advantages differ, we need a methodology to control for quality of team independent of home performance, and compute home performance in relation to that absolute advantage. Enter predicting home wins using road wins.
  • In aggregate, we would expect to be able to predict a team's home win percentage by looking at their road win percentage, as better teams should perform better in both venues.  If a team has a systematic advantage on their home court, we would expect their home win percentage to over-perform the predictive model developed from road wins.
  • I build a predictive model to predict home wins based on road wins for a team each year.  The models are developed for each Big 12 school as a hold out model, to remove each school's self-bias in the numbers.  Then I predict the model using the held-out school, calculate the  residuals on the hold out and move to the next school.
  • The residuals here represent a Wins Above Expectation metric. We can do two things with the residual data: 
    1. Calculate the mean residual and distribution over time which indicates the overall home bias of the school (which schools systematically over-perform at home) 
    2. Determine the best and worst performances at home for individual schools.
The initial models performed well, and show that road wins are fairly predictive of home wins, with a .52 R-squared value (variance accounted for) and a 0.4 elasticity in the log-log specification of the model. 


INITIAL DATA


Starting with a visual inspection of the data, we can get an idea of the relationships between teams, home and away games, and seasons.  First a data point, teams perform far better at home (65.6 average win % wins) than on the road (34.4 average win%) winning nearly twice as often on their home court as on the road. But let's go back to our initial question, does KU win more often than other Big 12 schools at home?  The answer here is yes. 


KU outperforms all schools, with the closest neighbor being Missouri (who has a limited sample as they left the conference a few years ago).  We then see a cluster of schools with about 70% home win percentages, and a few bad schools at the end of the distribution (TCU, notably).  This indicates that KU is an outlier in terms of home performance, but is that because KU is a much better team, or indicative of other issues?

Road % helps us answer this question, KU is the best road team in the conference, by a large margin.  Kansas in fact is the only team in the conference with a winning road record over the past decade, winning close to 75% of games on the road.  Even consistent KU rival and NCAA tournament qualifier Iowa State regularly wins fewer than 40% of their road games. 



We know that KU wins a lot of games at home and on the road, but is there a way to determine if their home wins exceed a logical expectation?  Before moving on to the modeling that can answer our question, we should prove out an underlying theory: whether road wins and home wins correlate with one another:



The chart and a basic model provide some basic answers:

  • Road wins are highly correlated to home wins at a correlation coefficient of .68.
  • Few teams (3%) finish a season with more road wins than home wins.


MODEL RESULTS


With the initial knowledge that KU performs highly both at home and on the road, we can start our model building process. If you're interested in the detailed model, look at the methodology section above.

Using the model to calculate how teams perform relative to peers in terms of home and road wins, I calculated the average home-court-boost, or the number of wins above road-based-expectations, shown below:




Oklahoma State has the largest home-court advantage in the conference, followed by Iowa State, Kansas and Oklahoma.  Each of these schools receive about a full extra-win per season over expectations due to their home-court advantage. TCU has the worst home performance followed by West Viriginia, and Baylor.  

A further interesting (and nerdy) way to view the data is a boxplot for each school representing the last ten years of wins-over performance.  This shows that some schools like Kansas and Iowa State have fairly tight distributions representing consistent performance above road expectation.  Other schools, like Kansas State and Baylor, have a wide distribution representing inconsistent home performance related to road expectations.



Using the same scoring method we can score individual year performances, and determine which teams have the best and worst home versus road years.


Most interesting here is that K-State's home-court advantage was a pretty amazing over the years 2014 - 2015.  During those years, Kansas State was 15-3 at home and 3-15 on the road.  At that time at least, it appears Kansas State's Octagon of Doom (I don't remember what it's really called, even though its where I received my Bachelor's degree) was a far greater advantage that KU's Allen Field House.

CONCLUSION

From the models developed we can reach several conclusions about the types of home advantages held by Big 12 teams:

  • The home advantage for the University of Kansas at Allen Field House is high (about +1 game a year) but in-line with several other top-tier Big 12 teams.  This doesn't necessarily fit the story-line that KU cheats at home, but doesn't rule out other theories given by Kansas State fans: that KU cheats/gets unfair deferential treatment both at home AND on the road.  
  • The top home advantages in the Big 12 are: Oklahoma State, Iowa State, Kansas and Oklahoma.  In fact both Oklahoma State ("Madison Garden of the Plains" .. seriously?) and Iowa State ("Hilton Magic"...) hold moderately larger home advantages than the Kansas at Allen Field House.
  • The worst home advantages in the Big 12 are: TCU, West Virginia, and Baylor.
  • Some individual team-years show volatile performance, specifically Kansas State through 2014 - 2015.



Tuesday, February 7, 2017

Data Science Trolling at the Airport: Using R to Look Like the Matrix

I've been quite busy at work lately, but I am working on some new, serious posts (one on Support Vector Machines, another on Trumps claims of voter fraud).  For today though, just a fun post about trolling people who like to stare at programmers in public.  

Occasionally, I am stranded somewhere in public (like an airport or a conference) when I need to do some serious coding (side note: one large company has a significant piece of their software architecture running that I wrote on a bench in Las Vegas).  In these situations I open my laptop and start coding, but often notice that people are staring at my computer screen. It doesn't bother me that much, but over time it does become a bit annoying.  A few theories for the stares:

  • It's novel: They've never seen a programmer at work before.
  • It's me specifically: I tend to fidget, and be generally annoying when coding.
  • Its evil: When normal people see programmers writing code in movies, they're always doing something fun and exciting, or evil. They're hackers. 




A Solution:

Anyway, I eventually get annoyed with people staring at me, and based on my third bullet point, decide to give them a show, troll them just a bit.  I wrote this very short piece of code a few years ago in an airport, it's for R and actually quite simple:




It uses a loop, runif() and some rounding to create 70,000,000 random number and print them to the console.  It basically makes your computer screen look like something from a hacker or Matrix movie.  The Sys.sleep() function is something to parameterize based on your system settings, but the point of it is to make the console animate as though you are in a movie.  What does this code create?  I  Here's video of the code running: 




Recommendations for maximum silliness: 

  • When you start the code, maximize the command line portion of your computer.
  • Make sure you're setup for black background and white or green text, for maximum appearance of evil.
  • For bonus points try to give the appearance that you are really up to something:
    • Wringing of hands or nervous fingers help with this appearance.
    • Rocking back and forth slightly while intently staring at the screen.
    • Mumbling under breath things like "it's working, it's working!" or "almost in... almost in."






Wednesday, January 18, 2017

Data Science Method: MARS Regression

People often ask which data science methods I use most often on the job or in exploring data in my free time.  This is the beginning of a series in which I describe some of those methods, and how they are used to explore, model and extrapolate large data sets.

Today I will cover MARS regression (Multi-Adaptive Regression Splines), a regression methodology that automates variable selection, detection of interactions, and accounts for non-linearities.  This methodology at times has become my hammer (from the saying, when you have a hammer in your hand sometimes everything looks like a nail) due to it's usefulness, ease of implementation, and accurate predictive capabilities.

The algorithm for MARS regression originated in 1991 by Jerome Friedman, and I suggest reading his original article for a full understanding of the algorithm.  BTW, because MARS is a proprietary method, the packages in many statistical programs (including R) is called "earth."  Essentially though the algorithm boils down to this:

  1. The Basics: The basic mechanics to MARS involves simple linear regression using ordinary least squares (OLS) method. But there are a few twists. 
  2. Variable Selection: MARS self-selects variables, first using a forward stepwise method (greedy algorithm based on variables with highest squared-error reduction) followed by a backward (in this case, truly back-out) method to remove over-fit coefficients from the model.
  3. Non-Linearity: MARS uses multiple "splines" or hinge functions inside of OLS to account for potentially non-linear data.  Piecewise-linear-regression is a rough analog to the hinge functions, except in the case of MARS, the location of hinges are auto detected through multiple iterations. That is to say, through the stepwise process the algorithm iteratively tries different break-points in the linearity of the model, and selects any breakpoints that fit the data well.  (Side note: sometimes when describing these models to non-data scientists, I refer to the hinges humorously as "bendies."  Goes over much better than "splines" or "hinges.")
  4. Regularization: The regularization strategy for MARS models uses Generalized Cross Validation (GCV) complexity versus accuracy tradeoffs during the backwards pass of the model.  GCV involves a user set "penalty factor," so there is room for some manipulation if you run into overfit issues.  As dynamic hinge functions give MARS flexibility to conform to complex functions (intuitively eats degrees of freedom with more effective factors considered in the equation), it increases probability of overfitting.  As such, it is very important to pay attention to regularization procedures.

The hinge function takes this type of form in the equation, allowing the regression splines to adapt to the data across the x axis.

ADVANTAGES

  • Ease of Fit: Two factors impact MARS models ease of fit: variable selection and hinge functions. A while back I was faced with a task where I needed to fit about 120 models (all different dependent variables) in two weeks. Due to the power of the MARS algorithm in variable selection and non-linearity detection, I was able to create these models quite easily without a lot of additional data preparation or a priori knowledge.  I still tested, validated, and pulled additional information from each model, however the initial model build was highly optimized.
  • Ease of Understanding: Because the basic fit (once you get past hinge functions) is OLS, most data scientists can easily understand the coefficient fitting process.  Also, even if your final model will involve a different method (simple linear regression for instance) MARS can provide a powerful initial understanding of function shapes, from which you may decide to use related transformation (quadratic, log) in your final model form.
  • Hinge Optimization: One question I often receive from business users takes the form "what is the value at which x maximizes it's value with y."  In many of these cases, depending on data form, that can be calculated directly by determining the hinge point from a MARS output, much like a local maximum point or other calculus-based optimization strategy.

DISADVANTAGES

  • Can be Overfit: Some people get overly confident over the internal regularization of MARS and forget that normal data science procedures are still necessary.  Especially in highly-dimensional and highly-orthogonal space, MARS regression will create a badly overfit model. Point being: ALWAYS USE A HOLDOUT TEST/VALIDATION SET. I have seen more of these types of models overfit in the past year than all other algorithms combined.
  • Hinge Functions can be Intimidating: Right now, if I went to a business user (or other data scientist) and said that a coefficient on an elastic equation was 0.8, we would have an easy shared understanding of what that meant.  However, if I give that same business user a set of three hinge functions, that's more difficult to understand.  I recommend always using the "plotmo" package in R to show business users partial dependency plots when building MARS models. This provides a simple and straightforward way to describe linear relationships.

AN EXAMPLE

And finally, a quick example from real world data.  The Kansas education data set I've used before on this blog can be modeled using a MARS algorithm.  In this case I pretended I wanted to understand the relationship between FTE (the size of the school) and spending per pupil.  From an economics perspective, very small schools should have higher costs due to lacking economies of scale.  I created a model in R, including a few known covariates for good measure.  Here's what the output with hinge functions look like:



That's all a bit difficult to read, what if we use a partial dependency plot to describe the line fit to the FTE to Spending relationship?  Here's what that looks like:

The green dots represent data points, the black line represents the line fit to the data per MARS regression.  The extreme left side of the graph looks appropriate, fitting an economy of scale curve, and the flat right side of the graph appears to be an appropriate flat line.  The "dip" between the two cuvrves is concerning, and for further analysis. (On futher analysis this appears to be a case of omitted variable bias, in which that category of districts contains many low-cost-of-living mid-rural districts, whereas larger districts tend to be in higher cost areas, so prices (e.g. teacher wages) are higher).