Wednesday, October 19, 2016

A Layperson's Guide to Multivariate Regression Outputs

Multivariate regression is a common technique used to predict a single outcome (dependent variable) using many predictors (independent variables).  For data scientists, multivariate regression is often the first statistical predictive technique to learn, and the easiest to understand and describe mathematically.  

For a lay person, the math of multivariate regression can seem daunting.  However, the general concepts can be described using knowledge of high school algebra and geometry.  Essentially, multivariate regression is the process of determining the line that best fits a set of data across multiple factors.  We can easily visualize this when we think about a two-dimensional problem, such as the pace in the Boston Marathon by age: 

The blue line represents the mathematically determined line of best fit.  As this looks like a coordinate plane from high school algebra, we can describe this line (and thus, the mean relationship between age and pace) using a y=mx+b equation.  Here that relationship looks like:

pace =  0.029 X [age]  +   8.06
This concept is easy to understand in two dimensions, but the "multi" in multivariate regression refers to multiple predictive variables.  In this case, the line now extends into multiple dimensions, and instead of our simple high school y=mx+b we now have something like:

 y=m1x1 + m2x2+m3x3+m4x4..+b
This sounds complex, and it certainly can be. The complex outputs that authors sometimes post in conjunction with the analyses do not help a non-statistical reader understand what the analysis means.  Below I will describe how to interpret multivariate regression outputs and what they mean in terms of describing the world, without having to understand advanced statistics.  I created a simple regression, predicting Boston Marathon pace by age and gender, and printed the output below:  

Here's what each element means one-by-one (elements 4,7 and 11, are all you really need to understand):
  1. Formula: shows what predictive elements went into the equation, and what is being predicted, separated by the ~ symbol.
  2. Residuals: tells us statistics about the "error" in the equation, essentially how far each of our data points are from our final predicted line.
  3. Variables: a list of all the predictive variables placed in the equation plus the "intercept," or the high school "b" from our predictive equation.
  4. Coefficient Estimate: this is the "m" from our high school equation, and tells the slope of the line between each independent variable and the dependent variable.  For instance, the data above tells us that pace tends to increase with age (coefficient is positive) at a rate of about 0.03 minutes per year.
  5. Standard Error: we can think about this as the standard deviation of data around our coefficient estimates.  Essentially, how much variation do we believe could exist in all of our "m" elements in the equation?
  6. t-value: this is the t statistical value, which is calculated by dividing the coefficient (4) by the standard error (5).  The interpretation isn't straight forward, so as a non-statistical person you can largely ignore this.
  7. P: the p-value is calculated from our t value (you can look it up in a table).  There two definitions we can use:
    1. (dumb statistical definition, skip) The value represents the probability of observing the coefficient and standard error values at least this extreme, assuming that the NULL hypothesis is correct. 
    2. This is tells you if an independent variable is statistically significant, meaning, if it is likely to have a non-random predictive relationship with the dependent variable.  In essence: does the independent variable have a meaning relationship with the dependent variable?  The lower this is the more likely the relationship is significant, with values below 0.05 being "significant" under common statistical standards.
  8. Asterisks: the asterisks represent at what level of significance each variable is significant in the model, redundant to (7).
  9. Sig Codes: these serve as a key for (8) above, and are derived from the P-value, (7).
  10. Residual error: this can largely be ignored, though the "degrees of freedom" tells us how many rows were used in the dataset.
  11. Multiple-R-Square: this is a measure of the quality of the model, expressed in how much variance is "explained" in the dependent variable. Using the prior example, if we were going to predict the pace of runners in the Boston Marathon, and knew nothing except average times for past finishers, our best predictive strategy would be to just guess the average for each competitor (R-squared = 0).  But if we know the age and gender of the competitor, we can create a linear model to represent their relationships to pace, and use those to inform our predictions (model above, R-squared is 0.086).  If we had prior paces for each finisher, we may be able to  improve our predictive model, and could measure that improvement by the R-squared value. This value runs from 0 to 1, and gets continuously "better" as it increases. 
  12. F-statistic: This one can also largely be ignored, though the p-value tells whether the regression as a whole is statistically significant.

And a test!  For the regression below I've added a single factor to the equation.  The variable e_a is a categorically indicator (assumes values 0 or 1) and indicates that the runner is or isn't part of some group.  Mail your answers to, to these questions (no prizes, only knowing you're right):

  1. What group of runners do you think the variable e_a represent? (hint: e_a is an acronym)
  2. How did the addition of this variable impact the model in terms of quality and other variables?