So, cue XKCD reference.
Randall Munroe is making fun of varying levels of technical knowledge and rigor in different fields. It's a funny cartoon, and hopefully not too offensive to any readers of this blog involved with Sociology or Literary Criticism (I doubt there are many).
The irony here, is in the first panel. In this case playing off a seemingly ignorant question of "Have you tried Logarithms?"
In analytics, I actually say "have you tried logarithms?" quite a bit. The reason is simple: to emulate different shapes of relationships that occur in nature, sometimes variable transformations are necessary.
Although logarithmic transformations are used across many modeling types, the most common is linear regression. If you understand why we log linear regression variables, it's easy to apply to other models. There are many ways to transform variables, but here's a quick primer focusing on logarithmic transformations.
- linear-linear: Neither dependent nor independent variables are logged. A change in X creates a "coefficient value" change in Y. This is your normal straight line relationship.
- linear-log: Only the independent variables are logged. In this case, a change in log(X) creates a corresponding coefficient change in Y. This looks like a normal log curve, and can be used in many diminishing returns problems.
- log-linear: Only the Dependent variable is logged. A change in X creates a coefficient change in log(Y) This creates an exponential curve, and is appropriate for exponential growth type relationships.
- log-log: Both dependent and independent variables are logged. If using a natural logarithm, this can be interpreted as % change in X creates a coefficient % change in Y. (Think calculus, the important relationship being that the derivative of LN(X) is 1/X.) This is used often in econometrics to represent elastic relationships.
Real Life ExampleRemember my fitness tracker data from earlier in the week? It struck me after that analysis that my activity didn't follow linear patterns. For instance, it seemed like my increased activity on the weekend varied in a non-linear fashion, and that there was an elastic relationship between steps yesterday and steps today.
So I logged some variables and re-ran my regression. Here's what I got:
Does it fit better? Well, R-squared as a metric has it's issues, but we can use it to generally measure fit. My original model had an R-squared value of 68%, whereas this model is at 71%. A small improvement, but there's one problem. Log transformation of a variable changes its variance, so it's inaccurate to compare the R-squared of the logged dependent variable to the unlogged. So what do we do?
I didn't document this. But you can create predictions using the logged equation in R, unlog those predictions, and then calculate an unlogged R-squared, for the logged equation. (If you have any questions on this or want code you can ask in comments.) This unlogged R-squared for the logged equation sits at 70%.
In regards to Munroe's original joke, sometimes "Have you considered logarithms?" is the right question to ask. In this case, I got a 2% boost to R-squared, which isn't huge, but is 6% of unexplained variance. And I've seen entire news stories and lawsuits over lesser correlations than that.