Sunday, March 4, 2018

A Common R Mistake: R Factor-Numeric Conversions

For the most part the R statistical system is a robust and fast way to quickly execute statistical analyses. Other times the annoyances and "tricks" it contains for more junior analysts on the system, leads me to encourage new analysts to opt for Python instead.

One of the biggest tricks inside of R for junior analysts involves a specific data type called "factors," attempted type conversion, and a sometimes difficult to detect programming issue.

WHAT ARE FACTORS

Factors are a data type specific to R that helps statistician deal with categorical data.  In CS terms, factors help statisticians deal with non-numeric low-cardinality variables.  In most statistical processes this type of variable will be converted to binary dummies, so their storage in situ is less important.

Here is an official description from Berkley's R documentation regarding the storage of factors.

What does this actually mean?  When storing a factor, R strips out all of the actual text and replaces it with index numbers correlated to the textual values and stores the index numbers instead.  This both saves space in data frame storage and logically makes sense in the way these are used by statisticians.

And this process is mostly invisible to user for *most* processes...

That is, until you try to convert a factor to something else.

HOW DOES THE PROBLEM START?

This system works fine, until you need to convert that data to something else.  And here's the key instance where I've seen that occur:  Let's say that you're importing some data that you're not entirely familiar with, So you run something like this to import and inspect your data:

We see a data frame with 3 columns  "x" appears to be an index, "b" is just a simple numeric field.  But "a" is weird.  It looks like numbers, but for some reason R thought it was a factor.  This is where the mistake starts:

• Junior analyst converts this value directly to number (as.numeric() which works in many other programming languages and the SQL that is often use by data scientists.)-and continues on with their day.
• Three hours later the junior analyst (who may be a bit unfamiliar with the business problem to be solved) turns in a work product that has completely bizarre results and is confusing to the business-they must be wrong. So what happened?

WHAT ACTUALLY HAPPENED

Let's split the process apart and see what actually happens when you as.numeric() a factor.
If we create a new column in our data set containing the type-converted data we see:

Wait.. what?  This now seems to be *correlated* to but with completely different values than our original column.  Here's the trick:
'When factors are converted to numeric using as.numeric() it pulls the underlying index numbers and not the actual values, even if that actual value appears to be a number.'
Essentially: Even though column 'a' looks like numbers, R ignores that and pulls an internal ID number R uses as backend lookup.  This can be deceptive, especially when your level of missingness is relatively low after the type conversion.  Confusing this a bit, is that expected correlations generally hold up after the conversion, because the index numbers are ordered-it's simply the magnitude + variance that changes.

FIXING THE PROBLEM

Fixing the problem is easy, you simply convert to character (as.character()) before converting to numeric. This conversion uses the actual data values, gets rid of our index numbers. But what if you want to know why your variable was converted to factor in the first place by read.csv(). I've written the following function for which to check the values that came in that natively fail numeric conversion: The function finds that your numeric column of data also includes values 'a' and 'b' which are preventing numeric conversion.  Let's say now you realize the issue, and are aware that 'a' and 'b' should be converted to 0.  You can easily make this conversion after forcing the values to numeric-but first converting to character, as so:

Now we see the column 'a_better' seems to directly represent the original values in 'a'.

The combination of these functions make it easy to:
• Avoid our initial type conversion issue.
• Discover why our data that was assumed numeric is not all numeric, and DO SOMETHING about it.

TIMES I'VE ALMOST BEEN BURNT BY FACTOR CONVERSIONS

To finish this up I thought I would give two examples of times when I've almost been burnt by this functional weirdness in R.

Scenario One
I was analyzing a dataset that had an interesting distribution-it was monetary data, but rounded to the nearest dollar, and involved integer values from -1 to 250-with some higher outliers.  Remember that as.numeric() replaces a factor scale with an integer index starting at 1.  The dataset also included some NULL values, represented by the word NULL (this is how the Python-Spark export created the data).

When I downloaded and imported the data it initially came in as factor, and (not thinking) I simply forced the type conversion.  This had the effect of creating NA from the prior NULLs which I knew were assumed 0's and fixed with a simple df[is.na(df)] = 0 statement.  The problem was that now my scale was shifted approximately two values higher due to the initial distribution-but the variance was still the same, the percent of 0's were reasonable, and generally the data was still reasonable.

After about an hour of working with the data, I noticed that I was a bit too far off of control totals I had run in PySpark, and backed into my problem, fixed and moved on.

This speaks to a major risk in the factor conversion problem: when the dataset is made up of integers very near zero, the error is difficult to detect.

Scenario Two
In scenario two I was dealing with geospatial data, a polygon shapefile at the zip code level (what our external vendor could handle).  I had crossed it with a few massive 'points layers' and was creating an analysis of output zips using some fairly massive distance and customer travel pattern analytics.  At one point I needed to link the zip codes up to some additional zip code based data, but the join failed because the zip codes were factors.

Knowing I was only dealing with zip codes in the United States, I quickly used the as.numeric() without thinking.  In this case (if you know about zip codes you can imagine what happened) the new factor levels lead to effectively a scramble join.  I would have missed this completely, except that my last step involved visualizing the zips in a nationwide map-which looked completely random.

The point of this anecdote: as usual, visualizing data can be a powerful check against otherwise undetectable coding mistakes.

CONCLUSION

Factors in R can be a powerful statistical tool, but under a few scenarios in type conversion, they can cause issues.  This blog post provided:
• A general description of the issue.
• A couple of methods including a function to find non-numeric values in a factor.
• Some warnings of difficult-to-detect errors.