Sunday, March 4, 2018

A Common R Mistake: R Factor-Numeric Conversions

For the most part the R statistical system is a robust and fast way to quickly execute statistical analyses. Other times the annoyances and "tricks" it contains for more junior analysts on the system, leads me to encourage new analysts to opt for Python instead.

One of the biggest tricks inside of R for junior analysts involves a specific data type called "factors," attempted type conversion, and a sometimes difficult to detect programming issue.


Factors are a data type specific to R that helps statistician deal with categorical data.  In CS terms, factors help statisticians deal with non-numeric low-cardinality variables.  In most statistical processes this type of variable will be converted to binary dummies, so their storage in situ is less important.

Here is an official description from Berkley's R documentation regarding the storage of factors.
What does this actually mean?  When storing a factor, R strips out all of the actual text and replaces it with index numbers correlated to the textual values and stores the index numbers instead.  This both saves space in data frame storage and logically makes sense in the way these are used by statisticians.

And this process is mostly invisible to user for *most* processes...

That is, until you try to convert a factor to something else.


This system works fine, until you need to convert that data to something else.  And here's the key instance where I've seen that occur:  Let's say that you're importing some data that you're not entirely familiar with, So you run something like this to import and inspect your data:

We see a data frame with 3 columns  "x" appears to be an index, "b" is just a simple numeric field.  But "a" is weird.  It looks like numbers, but for some reason R thought it was a factor.  This is where the mistake starts:

  • Junior analyst converts this value directly to number (as.numeric() which works in many other programming languages and the SQL that is often use by data scientists.)-and continues on with their day.
  • Three hours later the junior analyst (who may be a bit unfamiliar with the business problem to be solved) turns in a work product that has completely bizarre results and is confusing to the business-they must be wrong. So what happened?


Let's split the process apart and see what actually happens when you as.numeric() a factor.
If we create a new column in our data set containing the type-converted data we see:

Wait.. what?  This now seems to be *correlated* to but with completely different values than our original column.  Here's the trick:
'When factors are converted to numeric using as.numeric() it pulls the underlying index numbers and not the actual values, even if that actual value appears to be a number.'
Essentially: Even though column 'a' looks like numbers, R ignores that and pulls an internal ID number R uses as backend lookup.  This can be deceptive, especially when your level of missingness is relatively low after the type conversion.  Confusing this a bit, is that expected correlations generally hold up after the conversion, because the index numbers are ordered-it's simply the magnitude + variance that changes.


Fixing the problem is easy, you simply convert to character (as.character()) before converting to numeric. This conversion uses the actual data values, gets rid of our index numbers. But what if you want to know why your variable was converted to factor in the first place by read.csv(). I've written the following function for which to check the values that came in that natively fail numeric conversion:

The function finds that your numeric column of data also includes values 'a' and 'b' which are preventing numeric conversion.  Let's say now you realize the issue, and are aware that 'a' and 'b' should be converted to 0.  You can easily make this conversion after forcing the values to numeric-but first converting to character, as so:

Now we see the column 'a_better' seems to directly represent the original values in 'a'.

The combination of these functions make it easy to:
  • Avoid our initial type conversion issue.
  • Discover why our data that was assumed numeric is not all numeric, and DO SOMETHING about it.


To finish this up I thought I would give two examples of times when I've almost been burnt by this functional weirdness in R.

Scenario One
I was analyzing a dataset that had an interesting distribution-it was monetary data, but rounded to the nearest dollar, and involved integer values from -1 to 250-with some higher outliers.  Remember that as.numeric() replaces a factor scale with an integer index starting at 1.  The dataset also included some NULL values, represented by the word NULL (this is how the Python-Spark export created the data).

When I downloaded and imported the data it initially came in as factor, and (not thinking) I simply forced the type conversion.  This had the effect of creating NA from the prior NULLs which I knew were assumed 0's and fixed with a simple df[] = 0 statement.  The problem was that now my scale was shifted approximately two values higher due to the initial distribution-but the variance was still the same, the percent of 0's were reasonable, and generally the data was still reasonable.

After about an hour of working with the data, I noticed that I was a bit too far off of control totals I had run in PySpark, and backed into my problem, fixed and moved on.

This speaks to a major risk in the factor conversion problem: when the dataset is made up of integers very near zero, the error is difficult to detect.  

Scenario Two
In scenario two I was dealing with geospatial data, a polygon shapefile at the zip code level (what our external vendor could handle).  I had crossed it with a few massive 'points layers' and was creating an analysis of output zips using some fairly massive distance and customer travel pattern analytics.  At one point I needed to link the zip codes up to some additional zip code based data, but the join failed because the zip codes were factors. 

Knowing I was only dealing with zip codes in the United States, I quickly used the as.numeric() without thinking.  In this case (if you know about zip codes you can imagine what happened) the new factor levels lead to effectively a scramble join.  I would have missed this completely, except that my last step involved visualizing the zips in a nationwide map-which looked completely random.

The point of this anecdote: as usual, visualizing data can be a powerful check against otherwise undetectable coding mistakes.


Factors in R can be a powerful statistical tool, but under a few scenarios in type conversion, they can cause issues.  This blog post provided:
  • A general description of the issue.
  • A couple of methods including a function to find non-numeric values in a factor.
  • Some warnings of difficult-to-detect errors.


  1. Needed to compose one little word yet thanks for the suggestions that you are contributed here, would like to read this blog regularly to get more important stuff...
    Best Online Software Training Institute | Big Data Analytics Training

  2. Great presentation of Data Science form of blog and Data Science tutorial. Very helpful for beginners like us to understand Data Science course. if you're interested to have an insight on Data Science training do watch this amazing tutorial.

  3. Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad. Get the best Data Science Training in Hyderabad.

  4. Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this R Programming tutorial for updated knowledge on R Programming.

  5. Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Data Science tutorial for updated knowledge on Data Science

  6. Hi, I like to Comments what you given about Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad.
    Python training in Chennai

  7. [The future in 2019] Trending Technologies to learn in 2019:

  8. Awesome post. Keep it up. Much thanks to you such an incredible sum for sharing your beneficial blog. CFA Audit | Fixed Assets Audit | Warehouse Audit

  9. myTectra Placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance

  10. This comment has been removed by the author.

  11. Very helpful post. Very clear commentary and suggested phrasing are most impressive, and your generosity in sharing this explanation and example. Keep it up. Duplicate Payment Audit
    Duplicate Invoice Audit
    AP Vendor Helpdesk

  12. Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.

    Advanced AWS Training in Marathahalli |No.1 AWS Training in Marathahalli
    Best AWS Amazon Web Services Training Institute in Chennai | No.1 AWS Training Institutes for Solution Architect in Chennai | Advanced AWS Certification Training in Chennai

  13. This is such a good post. One of the best posts that I\'ve read in my whole life. I am so happy that you chose this day to give me this. Please, continue to give me such valuable posts. Cheers!

    Python training in bangalore
    Python course in pune
    Python training in bangalore

  14. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
    Best Devops online Training
    Online DevOps Certification Course - Gangboard

  15. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
    aws training in bangalore
    RPA Training in bangalore
    Python Training in bangalore
    Selenium Training in bangalore
    Hadoop Training in bangalore

  16. Thinking how to win? Play BGAOC with us perfec slot Do not abuse a casino or go.


  17. The post good and really helpful for more stuff click on the link below.

    shriram earth

  18. Лучшый профиль для светодиодной ленты в СНГ вы можете купить у нас в Ekodio


  19. Thank you so much for posting this. I really appreciate your work. Keep it up. Great work!Best software training company with placement in Hyderabad

  20. Such an informative and helpful, Thank you for sharing this wonderful post.

    Data Science Courses in Bangalore

  21. It’s very informative and you are obviously very knowledgeable in this area. You have opened my eyes to varying views on this topic with interesting and solid content.
    date analytics certification training courses
    data science courses training

  22. I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me.. Thanks for all your help and wishing you all the success in your business.
    AI learning course malaysia

  23. Welcome to the party of my life here you will learn everything about me.
    Data Science Course in Pune

  24. I finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing.

    data science course malaysia

  25. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
    Data Science Courses

  26. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!digital marketing course in singapore

  27. thanks for sharing nice information and nice article and very useful information.....
    MORE :