## Sunday, March 4, 2018

### A Common R Mistake: R Factor-Numeric Conversions

For the most part the R statistical system is a robust and fast way to quickly execute statistical analyses. Other times the annoyances and "tricks" it contains for more junior analysts on the system, leads me to encourage new analysts to opt for Python instead.

One of the biggest tricks inside of R for junior analysts involves a specific data type called "factors," attempted type conversion, and a sometimes difficult to detect programming issue.

## WHAT ARE FACTORS

Factors are a data type specific to R that helps statistician deal with categorical data.  In CS terms, factors help statisticians deal with non-numeric low-cardinality variables.  In most statistical processes this type of variable will be converted to binary dummies, so their storage in situ is less important.

Here is an official description from Berkley's R documentation regarding the storage of factors.

What does this actually mean?  When storing a factor, R strips out all of the actual text and replaces it with index numbers correlated to the textual values and stores the index numbers instead.  This both saves space in data frame storage and logically makes sense in the way these are used by statisticians.

And this process is mostly invisible to user for *most* processes...

That is, until you try to convert a factor to something else.

## HOW DOES THE PROBLEM START?

This system works fine, until you need to convert that data to something else.  And here's the key instance where I've seen that occur:  Let's say that you're importing some data that you're not entirely familiar with, So you run something like this to import and inspect your data:

We see a data frame with 3 columns  "x" appears to be an index, "b" is just a simple numeric field.  But "a" is weird.  It looks like numbers, but for some reason R thought it was a factor.  This is where the mistake starts:

• Junior analyst converts this value directly to number (as.numeric() which works in many other programming languages and the SQL that is often use by data scientists.)-and continues on with their day.
• Three hours later the junior analyst (who may be a bit unfamiliar with the business problem to be solved) turns in a work product that has completely bizarre results and is confusing to the business-they must be wrong. So what happened?

## WHAT ACTUALLY HAPPENED

Let's split the process apart and see what actually happens when you as.numeric() a factor.
If we create a new column in our data set containing the type-converted data we see:

Wait.. what?  This now seems to be *correlated* to but with completely different values than our original column.  Here's the trick:
'When factors are converted to numeric using as.numeric() it pulls the underlying index numbers and not the actual values, even if that actual value appears to be a number.'
Essentially: Even though column 'a' looks like numbers, R ignores that and pulls an internal ID number R uses as backend lookup.  This can be deceptive, especially when your level of missingness is relatively low after the type conversion.  Confusing this a bit, is that expected correlations generally hold up after the conversion, because the index numbers are ordered-it's simply the magnitude + variance that changes.

## FIXING THE PROBLEM

Fixing the problem is easy, you simply convert to character (as.character()) before converting to numeric. This conversion uses the actual data values, gets rid of our index numbers. But what if you want to know why your variable was converted to factor in the first place by read.csv(). I've written the following function for which to check the values that came in that natively fail numeric conversion:

The function finds that your numeric column of data also includes values 'a' and 'b' which are preventing numeric conversion.  Let's say now you realize the issue, and are aware that 'a' and 'b' should be converted to 0.  You can easily make this conversion after forcing the values to numeric-but first converting to character, as so:

Now we see the column 'a_better' seems to directly represent the original values in 'a'.

The combination of these functions make it easy to:
• Avoid our initial type conversion issue.
• Discover why our data that was assumed numeric is not all numeric, and DO SOMETHING about it.

## TIMES I'VE ALMOST BEEN BURNT BY FACTOR CONVERSIONS

To finish this up I thought I would give two examples of times when I've almost been burnt by this functional weirdness in R.

Scenario One
I was analyzing a dataset that had an interesting distribution-it was monetary data, but rounded to the nearest dollar, and involved integer values from -1 to 250-with some higher outliers.  Remember that as.numeric() replaces a factor scale with an integer index starting at 1.  The dataset also included some NULL values, represented by the word NULL (this is how the Python-Spark export created the data).

When I downloaded and imported the data it initially came in as factor, and (not thinking) I simply forced the type conversion.  This had the effect of creating NA from the prior NULLs which I knew were assumed 0's and fixed with a simple df[is.na(df)] = 0 statement.  The problem was that now my scale was shifted approximately two values higher due to the initial distribution-but the variance was still the same, the percent of 0's were reasonable, and generally the data was still reasonable.

After about an hour of working with the data, I noticed that I was a bit too far off of control totals I had run in PySpark, and backed into my problem, fixed and moved on.

This speaks to a major risk in the factor conversion problem: when the dataset is made up of integers very near zero, the error is difficult to detect.

Scenario Two
In scenario two I was dealing with geospatial data, a polygon shapefile at the zip code level (what our external vendor could handle).  I had crossed it with a few massive 'points layers' and was creating an analysis of output zips using some fairly massive distance and customer travel pattern analytics.  At one point I needed to link the zip codes up to some additional zip code based data, but the join failed because the zip codes were factors.

Knowing I was only dealing with zip codes in the United States, I quickly used the as.numeric() without thinking.  In this case (if you know about zip codes you can imagine what happened) the new factor levels lead to effectively a scramble join.  I would have missed this completely, except that my last step involved visualizing the zips in a nationwide map-which looked completely random.

The point of this anecdote: as usual, visualizing data can be a powerful check against otherwise undetectable coding mistakes.

## CONCLUSION

Factors in R can be a powerful statistical tool, but under a few scenarios in type conversion, they can cause issues.  This blog post provided:
• A general description of the issue.
• A couple of methods including a function to find non-numeric values in a factor.
• Some warnings of difficult-to-detect errors.

1. Needed to compose one little word yet thanks for the suggestions that you are contributed here, would like to read this blog regularly to get more important stuff...
Best Online Software Training Institute | Big Data Analytics Training

2. Great presentation of Data Science form of blog and Data Science tutorial. Very helpful for beginners like us to understand Data Science course. if you're interested to have an insight on Data Science training do watch this amazing tutorial.https://www.youtube.com/watch?v=1ek7IdGhbXI

3. Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad. Get the best Data Science Training in Hyderabad.

4. Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this R Programming tutorial for updated knowledge on R Programming.

5. Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Data Science tutorial for updated knowledge on Data Science https://www.youtube.com/watch?v=8gFu30KW-ek&t=270s

6. Hi, I like to Comments what you given about Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad.
Python training in Chennai

7. [The future in 2019] Trending Technologies to learn in 2019: https://www.youtube.com/watch?v=-y5Z2fmnp-o

8. Awesome post. Keep it up. Much thanks to you such an incredible sum for sharing your beneficial blog. CFA Audit | Fixed Assets Audit | Warehouse Audit

9. myTectra Placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance

10. This comment has been removed by the author.

11. Very helpful post. Very clear commentary and suggested phrasing are most impressive, and your generosity in sharing this explanation and example. Keep it up. Duplicate Payment Audit
Duplicate Invoice Audit
AP Vendor Helpdesk

12. Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.

Advanced AWS Training in Marathahalli |No.1 AWS Training in Marathahalli
Best AWS Amazon Web Services Training Institute in Chennai | No.1 AWS Training Institutes for Solution Architect in Chennai | Advanced AWS Certification Training in Chennai

13. This is such a good post. One of the best posts that I\'ve read in my whole life. I am so happy that you chose this day to give me this. Please, continue to give me such valuable posts. Cheers!

Python training in bangalore
Python course in pune
Python training in bangalore

14. Thank you for allowing me to read it, welcome to the next in a recent article. And thanks for sharing the nice article, keep posting or updating news article.

Java training in Chennai | Java training in Tambaram

Java training in Chennai | Java training in Velachery

15. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
Best Devops online Training
Online DevOps Certification Course - Gangboard

16. I really like reading it.
anal fuck

17. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
aws training in bangalore
RPA Training in bangalore
Python Training in bangalore
Selenium Training in bangalore

18. Thinking how to win? Play BGAOC with us perfec slot Do not abuse a casino or go.

19. Great Article. Thanks for sharing info.

20. The post good and really helpful for more stuff click on the link below.

shriram earth

21. Лучшый профиль для светодиодной ленты в СНГ вы можете купить у нас в Ekodio

22. Thank you so much for posting this. I really appreciate your work. Keep it up. Great work!Best software training company with placement in Hyderabad

23. Such an informative and helpful, Thank you for sharing this wonderful post.

Data Science Courses in Bangalore

24. It’s very informative and you are obviously very knowledgeable in this area. You have opened my eyes to varying views on this topic with interesting and solid content.
date analytics certification training courses
data science courses training

25. I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me.. Thanks for all your help and wishing you all the success in your business.
AI learning course malaysia

26. Welcome to the party of my life here you will learn everything about me.
Data Science Course in Pune

27. I finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing.

data science course malaysia

28. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
Data Science Courses

29. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!digital marketing course in singapore

30. thanks for sharing nice information and nice article and very useful information.....

31. Easily, the article is actually the best topic on this registry related issue. I fit in with your conclusions and will eagerly look forward to your next updates.
iot training in malaysia

32. Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data science course in Mumbai
Organisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.

33. Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data analytics course in Mumbai Organisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.

34. Nice Post...I have learn some new information.thanks for sharing.

35. Attend The PMP Certification From ExcelR. Practical PMP Certification Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification.
ExcelR PMP Certification

36. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.Best data science courses in hyerabad

37. very well explained .I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Simple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm
Logistic Regression explained

38. Very nice blogs!!! i have to learning for lot of information for this sites…Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, data sciecne course in hyderabad

39. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
Data Analyst Course

40. The course consists of lifetime access of 160+ hours of training by the best instructors from renowned universities and having experience in Data industry for over fifteen years. In addition, 100+ hours of rigorous assignment and 50+ hours of Hadoop and SAS e-learning videos are provided. The student gets to attend numerous webinars and an opportunity to work on at least two live projects. Placement assistance is also available to help students land their dream job in leading companies. data science course syllabus

41. Attend The Machine Learning Course Bangalore From ExcelR. Practical Machine Learning course Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning course Bangalore.
Machine Learning Course Bangalore

42. I genuinely appreciated understanding it. Sitting tight for some more incredible articles like this from you in the nearing days.

Online Training for Big Data
best Apache Spark online course

43. Attend The Business Analytics Course From ExcelR. Practical Business Analytics Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Course.

44. Awesome blog. It was very informative. I would like to appreciate you. Keep updated like this!

Data Science Training in Gurgaon

45. Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
devops online training
best devops online training
top devops online training

46. Artificial intelligence is the branch in Computer science which aims to develop machines to act the way humans work with his intelligence. data science course in india

47. Attend The PMP Certification From ExcelR. Practical PMP Certification Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification.
PMP Certification

48. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.
data scientist courses

49. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
Online Data Science Classes
Selenium Training in Pune
AWS Online Classes
Python Online Classes

50. The astrologer was a middle aged gentleman and he guided us to the first floor of his house, where there was a visitors hall. We waited there, may be for two hours, online nadi astrology

51. You totally coordinate our desire and the assortment of our data.
data science courses malaysia

52. Standard visits recorded here are the simplest strategy to value your vitality, which is the reason why I am heading off to the site regularly, looking for new, fascinating information. Many, bless your heart!
data science course delhi

53. Thanks for posting the best information and the blog is very helpful.data science interview questions and answers

54. Fantastic blog! Thanks for sharing a very interesting post, I appreciate to blogger for an amazing post.
Data Science Course in Pune
Python Classes in Pune