One of the biggest tricks inside of R for junior analysts involves a specific data type called "factors," attempted type conversion, and a sometimes difficult to detect programming issue.
WHAT ARE FACTORS
Factors are a data type specific to R that helps statistician deal with categorical data. In CS terms, factors help statisticians deal with non-numeric low-cardinality variables. In most statistical processes this type of variable will be converted to binary dummies, so their storage in situ is less important.Here is an official description from Berkley's R documentation regarding the storage of factors.
What does this actually mean? When storing a factor, R strips out all of the actual text and replaces it with index numbers correlated to the textual values and stores the index numbers instead. This both saves space in data frame storage and logically makes sense in the way these are used by statisticians.
And this process is mostly invisible to user for *most* processes...
That is, until you try to convert a factor to something else.
HOW DOES THE PROBLEM START?
This system works fine, until you need to convert that data to something else. And here's the key instance where I've seen that occur: Let's say that you're importing some data that you're not entirely familiar with, So you run something like this to import and inspect your data:We see a data frame with 3 columns "x" appears to be an index, "b" is just a simple numeric field. But "a" is weird. It looks like numbers, but for some reason R thought it was a factor. This is where the mistake starts:
- Junior analyst converts this value directly to number (as.numeric() which works in many other programming languages and the SQL that is often use by data scientists.)-and continues on with their day.
- Three hours later the junior analyst (who may be a bit unfamiliar with the business problem to be solved) turns in a work product that has completely bizarre results and is confusing to the business-they must be wrong. So what happened?
WHAT ACTUALLY HAPPENED
Let's split the process apart and see what actually happens when you as.numeric() a factor.If we create a new column in our data set containing the type-converted data we see:
Wait.. what? This now seems to be *correlated* to but with completely different values than our original column. Here's the trick:
'When factors are converted to numeric using as.numeric() it pulls the underlying index numbers and not the actual values, even if that actual value appears to be a number.'Essentially: Even though column 'a' looks like numbers, R ignores that and pulls an internal ID number R uses as backend lookup. This can be deceptive, especially when your level of missingness is relatively low after the type conversion. Confusing this a bit, is that expected correlations generally hold up after the conversion, because the index numbers are ordered-it's simply the magnitude + variance that changes.
FIXING THE PROBLEM
Fixing the problem is easy, you simply convert to character (as.character()) before converting to numeric. This conversion uses the actual data values, gets rid of our index numbers. But what if you want to know why your variable was converted to factor in the first place by read.csv(). I've written the following function for which to check the values that came in that natively fail numeric conversion:The function finds that your numeric column of data also includes values 'a' and 'b' which are preventing numeric conversion. Let's say now you realize the issue, and are aware that 'a' and 'b' should be converted to 0. You can easily make this conversion after forcing the values to numeric-but first converting to character, as so:
Now we see the column 'a_better' seems to directly represent the original values in 'a'.
The combination of these functions make it easy to:
- Avoid our initial type conversion issue.
- Discover why our data that was assumed numeric is not all numeric, and DO SOMETHING about it.
TIMES I'VE ALMOST BEEN BURNT BY FACTOR CONVERSIONS
To finish this up I thought I would give two examples of times when I've almost been burnt by this functional weirdness in R.Scenario One
I was analyzing a dataset that had an interesting distribution-it was monetary data, but rounded to the nearest dollar, and involved integer values from -1 to 250-with some higher outliers. Remember that as.numeric() replaces a factor scale with an integer index starting at 1. The dataset also included some NULL values, represented by the word NULL (this is how the Python-Spark export created the data).
When I downloaded and imported the data it initially came in as factor, and (not thinking) I simply forced the type conversion. This had the effect of creating NA from the prior NULLs which I knew were assumed 0's and fixed with a simple df[is.na(df)] = 0 statement. The problem was that now my scale was shifted approximately two values higher due to the initial distribution-but the variance was still the same, the percent of 0's were reasonable, and generally the data was still reasonable.
After about an hour of working with the data, I noticed that I was a bit too far off of control totals I had run in PySpark, and backed into my problem, fixed and moved on.
This speaks to a major risk in the factor conversion problem: when the dataset is made up of integers very near zero, the error is difficult to detect.
Scenario Two
In scenario two I was dealing with geospatial data, a polygon shapefile at the zip code level (what our external vendor could handle). I had crossed it with a few massive 'points layers' and was creating an analysis of output zips using some fairly massive distance and customer travel pattern analytics. At one point I needed to link the zip codes up to some additional zip code based data, but the join failed because the zip codes were factors.
Knowing I was only dealing with zip codes in the United States, I quickly used the as.numeric() without thinking. In this case (if you know about zip codes you can imagine what happened) the new factor levels lead to effectively a scramble join. I would have missed this completely, except that my last step involved visualizing the zips in a nationwide map-which looked completely random.
The point of this anecdote: as usual, visualizing data can be a powerful check against otherwise undetectable coding mistakes.
CONCLUSION
Factors in R can be a powerful statistical tool, but under a few scenarios in type conversion, they can cause issues. This blog post provided:- A general description of the issue.
- A couple of methods including a function to find non-numeric values in a factor.
- Some warnings of difficult-to-detect errors.
Needed to compose one little word yet thanks for the suggestions that you are contributed here, would like to read this blog regularly to get more important stuff...
ReplyDeleteBest Online Software Training Institute | Big Data Analytics Training
Great presentation of Data Science form of blog and Data Science tutorial. Very helpful for beginners like us to understand Data Science course. if you're interested to have an insight on Data Science training do watch this amazing tutorial.https://www.youtube.com/watch?v=1ek7IdGhbXI
ReplyDeleteData Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad. Get the best Data Science Training in Hyderabad.
ReplyDeleteVery interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this R Programming tutorial for updated knowledge on R Programming.
ReplyDeletehttps://www.youtube.com/watch?v=rgFVq_Q6VF0
Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Data Science tutorial for updated knowledge on Data Science https://www.youtube.com/watch?v=8gFu30KW-ek&t=270s
ReplyDeleteHi, I like to Comments what you given about Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad.
ReplyDeletePython training in Chennai
[The future in 2019] Trending Technologies to learn in 2019: https://www.youtube.com/watch?v=-y5Z2fmnp-o
ReplyDeleteAwesome post. Keep it up. Much thanks to you such an incredible sum for sharing your beneficial blog. CFA Audit | Fixed Assets Audit | Warehouse Audit
ReplyDeletemyTectra Placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteVery helpful post. Very clear commentary and suggested phrasing are most impressive, and your generosity in sharing this explanation and example. Keep it up. Duplicate Payment Audit
ReplyDeleteDuplicate Invoice Audit
AP Vendor Helpdesk
Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.
ReplyDeleteAdvanced AWS Training in Marathahalli |No.1 AWS Training in Marathahalli
Best AWS Amazon Web Services Training Institute in Chennai | No.1 AWS Training Institutes for Solution Architect in Chennai | Advanced AWS Certification Training in Chennai
Nice Blog....
ReplyDeletedata science training in bangalore
best data science courses in bangalore
data science institute in bangalore
data science certification bangalore
data analytics training in bangalore
data science training institute in bangalore
Nice post..
ReplyDeletedata science training in BTM
best data science courses in BTM
data science institute in BTM
data science certification BTM
data analytics training in BTM
data science training institute in BTM
Nice blog
ReplyDeletedevops course in Marathahalli
best devops training in Marathahalli
Devops certification training in Marathahalli
devops training in Marathahalli
devops training institute in marathahalli
This is such a good post. One of the best posts that I\'ve read in my whole life. I am so happy that you chose this day to give me this. Please, continue to give me such valuable posts. Cheers!
ReplyDeletePython training in bangalore
Python course in pune
Python training in bangalore
Thank you for allowing me to read it, welcome to the next in a recent article. And thanks for sharing the nice article, keep posting or updating news article.
ReplyDeleteJava interview questions and answers
Core Java interview questions and answers| Java interview questions and answers
Java training in Chennai | Java training in Tambaram
Java training in Chennai | Java training in Velachery
Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
ReplyDeleterpa training in chennai
rpa training in bangalore
rpa training in btm | rpa training in kalyan nagar | rpa training in electronic city | rpa training in chennai | rpa online training | rpa training in bangalore
rpa training in chennai
This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
ReplyDeleteBest Devops online Training
Online DevOps Certification Course - Gangboard
Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us
ReplyDeleteData Science course in Chennai | Best Data Science course in Chennai
Data science course in bangalore | Best Data Science course in Bangalore
Data science course in pune | Data Science Course institute in Pune
Data science online course | Online Data Science certification course-Gangboard
Data Science Interview questions and answers
Data Science Tutorial
I really like reading it.
ReplyDeleteanal fuck
Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
ReplyDeleteaws training in bangalore
RPA Training in bangalore
Python Training in bangalore
Selenium Training in bangalore
Hadoop Training in bangalore
Thinking how to win? Play BGAOC with us perfec slot Do not abuse a casino or go.
ReplyDeleteGreat Article. Thanks for sharing info.
ReplyDeleteCEH Training In Hyderbad
ReplyDeleteThe post good and really helpful for more stuff click on the link below.
shriram earth
Лучшый профиль для светодиодной ленты в СНГ вы можете купить у нас в Ekodio
ReplyDelete
ReplyDeleteThank you so much for posting this. I really appreciate your work. Keep it up. Great work!Best software training company with placement in Hyderabad
Such an informative and helpful, Thank you for sharing this wonderful post.
ReplyDeleteData Science Courses in Bangalore
It’s very informative and you are obviously very knowledgeable in this area. You have opened my eyes to varying views on this topic with interesting and solid content.
ReplyDeletedate analytics certification training courses
data science courses training
I can see that you are an expert at your field! I am launching a website soon, and your information will be very useful for me.. Thanks for all your help and wishing you all the success in your business.
ReplyDeleteAI learning course malaysia
Welcome to the party of my life here you will learn everything about me.
ReplyDeleteData Science Course in Pune
I finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing.
ReplyDeletedata science course malaysia
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
ReplyDeleteData Science Courses
Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!digital marketing course in singapore
ReplyDeletethanks for sharing nice information and nice article and very useful information.....
ReplyDeleteMORE : http://www.orienit.com/courses/data-science-training-in-hyderabad
Easily, the article is actually the best topic on this registry related issue. I fit in with your conclusions and will eagerly look forward to your next updates.
ReplyDeleteiot training in malaysia
Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data science course in Mumbai
ReplyDeleteOrganisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.
nice post...Thanks for sharing ...
ReplyDeletePython training in Chennai/Python training in OMR/Python training in Velachery/Python certification training in Chennai/Python training fees in Chennai/Python training with placement in Chennai/Python training in Chennai with Placement/Python course in Chennai/Python Certification course in Chennai/Python online training in Chennai/Python training in Chennai Quora/Best Python Training in Chennai/Best Python training in OMR/Best Python training in Velachery/Best Python course in Chennai/<a
Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data analytics course in Mumbai Organisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.
ReplyDeleteNice Post...I have learn some new information.thanks for sharing.
ReplyDeleteClick here for ExcelR Business Analytics Course
Great Article
ReplyDeleteData Mining Projects IEEE for CSE
Project Centers in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Attend The PMP Certification From ExcelR. Practical PMP Certification Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification.
ReplyDeleteExcelR PMP Certification
I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.Best data science courses in hyerabad
ReplyDeletevery well explained .I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
ReplyDeleteSimple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm
Logistic Regression explained
Very nice blogs!!! i have to learning for lot of information for this sites…Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, data sciecne course in hyderabad
ReplyDeleteAttend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
ReplyDeleteData Analyst Course
The course consists of lifetime access of 160+ hours of training by the best instructors from renowned universities and having experience in Data industry for over fifteen years. In addition, 100+ hours of rigorous assignment and 50+ hours of Hadoop and SAS e-learning videos are provided. The student gets to attend numerous webinars and an opportunity to work on at least two live projects. Placement assistance is also available to help students land their dream job in leading companies. data science course syllabus
ReplyDeleteAttend The Machine Learning Course Bangalore From ExcelR. Practical Machine Learning course Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning course Bangalore.
ReplyDeleteMachine Learning Course Bangalore
I genuinely appreciated understanding it. Sitting tight for some more incredible articles like this from you in the nearing days.
ReplyDeleteOnline Training for Big Data
best Apache Spark online course
Attend The Business Analytics Course From ExcelR. Practical Business Analytics Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Course.
ReplyDeleteBusiness Analytics Course
Awesome blog. It was very informative. I would like to appreciate you. Keep updated like this!
ReplyDeleteData Science Training in Gurgaon
Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
ReplyDeletedevops online training
best devops online training
top devops online training
Artificial intelligence is the branch in Computer science which aims to develop machines to act the way humans work with his intelligence. data science course in india
ReplyDeleteAttend The PMP Certification From ExcelR. Practical PMP Certification Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification.
ReplyDeletePMP Certification
I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you.
ReplyDeletedata scientist courses
Thanks for your excellent article. It is so good to read a new article.
ReplyDeletecomparable and comparator in java
interface in java
what is static in java
design patterns in java
sql for data analytics
php interview questions for freshers
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
ReplyDeleteOnline Data Science Classes
Selenium Training in Pune
AWS Online Classes
Python Online Classes
The astrologer was a middle aged gentleman and he guided us to the first floor of his house, where there was a visitors hall. We waited there, may be for two hours, online nadi astrology
ReplyDelete