- Data Cleaning,
- Feature Selection,
- Running Models,
- Model Evaluation, and
- Report Production (create a PDF for review by business owners, if they so choose).
I can write all five of these steps easily in R, and haven't really had problems with this type of modeling. But I also know Python, which has similar Machine Learning and analytical packages-and has been referred to as the "future of Data Science." I had used these packages before (pandas, numpy, sklearn), but I most often use Python for non-modeling tasks or to access frameworks like Spark.
Two weeks ago, on testing one of my pipelines, I had the idea to port my primary model building pipeline into Python. The reasons for this were two-fold:
- To make use and test Python's different data science methods and packages
- To make use of Python's flexibility as a programming language (as well as it's status as a "real" all-purpose language).

- Categorical handling: If I have a categorical variable in an R data frame, and I want to pass that to an R model, I can pass the variable directly to an algorithm, and R efficiently creates numerical data on the fly without user intervention. Python, however, generally requires a preprocessing step to map the categorical into per-dimension binaries. There are drawbacks of both methods:
- R is like an "automatic transmission," it is less work for the user and makes the data frame in memory easier to manipulate. On the other hand when using this method, some R methods force all levels of a categorical variable (minus one) into an algorithm, when sometimes optimal models would feature-select to far fewer (some models handle this, some don't).
- Python is more of a "manual transmission," situation where the user has to intervene to decide on a categorical encoding strategy. (e.g. pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder()). This ends in more work for the user, massive data frames, but allows for more control of feature selection (in some algorithms) at run time. (This is actually a problem I've seen in R for quite some time, and through being less-developed in this space, Python has "solved" the problem)
- Different algorithms: This is generally to say that Python is not a primary language for statisticians and research data-scientists, (Python is new to the game) making Python a bit behind the curve for algorithm availability. One example of such a missing case is a shrunken centroids model which I had found useful in a few specific types of classification.
- Some models run faster: When I run a model in R versus Python I get similar results within tolerance, except that the Python models tend to compile on my hardware much faster. As a test I ran XGBoost in both systems. The models were substantially similar (AUC= .713 v AUC = .716), however the Python version finished in 3 seconds versus 32 seconds R. Both were still under a minute, and this may not seem substantial, however inside of a analytics pipeline where you may be building a few thousand models, the timing difference at multiplication becomes substantial.
- More consistency between models: R is a bit of the "wild west" in terms of consistency both in model parameters and model object outputs. For direct comparison of models (or to run different model types under similar parameters) one often has to rely on third-party packages like "caret" or "broom." This makes R's advantage in packages and model types less-than-ideal in that traversing those model types is not straight-forward. Generally in Python's sklearn I can count on classification packages of similar types to give me similar output objects and methods.
- Some things don't work at all: I've had more issues in Python of certain functions not working *out of the box* as stated in documentation-many of these seem to be fixed in down-line bug fixes. I *think* this is likely because sklearn is still mainly a package under development.
- Plotting: To be honest, I'm still figuring this one out. Matplotlib appears to be the preferred plotting strategy in Python (though there is a Python version of ggplot), but honestly rewriting all my diagnostic plotting strategies (and getting labels, titles, axis, and legends correct) has been one of the biggest pains in this entire process. It's difficult to determine whether Python is actually more difficult, or if it's just painful because I've spent several years developing my own plots in R.
- Object Oriented: Python has a bit more straight-forward syntax as a programming language, and my code for the Python pipeline is more object oriented-and quite honestly-better coded than what I have in R. That said, the whitespace and syntax requirements in Python took some getting used to versus my "I do what I want" attitude of coding in R.
Overall-both platforms have advantages and disadvantages. My takeaways are this:
- R is likely better (in the short-term at least) for data exploration and manual or "academic" model builds due to relative ease of coding and availability of models and methods.
- Python may be better for large-scale model builds where speed and consistency between models is necessary (and also if you an adversion to hearing the term "tidy").
Good Explanation Levi Bowles. Thank you for your sharing this informative blog. Keep sharing...
ReplyDeletePython Online Training
R Language Training
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
ReplyDeleteIt’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
Python Training in Chennai | Python Training Institutes in Chennai
Great Article
Deletefinal year projects on machine learning
Final Year Project Centers in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
ReplyDeletePython Training in Bangalore
Pretty blog, so many ideas in a single site, thanks for the informative article, keep updating more article.
ReplyDeleteDigital marketing course in chennai
Nice informative post...Thanks for sharing.. Full Stack Training in Hyderabad
ReplyDelete
ReplyDeleteRPA Training in Hyderabad
myTectra Placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance
ReplyDeleteNice blog.Thank you for sharing your experience with us.See more: Python Online Training
ReplyDeleteThanks a lot very much for the high quality and results-oriented help. Keep in blogging.I want more.... Customer Reconciliation
ReplyDeleteVendor Reconciliation
Fixed Assets Audit
CA Firms
thanks for this excellent article
ReplyDeletepython Training in Bangalore | Python Training institute in Bangalore
I am perusing your post from the earliest starting point, it was so fascinating to peruse and I feel because of you for posting such a decent blog, keep refreshes frequently. Duplicate Payment Audit
ReplyDeleteContinuous Monitoring
Duplicate Payment
Nice info
ReplyDeleteUnic Sol is the best Best java training in hyderabad with job placements. Along with java training full stack, mean stack, angular & testing tools training is provided by industry experts. We are the best java training in Hyderabad.
فوائد العزل الحراري تقليل استهلاك الطاقة الكهربائية في المباني، حيثُ أثبتت التجارب العلمية أن المباني التي تستخدم العزل الحراري يقل استخدام الطاقة الكهربائية فيها بنسبة 40%. ثبات درجة الحرارة في المبنى لمدة زمنية طويلة دونَ الحاجة إلى استخدام أجهزة التكييف أو التدفئة. تقليل الحاجة إلى استخدام أجهزة التكييف والتدفئة ذات السعة العالية، وبالتالي تقليل استهلاك الطاقة، وعدد الأجهزة المستخدمة.
ReplyDeleteشركة عزل خزانات
شركة عزل خزانات بتبوك
شركة عزل خزانات بحائل
However, then again, these progressions will significantly affect everybody all over. In both individual and expert lives. machine learning course in pune
ReplyDeleteThank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point...
ReplyDeleteData Science Online Training
Data Science Certification Course
Great Article
ReplyDeleteData Mining Projects IEEE for CSE
Project Centers in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Amazing Article, Really useful information to all So, I hope you will share more information to be check and share here.
ReplyDeleteJupyter Notebook
Jupyter Notebook Online
Jupyter Notebook Install
Automation Anywhere Tutorial
Rpa automation anywhere tutorial pdf
Automation anywhere Tutorial for beginners
Kivy Python
Kivy Tutorial
Kivy for Python
Kivy Installation on Windows
Super article
ReplyDeleteWhat is Cyber Security
Types of Cyber Attacks
Types of Cyber Attackers
Cyber Security Technology
Cyber Security Tools
Cyber Security Standards
What is Google Adwords
Google Adwords tutorial
Google Keyword Planner
How to Advertise on Google
Thanks for shared that blog with us. If anyone here searching Python Exception Class then visit Coding Dolphin.
ReplyDeleteAttend The PMP Certification in Bangalore From ExcelR. Practical PMP Certification in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification in Bangalore.
ReplyDeletePMP Certification in Bangalore
Nice post. Check best machine learning training institute in bangalore
ReplyDeleteAmazing post thanks for sharing.
ReplyDeleteHadoop Training Institute in Pune
Hadoop Administration training institutes in Pune
Your post is very good and unique. I am eagerly waiting for your new post.
ReplyDeleteimpact of social media marketing
artificial intelligence uses
artificial intelligence technology today
use of php language
rpa developer jobs
salesforce integration interview questions