Sunday, September 6, 2020

R versus Python: A Practical Paradigm for Choosing Your Production Language

As a Data Scientist fluent in both Python and R (and to a lesser degree, a few other languages), I'm often tasked deciding which to use for a specific project or script.  Making that decision is tough but generally can be made pretty quickly:

  1. Do the specific outcomes of this task make one option obviously better?  Example: if the outcome is a plot, I'll choose R for the ease of the ggplot interface.   
  2. Do the players on this task make one option obviously better?  Example: the data engineer involved is a Python person, let's keep it all in Python.
  3. Does the data science work (modeling) tend towards one language?  Example: the solution requires heavy econometric modeling, panel modeling, or time-series analysis, I'll generally choose R.

But one use case which often doesn't offer clear decision making is the productionization of Data Science processes into microservices.  Both R and Python have tools to handle this, but how do we decide the correct option for each case?  Here I offer a quick survey of the tools in each language, and my current paradigm in final choice.

Python Vs R


I have blogged a few times about pushing R into production to solve a variety of problems.  Is it possible? Absolutely-it has been done multiple times, mainly by data scientists.  What does that environment look like:

  1. Application code: Normal R applications using function, classes and code from 10K CRAN packages.  I recommend avoiding Tidyverse and pipes due to issues with debugging in production.   
  2. Database connectivity: R has several packages for db connections-I generally use RODBC or RJDBC depending on the use case, but DBI can also be implemented in prod. 
  3. Serving the application: Depending on what you want to do there are a couple of options.  To easily spin up a quick REST API, you can us the plumbeR package.  If you want a more robust solution Rserve provides a robust binary R server.

General thoughts:  As with most things R, building and running .predict() functions on models are straight forward.  plumbeR is a very simple system to setup and run a quick REST API, that's usually a single threaded API that isn't appropriate for production.  Rserve solves this problem on unix systems with forked processes, but is also esoteric in implementation, and getting to a REST system is more difficult.

Plumber: Getting R ready for production environments? - Data Scientists


I have setup production systems in Python a few times, solving a variety of problems-mainly data engineering tasks.  It is possible, and actually happens in a much more broad sense than data science.  Many of the microservices for general web development purposes are also developed in Python, a lot of people work on them, which gives us a much more broad set of tools.

  1. Application code: In this case we would use standard Python Data Science application tools, including but not limited to pandas, numpy, and sci-kit learn.  Though this tool set is somewhat inferior to the 10,000 packages on CRAN, it does give us a solid base to complete 99% of Data Science tasks.
  2. Database connectivity: Python also has several options to connect to a database server in production, including pyodbc for ODBC connections and pymysql for a pure client API connection.
  3. Serving the application: As I stated before, because Python is used for many other potential application microservices, we have a rich toolset to choose from.  While there are options, we can use a combination of Flask (API), nginx (webserver) and gunicorn (workload handler).
General thoughts: While Python's data science toolset is a bit limiting, especially if tending towards econometric or traditional statistical models, there are other big advantages.  Specifically, with a rich micro service framework, we can very simply setup REST APIs, serve them with robust webservers, and manage heavy and embarrassingly parallel workloads.  

What Is Nginx? A Basic Look at What It Is and How It Works

Making the Decision

When weighing the relative feature sets of R and Python for production, we end up in a prioritization wash.  If ease of setting up robust APIs is a priority, then Python wins easily (yes plumbeR is easy, but has issues with workloads).  If access to the the thousands of algorithms and packages in CRAN is a priority, then a move to R is more appropriate--while biting the bullet on Rserve.  

But, as with most things, there are some external consequences we need to consider, and I suggest a new paradigm for choosing production language:  supportability.  If I personally can support R and Python packages that's great, but in truth, the Data Science team is not the group supporting prod.  

More often, that falls to a DevOps team to quickly figure out what's gone wrong, sometimes at odd hours.  From my experience, 90+% of DevOps teams have Python experience, and approximately 3% have R experience.  If R was a language that created reasonable error handling that would be less of a problem-but those errors are often non-sense to non R users.  As a result, my quick paradigm for Data Science production:

While R and Python come with similarly rich feature sets for production code, in cases when either can be used, Python is often to be preferred as it will ease support from DevOps and other technical resources.