Sunday, September 6, 2020

R versus Python: A Practical Paradigm for Choosing Your Production Language

As a Data Scientist fluent in both Python and R (and to a lesser degree, a few other languages), I'm often tasked deciding which to use for a specific project or script.  Making that decision is tough but generally can be made pretty quickly:

  1. Do the specific outcomes of this task make one option obviously better?  Example: if the outcome is a plot, I'll choose R for the ease of the ggplot interface.   
  2. Do the players on this task make one option obviously better?  Example: the data engineer involved is a Python person, let's keep it all in Python.
  3. Does the data science work (modeling) tend towards one language?  Example: the solution requires heavy econometric modeling, panel modeling, or time-series analysis, I'll generally choose R.

But one use case which often doesn't offer clear decision making is the productionization of Data Science processes into microservices.  Both R and Python have tools to handle this, but how do we decide the correct option for each case?  Here I offer a quick survey of the tools in each language, and my current paradigm in final choice.

Python Vs R


I have blogged a few times about pushing R into production to solve a variety of problems.  Is it possible? Absolutely-it has been done multiple times, mainly by data scientists.  What does that environment look like:

  1. Application code: Normal R applications using function, classes and code from 10K CRAN packages.  I recommend avoiding Tidyverse and pipes due to issues with debugging in production.   
  2. Database connectivity: R has several packages for db connections-I generally use RODBC or RJDBC depending on the use case, but DBI can also be implemented in prod. 
  3. Serving the application: Depending on what you want to do there are a couple of options.  To easily spin up a quick REST API, you can us the plumbeR package.  If you want a more robust solution Rserve provides a robust binary R server.

General thoughts:  As with most things R, building and running .predict() functions on models are straight forward.  plumbeR is a very simple system to setup and run a quick REST API, that's usually a single threaded API that isn't appropriate for production.  Rserve solves this problem on unix systems with forked processes, but is also esoteric in implementation, and getting to a REST system is more difficult.

Plumber: Getting R ready for production environments? - Data Scientists


I have setup production systems in Python a few times, solving a variety of problems-mainly data engineering tasks.  It is possible, and actually happens in a much more broad sense than data science.  Many of the microservices for general web development purposes are also developed in Python, a lot of people work on them, which gives us a much more broad set of tools.

  1. Application code: In this case we would use standard Python Data Science application tools, including but not limited to pandas, numpy, and sci-kit learn.  Though this tool set is somewhat inferior to the 10,000 packages on CRAN, it does give us a solid base to complete 99% of Data Science tasks.
  2. Database connectivity: Python also has several options to connect to a database server in production, including pyodbc for ODBC connections and pymysql for a pure client API connection.
  3. Serving the application: As I stated before, because Python is used for many other potential application microservices, we have a rich toolset to choose from.  While there are options, we can use a combination of Flask (API), nginx (webserver) and gunicorn (workload handler).
General thoughts: While Python's data science toolset is a bit limiting, especially if tending towards econometric or traditional statistical models, there are other big advantages.  Specifically, with a rich micro service framework, we can very simply setup REST APIs, serve them with robust webservers, and manage heavy and embarrassingly parallel workloads.  

What Is Nginx? A Basic Look at What It Is and How It Works

Making the Decision

When weighing the relative feature sets of R and Python for production, we end up in a prioritization wash.  If ease of setting up robust APIs is a priority, then Python wins easily (yes plumbeR is easy, but has issues with workloads).  If access to the the thousands of algorithms and packages in CRAN is a priority, then a move to R is more appropriate--while biting the bullet on Rserve.  

But, as with most things, there are some external consequences we need to consider, and I suggest a new paradigm for choosing production language:  supportability.  If I personally can support R and Python packages that's great, but in truth, the Data Science team is not the group supporting prod.  

More often, that falls to a DevOps team to quickly figure out what's gone wrong, sometimes at odd hours.  From my experience, 90+% of DevOps teams have Python experience, and approximately 3% have R experience.  If R was a language that created reasonable error handling that would be less of a problem-but those errors are often non-sense to non R users.  As a result, my quick paradigm for Data Science production:

While R and Python come with similarly rich feature sets for production code, in cases when either can be used, Python is often to be preferred as it will ease support from DevOps and other technical resources.

Thursday, January 2, 2020

Data Science Job Market

A few weeks ago I was thinking about the state of the current Data Science job market--and a bit frustrated about inquiries by job seekers on LinkedIn, I hastily sent out a tweet on my thoughts..

As the weeks have gone by (and my employment situation has become a bit weird--in good ways), I've often thought about that tweet.  Essentially, since the day I said this, I've noticed it to be more true than I originally thought--Especially on the Shingy front.

I thought it would might good to put together some more in-depth thoughts on the different types of Data Science candidates I see on the market and how they relate to real roles in companies. In result, this blog can serve as a helpful guide in building a data science team at your organization.  I've gone into more detail of the five types of candidates below.

Directly Out of School

The trademark of these candidates are recently completing grad school with no or limited work experience.  A few resumes will have a history of internships, or will try to pass off class projects as "jobs they worked." My general thoughts:
  • There are a lot of these candidates-a lot.
  • They harass Data Science managers regularly on social media, LinkedIn and email.
  • Many of them are not yet talented, and are going to take a lot of work before they're tackling projects on their own.
  • Some of them have not been screened for what I call "employability" and are unlikely to survive the rigor, rules and norms of the workplace.
  • Be careful of buzzword slingers.
Advice: You can take on a few of these and they can add value for your team over time.  But you can and should screen heavily before choosing your candidate.  The sheer number of these candidates on the market gives the employer the luxury of being picky-and because many of these candidates come untested and without serious references (professors are not real references) you need to use your best BS detecting.

Couple of Years Experience

These candidates have been out in the job market for one to five years and generally with one or two employers.  Resumes will generally have real projects worked on listed, though being a junior-level employee, you don't know what their actual role involved.  My thoughts on this group:
  • There are quite a few in this group as well.
  • The loudest ones are usually in the process of failing out of a first job.
  • A lot of them are well on their way to a great career, though.
  • Their skills may be largely defined by the experience of their first job-so there will still be a big training job ahead. 
  • It is very important to determine *exactly* what their roles are on teams and projects, beware of Tableau jockeys and spreadsheet analysts pumping their resume.
Advice: These employees are generally a great investment, though you have to be very careful not to be picking up another organization's castoffs.  They can begin to take on larger projects on their own, or serve as junior mentors to the first category of employees.  This group should be seen as your bridge the future Data Science team, the group that will be your Senior and Principal data scientists within 2-5 years.

Shingy Clones

First, who's shingy?  This guy.

These candidates are full of hot air, very much hyped on Data Science as a concept--also other hyped concepts--have fun getting them to talk about block chain.  The dark side of course is that they have no Data Science abilities and are just low-rent hype people
  • They will come with a lot of energy and enthusiasm, which can be hypnotizing, especially for executives.
  • These people are the definition of why the interview process is critical.
  • They are completely destructive if you hire one, will always be hyping and saying we need new technology or to do "x".  However they don't have the skills or knowledge to understand what they are suggesting or how to deliver.
  • They have no clue what they are talking about.
Advice: You can't hire these people.  They are going to be a high cost with zero deliverables.  To avoid this put some matrix algebra or simple calculus questions on your interview.   Coefficient interpretation? Ask them questions about coding in un-sexy languages (SQL).  On difficult questions this group will break.  

(As an aside, I've had a few of these people try to gaslight me, and then end up yelling at me in an interview.  It's not fun to be yelled at, but when this happens, I know I've dodged a bullet in calling someone out.)

Experienced Statisticians

These candidates are more advanced in their career, and often will shun the term Data Scientist.  They may be less striking at first, and certainly with less flash than a 25-year-old machine learning expert, but add a ton of value to your organization.
  • These candidates generally build models, often outperforming data scientists models while using simpler, more elegant methods.
  • They can be great mentors to young data scientists-if junior staff are willing to listen.
  • They often lack machine learning or big data system (e.g. Hadoop).
  • They will also lack some more modern coding/computing skills (e.g. containerization, cloud, etc).
  • One successful tactic is to use this type of employee on a project team with a machine learning expert and a data engineer.  The data engineer will bring the technical coding skills, the machine learning expert will bring modern methods, and the more experienced employee will bring research design and rigor.
Advice: These candidates are some of the best deals on the market, mainly because they can mentor and "fix" a lot of the missing knowledge of young data scientists.  Younger data scientists tend to have bad habits, or in cases massive holes in their skillsets and intuition around research design, rigor, probability, and statistics.  And more simply put-these employees often can do better work, using older methods and are great mentors.


These are the classic Data Scientists that many organizations are looking for.  Their traits:
  • 15+ years in building machine learning models.
  • 15+ years building econometric models.
  • Production level developer who has built massive productionized ML systems
  • Hadoop/Spark developer.
  • 100x ROI.
  • Virtually non-existent.
I'm being a bit hyperbolic, but these candidates essentially don't exist.  Well, some do, but you may not want to pay the premium involved (it's high).  If you can find one, by all means hire.  On the other hand, you can build an all-star team fairly well by focusing on building blocks.


A few months ago a recruiter called me and asked if I had 15 years of experience in Hadoop.  This is an absurd question given that Hadoop's first release was in 2006, but it speaks to an underlying truth:  many organizations are looking for a Data Science candidate pool that simply does not exist.  I hope that the essential takeaway allows you to build a reasonable Data Science team with building blocks based on the talent actually available. A reasonable team might involve:
  • Directly out of School: 1-2 FTE
  • Couple Years Experience: 2 FTE
  • Shingy Clones: 0 FTE
  • Experienced Statisticians: 1 FTE
  • Unicorns: 1 - if you can find one, but not necessary
As an aside, the model without "Unicorn" candidates will likely require substantial help from data engineers and some developers in order to get data into systems, and models into production.  This does create some inefficiencies, but is often less expensive than finding a unicorn candidate.