Friday, November 24, 2017

Python versus R: A Porting Project

I have been programming in R for over a decade, and during that time, especially more recently, I have built robust pipelines to create large numbers of machine learning and statistical classification models at a time.  The purpose of these pipelines are to evaluate multiple model types against a single dependent variable (usually in highly-dimensional space), quickly determine which works best, and automatically move to the next variable to be modeled.  Like many data science projects the pipelines include five steps:

  1. Data Cleaning,
  2. Feature Selection,
  3. Running Models,
  4. Model Evaluation, and
  5. Report Production (create a PDF for review by business owners, if they so choose).

I can write all five of these steps easily in R, and haven't really had problems with this type of modeling. But I also know Python, which has similar Machine Learning and analytical packages-and has been referred to as the "future of Data Science."  I had used these packages before (pandas, numpy, sklearn), but I most often use Python for non-modeling tasks or to access frameworks like Spark.  

Two weeks ago, on testing one of my pipelines, I had the idea to port my primary model building pipeline into Python.  The reasons for this were two-fold:
  1. To make use and test Python's different data science methods and packages
  2. To make use of Python's flexibility as a programming language (as well as it's status as a "real" all-purpose language).  
The coding was more difficult at first as I figured out some details of pandas data frames and numpy arrays.  I'm mostly finished now, just now fixing small breaks I find.  Generally the models created are nearly identical in quality, with Python maybe showing a slight edge.  Other than that, my initial thoughts:

PythonR                                              Versus

  1. Categorical handling: If I have a categorical variable in an R data frame, and I want to pass that to an R model, I can pass the variable directly to an algorithm, and R efficiently creates numerical data on the fly without user intervention.  Python, however, generally requires a preprocessing step to map the categorical into per-dimension binaries.  There are drawbacks of both methods:
    1. R is like an "automatic transmission," it is less work for the user and makes the data frame in memory easier to manipulate.  On the other hand when using this method, some R  methods force all levels of a categorical variable (minus one) into an algorithm, when sometimes optimal models would feature-select to far fewer (some models handle this, some don't). 
    2. Python is more of a "manual transmission," situation where the user has to intervene to decide on a categorical encoding strategy.  (e.g. pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder()). This ends in more work for the user, massive data frames, but allows for more control of feature selection (in some algorithms) at run time. (This is actually a problem I've seen in R for quite some time, and through being less-developed in this space, Python has "solved" the problem)
  2. Different algorithms:  This is generally to say that Python is not a primary language for statisticians and research data-scientists, (Python is new to the game) making Python a bit behind the curve for algorithm availability.  One example of such a missing case is a shrunken centroids model which I had found useful in a few specific types of classification.
  3. Some models run faster: When I run a model in R versus Python I get similar results within tolerance, except that the Python models tend to compile on my hardware much faster.  As a test I ran XGBoost in both systems.  The models were substantially similar (AUC= .713 v AUC = .716), however the Python version finished in 3 seconds versus 32 seconds R.  Both were still under a minute, and this may not seem substantial, however inside of a analytics pipeline where you may be building a few thousand models, the timing difference at multiplication becomes substantial.
  4. More consistency between models: R is a bit of the "wild west" in terms of consistency both in model parameters and model object outputs.  For direct comparison of models (or to run different model types under similar parameters) one often has to rely on third-party packages like "caret" or "broom."  This makes R's advantage in packages and model types less-than-ideal in that traversing those model types is not straight-forward.  Generally in Python's sklearn I can count on classification packages of similar types to give me similar output objects and methods.
  5. Some things don't work at all: I've had more issues in Python of certain functions not working *out of the box* as stated in documentation-many of these seem to be fixed in down-line bug fixes.  I *think* this is likely because sklearn is still mainly a package under development.
  6. Plotting: To be honest, I'm still figuring this one out.  Matplotlib appears to be the preferred plotting strategy in Python (though there is a Python version of ggplot), but honestly rewriting all my diagnostic plotting strategies (and getting labels, titles, axis, and legends correct) has been one of the biggest pains in this entire process.  It's difficult to determine whether Python is actually more difficult, or if it's just painful because I've spent several years developing my own plots in R.
  7. Object Oriented: Python has a bit more straight-forward syntax as a programming language, and my code for the Python pipeline is more object oriented-and quite honestly-better coded than what I have in R.  That said, the whitespace and syntax requirements in Python took some getting used to versus my "I do what I want" attitude of coding in R.


Overall-both platforms have advantages and disadvantages.  My takeaways are this:
  • R is likely better (in the short-term at least) for data exploration and manual or "academic" model builds due to relative ease of coding and availability of models and methods.
  • Python may be better for large-scale model builds where speed and consistency between models is necessary (and also if you an adversion to hearing the term "tidy").

No comments:

Post a Comment