Sunday, September 6, 2020

R versus Python: A Practical Paradigm for Choosing Your Production Language

As a Data Scientist fluent in both Python and R (and to a lesser degree, a few other languages), I'm often tasked deciding which to use for a specific project or script.  Making that decision is tough but generally can be made pretty quickly:

  1. Do the specific outcomes of this task make one option obviously better?  Example: if the outcome is a plot, I'll choose R for the ease of the ggplot interface.   
  2. Do the players on this task make one option obviously better?  Example: the data engineer involved is a Python person, let's keep it all in Python.
  3. Does the data science work (modeling) tend towards one language?  Example: the solution requires heavy econometric modeling, panel modeling, or time-series analysis, I'll generally choose R.

But one use case which often doesn't offer clear decision making is the productionization of Data Science processes into microservices.  Both R and Python have tools to handle this, but how do we decide the correct option for each case?  Here I offer a quick survey of the tools in each language, and my current paradigm in final choice.

Python Vs R


I have blogged a few times about pushing R into production to solve a variety of problems.  Is it possible? Absolutely-it has been done multiple times, mainly by data scientists.  What does that environment look like:

  1. Application code: Normal R applications using function, classes and code from 10K CRAN packages.  I recommend avoiding Tidyverse and pipes due to issues with debugging in production.   
  2. Database connectivity: R has several packages for db connections-I generally use RODBC or RJDBC depending on the use case, but DBI can also be implemented in prod. 
  3. Serving the application: Depending on what you want to do there are a couple of options.  To easily spin up a quick REST API, you can us the plumbeR package.  If you want a more robust solution Rserve provides a robust binary R server.

General thoughts:  As with most things R, building and running .predict() functions on models are straight forward.  plumbeR is a very simple system to setup and run a quick REST API, that's usually a single threaded API that isn't appropriate for production.  Rserve solves this problem on unix systems with forked processes, but is also esoteric in implementation, and getting to a REST system is more difficult.

Plumber: Getting R ready for production environments? - Data Scientists


I have setup production systems in Python a few times, solving a variety of problems-mainly data engineering tasks.  It is possible, and actually happens in a much more broad sense than data science.  Many of the microservices for general web development purposes are also developed in Python, a lot of people work on them, which gives us a much more broad set of tools.

  1. Application code: In this case we would use standard Python Data Science application tools, including but not limited to pandas, numpy, and sci-kit learn.  Though this tool set is somewhat inferior to the 10,000 packages on CRAN, it does give us a solid base to complete 99% of Data Science tasks.
  2. Database connectivity: Python also has several options to connect to a database server in production, including pyodbc for ODBC connections and pymysql for a pure client API connection.
  3. Serving the application: As I stated before, because Python is used for many other potential application microservices, we have a rich toolset to choose from.  While there are options, we can use a combination of Flask (API), nginx (webserver) and gunicorn (workload handler).
General thoughts: While Python's data science toolset is a bit limiting, especially if tending towards econometric or traditional statistical models, there are other big advantages.  Specifically, with a rich micro service framework, we can very simply setup REST APIs, serve them with robust webservers, and manage heavy and embarrassingly parallel workloads.  

What Is Nginx? A Basic Look at What It Is and How It Works

Making the Decision

When weighing the relative feature sets of R and Python for production, we end up in a prioritization wash.  If ease of setting up robust APIs is a priority, then Python wins easily (yes plumbeR is easy, but has issues with workloads).  If access to the the thousands of algorithms and packages in CRAN is a priority, then a move to R is more appropriate--while biting the bullet on Rserve.  

But, as with most things, there are some external consequences we need to consider, and I suggest a new paradigm for choosing production language:  supportability.  If I personally can support R and Python packages that's great, but in truth, the Data Science team is not the group supporting prod.  

More often, that falls to a DevOps team to quickly figure out what's gone wrong, sometimes at odd hours.  From my experience, 90+% of DevOps teams have Python experience, and approximately 3% have R experience.  If R was a language that created reasonable error handling that would be less of a problem-but those errors are often non-sense to non R users.  As a result, my quick paradigm for Data Science production:

While R and Python come with similarly rich feature sets for production code, in cases when either can be used, Python is often to be preferred as it will ease support from DevOps and other technical resources.

Thursday, January 2, 2020

Data Science Job Market

A few weeks ago I was thinking about the state of the current Data Science job market--and a bit frustrated about inquiries by job seekers on LinkedIn, I hastily sent out a tweet on my thoughts..

As the weeks have gone by (and my employment situation has become a bit weird--in good ways), I've often thought about that tweet.  Essentially, since the day I said this, I've noticed it to be more true than I originally thought--Especially on the Shingy front.

I thought it would might good to put together some more in-depth thoughts on the different types of Data Science candidates I see on the market and how they relate to real roles in companies. In result, this blog can serve as a helpful guide in building a data science team at your organization.  I've gone into more detail of the five types of candidates below.

Directly Out of School

The trademark of these candidates are recently completing grad school with no or limited work experience.  A few resumes will have a history of internships, or will try to pass off class projects as "jobs they worked." My general thoughts:
  • There are a lot of these candidates-a lot.
  • They harass Data Science managers regularly on social media, LinkedIn and email.
  • Many of them are not yet talented, and are going to take a lot of work before they're tackling projects on their own.
  • Some of them have not been screened for what I call "employability" and are unlikely to survive the rigor, rules and norms of the workplace.
  • Be careful of buzzword slingers.
Advice: You can take on a few of these and they can add value for your team over time.  But you can and should screen heavily before choosing your candidate.  The sheer number of these candidates on the market gives the employer the luxury of being picky-and because many of these candidates come untested and without serious references (professors are not real references) you need to use your best BS detecting.

Couple of Years Experience

These candidates have been out in the job market for one to five years and generally with one or two employers.  Resumes will generally have real projects worked on listed, though being a junior-level employee, you don't know what their actual role involved.  My thoughts on this group:
  • There are quite a few in this group as well.
  • The loudest ones are usually in the process of failing out of a first job.
  • A lot of them are well on their way to a great career, though.
  • Their skills may be largely defined by the experience of their first job-so there will still be a big training job ahead. 
  • It is very important to determine *exactly* what their roles are on teams and projects, beware of Tableau jockeys and spreadsheet analysts pumping their resume.
Advice: These employees are generally a great investment, though you have to be very careful not to be picking up another organization's castoffs.  They can begin to take on larger projects on their own, or serve as junior mentors to the first category of employees.  This group should be seen as your bridge the future Data Science team, the group that will be your Senior and Principal data scientists within 2-5 years.

Shingy Clones

First, who's shingy?  This guy.

These candidates are full of hot air, very much hyped on Data Science as a concept--also other hyped concepts--have fun getting them to talk about block chain.  The dark side of course is that they have no Data Science abilities and are just low-rent hype people
  • They will come with a lot of energy and enthusiasm, which can be hypnotizing, especially for executives.
  • These people are the definition of why the interview process is critical.
  • They are completely destructive if you hire one, will always be hyping and saying we need new technology or to do "x".  However they don't have the skills or knowledge to understand what they are suggesting or how to deliver.
  • They have no clue what they are talking about.
Advice: You can't hire these people.  They are going to be a high cost with zero deliverables.  To avoid this put some matrix algebra or simple calculus questions on your interview.   Coefficient interpretation? Ask them questions about coding in un-sexy languages (SQL).  On difficult questions this group will break.  

(As an aside, I've had a few of these people try to gaslight me, and then end up yelling at me in an interview.  It's not fun to be yelled at, but when this happens, I know I've dodged a bullet in calling someone out.)

Experienced Statisticians

These candidates are more advanced in their career, and often will shun the term Data Scientist.  They may be less striking at first, and certainly with less flash than a 25-year-old machine learning expert, but add a ton of value to your organization.
  • These candidates generally build models, often outperforming data scientists models while using simpler, more elegant methods.
  • They can be great mentors to young data scientists-if junior staff are willing to listen.
  • They often lack machine learning or big data system (e.g. Hadoop).
  • They will also lack some more modern coding/computing skills (e.g. containerization, cloud, etc).
  • One successful tactic is to use this type of employee on a project team with a machine learning expert and a data engineer.  The data engineer will bring the technical coding skills, the machine learning expert will bring modern methods, and the more experienced employee will bring research design and rigor.
Advice: These candidates are some of the best deals on the market, mainly because they can mentor and "fix" a lot of the missing knowledge of young data scientists.  Younger data scientists tend to have bad habits, or in cases massive holes in their skillsets and intuition around research design, rigor, probability, and statistics.  And more simply put-these employees often can do better work, using older methods and are great mentors.


These are the classic Data Scientists that many organizations are looking for.  Their traits:
  • 15+ years in building machine learning models.
  • 15+ years building econometric models.
  • Production level developer who has built massive productionized ML systems
  • Hadoop/Spark developer.
  • 100x ROI.
  • Virtually non-existent.
I'm being a bit hyperbolic, but these candidates essentially don't exist.  Well, some do, but you may not want to pay the premium involved (it's high).  If you can find one, by all means hire.  On the other hand, you can build an all-star team fairly well by focusing on building blocks.


A few months ago a recruiter called me and asked if I had 15 years of experience in Hadoop.  This is an absurd question given that Hadoop's first release was in 2006, but it speaks to an underlying truth:  many organizations are looking for a Data Science candidate pool that simply does not exist.  I hope that the essential takeaway allows you to build a reasonable Data Science team with building blocks based on the talent actually available. A reasonable team might involve:
  • Directly out of School: 1-2 FTE
  • Couple Years Experience: 2 FTE
  • Shingy Clones: 0 FTE
  • Experienced Statisticians: 1 FTE
  • Unicorns: 1 - if you can find one, but not necessary
As an aside, the model without "Unicorn" candidates will likely require substantial help from data engineers and some developers in order to get data into systems, and models into production.  This does create some inefficiencies, but is often less expensive than finding a unicorn candidate.  

Saturday, July 7, 2018

A Suggested Curriculum for Undergrad Data Science

It's intern season which basically means two things for me:
  1. I feel really old in the office.
  2. I take a lot of meetings with young people wanting to be data scientists..

The meetings generally have this feel:

Hi, I have interest in becoming a data scientist and I want to get your perspective on what that might take, let's meet for coffee.
The interns that setup these meetings come from a wide range of backgrounds and skill levels.  Many times they are interested in which classes they should take to be competitive for data science roles after they graduate.  I thought it would be helpful to put the advice I give them into a blog post so more people can read it.

I reviewed the current course offerings at a few schools, and settled on a list of 14 basic classes (and some electives at the end) that I think should be part of every data science curriculum.  Here is my list:


  • Calculus I, II, III
  • Differential Equations
  • Linear Algebra
Some data scientists say you can skimp on these requirements, but that's universally false.  If you don't understand matrix algebra or differentials you simply can't understand the algorithms we implement.  The data scientists I see who struggle with higher level math fail in understanding the operations of complex algorithms, which leads to failure in implementation.


  • Intro to Stats
  • Calculus Based Stats
  • Generalized Research Method Class 
  • Econometrics
Although many modern data science algorithms are not based purely in statistics, the concepts of risk, certainty, and confidence in these classes are key to understanding predictive modeling in general. Having worked with computer science only focused data scientists, I now see a background in stats as key to the fundamentals of making predictions.

Econometrics may seem like an outlier, but there are concepts of predictive modeling such as time-series analysis, dealing with collinearity, endogeneity and auto correlation which are best taught in the context of econometrics.


  • Programming I, II
  • Data Structures
  • Fundamentals of Computer Algorithms
  • Introduction to Database Systems
Sometimes I question the value of formal education in coding, some of the best programmers I know have degrees in non-computer fields.  That said, computer science is still a core skillset for data scientists, and is required knowledge to be hired by someone like me (if you have the skills from another source that's great, just figure out a way to demonstrate it with an application/in an interview).


As for electives in the data science space, these should be modeled towards what specifically you want to do.
  • If you want to go into business, take classes in economics, business operations, and accounting.
  • If you want to go into algorithm development, focus more time in advanced computer science classes.
  • If you want to go into academic research, focus on whichever academic discipline you are most interested in.


This is intended to be a reasonable list of classes for young people interested in data science.  It serves two purposes really:
  • Provide a framework for undergrads looking to become a data scientist. 
  • Prevent me from saying things I later regret when confronted by students who want to be data scientists without taking math.

Sunday, March 4, 2018

A Common R Mistake: R Factor-Numeric Conversions

For the most part the R statistical system is a robust and fast way to quickly execute statistical analyses. Other times the annoyances and "tricks" it contains for more junior analysts on the system, leads me to encourage new analysts to opt for Python instead.

One of the biggest tricks inside of R for junior analysts involves a specific data type called "factors," attempted type conversion, and a sometimes difficult to detect programming issue.


Factors are a data type specific to R that helps statistician deal with categorical data.  In CS terms, factors help statisticians deal with non-numeric low-cardinality variables.  In most statistical processes this type of variable will be converted to binary dummies, so their storage in situ is less important.

Here is an official description from Berkley's R documentation regarding the storage of factors.
What does this actually mean?  When storing a factor, R strips out all of the actual text and replaces it with index numbers correlated to the textual values and stores the index numbers instead.  This both saves space in data frame storage and logically makes sense in the way these are used by statisticians.

And this process is mostly invisible to user for *most* processes...

That is, until you try to convert a factor to something else.


This system works fine, until you need to convert that data to something else.  And here's the key instance where I've seen that occur:  Let's say that you're importing some data that you're not entirely familiar with, So you run something like this to import and inspect your data:

We see a data frame with 3 columns  "x" appears to be an index, "b" is just a simple numeric field.  But "a" is weird.  It looks like numbers, but for some reason R thought it was a factor.  This is where the mistake starts:

  • Junior analyst converts this value directly to number (as.numeric() which works in many other programming languages and the SQL that is often use by data scientists.)-and continues on with their day.
  • Three hours later the junior analyst (who may be a bit unfamiliar with the business problem to be solved) turns in a work product that has completely bizarre results and is confusing to the business-they must be wrong. So what happened?


Let's split the process apart and see what actually happens when you as.numeric() a factor.
If we create a new column in our data set containing the type-converted data we see:

Wait.. what?  This now seems to be *correlated* to but with completely different values than our original column.  Here's the trick:
'When factors are converted to numeric using as.numeric() it pulls the underlying index numbers and not the actual values, even if that actual value appears to be a number.'
Essentially: Even though column 'a' looks like numbers, R ignores that and pulls an internal ID number R uses as backend lookup.  This can be deceptive, especially when your level of missingness is relatively low after the type conversion.  Confusing this a bit, is that expected correlations generally hold up after the conversion, because the index numbers are ordered-it's simply the magnitude + variance that changes.


Fixing the problem is easy, you simply convert to character (as.character()) before converting to numeric. This conversion uses the actual data values, gets rid of our index numbers. But what if you want to know why your variable was converted to factor in the first place by read.csv(). I've written the following function for which to check the values that came in that natively fail numeric conversion:

The function finds that your numeric column of data also includes values 'a' and 'b' which are preventing numeric conversion.  Let's say now you realize the issue, and are aware that 'a' and 'b' should be converted to 0.  You can easily make this conversion after forcing the values to numeric-but first converting to character, as so:

Now we see the column 'a_better' seems to directly represent the original values in 'a'.

The combination of these functions make it easy to:
  • Avoid our initial type conversion issue.
  • Discover why our data that was assumed numeric is not all numeric, and DO SOMETHING about it.


To finish this up I thought I would give two examples of times when I've almost been burnt by this functional weirdness in R.

Scenario One
I was analyzing a dataset that had an interesting distribution-it was monetary data, but rounded to the nearest dollar, and involved integer values from -1 to 250-with some higher outliers.  Remember that as.numeric() replaces a factor scale with an integer index starting at 1.  The dataset also included some NULL values, represented by the word NULL (this is how the Python-Spark export created the data).

When I downloaded and imported the data it initially came in as factor, and (not thinking) I simply forced the type conversion.  This had the effect of creating NA from the prior NULLs which I knew were assumed 0's and fixed with a simple df[] = 0 statement.  The problem was that now my scale was shifted approximately two values higher due to the initial distribution-but the variance was still the same, the percent of 0's were reasonable, and generally the data was still reasonable.

After about an hour of working with the data, I noticed that I was a bit too far off of control totals I had run in PySpark, and backed into my problem, fixed and moved on.

This speaks to a major risk in the factor conversion problem: when the dataset is made up of integers very near zero, the error is difficult to detect.  

Scenario Two
In scenario two I was dealing with geospatial data, a polygon shapefile at the zip code level (what our external vendor could handle).  I had crossed it with a few massive 'points layers' and was creating an analysis of output zips using some fairly massive distance and customer travel pattern analytics.  At one point I needed to link the zip codes up to some additional zip code based data, but the join failed because the zip codes were factors. 

Knowing I was only dealing with zip codes in the United States, I quickly used the as.numeric() without thinking.  In this case (if you know about zip codes you can imagine what happened) the new factor levels lead to effectively a scramble join.  I would have missed this completely, except that my last step involved visualizing the zips in a nationwide map-which looked completely random.

The point of this anecdote: as usual, visualizing data can be a powerful check against otherwise undetectable coding mistakes.


Factors in R can be a powerful statistical tool, but under a few scenarios in type conversion, they can cause issues.  This blog post provided:
  • A general description of the issue.
  • A couple of methods including a function to find non-numeric values in a factor.
  • Some warnings of difficult-to-detect errors.

Saturday, February 3, 2018

Twitter Blocking: Fast-Scan Social Algorithms

Summary Points:
  • Over 3100 people block my personal account on Twitter.
  • I created an algorithm to crawl Twitter and find people who block me.
  • Those people appear to be mainly aspiring authors and members of the anti-Trump "resistance."

When I go on Twitter, I'm often confronted with this view:

Trying to see the unavaible tweet, I click through and see this:

I am blocked by thousands of people on Twitter.  When I tell people about this publicly they often react in thinking I must be the worlds most massive troll (but I'm not).  But most of these "blockers" are accounts that I've never interacted with, they have essentially blocked me either categorically or because I'm part of a massive block list.

Blocking has become a major part of the user experience on Twitter for many reasons, and that's largely out of the scope of this blog. To understand how someone ends up blocked like me; you should understand two products:

  • Block together:  A program that gives users the ability to share block lists and otherwise categorically ban accounts.  
  • Twitter Block Chain: This chrome extension is (poorly named) used to block all users of a specific account.  For instance, it could be used to block anyone who follows @realDonaldTrump
After a day when I found a couple of random accounts blocking me.  I realized their was a Data Science angle, specifically:
  • Can I use an algorithm to scan Twitter and accounts that block me?
  • Can I optimize the algorithm with Machine Learning to predict accounts likely to block me and make my initial algorithm find blockers more quickly.
The answer to these questions ended up being "yes" and "yes."  Here I'll describe the results, first with a description of results and then a concept of the data science method.


To-date, I've used an algorithm to find about 3100 people who block me on twitter.  Releasing the full list would seem to be doxxing-however the public nature of which "verified" accounts block me seems less problematic.  Here's a listing with evidence of some "celebrity blocks":  

TL;DR version: A guy who invented the term "vice signaling," a liberal writer, another liberal writer, a standup comedian, and a former Star Trek actor.

Digging into the data around who blocks me, we can generally describe the nature of people who tend to block me by comparing how the words on their profiles compare to the words on the profiles of people who *don't* block me.  Below is a list of those words and their indication of risk.

A score of 1.0 deems that this word neither increases the probability that a user will block me, a score of 2.0 is twice as likely to block, whereas a score of 0.5 is half as likely to block.  There are two effective dimensions to the majority of people who block me:
  • People who are part of the liberal "resistance" to Donald Trump (which is a bit bizarre because I'm not a Trump supporter-though I was very anti-Bernie Sanders).
  • "Geeky" authors and writers-this is speaks to the Wil Wheaton theory of my epic blocked status (see below).
One outlier word is the term "lyme" which turned out on inspection to relate to people who believe they have "Chronic Lyme Disease," a condition covered by Evidence Based Medicine proponent Orac over on his blog.

And because I know everyone loves wordclouds (sarcasm), on a recent long social scan, I created wordclouds of people who block me.  Here's what that looks like, first looking at those who block me:

And those who do not:

Apparently, unbeknownst to me, I am loved by dog/pet lovers (potentially an artifact of sampling, rescan method-but few of the dog-lover accounts blocked me).


To begin my methodology, I started with a bit of a priori theory-specifically-what are the incidents that led to my blocking?  I'm blocked by people across the political spectrum, from conservative politicians in my own state to "resistance" members that are famous nationwide.. and an odd number of science fiction writers (I'm not a fan, don't care).  Here were my best theories of my own blocking:

  1. I was added to Wil Wheaton's block list after telling him to "Shutup Wesley" one too many times.
  2. I got into a few arguments that led to individual blocks with Bernie bros when I was pointing out some misconceptions they had about tax policy.

Wheaton's block list, as well as the block together App seemed the most likely scenario for wide-spread blocking.  


Figuring out that someone blocks you via the Twitter API is actually super easy. My initial scanner ran all inside of R, then I ported it to Python using the tweepy package.  The basic steps:

  1. Try to download a user timeline for your target user.
  2. Catch all errors. 
  3. Check to see if the returned error is code 136.

Here's what that looks like in general Python:

Note: This code is simply for a single user.  The entire application I wrote is a full iterative search-for-user, check block status, recursively search for more users, model, repeat application in Python with a Microsoft SQL Server backend that I am not publishing in whole at this time, for various reasons.


Knowing that our end goal here is to optimize an algorithm which will tell us which users to check for blocking, we first need to gather an informed set of accounts that will tell us which types of users block us.  Using the a priori theory of why users may be blocking me, I turned to two places:
  1. The followers of the account for the app "BlockTogether"-this may create an artificially high incidence rate-but hey at least I know they follow me.
  2. Followers of Wil Wheaton-because I know he blocks me, and also shares his block list, so there's a fair chance that his followers would buy into this.
I used this strategy to get an initial subset of users to figure out what types of users might block me.  So I started there, using the Twitter API to download all the Twitter ID's associated with Wheaton's followers (reduced Python below-I'm showing the basic blocks in segments here for ease of presentation). 


Why would I even need to model my blockers? A few reasons:
  1. I don't have a list of all Twitter accounts, so I need to predict who's followers might block me at high rates.
  2. These checks rely on the Twitter API, which is rate limited, so it's of great advantage to rank-order accounts in terms of likelihood to block.
  3. Understanding which people block me gives me insight to the 'why'?
As such I began creating a model against Twitter profile data to determine which accounts are likely to block me.  First, what data elements will I have available?  The basic elements available are what is pulled from a profile scan of the Twitter API: 

  • Verified Account?
  • # of follows
  • # you follow
  • date created
  • #of favorites
  • #of statuses
  • Location
  • Name
  • Description (text description on your profile)
Most of these are fairly easy to analyze, with the exception of the text description (of course, the most rich sense of actual attitudinal data which is likely to predict who you may block). I can't push a text description directly into a ML model, so I took two strategies in creating numeric data from textual data: 
  1. Compute relative frequency between blockers and non-blockers of high incidence words, and then create binary dummies for all words with a skew (measured by binomial test) <.05
  2. Use a type of Natural Language Processing (NLP) to compute correlated topic models accross the description data; create a variable for each observed topic, and assign the probabilistic association for each user description to fill in the data for each topic.
Data was clean at this point and essentially ready for modeling.  Dependent variable was whether or not I was blocked, N was all accounts "checked" to date, independent variables are as listed above.

I tested several types of models (read: ran and evaluated them using R's caret package).  Extreme Gradient Boost trees ended up winning (though there wasn't a huge gain).  AUC = .85.  

Variable importance was interesting:

As expected, textual data was most predictive with several other items also providing important information to the prediction.  Note: many of these variables have complex interactive effects inside our XGBTree, so it's not a simple matter of "more followers==more likely to block."


What is the value of modeling this data?  Because the Twitter API is rate limited I can only check a certain number of accounts per day, so it's good for me to use my rate limit wisely.  By predicting which accounts are most likely to block me, I can simply prioritize those and thus increase my find rate.  Using this prioritization at first increased my positive block rate by about 3x, from about 0.1% to 0.3%.


Far above I covered where to start with a large number of account to check, deciding on followers of Wil Wheaton and the BlockTogether follower list.  But once I've checke all of these accounts where should I go?  Here I implemented two strategies:

  • I found a relationship between an individuals propensity to block and their followers propensity to block.  A quirk of the twitter API is that while blocked from seeing a user's timeline, I can still pull all of their followers.  So, I began iteratively pulling all the followers for each block I found.
  • I also found a releationship between the probability that you block me, and the incidence at which your followers will block.  As such, I began pulling the followers for all high-probability blocking accounts(as determined by model above), regardless of whether they actually blocked me.
  • There's an ability to find users via Twitter API using words on their profile, because I already know which words (Lyme, Geek) are associated with blocking me, I simple search Twitter for those words.
This iterative method can lead to a biased sample effect over time-so it's important that you also include some random accounts in your sample OR some that are intentionally in large deviation to your current pull.


In the end a few points to take away:

  1. The culture of Twitter has led to a situation where blocking is pervasive-especially for certain people who end up on block lists.
  2. I am blocked by many people on Twitter, and by gathering data about these individuals-I can certainly create a general profile of my blockers.
  3. Using Machine Learning, the Twitter API, Natural Language Processing, and other Data Science Technology, I can pull together a list of people who block me on Twitter.

As a personal aside, I don't care a lot that I'm blocked on Twitter, it doesn't appear to effect my user experience in a material way, as I am not blocked by any accounts I would normally follow and it's not an emotional issue for me.  I do find it troubling that the pervasive use of block lists is allowing many Americans to insulate themselves from conflicting points of view, including primarily blocking people with whom they have never interacted with in the past.

Friday, November 24, 2017

Python versus R: A Porting Project

I have been programming in R for over a decade, and during that time, especially more recently, I have built robust pipelines to create large numbers of machine learning and statistical classification models at a time.  The purpose of these pipelines are to evaluate multiple model types against a single dependent variable (usually in highly-dimensional space), quickly determine which works best, and automatically move to the next variable to be modeled.  Like many data science projects the pipelines include five steps:

  1. Data Cleaning,
  2. Feature Selection,
  3. Running Models,
  4. Model Evaluation, and
  5. Report Production (create a PDF for review by business owners, if they so choose).

I can write all five of these steps easily in R, and haven't really had problems with this type of modeling. But I also know Python, which has similar Machine Learning and analytical packages-and has been referred to as the "future of Data Science."  I had used these packages before (pandas, numpy, sklearn), but I most often use Python for non-modeling tasks or to access frameworks like Spark.  

Two weeks ago, on testing one of my pipelines, I had the idea to port my primary model building pipeline into Python.  The reasons for this were two-fold:
  1. To make use and test Python's different data science methods and packages
  2. To make use of Python's flexibility as a programming language (as well as it's status as a "real" all-purpose language).  
The coding was more difficult at first as I figured out some details of pandas data frames and numpy arrays.  I'm mostly finished now, just now fixing small breaks I find.  Generally the models created are nearly identical in quality, with Python maybe showing a slight edge.  Other than that, my initial thoughts:

PythonR                                              Versus

  1. Categorical handling: If I have a categorical variable in an R data frame, and I want to pass that to an R model, I can pass the variable directly to an algorithm, and R efficiently creates numerical data on the fly without user intervention.  Python, however, generally requires a preprocessing step to map the categorical into per-dimension binaries.  There are drawbacks of both methods:
    1. R is like an "automatic transmission," it is less work for the user and makes the data frame in memory easier to manipulate.  On the other hand when using this method, some R  methods force all levels of a categorical variable (minus one) into an algorithm, when sometimes optimal models would feature-select to far fewer (some models handle this, some don't). 
    2. Python is more of a "manual transmission," situation where the user has to intervene to decide on a categorical encoding strategy.  (e.g. pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder()). This ends in more work for the user, massive data frames, but allows for more control of feature selection (in some algorithms) at run time. (This is actually a problem I've seen in R for quite some time, and through being less-developed in this space, Python has "solved" the problem)
  2. Different algorithms:  This is generally to say that Python is not a primary language for statisticians and research data-scientists, (Python is new to the game) making Python a bit behind the curve for algorithm availability.  One example of such a missing case is a shrunken centroids model which I had found useful in a few specific types of classification.
  3. Some models run faster: When I run a model in R versus Python I get similar results within tolerance, except that the Python models tend to compile on my hardware much faster.  As a test I ran XGBoost in both systems.  The models were substantially similar (AUC= .713 v AUC = .716), however the Python version finished in 3 seconds versus 32 seconds R.  Both were still under a minute, and this may not seem substantial, however inside of a analytics pipeline where you may be building a few thousand models, the timing difference at multiplication becomes substantial.
  4. More consistency between models: R is a bit of the "wild west" in terms of consistency both in model parameters and model object outputs.  For direct comparison of models (or to run different model types under similar parameters) one often has to rely on third-party packages like "caret" or "broom."  This makes R's advantage in packages and model types less-than-ideal in that traversing those model types is not straight-forward.  Generally in Python's sklearn I can count on classification packages of similar types to give me similar output objects and methods.
  5. Some things don't work at all: I've had more issues in Python of certain functions not working *out of the box* as stated in documentation-many of these seem to be fixed in down-line bug fixes.  I *think* this is likely because sklearn is still mainly a package under development.
  6. Plotting: To be honest, I'm still figuring this one out.  Matplotlib appears to be the preferred plotting strategy in Python (though there is a Python version of ggplot), but honestly rewriting all my diagnostic plotting strategies (and getting labels, titles, axis, and legends correct) has been one of the biggest pains in this entire process.  It's difficult to determine whether Python is actually more difficult, or if it's just painful because I've spent several years developing my own plots in R.
  7. Object Oriented: Python has a bit more straight-forward syntax as a programming language, and my code for the Python pipeline is more object oriented-and quite honestly-better coded than what I have in R.  That said, the whitespace and syntax requirements in Python took some getting used to versus my "I do what I want" attitude of coding in R.

Overall-both platforms have advantages and disadvantages.  My takeaways are this:
  • R is likely better (in the short-term at least) for data exploration and manual or "academic" model builds due to relative ease of coding and availability of models and methods.
  • Python may be better for large-scale model builds where speed and consistency between models is necessary (and also if you an adversion to hearing the term "tidy").

Sunday, September 10, 2017

So You Want To Be A Data Scientist

Quite often, someone I know asks me a question that I don't have a great answer for:

How would I go about becoming a data scientist?

This is always a tough place to start a conversation, especially if data science is not a great fit for the individual I'm talking to, but there are generally two types of people who ask me this question:

  • Young professionals:  I get the joy of working with quite a few interns and "first jobbers," who, BTW generally give me a reason to be hopeful about the future of America. (Ironically most of them aren't Americans, but whatever...) Most of these people are in computer science or some kind of analytical program and want to know what they should do to become a real "data scientist."
  • People my age: I also get this question from people in their mid-30's, many of whom have limited relevant education background.  For certain mid-career professionals this could be a great option, especially if they have both computer science and math in their background, but this often isn't the case.  They seem to be drawn to data science because they've seen the paycheck, or it just sounds mysterious and sexy.  Often these people say "I love data, I'd be great at data science" (though this claim is somewhat dubious, and by this they often mean that they like USA Today infographics). 

I'm writing this blog post as a place to point both of these groups, in order to give a fair full-breadth look at the skills that I would expect from data scientists.  I break these skills down into three general areas (with some bonus at the end):

  • Math Skills
  • Computer Science Skills
  • Business Skills
Fictionalized Portrayal of a Data Scientist (that looks somewhat like my work setup)


Math is the language of data science, and it's pretty difficult to make it 10 minutes in my day without some form of higher math coming into play.  Point being: if you struggle and/or dislike math this isn't the career for you.  And if the highest level math you've taken is college algebra, you're also in trouble.  Knowledge of algebra is absolute assumption in data science, and most of the real work is done in higher-order math classes.  I would consider four types requirements:

  • Calculus (differential + integral):  I use calculus daily in my job, when calculating equilibrium, optimization points, or spot change.  Three semesters of college-level calculus is a must for data scientists.
  • Matrix/Linear Algebra: The algorithms that we use to extract information from large data sets is written in the language of matrix and vector algebra.  This is for many reasons, but it allows data scientists to write large scale computations very quickly without having to manually code 1000's of individual operations.
  • Differential Equations: This is an extension of calculus, but is extremely helpful in calculating complex variable interactions and change-based relationships.
  • Statistics:  Don't just take the stats class that is offered as part of your major, which tends to be a bare-necessities look.  Take something that focuses on the mathematics underlying statistics. I suggest a stats class at your university that requires calculus as prerequisite.

If this equation is intimidating to you, then data science is likely not a great option.


Here's the guidelines I give young data scientists: The correct level of computer science skill is such that you could get a job as a mid-level developer (or DBA) at a major company.  This may seem like a weird metric, but it plays into the multi-faceted role of data scientists: we design new algorithms and process data which involves designing the programs that analyze that data.  Being able to write code as dynamic programs allows for automated analysis and model builds that take minutes rather than weeks. Here are some courses/skills to pick up before becoming a data scientist:

  • Introduction to Programming: Simply knowing how computer programming works, the keys to functional and object-oriented programming.
  • Introduction to Database theory: Most of the data we access is stored or housed in some kind of database.  In fact, Hadoop, is just a different type of database, but it's good to start with the basics in elementals.  As part of this course, it's vital to learn the basics of SQL which is still (despite claims and attempts to the contrary) the primary language of data manipulation for business.
  • Python: Python is becoming the language of data science, and it is also a great utility language, which has available packages and add-ons for most computing purposes.  It's good to have a utility language in your toolkit as many data wrangling and automation tasks don't exclusively require the tools of data manipulation (e.g.: audio to text conversion).
  • R: R is my primary computing language, though I work in Python and SQL in equal proportions these days (and sometimes SAS). R has extensive statistical and data science computing packages, so it's a great language to know.  The question I get most often is: should I learn R, Python or SAS?  My answer: have a functional understanding and ability to write code in all three, be highly proficient in at least one.


When asking about business skills, the question I most often receive is: Should I get an MBA?  In a word, no.  But it is helpful to understand business concepts and goals, especially to understand and explain concepts to coworkers fluently.  You don't have to go deep into business theory, but a few helpful courses:

  • Accounting: Often data scientists are asked to look at accounting data in order to create financial analyses, or to merge financial data with other interesting areas of a business.  Understanding the basics of the meaning of accounting data, accounting strategies, and how data is entered into financial systems can be helpful.
  • Marketing: Much of the use of data science over the past five years has dealt with targeted marketing both online and through other channels.  Understanding the basics of targeted marketing, meaning of lift, acquisition versus retention, and the financials underlying these concepts is also helpful.
  • Micro-Econ: Though technically an economics class knowing the basics of micro theory allows you to analyze a business more wholly.  Some relevant analyses may be demand and pricing elasticity,  market saturation modeling, and consumer preference models.  It also helps you with personally valuable analysis, like evaluating the viability of a start-up you might be thinking about joining.
Supply-demand relationships are relevant to many data science business applications.


Though the above set of skills are necessities for data science, there are a few "honorable mention" classes that are helpful:

  • Social Sciences: When modeling aggregate consumer behavior, it's important to understand why people do the things they do.  Social sciences are designed to analyze this; I recommend classes in economics, political science (political behavior or institutional classes), and behavioral psychology.
  • Econometrics: Econometrics is a blending of economics and statistical modeling, but the focus on time-series and panel analysis is especially helpful in solving certain business problems.  
  • Communication: One of the most common complaints I hear about data scientists is "yeah _____'s smart, but can't talk to people."  A business communication class can help rememdy this before it becomes a serious issue.


There are many options as the road to data science is not fixed.  This road map gives you all the skills you will need to be a modern data scientist.  People who want to become data scientists should focus on three major skillsets: math, computer science, and business.  Some may notice that I omitted artificial intelligence and machine learning, but the statistics, math, and computer science courses on this list more than give one a head start on those skills.