How would I go about becoming a data scientist?
This is always a tough place to start a conversation, especially if data science is not a great fit for the individual I'm talking to, but there are generally two types of people who ask me this question:
- Young professionals: I get the joy of working with quite a few interns and "first jobbers," who, BTW generally give me a reason to be hopeful about the future of America. (Ironically most of them aren't Americans, but whatever...) Most of these people are in computer science or some kind of analytical program and want to know what they should do to become a real "data scientist."
- People my age: I also get this question from people in their mid-30's, many of whom have limited relevant education background. For certain mid-career professionals this could be a great option, especially if they have both computer science and math in their background, but this often isn't the case. They seem to be drawn to data science because they've seen the paycheck, or it just sounds mysterious and sexy. Often these people say "I love data, I'd be great at data science" (though this claim is somewhat dubious, and by this they often mean that they like USA Today infographics).
I'm writing this blog post as a place to point both of these groups, in order to give a fair full-breadth look at the skills that I would expect from data scientists. I break these skills down into three general areas (with some bonus at the end):
- Math Skills
- Computer Science Skills
- Business Skills
Math is the language of data science, and it's pretty difficult to make it 10 minutes in my day without some form of higher math coming into play. Point being: if you struggle and/or dislike math this isn't the career for you. And if the highest level math you've taken is college algebra, you're also in trouble. Knowledge of algebra is absolute assumption in data science, and most of the real work is done in higher-order math classes. I would consider four types requirements:
- Calculus (differential + integral): I use calculus daily in my job, when calculating equilibrium, optimization points, or spot change. Three semesters of college-level calculus is a must for data scientists.
- Matrix/Linear Algebra: The algorithms that we use to extract information from large data sets is written in the language of matrix and vector algebra. This is for many reasons, but it allows data scientists to write large scale computations very quickly without having to manually code 1000's of individual operations.
- Differential Equations: This is an extension of calculus, but is extremely helpful in calculating complex variable interactions and change-based relationships.
- Statistics: Don't just take the stats class that is offered as part of your major, which tends to be a bare-necessities look. Take something that focuses on the mathematics underlying statistics. I suggest a stats class at your university that requires calculus as prerequisite.
COMPUTER SCIENCE SKILLS
Here's the guidelines I give young data scientists: The correct level of computer science skill is such that you could get a job as a mid-level developer (or DBA) at a major company. This may seem like a weird metric, but it plays into the multi-faceted role of data scientists: we design new algorithms and process data which involves designing the programs that analyze that data. Being able to write code as dynamic programs allows for automated analysis and model builds that take minutes rather than weeks. Here are some courses/skills to pick up before becoming a data scientist:
- Introduction to Programming: Simply knowing how computer programming works, the keys to functional and object-oriented programming.
- Introduction to Database theory: Most of the data we access is stored or housed in some kind of database. In fact, Hadoop, is just a different type of database, but it's good to start with the basics in elementals. As part of this course, it's vital to learn the basics of SQL which is still (despite claims and attempts to the contrary) the primary language of data manipulation for business.
- Python: Python is becoming the language of data science, and it is also a great utility language, which has available packages and add-ons for most computing purposes. It's good to have a utility language in your toolkit as many data wrangling and automation tasks don't exclusively require the tools of data manipulation (e.g.: audio to text conversion).
- R: R is my primary computing language, though I work in Python and SQL in equal proportions these days (and sometimes SAS). R has extensive statistical and data science computing packages, so it's a great language to know. The question I get most often is: should I learn R, Python or SAS? My answer: have a functional understanding and ability to write code in all three, be highly proficient in at least one.
When asking about business skills, the question I most often receive is: Should I get an MBA? In a word, no. But it is helpful to understand business concepts and goals, especially to understand and explain concepts to coworkers fluently. You don't have to go deep into business theory, but a few helpful courses:
- Accounting: Often data scientists are asked to look at accounting data in order to create financial analyses, or to merge financial data with other interesting areas of a business. Understanding the basics of the meaning of accounting data, accounting strategies, and how data is entered into financial systems can be helpful.
- Marketing: Much of the use of data science over the past five years has dealt with targeted marketing both online and through other channels. Understanding the basics of targeted marketing, meaning of lift, acquisition versus retention, and the financials underlying these concepts is also helpful.
- Micro-Econ: Though technically an economics class knowing the basics of micro theory allows you to analyze a business more wholly. Some relevant analyses may be demand and pricing elasticity, market saturation modeling, and consumer preference models. It also helps you with personally valuable analysis, like evaluating the viability of a start-up you might be thinking about joining.
|Supply-demand relationships are relevant to many data science business applications.|
Though the above set of skills are necessities for data science, there are a few "honorable mention" classes that are helpful:
- Social Sciences: When modeling aggregate consumer behavior, it's important to understand why people do the things they do. Social sciences are designed to analyze this; I recommend classes in economics, political science (political behavior or institutional classes), and behavioral psychology.
- Econometrics: Econometrics is a blending of economics and statistical modeling, but the focus on time-series and panel analysis is especially helpful in solving certain business problems.
- Communication: One of the most common complaints I hear about data scientists is "yeah _____'s smart, but can't talk to people." A business communication class can help rememdy this before it becomes a serious issue.
There are many options as the road to data science is not fixed. This road map gives you all the skills you will need to be a modern data scientist. People who want to become data scientists should focus on three major skillsets: math, computer science, and business. Some may notice that I omitted artificial intelligence and machine learning, but the statistics, math, and computer science courses on this list more than give one a head start on those skills.