Saturday, July 7, 2018

A Suggested Curriculum for Undergrad Data Science

It's intern season which basically means two things for me:
  1. I feel really old in the office.
  2. I take a lot of meetings with young people wanting to be data scientists..

The meetings generally have this feel:

Hi, I have interest in becoming a data scientist and I want to get your perspective on what that might take, let's meet for coffee.
The interns that setup these meetings come from a wide range of backgrounds and skill levels.  Many times they are interested in which classes they should take to be competitive for data science roles after they graduate.  I thought it would be helpful to put the advice I give them into a blog post so more people can read it.

I reviewed the current course offerings at a few schools, and settled on a list of 14 basic classes (and some electives at the end) that I think should be part of every data science curriculum.  Here is my list:

MATH

  • Calculus I, II, III
  • Differential Equations
  • Linear Algebra
Some data scientists say you can skimp on these requirements, but that's universally false.  If you don't understand matrix algebra or differentials you simply can't understand the algorithms we implement.  The data scientists I see who struggle with higher level math fail in understanding the operations of complex algorithms, which leads to failure in implementation.

STATISTICS

  • Intro to Stats
  • Calculus Based Stats
  • Generalized Research Method Class 
  • Econometrics
Although many modern data science algorithms are not based purely in statistics, the concepts of risk, certainty, and confidence in these classes are key to understanding predictive modeling in general. Having worked with computer science only focused data scientists, I now see a background in stats as key to the fundamentals of making predictions.

Econometrics may seem like an outlier, but there are concepts of predictive modeling such as time-series analysis, dealing with collinearity, endogeneity and auto correlation which are best taught in the context of econometrics.


COMPUTER SCIENCE

  • Programming I, II
  • Data Structures
  • Fundamentals of Computer Algorithms
  • Introduction to Database Systems
Sometimes I question the value of formal education in coding, some of the best programmers I know have degrees in non-computer fields.  That said, computer science is still a core skillset for data scientists, and is required knowledge to be hired by someone like me (if you have the skills from another source that's great, just figure out a way to demonstrate it with an application/in an interview).

ELECTIVES

As for electives in the data science space, these should be modeled towards what specifically you want to do.
  • If you want to go into business, take classes in economics, business operations, and accounting.
  • If you want to go into algorithm development, focus more time in advanced computer science classes.
  • If you want to go into academic research, focus on whichever academic discipline you are most interested in.

CONCLUSION


This is intended to be a reasonable list of classes for young people interested in data science.  It serves two purposes really:
  • Provide a framework for undergrads looking to become a data scientist. 
  • Prevent me from saying things I later regret when confronted by students who want to be data scientists without taking math.