I want to share some thoughts on the math required for a data scientist (or at least, a data science undergraduate degree). The discussion can really be boiled down to one question: “Discrete Mathematics or Calculus 2?” Let’s first take a look at the outcomes from an in-progress and two completed working groups on outlining data science education.
An ACM organized workshop in 2015 included participants from ACM, ASA, IEEE-CS, AMS and more. That workshop’s report does not explicitly state any math requirements, but does make clear the need for sufficient supporting statistics courses. The clearest recommendations come from a group of faculty at the Park City Mathematical Institute in the summer of 2016. Their report gives suggestions on how to make a data science degree from existing courses and ideas for new integrated courses (this is the real gold in the report). If constructing a curriculum from existing courses, the group recommends three mathematics courses: Calculus 1, 2, and Linear Algebra. Last, a series of round-table discussions is currently underway by the National Academy of Science to discuss Data Science Post-Secondary Education. While all three NAS round tables are interesting, only the first is relevant to this discussion. At that meeting, there was a presentation on the underlying mathematics of data science. Their list of mathematics supporting data science included linear algebra, numerical analysis and graph theory.
In summary, all three groups clearly support the need for linear algebra to be a part of any data science curriculum. I doubt you’ll find many objections to this idea since linear algebra forms the mathematical foundation for manipulating data contained in tables or arrays as rows/columns. If nothing else, simply learning the notation is vitally important for anyone wanting to extend algorithms for data science. All three also clearly support at least two traditional statistics courses, up through regression analysis. A little less clearly, I would argue that all three support the requirement of a Calc 1 course. The NAS round-table discussed needing numerical analysis, which is traditionally based on calculus concepts. The ACM workshop supported disciplinary knowledge and just about all science disciplines require at least one semester of calculus.
Alright, on to the differences. The PCMI group included Calculus 2 in their “minimum” courses needed for data science. In my opinion, the suggestion that Calc-2 be included in the bare minimum courses for data science is indicative of the dominance of mathematicians (many applied) and statisticians in the group (there were a FEW computer scientists). While I think overall they are quite good, I think the inclusion of Calc 2 over discrete mathematics (as well as the odd location of data mining) clearly reflect this make-up. The presentation on mathematics (from two mathematicians) at the first NAS however included graph theory as one of the three main supporting mathematical areas. So, perhaps the question from these two groups is: “Calculus 2 or Discrete Mathematics?”
Here’s an alternative way to build an answer to this question. Instead of just focusing on the topics covered, what about the requirements for the other supporting disciplines that make up data science? Computer Science is pretty easy. Almost all programs require Calculus 1 and discrete mathematics, and the ACM 2013 guidelines include a list of core topics (set theory, graph theory and logic) that are traditionally covered in either a discrete mathematics course, or a combination of several mathematics courses. They also articulate very clearly that for some areas of computer science (like visualization or data science) linear algebra and statistics will be required. We can contrast this with typical mathematics requirements for statistics curriculum. For many statistics programs, a minimum of Calc 2 is required to support advanced probability courses (with a preference for multivariable calculus). The ASA 2014 guidelines specify that statistics majors should have both differentiation and integration (typically covered by Calc 1 and 2), and linear algebra.
Development from supporting disciplines can leave us just as confused as to what to require. I think there is an answer, but it requires taking off the mathematician glasses, and thinking about jobs, applications, and where a student might be headed. First, a good portion of researchers and practitioners doing data science use graphs and networks, often doing mining on those graphs for information. Turns out graphs (the node/edge type, not the line/bar plot type) are also a great way to visualize a lot of information. Another key skill when doing data science is the ability to partition data. That is, to think of data as either meeting, or not meeting specific criteria. This is encompassed in set theory in mathematics, and is sometimes partially covered as part of logic. Together these topics provide two new ways of thinking about data that aren’t included in other mathematics courses. The need for this sort of knowledge, and a basic introduction to proofs is why discrete mathematics courses came into existence, to allow CS majors to get these topics without taking another 3 or 4 mathematics courses. To me, this is a far stronger case for including discrete mathematics than the (possible) need of Calculus 2 for advanced statistics courses. If you are requiring 4 math courses, by all means, include Calculus 2 next. Or, if a student is particularly interested in understanding the theoretical underpinnings of data science (by taking more statistics courses) then they should take Calc 2. But if we are really thinking about an undergraduate degree as a stand-alone, prepared to enter the work force degree, Calc 2 does not seem to add a lot of direct value to the student’s degree.