Friday morning I had the opportunity to chat with Kathryn (Katie) Kinnaird, currently an Applied Mathematics Post-Doc at Brown University, and former director of the Data Science TRAIn Lab at Macalester College. Originally I had called to ask her about the methodology behind the TRAIn lab and materials related to it (it’s a neat approach and something I may adopt some for my own research students) but we got to talking about the introduction to data science courses we taught this year and the textbooks we used. As a bit of a preview, I used two textbooks, Doing Data Science by Cathy O’Neil and Rachel Schutt, and Data Science from Scratch by Joel Grus. Katie used Analyzing Data with GraphPad Prism by Harvey Motulsky. I’m going to provide a short discussion of what I was looking for in my textbooks(s), what I ended up getting, and ask for a bit of reader input.
When I was reviewing textbooks for Data 151, I had some fairly specific constraints in mind. First, I wanted a book that dealt with data science, not just data mining, machine learning, or statistics. That ruled out a lot of books, but there were still a few left to pick from. I also wanted a book that used either Python or R as its primary coding language, since I believe (based on conferences and literature) that these will be the languages of the Realm long-term (more on this in another post). This excluded a few more books, including the book Katie used since it focused on the Prism software package. It also excluded the textbook I previously used for a data mining class, Data Mining: Practical Machine Learning Tools and Techniques by Witten, Frank, Hall, and Pal which uses Weka.
Alright, so general data science, R or Python. The last requirement I had was that the book didn’t assume the reader knew a lot (or any) coding. Wait…no coding experience? Yep. Data 151 is targeted at freshman and interdisciplinary students who may well have zero coding experience. Katie’s class was even less likely to have coding experience (she had a fall class instead of my spring class). There are a couple of books out there that seem to be targeted at upper-tier undergraduates and early graduate students, for example Introduction to Data Science by Ingul and Segui, but very few that are technical while still aimed at novices. That’s how I landed on Doing Data Science and Data Science from Scratch. They were basically the only books I could find that even sort of fit my criteria. And even Doing Data Science had a bit of an assumption of some programming. So how did they work out?
Doing Data Science – Straight Talk from the Frontline:
I really like the description that becomingadatascientist.com used (read the whole review here):
“To make a metaphor, Rachel Schutt and Cathy O’Neil tell you about a great dish someone cooked, and give some general info about the process of making the dish, and what to watch out for when you attempt it yourself. They even include some quotes from the chef about the art of making this particular dish, and tips on preparing and presenting it.”
It’s true, they cover everything you might want to touch on in an intro class and generally do so in a very high-level, newbie friendly way. There’s a few chapters that get way to technical (for my purposes), but those can be glossed over. So, what’s the down-side? Sadly, something written in 2013 has every potential to be out of date in the data science world unless very carefully written. Read carefully, they make several comments that I don’t feel are entirely true any more. Second, the code in the book is all in R, and I had really planned to exclusively use Python. Third, while I wanted the book to not assume programming knowledge, I also didn’t want it to ignore acquiring (some) programming knowledge (it does). Last, while generally accessible, it is still clearly written for graduate students or independent learners. That is to say, not really for brand new students who aren’t comfortable pulling out the key information they need. It also was not written as a textbook-per-se. It doesn’t really have any true exercises included (a few are scattered throughout from the class the book was based on).
Data Science from Scratch – First Principles with Python:
Great, first principles, using Python, everything I wanted right? Yep, it was, if I were only teaching data science majors or computer science majors. I could probably get away with no programming background (though probably not or so my business students said). Really, if you are looking at implementing an Intro to Data Science course as a sophomore/junior level elective in a computer science program, or possibly a Math/Stat program with a programming prerequisite, this is a reasonable choice. It covers the minimum statistics and mathematics you need (including linear algebra!) and doesn’t really assume you know tons of Python.
So what’s wrong with this text? For a general course, its simply too technical. While it does cover several important ideas at a high level, the students lost those few comments in the nitty-gritty of code implementation and examples. Furthermore, while I really liked the idea behind the text, that you will implement everything you need to do data science from scratch (or from things earlier in the book) that turned out to actually felt like a giant step backwards. There are quite a few fantastic packages for data science, from the really well known like Pandas and Sci-Kit learn to lesser known like Bokeh. I understand wanting students to know from the ground up how things work, but you can understand things while still using well-tested packages (and avoid all your own errors).
So, bringing it back around to my conversation with Katie on Friday. She had also expressed frustration with her textbook in terms of reaching the intended audience (freshman, new students). I wasn’t happy in the end with either textbook. Next year I’ll probably cut Data Science from Scratch as it really didn’t do what I needed. A lot of the other intro courses I know about don’t even use a textbook. So, here’s a few questions I’d like to pose to my readers:
- Are there other textbooks that you are using, or know of that fit these needs?
- What are your “must-haves” or “must-avoids” for a textbook for freshman-targeted courses in data science?