From Data 151: Intro to Data Science

Intro to DS Assignment Sites

As an instructor, I want to provide high-quality assignments that are focused (so they achieve the learning objective), engaging (so they aren’t bored), and well supported (so they don’t end up frustrated). In an ideal world, I’d have time to write, test, debug, and administer all my own, course-tailored assignments that meet these goals. I, however, do not live in an ideal world, nor have enough graduate/undergraduate minions to mimic this ideal world. Instead, I’ve turned to using a few sites that already host assignments, resources, and even include auto-grading (without me needing to learn/setup the system).

Learn2Mine (L2M) is the first site I used in conjunction with my Data Mining course, and more recently my Introduction to Data Science course. Learn2Mine is a free, open source platform developed at the College of Charleston (CoC). While I have only really made use of the contents already there and CoC’s hosted site, you can contribute, or host your own version by getting the source directly from github. Dr. Anderson is fairly responsive about keeping the site running and grading.

The positive features for L2M (beyond being totally free/open source) are that it includes a mix of both introductory programming assignments and several more advanced machine learning/data mining lessons. It even has several search algorithm lessons (which I tend not to use). All of the lessons include auto-graded response boxes which also provide limited feedback of the errors generated when comparing submitted work to answers. There is also an interface for instructors to create their own ‘courses’ which consist of a series of the lessons on L2M. This allows the instructor to see student progress through lessons and download a grade-book in spreadsheet format.

Downsides for L2M are in-line with what you pay for (or invest in time-wise). Even though there is feedback when students get answers wrong, this often just consists of the identification of mismatched output lines (so pretty sparse). Students often get very frustrated trying to figure out what they are missing. This is exacerbated by the fact that often the instructions are unclear or insufficient to allow students to simply do the lessons. Also, as might be expected from a locally built/maintained project, there are a lot of “polish” features missing, such as being able to reorder assignments in a course, or associate a name with an account. Students have an account associated with the email they login with so it can sometimes be challenging to connect records with students. Overall, I’ve been considering phasing L2M out of my normal assignment structure, though the possibility of hosting my own local version and implementing different, more explained lessons has also been tempting.

The prime contender to replace L2M for me has been DataCamp. I’ve know about DataCamp for a while now but had the first chance to actually use it and make assignments from it this spring when I was looking for data visualization lessons (see visualization resources post). I’ve gone through a few lessons myself and found DataCamp to basically be exactly what I’d want/envision online course-work to be. Most courses consist of short videos (a best practice) followed by several guided coding exercises. DataCamp is not (sort of) free, which turns out to be a pro and a con.

If it’s not free, why is DataCamp going to replace L2M for me? Great question. Because, for academic purposes, Datacamp IS free. If you are an instructor for an academic institution teaching a course with 10+ students in, you can request free, premium access for students enrolled in your course(s). That access is limited (they give you 6 months), but hey, it’s free. What else makes DataCamp a nicer replacement? First the coding exercises are scaffolded, that is, early exercises have more prewritten code while later exercises require you to remember and use what you’ve already learned. In addition, the coding exercises have reasonably helpful error messages and help often allowing you to more accurately debug code. They’ve also got built in hints/help available, so a student can’t get permanently stuck. Using those however decreases the “exp” they gain, so you can still track how success a student has been without help. The other major advantage is that DataCamp has a SIGNIFICANTLY larger set of lessons/courses available to pull from.

There is no free lunch in data/computer science though. DataCamp does have a few downsides. Perhaps the biggest is the granularity available in assignments. You have three choices, “collect xp”, “complete chapter”, or “complete course”. Given that a chapter is really the smallest cohesive learning unit on DataCamp, this makes a lot of sense educationally. However, that also means it’s not exactly an alternative for giving individual lab/homework assignments. Instead, it would serve best as a resource/major assignment related to learning how to program in python/r, or a bigger topic.

Finally, I want to mention Gradescope. Gradescope isn’t data science educational site. Instead it’s a jack-of-all trades which can help ease the burden of assignments and grading. If DataCamp took L2M and removed granularity/options, Gradescope (in this context) goes the other direction. Lots of faculty use it for all kinds of courses, from computer science or mathematics to writing. Given its purpose, Gradescope doesn’t have any specific assignments (maybe that was obvious). Instead, it can serve as an autograder or collection site for your assignments. I’ve included it here for those that might already have assignments (or who get them from others) but still want a speedy, simple way to get feedback to students.

I’d be remiss if I didn’t point out that there are some alternatives to DataCamp, depending on your goals. If all you need students to do is learn to program (not necessarily in a data-centric style) try Codecademy or explore I also know there is an alternative to Gradescope (but I couldn’t track down the name/site if someone knows, please email me or leave a comment). What I recall is that the alternative is NOT free, but does provide better support and scaling. You might also consider what options are available or integratable with your learning management system (DataCamp IS…but maybe not by you..).

Hopefully you found this post informative, if you’ve got other suggestions of websites with assignments (particularly data-science related) please let me know or leave a comment.


Version Control and Reproducible Research/Data Science

A current hot-topic in research, especially within statistically driven or based research is “reproducible research”. In academia, the process of peer-review publication is meant to assure that any finding are reproducible by other scientists. But those of us in the trenches, and especially on the data-side of things know that is a theoretical outcome (the reproduciblity) and far more rarely something tested. While academia is rightly under fire for this lack of actual, reproducible research (see this great example from epidemiology) this is even more of a problem in industry. If the analysis can’t be reproduced, then it can’t be applied to new client base.

So why bring this up on a educational blog? I think its important to embed the idea of reproducible work deep inside our teaching and assignment practices. While the idea of repeating a specific analysis once the data has changed isn’t really novel, it becomes far more relevant when we begin talking about filtering or cleaning the input data. Just think about searching for outliers in a data-set. First, we might plot a histogram of values/categories, then we go back, remove the data points that we want ignored, and replot the histogram. BAM! The we have a perfect opportunity to teach the value of reproducible work! We used exactly the same visualization technique (a histogram), on practically the same data (with outliers and without outliers).

Where does the reproduction of the work fit in though? Python and R both have histogram functions, so this is definitely a toy example (but the whole idea of functions can serve to emphasize the idea of reproducible/reusable work). Instead, I think this is where the instructor has an opportunity. This idea of cleaning outliers could easily be demonstrated in the command line window of R or an interactive Python shell. And then you’ve lost your teaching moment. Instead, if this is embedded in an R script or Python/R Notebook you can reuse the code, retrace whatever removal process you used, etc. In the courses I’ve taught, I’ve seen student after student complete these sorts of tasks in the command-line window, especially when told to do so as part of an active, in-class demo. But they never move the code into a script so when they are left to their own devices they flounder and have to go look for help.

I titled this post “Version Control and Reproducible Research” … you might be wondering what version control has to do with this topic. The ideas described above are great if you are the sole purveyor of your code/project. But if you have your students working in teams, or are trying to collaborate yourself, this might not be exactly ideal. But it’s getting pretty close! Here’s the last nugget you need to make this work… version control. Or in this case, I’m specifically talking about using GitHub. The short version of what could be an entire separate post (I’ll probably try to do one eventually) is that git (and the cloud repository github) is the tool that software developers designed to facilitate collaborative software development without the desire to kill each other from broken code. It stores versions of code (or really any file) that can be jointly contributed to without breaking each other’s work. For now, I’ll point you to a few resources on this..

First, a bit more from an industry blog on workflows to promote reproduction using github — Stripe’s Notebooks and Github Post

Second, for using Git/GitHub with R — Jenny Bryan, Prof. University of British Columbia — Note that this is a really long, complete webpage/workshop resource!

Third, a template/package for Python to help structure your reproducible git-hub work — Cookiecutter Data Science —  (heck, this could be an entire lesson itself in how to manage a project– more on that later)

Fourth, a template/package for R to help structure your reproducible git-hub/R work — ProjectTemplate


Intro to Data Science Books

Friday morning I had the opportunity to chat with Kathryn (Katie) Kinnaird, currently an Applied Mathematics Post-Doc at Brown University, and former director of the Data Science TRAIn Lab at Macalester College. Originally I had called to ask her about the methodology behind the TRAIn lab and materials related to it (it’s a neat approach and something I may adopt some for my own research students) but we got to talking about the introduction to data science courses we taught this year and the textbooks we used. As a bit of a preview, I used two textbooks, Doing Data Science by Cathy O’Neil and Rachel Schutt, and Data Science from Scratch by Joel Grus. Katie used Analyzing Data with GraphPad Prism by Harvey Motulsky. I’m going to provide a short discussion of what I was looking for in my textbooks(s), what I ended up getting, and ask for a bit of reader input.


When I was reviewing textbooks for Data 151, I had some fairly specific constraints in mind. First, I wanted a book that dealt with data science, not just data mining, machine learning, or statistics. That ruled out a lot of books, but there were still a few left to pick from. I also wanted a book that used either Python or R as its primary coding language, since I believe (based on conferences and literature) that these will be the languages of the Realm long-term (more on this in another post). This excluded a few more books, including the book Katie used since it focused on the Prism software package. It also excluded the textbook I previously used for a data mining class, Data Mining: Practical Machine Learning Tools and Techniques  by Witten, Frank, Hall, and Pal which uses Weka.

Alright, so general data science, R or Python. The last requirement I had was that the book didn’t assume the reader knew a lot (or any) coding. Wait…no coding experience? Yep. Data 151 is targeted at freshman and interdisciplinary students who may well have zero coding experience. Katie’s class was even less likely to have coding experience (she had a fall class instead of my spring class). There are a couple of books out there that seem to be targeted at upper-tier undergraduates and early graduate students, for example Introduction to Data Science by Ingul and Segui, but very few that are technical while still aimed at novices. That’s how I landed on Doing Data Science and Data Science from Scratch. They were basically the only books I could find that even sort of fit my criteria. And even Doing Data Science had a bit of an assumption of some programming. So how did they work out?

Doing Data Science – Straight Talk from the Frontline:

I really like the description that used (read the whole review here):
“To make a metaphor, Rachel Schutt and Cathy O’Neil tell you about a great dish someone cooked, and give some general info about the process of making the dish, and what to watch out for when you attempt it yourself. They even include some quotes from the chef about the art of making this particular dish, and tips on preparing and presenting it.”

It’s true, they cover everything you might want to touch on in an intro class and generally do so in a very high-level, newbie friendly way. There’s a few chapters that get way to technical (for my purposes), but those can be glossed over. So, what’s the down-side? Sadly, something written in 2013 has every potential to be out of date in the data science world unless very carefully written. Read carefully,  they make several comments that I don’t feel are entirely true any more. Second, the code in the book is all in R, and I had really planned to exclusively use Python. Third, while I wanted the book to not assume programming knowledge, I also didn’t want it to ignore acquiring (some) programming knowledge (it does). Last, while generally accessible, it is still clearly written for graduate students or independent learners. That is to say, not really for brand new students who aren’t comfortable pulling out the key information they need. It also was not written as a textbook-per-se. It doesn’t really have any true exercises included (a few are scattered throughout from the class the book was based on).

Data Science from Scratch – First Principles with Python:

Great, first principles, using Python, everything I wanted right? Yep, it was, if I were only teaching data science majors or computer science majors. I could probably get away with no programming background (though probably not or so my business students said). Really, if you are looking at implementing an Intro to Data Science course as a sophomore/junior level elective in a computer science program, or possibly a Math/Stat program with a programming prerequisite, this is a reasonable choice. It covers the minimum statistics and mathematics you need (including linear algebra!) and doesn’t really assume you know tons of Python.

So what’s wrong with this text? For a general course, its simply too technical. While it does cover several important ideas at a high level, the students lost those few comments in the nitty-gritty of code implementation and examples. Furthermore, while I really liked the idea behind the text, that you will implement everything you need to do data science from scratch (or from things earlier in the book) that turned out to actually felt like a giant step backwards. There are quite a few fantastic packages for data science, from the really well known like Pandas and Sci-Kit learn to lesser known like Bokeh. I understand wanting students to know from the ground up how things work, but you can understand things while still using well-tested packages (and avoid all your own errors).

Something New?

So, bringing it back around to my conversation with Katie on Friday. She had also expressed frustration with her textbook in terms of reaching the intended audience (freshman, new students). I wasn’t happy in the end with either textbook. Next year I’ll probably cut Data Science from Scratch as it really didn’t do what I needed. A lot of the other intro courses I know about don’t even use a textbook. So, here’s a few questions I’d like to pose to my readers:

  • Are there other textbooks that you are using, or know of that fit these needs?
  • What are your “must-haves” or “must-avoids” for a textbook for freshman-targeted courses in data science?

Student Learning Objectives – Part 4

This post is part of a series on student learning objectives (SLO’s) for both curriculum and courses. The SLO’s in this post are course level, specifically topical objectives for an “Introduction to Data Science” (Data 151) class for new students. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

The last post focused on high-level learning objectives for the course “Introduction to Data Science” (I’ve repeated them below for reference). Those are certainly the big picture, but those four objectives are hardly enough to really design day-to-day lessons around. Data 151 also has seven topical objectives tied directly to those general objectives and modeled after Paul Anderson’s DISC 101 course objectives. I’ll tie each topical objective back to the course’s overall goals.

General Course Objectives:

A. Students understand the fundamental concepts of data science and knowledge discovery
B. Students can apply and perform the basic algorithmic and computational tasks for data science
C. Students develop and improve analytical thinking for problem formulation and solution validation, especially using technology
D. Students prepare for success in a world overflowing with data.

Topical Objectives:

  1. gain an overview of the field of knowledge discovery (A)
  2. learn introductory and state-of-the-art data mining algorithms (A,B)
  3. be able to distinguish and translate between data, information, and knowledge (A, C)
  4. apply algorithms for inductive and deductive reasoning (B,C)
  5. apply information filtering and validation on real world datasets (B,C)
  6. understand the social, ethical, and legal issues of informatics and data science (A,D)
  7. apply data mining, statistical inference, and machine learning algorithms to a variety of datasets including text, image, biological, and health (B,D)

Four of the topical objectives (1,2, 3 & 6) focus on guiding students towards understanding the fundamental concepts behind data science. One can hardly call a course an “introduction” without giving an overall picture of the field (Obj. 1) or spending time understanding key tools that practitioners use (Obj. 2). While I fully anticipate that the state-of-the-art algorithms will change, the basics like k-Nearest Neighbor, k-Means, and Decision Trees will certainly not. These algorithms provide a nice gateway into understanding the ideas of learning from a collection of data (Obj. A).

We also know in data science that what you can learn from a data-set is limited by the quality of the input data (like a lot of other things in life, garbage-in = garbage-out). Objectives 5 & 7 articulate the sorts of data that will be used in the course, both real-world data and a mix of prepared/known data sets. These data sets provide a way to actually practice Objectives 2 & 4 in more than just an abstract way. I want students to walk away from this class knowing how practitioners actually make use of algorithms. Students need to get their hands dirty putting some of those algorithms to work (Obj. B/C).

Now, I think it’s important to note here that in their projects and general work, I’m not expecting a really deep understanding or application of the algorithms. That’s saved for two later courses, one explicitly on data mining and the other their capstone sequence. In Data 151 they should be learning enough to continue learning on their own, understand and interact with people who are really doing this work, and to grasp how the ideas can and are shaping the evolution of various disciplines or industries.

While Objectives 2, 4 & 5 articulate using data science skills, Objectives 2-5 have a second layer as well. These objectives aim to have students think about the implications and knowledge that comes from the data science process. This course is about more than just data engineering or data mining, it’s really about the questions and, well, science that can done with data. It is only when students can understand the processes of both inductive and deductive reasoning for science, or transform raw data into actionable knowledge that they become aware of the true power of the field (Obj. B/C).

Last, but certainly not least, Objective 6. As we know from Spider-Man (and some other great speeches), “With great power comes great responsibilities.” If you believe, like I do, that data science could dramatically change what we know and how industries and society is run… then I hope you are also a little nervous, perhaps occasionally terrified. Because if we DON’T talk about the social, ethical, and legal issues surrounding informatics and data science we might well end up with something like Ultron (the artificial intelligence gone bad in Marvel’s “Avengers: Age of Ultron”). More likely, we’ll end up with biased learning algorithms that perpetuate injustices or inequality. Making sure students have at least started to think about these sorts of issues may not prevent them from happening, but it is one (in my mind necessary) step towards that goal (Obj. D).

Together this is a pretty hefty set of things to accomplish in a semester. All in all though, I think they serve as a great lead into the entire field, and the overall goals of Valpo’s Data Science program (described in previous posts). Even if a student only takes Data 151 (as some certainly will), they will leave with a broad understanding of the field, enough knowledge to interact successfully with experts, and enough insight to see the real value that the effective and intelligent use of data can provide. I hope my business students are now prepared to be the “data-savvy business managers” that McKinsey & Co. described a few years ago and that the rest (C.S., Math and Stats) can work with, or become true data scientists, engineers, or creators.

Student Learning Objectives – Part 3

This post is part of a series on student learning objectives (SLO’s) both for curriculum and courses. The SLO’s in this post are course level, specifically for an “Introduction to Data Science” (Data 151) class for new students. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

In many ways, the general course SLO’s for Data 151 mirror the SLO’s for the program as a whole. Students need to leave with an understanding of what data science is, know about the basic algorithms, and be made aware of the ethic and moral issues surrounding the use of data. Data 151 is intended to be a hook that draws in students from across our university to learn about data and then consider adding a major in Data Science. It also draws in juniors and seniors in less technical disciplines like business. This  may in turn make Data 151 the only course where a student explicitly thinks about data. The major difference between the curricular and course SLO’s is the depth of understanding I expect students to leave the course with (as opposed to the program). This is most clear in the first two SLO’s below.

  1. Students understand the fundamental concepts of data science and knowledge discovery
  2. Students can apply and perform the basic algorithmic and computational tasks for data science

As said, these are very close to the first two SLO’s for the whole curriculum and related to both their ability to communicate data science concepts and also their ability to implement solutions, though in both cases with lower levels of expertise. Data 151 has two additional SLO’s that target the broader (potential) audience for the course (in addition to continuing majors). These are:

3. Students develop and improve analytical thinking for problem formulation and solution validation, especially using technology
4. Students prepare for success in a world overflowing with data.

In many cases, students in Intro to Data Science are still gaining experience (aren’t we all?) with general problem solving skills. Perhaps (to my mind) one of the most under-taught skills in STEM courses is how to actually formulate and structure the process of solving a problem. In many, many cases, a significant amount of time can be saved in the execution of problem solving by carefully planning out how you are going to explore or solve a problem. Data science even has this explicitly built into several locations in a typical workflow, specifically performing exploratory data analysis and planning for solution validation.

Meanwhile, the final objective is meant to really be a catch-all. The field of data science is changing incredibly rapidly, as are the ways data is generated and used. I wanted Data 151 to be something that is capable of covering current, bleeding-edge topics. This SLO also nicely encompasses my plans to bring in alumni and current practitioners as speakers to give the students insight into what future jobs might look like. Bringing in these speakers also provides a chance for students to get an industry perspective on workflows and processes, something that can be very different from academia’s problem solving process.

These SLO’s are pretty high-level, but intentionally so. At Valpo, we’ve got both “course objectives” and also topical objectives. My next post will take a look at the specific, topical objectives for Data 151, which deal with the more nitty-gritty topics of what will actually get covered in Data 151.

Resources for Learning Data Science Visualization

A large component of a data science education is learning how to effectively visualize data. This could be part of exploratory data analysis, producing presentations for co-workers or bosses, or just because you want to show off something neat you found.

One of the content sections for Data 151: Introduction to Data Science tries to give students a real brief introduction to more complicated visualizations and functions, especially those available in R and Python. The books I’ve been using for the course (Doing Data Science by Cathy O’Neil and Rachel Schutt & Data Science from Scratch by Joel Grus) include sections on visualization. In-fact, I started initially collecting these resources from the “extra resources” section of Doing Data Science’s Chapter 9, specifically, the reference to FlowingData‘s tutorials and Michael Dubakov’s Visual Encoding Article.

Not all of FlowingData’s tutorials are free though, so this page details and links the FREE tutorials available (see end of post). Additionally, I struggled to find anything for Python of equal quality/type to FlowingData’s tutorials in R. So, for now I’ve resorted to sending the students to‘s courses on Python visualization (which seem great so far, but are a bit different).

More generally, for a summary of some common/useful Python visualization libraries you can read either Mode Analytic’s article 10 Useful Python Data Visualization Libraries or KD-Nugget’s article Overview of Python Visualization Tools. There’s significant overlap of information between the two, ironically, I think Mode Analytic’s article does a better “Overview” while KDNugget’s article actually shows some code.

There’s also (perhaps obviously) entire books on this subject, but that seems like over-kill for an introduction class. (I haven’t really researched books yet, though when I get to planning our Scientific Visualization course there’ll be more posts) Similarly, even though D3 does amazing things, introducing another programming language seemed excessive. If you are interested in D3, Doing Data Science points to the D3 tutorials by Scott Murray at There’s also a few (free) tutorials on using D3 (and python) on FlowingData (mentioned below).

Free Tutorials from FlowingData in R:

Good, basic, initial tutorial:

Several Basics of Plotting:

Some more intermediate chart-types:

A few really nifty advanced charts:

Other: How to Download and Use Online Data with Arduino (Uses R)

Free Tutorials from FlowingData (not in R):

Getting Data (and some visualizations):

Visualization Focused:


Other: How to Make an Interactive Area Graph with Flare (Uses Actionscript/Flash) allows academics to make free classes with full access to their premium content, but even if you can’t get free access, the first chapters are free for all and include several good visualization intros. I’ve only included the Python content here but they also have material on visualization in R.

As a quick primer, matplotlib is the ‘grandfather’ of plotting in python, so is the most technical/intensive but also the most powerful. ggplot2 and Bokeh are far easier to use and are probably better places to start learning.