From Other

Intro to DS Assignment Sites

As an instructor, I want to provide high-quality assignments that are focused (so they achieve the learning objective), engaging (so they aren’t bored), and well supported (so they don’t end up frustrated). In an ideal world, I’d have time to write, test, debug, and administer all my own, course-tailored assignments that meet these goals. I, however, do not live in an ideal world, nor have enough graduate/undergraduate minions to mimic this ideal world. Instead, I’ve turned to using a few sites that already host assignments, resources, and even include auto-grading (without me needing to learn/setup the system).

Learn2Mine (L2M) is the first site I used in conjunction with my Data Mining course, and more recently my Introduction to Data Science course. Learn2Mine is a free, open source platform developed at the College of Charleston (CoC). While I have only really made use of the contents already there and CoC’s hosted site, you can contribute, or host your own version by getting the source directly from github. Dr. Anderson is fairly responsive about keeping the site running and grading.

The positive features for L2M (beyond being totally free/open source) are that it includes a mix of both introductory programming assignments and several more advanced machine learning/data mining lessons. It even has several search algorithm lessons (which I tend not to use). All of the lessons include auto-graded response boxes which also provide limited feedback of the errors generated when comparing submitted work to answers. There is also an interface for instructors to create their own ‘courses’ which consist of a series of the lessons on L2M. This allows the instructor to see student progress through lessons and download a grade-book in spreadsheet format.

Downsides for L2M are in-line with what you pay for (or invest in time-wise). Even though there is feedback when students get answers wrong, this often just consists of the identification of mismatched output lines (so pretty sparse). Students often get very frustrated trying to figure out what they are missing. This is exacerbated by the fact that often the instructions are unclear or insufficient to allow students to simply do the lessons. Also, as might be expected from a locally built/maintained project, there are a lot of “polish” features missing, such as being able to reorder assignments in a course, or associate a name with an account. Students have an account associated with the email they login with so it can sometimes be challenging to connect records with students. Overall, I’ve been considering phasing L2M out of my normal assignment structure, though the possibility of hosting my own local version and implementing different, more explained lessons has also been tempting.

The prime contender to replace L2M for me has been DataCamp. I’ve know about DataCamp for a while now but had the first chance to actually use it and make assignments from it this spring when I was looking for data visualization lessons (see visualization resources post). I’ve gone through a few lessons myself and found DataCamp to basically be exactly what I’d want/envision online course-work to be. Most courses consist of short videos (a best practice) followed by several guided coding exercises. DataCamp is not (sort of) free, which turns out to be a pro and a con.

If it’s not free, why is DataCamp going to replace L2M for me? Great question. Because, for academic purposes, Datacamp IS free. If you are an instructor for an academic institution teaching a course with 10+ students in, you can request free, premium access for students enrolled in your course(s). That access is limited (they give you 6 months), but hey, it’s free. What else makes DataCamp a nicer replacement? First the coding exercises are scaffolded, that is, early exercises have more prewritten code while later exercises require you to remember and use what you’ve already learned. In addition, the coding exercises have reasonably helpful error messages and help often allowing you to more accurately debug code. They’ve also got built in hints/help available, so a student can’t get permanently stuck. Using those however decreases the “exp” they gain, so you can still track how success a student has been without help. The other major advantage is that DataCamp has a SIGNIFICANTLY larger set of lessons/courses available to pull from.

There is no free lunch in data/computer science though. DataCamp does have a few downsides. Perhaps the biggest is the granularity available in assignments. You have three choices, “collect xp”, “complete chapter”, or “complete course”. Given that a chapter is really the smallest cohesive learning unit on DataCamp, this makes a lot of sense educationally. However, that also means it’s not exactly an alternative for giving individual lab/homework assignments. Instead, it would serve best as a resource/major assignment related to learning how to program in python/r, or a bigger topic.

Finally, I want to mention Gradescope. Gradescope isn’t data science educational site. Instead it’s a jack-of-all trades which can help ease the burden of assignments and grading. If DataCamp took L2M and removed granularity/options, Gradescope (in this context) goes the other direction. Lots of faculty use it for all kinds of courses, from computer science or mathematics to writing. Given its purpose, Gradescope doesn’t have any specific assignments (maybe that was obvious). Instead, it can serve as an autograder or collection site for your assignments. I’ve included it here for those that might already have assignments (or who get them from others) but still want a speedy, simple way to get feedback to students.

I’d be remiss if I didn’t point out that there are some alternatives to DataCamp, depending on your goals. If all you need students to do is learn to program (not necessarily in a data-centric style) try Codecademy or explore Code.org. I also know there is an alternative to Gradescope (but I couldn’t track down the name/site if someone knows, please email me or leave a comment). What I recall is that the alternative is NOT free, but does provide better support and scaling. You might also consider what options are available or integratable with your learning management system (DataCamp IS…but maybe not by you..).

Hopefully you found this post informative, if you’ve got other suggestions of websites with assignments (particularly data-science related) please let me know or leave a comment.

 

Version Control and Reproducible Research/Data Science

A current hot-topic in research, especially within statistically driven or based research is “reproducible research”. In academia, the process of peer-review publication is meant to assure that any finding are reproducible by other scientists. But those of us in the trenches, and especially on the data-side of things know that is a theoretical outcome (the reproduciblity) and far more rarely something tested. While academia is rightly under fire for this lack of actual, reproducible research (see this great example from epidemiology) this is even more of a problem in industry. If the analysis can’t be reproduced, then it can’t be applied to new client base.

So why bring this up on a educational blog? I think its important to embed the idea of reproducible work deep inside our teaching and assignment practices. While the idea of repeating a specific analysis once the data has changed isn’t really novel, it becomes far more relevant when we begin talking about filtering or cleaning the input data. Just think about searching for outliers in a data-set. First, we might plot a histogram of values/categories, then we go back, remove the data points that we want ignored, and replot the histogram. BAM! The we have a perfect opportunity to teach the value of reproducible work! We used exactly the same visualization technique (a histogram), on practically the same data (with outliers and without outliers).

Where does the reproduction of the work fit in though? Python and R both have histogram functions, so this is definitely a toy example (but the whole idea of functions can serve to emphasize the idea of reproducible/reusable work). Instead, I think this is where the instructor has an opportunity. This idea of cleaning outliers could easily be demonstrated in the command line window of R or an interactive Python shell. And then you’ve lost your teaching moment. Instead, if this is embedded in an R script or Python/R Notebook you can reuse the code, retrace whatever removal process you used, etc. In the courses I’ve taught, I’ve seen student after student complete these sorts of tasks in the command-line window, especially when told to do so as part of an active, in-class demo. But they never move the code into a script so when they are left to their own devices they flounder and have to go look for help.

I titled this post “Version Control and Reproducible Research” … you might be wondering what version control has to do with this topic. The ideas described above are great if you are the sole purveyor of your code/project. But if you have your students working in teams, or are trying to collaborate yourself, this might not be exactly ideal. But it’s getting pretty close! Here’s the last nugget you need to make this work… version control. Or in this case, I’m specifically talking about using GitHub. The short version of what could be an entire separate post (I’ll probably try to do one eventually) is that git (and the cloud repository github) is the tool that software developers designed to facilitate collaborative software development without the desire to kill each other from broken code. It stores versions of code (or really any file) that can be jointly contributed to without breaking each other’s work. For now, I’ll point you to a few resources on this..

First, a bit more from an industry blog on workflows to promote reproduction using github — Stripe’s Notebooks and Github Post

Second, for using Git/GitHub with R — Jenny Bryan, Prof. University of British Columbia — Note that this is a really long, complete webpage/workshop resource!

Third, a template/package for Python to help structure your reproducible git-hub work — Cookiecutter Data Science —  (heck, this could be an entire lesson itself in how to manage a project– more on that later)

Fourth, a template/package for R to help structure your reproducible git-hub/R work — ProjectTemplate