Tagged websites

Big Month in Data Education — October

October has been an incredibly busy month! I’ve been traveling a lot, taking part in a wide variety of activities around data science education. It’s been a pretty big month and I’m here to give you a very quick run-down of what’s been happening!

The month kicked off with the Midwest Big Data Innovation Hub’s “All-hands on Deck” Meeting. I was invited there as part of a planning grant the hub had received to develop a spoke proposal for the hub to create a “Resource Center for Non-R1 Universities”. The meeting was very interesting, and we got to hear about some really neat work on using data science to advance agriculture, smart cities and more. The most relevant for data science education though was the final panel, “Education and Workforce Development.” Panelists included Jim Barkley, David Mongeau and Renata Rawlings-Goss. You can find their slides on the Midwest Big Data Hub (Barkley Slides, Mongeau Slides, Rawlings-Goss Slides). There is also a video recording of the panel here. The other important event that happened at the meeting was the afternoon grant-planning session. While I can’t share documents from that yet, I left very excited about the possibilities of establishing an important educational center for data science education that would help address the needs of non-R1 institutions. Some of the ideas that were shared included providing a clearing house for internships and project opportunities, connecting smaller institutions with interesting research projects and facilitating finding instructional expertise for most esoteric courses.

Mid-Month (October 20th), the National Academy of Sciences’ held their 4th roundtable on Data Science Education, “Alternative Institutional and Educational Mechanisms”. You can find the webcast and agenda webpage here. I attended as a member of the public and was able to contribute a few ideas and questions. There were several great presentations and some perspectives on education I hadn’t considered were definitely presented. Eric Kolaczyk gave a great presentation that described a very nicely integrated learning pathway for building data expertise at the master’s level. The MS in Statistical Practice It is one of the few programs I know of (now) that actually redesigned several of their courses to make a more effective data science education, and cohesive learning structure. It was also very informative to hear about Metis’s data science “bootcamps”. It’s pretty clear Metis is doing some excellent education work in data science, but very different from traditional, academic education. Additional talks worth listening to were Andrew Bray, explaining the origin and evolution of the American Statistical Association’s DataFest events, Ron Brachman describing Cornell Tech’s ‘entrepreneurial’ focused data science, and Catherine Cramer discussing the New York Hall of Science‘s Network Science education initiatives (I plan to use some of this material for with my students who do network science research!).

Additionally, the National Academy of Sciences have released an interim report on the “Envisioning the Data Science Discipline” studies going on.  The report is definitely worth reading and provides some very interesting views and findings. There’s also a strong call for community input, so send you ideas in!

The last activity I participated in during October was the South Big Data Hub‘s workshop “Keeping Data Science Broad: Workshop on Negotiating the Digital and Data Divide“. This workshop was an incredible pleasure to join! I think the best part was that with the entire room filled with people who have already been thinking about what data science and data science education might look like, we were able to frequently move beyond the “what is data science” discussion. It meant that we could really start discussing the roadblocks and opportunities inherent in data science. While I can’t share more of the actual outcomes/products from the workshop yet, we’ve got a really aggressive schedule to turn the output into a report (due Dec 6th!). I’m hopeful that something really powerful will come out. I know there was a lot of writing accomplished while there (I wrote 5-6 pages, and others did too) so look for another announcement of a report in early december.

Finally, while I haven’t been participating/watching them much yet. I need to mention the ongoing webinar series being run by the National Academy of Sciences. You can find the entire webinar series here. October saw 4 webinars posters, “Communication Skills and Teamwork”, “Inter-Departmental Collaboration and Institutional Organization”, “Ethics”, and “Assessment and Evaluation for Data Science Programs”. I’m still hoping to watch these and provide summary posts… but that hasn’t happened yet. If any of my readers have been watching them and would like a guest-post with a summary, please get in touch!

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Intro to DS Assignment Sites

As an instructor, I want to provide high-quality assignments that are focused (so they achieve the learning objective), engaging (so they aren’t bored), and well supported (so they don’t end up frustrated). In an ideal world, I’d have time to write, test, debug, and administer all my own, course-tailored assignments that meet these goals. I, however, do not live in an ideal world, nor have enough graduate/undergraduate minions to mimic this ideal world. Instead, I’ve turned to using a few sites that already host assignments, resources, and even include auto-grading (without me needing to learn/setup the system).

Learn2Mine (L2M) is the first site I used in conjunction with my Data Mining course, and more recently my Introduction to Data Science course. Learn2Mine is a free, open source platform developed at the College of Charleston (CoC). While I have only really made use of the contents already there and CoC’s hosted site, you can contribute, or host your own version by getting the source directly from github. Dr. Anderson is fairly responsive about keeping the site running and grading.

The positive features for L2M (beyond being totally free/open source) are that it includes a mix of both introductory programming assignments and several more advanced machine learning/data mining lessons. It even has several search algorithm lessons (which I tend not to use). All of the lessons include auto-graded response boxes which also provide limited feedback of the errors generated when comparing submitted work to answers. There is also an interface for instructors to create their own ‘courses’ which consist of a series of the lessons on L2M. This allows the instructor to see student progress through lessons and download a grade-book in spreadsheet format.

Downsides for L2M are in-line with what you pay for (or invest in time-wise). Even though there is feedback when students get answers wrong, this often just consists of the identification of mismatched output lines (so pretty sparse). Students often get very frustrated trying to figure out what they are missing. This is exacerbated by the fact that often the instructions are unclear or insufficient to allow students to simply do the lessons. Also, as might be expected from a locally built/maintained project, there are a lot of “polish” features missing, such as being able to reorder assignments in a course, or associate a name with an account. Students have an account associated with the email they login with so it can sometimes be challenging to connect records with students. Overall, I’ve been considering phasing L2M out of my normal assignment structure, though the possibility of hosting my own local version and implementing different, more explained lessons has also been tempting.

The prime contender to replace L2M for me has been DataCamp. I’ve know about DataCamp for a while now but had the first chance to actually use it and make assignments from it this spring when I was looking for data visualization lessons (see visualization resources post). I’ve gone through a few lessons myself and found DataCamp to basically be exactly what I’d want/envision online course-work to be. Most courses consist of short videos (a best practice) followed by several guided coding exercises. DataCamp is not (sort of) free, which turns out to be a pro and a con.

If it’s not free, why is DataCamp going to replace L2M for me? Great question. Because, for academic purposes, Datacamp IS free. If you are an instructor for an academic institution teaching a course with 10+ students in, you can request free, premium access for students enrolled in your course(s). That access is limited (they give you 6 months), but hey, it’s free. What else makes DataCamp a nicer replacement? First the coding exercises are scaffolded, that is, early exercises have more prewritten code while later exercises require you to remember and use what you’ve already learned. In addition, the coding exercises have reasonably helpful error messages and help often allowing you to more accurately debug code. They’ve also got built in hints/help available, so a student can’t get permanently stuck. Using those however decreases the “exp” they gain, so you can still track how success a student has been without help. The other major advantage is that DataCamp has a SIGNIFICANTLY larger set of lessons/courses available to pull from.

There is no free lunch in data/computer science though. DataCamp does have a few downsides. Perhaps the biggest is the granularity available in assignments. You have three choices, “collect xp”, “complete chapter”, or “complete course”. Given that a chapter is really the smallest cohesive learning unit on DataCamp, this makes a lot of sense educationally. However, that also means it’s not exactly an alternative for giving individual lab/homework assignments. Instead, it would serve best as a resource/major assignment related to learning how to program in python/r, or a bigger topic.

Finally, I want to mention Gradescope. Gradescope isn’t data science educational site. Instead it’s a jack-of-all trades which can help ease the burden of assignments and grading. If DataCamp took L2M and removed granularity/options, Gradescope (in this context) goes the other direction. Lots of faculty use it for all kinds of courses, from computer science or mathematics to writing. Given its purpose, Gradescope doesn’t have any specific assignments (maybe that was obvious). Instead, it can serve as an autograder or collection site for your assignments. I’ve included it here for those that might already have assignments (or who get them from others) but still want a speedy, simple way to get feedback to students.

I’d be remiss if I didn’t point out that there are some alternatives to DataCamp, depending on your goals. If all you need students to do is learn to program (not necessarily in a data-centric style) try Codecademy or explore Code.org. I also know there is an alternative to Gradescope (but I couldn’t track down the name/site if someone knows, please email me or leave a comment). What I recall is that the alternative is NOT free, but does provide better support and scaling. You might also consider what options are available or integratable with your learning management system (DataCamp IS…but maybe not by you..).

Hopefully you found this post informative, if you’ve got other suggestions of websites with assignments (particularly data-science related) please let me know or leave a comment.