Tagged data science curriculum

Announcement and Reflections on ACM’s Draft Data Science Curriculum

Last week, there was an announcement of the first draft of ACM’s “Computing Competencies for Undergraduate Data Science Curricula” — I.E., ACM’s take on a Data Science Curriculum recommendation. The full draft can be found here. The ACM Data Science task force is explicitly asking for community feedback on this draft by March 31st. I was able to attend their town-hall feedback session at the SIGCSE Technical Symposium where there were both excitement, but also some concerns about the scope the curriculum recommendations take. This post is going to offer some reflections and thoughts on the draft, however I strongly encourage anyone involved with Data Science curriculum design or implementation to read it for yourself!


Chapter 1: Introduction and Background Materials

First, I’m really glad to see this being produced. I’ve commented previously on some of the other curriculum guidelines developed on this blog emphasizing that the ‘computing’ perspective was often a bit under-represented. I also need to praise the task-force for not simply reinventing the wheel! Their first substantial section is a review of the existing, relevant curriculum recommendations related to data science. They’ve done a thorough job (the first I’ve seen publicly posted), with some valuable insights into each. If you haven’t had a chance to read some of my blog posts about the other recommendations (See: Related Curricula, EDISON, Park City) their summary is an excellent starting place. One curriculum they examine that has not been discussed on this blog is the Business Higher Education Framework (BHEF) Data Science and Analytics (DSA) Competency Map (2016). Their discussion of this material can be found on page 7.

Another important thing to catch in their discussion of the task force’s charge, and work, is that they are only trying to define computing contribution to data science. This is in stark contrast to most of the other curriculum guidelines out there relating to data science. They all include the full-breath of what a data science curriculum might entail. In talking with the chair of the task force, there really is a recognition that this is only the first stage in developing a community recognized, full-fledged curriculum guide.


Chapter 2: The Competency Framework

The task force is taking a slightly different approach to developing the curriculum than ACM took with CS-2013. Instead of focusing exclusively on “Knowledge Areas” they are developing a competency framework. Given how much the field of data science leans on soft-skills, in addition to technical skills, this is certainly a reasonable approach. The main concern expressed by the task force chair, which I share, is that it is still important for the final guide to be highly usable to guide program development. While the current draft does not achieve the same level of usefulness that CS-2013 does, I have high hopes for their final product. The motivation for this switch is grounded heavily in current scholarship of teaching and learning alongside cognitive learning theory. This has a long-term potential to help transform educational settings from a passive learning environment to a more active, student-centered paradigm (which I am strongly in favor of!). However, it will require significantly more work to transform the current competencies into something usable for both student-centered design and programmatic design.

If you aren’t aware of the concepts of “Understanding by Design”, learning transfer theory, or how these interact on a ‘practical, operational level’ it would certainly be worth your time to read through this chapter carefully. It may provide you with many new ideas to consider when doing course planning or activity planning in general.


Appendix A: Draft of Competencies for Data Science

To begin with, this appendix is actually very massive. It is 23 pages long, 40% of the entire document. As of now, the task force is well aware that this section is actually too extensive for this to be truly useful, especially as currently presented. However, they will be forming several sub-committees to work on refining each of the competency areas in the next month or two. The target time-frame for a refined draft is late summer. The next sections of this post will reflect on the various competencies as stated.


BTW: If you are interested in serving on one of these subcommittees, please email the task-force co-chairs, Andrea Danyluk and  Paul Leidig ASAP.


  • Computing Fundamentals
    • Programming
    • Data Structures
    • Algorithms
    • Software Engineering

This competency and its sub-categories clearly demonstrates the break from CS-2013. Where CS2013 organized content based on topical areas of computer science, here we see a smattering of ideas from several areas. It pulls several ideas from the area of “Algorithms and Complexity” with a strong focus on the algorithmic side, and the data/programming structures that support algorithm implementations. The beautiful thing is that these do fairly clearly express computing’s perspective on absolutely essential tasks that support best-usage of statistical and data-science ideas. Probably the most surprising thing for someone not from a CS background would be the inclusion of the ‘Software Engineering’ ideas. However, based on my experiences talking with industry practitioners, this is perhaps the most overlooked area of preparing future data scientists. It becomes especially critical when trying to move their models and techniques into actual production code that produces value for a company.


  • Data Management
    • Data Acquisition
    • Data Governance
    • Data Maintenance & Delivery

I have actually merged two knowledge areas as defined by the task-force in this. They had defined the knowledge areas of: “Data Acquisition and Governance” and “Data Management”. As described, these could be merged into one, more over-arching idea. That of how a data scientist actually deals with the “bytes” of data, regardless of the actual content of the data. It also talks about ideas such as selecting data sources, storing the data, querying the databases etc. This section obviously comes strongly from the “Information Sciences” or “Information Management” sector of computer science.

Something that might be missing (or might be buried in the IS language) is the idea of careful design of the actual collection of data. That is, does a survey, log, or other acquisition process actually collect information that is usable for the planned data-science task or goal.


  • Data Protection and Sharing
    • Privacy
    • Security
    • Integrity

Again, I’ve re-named the higher-level category. The task-force originally called this group “Data Privacy, Security, and Integrity”. While highly descriptive, as it matched exactly the sub-categories, it seemed slightly redundant to have it as the meta-category as well. This is an interesting grouping also, as the “Privacy” competency clearly covers things that most faculty and practitioners I discuss data science with would agree should be included. However, the “Security” and “Integrity” competencies dive into highly technical areas of encryption and message authentication. They both seem to have been heavily drawn from the realm of Cybersecurity. I expect that most of the existing data science (undergraduate) programs would find it highly challenging to include more than a very superficial coverage of this content. Even graduate programs might not do more than touch upon the idea of mathematical encryption unless the students themselves sought out additional course work (such as a cryptography class).

Even though I’m not sure programs are, or even could, do more coverage of this section of content, this may be a clear area for program expansion. Perhaps as more courses are developed that exclusively serve data science programs it will become possible to include more of these ideas.


  • Machine Learning
  • Data Mining

As could be expected, there are competencies related to actually learning something from data. The task force has (currently) chosen to split some of the ideas into two categories. The Machine Learning knowledge area is massive, and includes most of the details about algorithms, evaluation, processes and more. The Data Mining knowledge area seems to try and provide competencies related to overall usage and actual implementation of machine learning. I’ll let you pick through it yourself, but from my read through it seems to cover the majority of ideas that would be expected, including recognition of bias and decisions on outcomes.

My feedback – Ditch the separate knowledge areas, and provide some “sub” areas under Machine Learning.


  • Big Data
    • Problems of Scale
    • Complexity Theory
    • Sampling and Filtering
    • Concurrency and Parallelism

Perhaps the area that drove data science into the lime-light, the task force has provided a nice break-down of sub-areas and related competencies. While a “sexy” area to have a course in, in my mind, this is actually a “nice to have” not a necessary content coverage area. Especially reading through all the details, it really does deal with “big” issues (appropriately!). However, lots and lots of data scientists that we train at the undergraduate level are simply not going to be dealing with these problems. Their day-to-day will be consumed with fundamentals, data governance and maintenance, and maybe, if they are lucky, some machine learning.


  • Analysis and Presentation

The task force’s take on this section was from a more technical standpoint. Specifically, it draws from the area of ‘human-computer-interfaces’ or HCI. In walking the line of defining computing specific competencies, without edging into statistics or graphic design, I think this is an excellent section. I am glad to see its inclusion, and thoughtful consideration. Often CS students forget about the importance of thinking carefully about how a human will actually interact with a computer. Instead they typically focus just on what the computer will output.


  • Professionalism
    • Continuing Professional Development
    • Communication
    • Teamwork
    • Economic Considerations
    • Privacy and Confidentiality
    • Ethical Issues
    • Legal Considerations
    • Intellectual Property
    • Change Management
    • On Automation

While this competency area is framed as a “meta” area with sub-categories, it has nearly as many sub-categories as the entire rest of the framework. While I think most (perhaps even all) of these do belong as part of a curriculum/competency guide, this felt excessive as presented. This is especially true if we are considering the suggested content for an undergraduate curriculum. While I feel that all students should be aware of the idea of “intellectual property” getting into the weeds of different regulations, IP ideas, etc seems pretty excessive for most students. Most likely, I’d simply be encouraging them to know what falls under these ideas, and then tell them to talk to a lawyer. Similarly, discussing at length “Change Management” seems highly ambitious for most data science students, especially at the undergraduate level. While they might need to be aware that their work will foster change, and that someone should be managing it… it probably shouldn’t be them unless they get explicit training in it! And, given the scope of technical skills to cover in a data-science curriculum, I sincerely doubt there will be space for much of this.


While I’ve tried to provide some quick reflections on the entire draft, you should definitely go read it yourself! Or, keep your head up looking for the subsequent drafts and processes. ACM has a history of collecting very interdisciplinary teams for generating consensus curriculum guidelines, so I expect over the next few years we’ll see a fairly substantial effort to bring more perspectives to the table and generate an inclusive curriculum guide.

 

NASEM Webinar 1: Data Acumen

This webinar aimed to discuss how to build undergraduate’s “data acumen”. If acumen isn’t a word you use regularly (didn’t before last year), it means “the ability to make good judgments and quick decisions”. Data acumen therefore is the ability to make good judgments and quick decisions with data. Certainly a valuable and important skill for students to develop! The webinar’s presenters were Dr. Nicole Lazar, University of Georgia and Dr. Mladen Vouk, North Carolina State University. Dr. Lazar is a professor of statistics at University of Georgia. Dr. Vouk is a distinguished professor of computer science and the Associate Vice Chancellor for Research Development and Administration.

Overall, this webinar seemed to be largely a waste of time, if your goal was to understand what activities, curricular design and practices would help students develop data acumen. (See my last paragraph for a suggested alternative) On the other-hand, if you’d like a decent description of the design and implementation of a capstone course, and the process of scaling a capstone course, listen to Dr. Lazar’s portion. If you still need an overview of the state of data science then Dr. Vouk’s portion provided a reasonable context for data science. The most valuable thing in the entire webinar was slides 26 and 27 (about minute 48). Slide 26 shows an excellent diagram for an “End-to-End Data Science Curriculum” that reasonably well articulates how a student might mature (and thereby gain data acumen), see figure 1 below. Slide 27 provides well-articulated learning objectives for core, intermediate and advanced data science courses (see table below)

From NASEM Data Acument Webinar. North Carolina State University’s Curriculum Vision
  • Core
    • Able to master individual core concepts within the bloom’s taxonomy:
      Knowledge, Comprehension, Application, Analysis, Evaluation, and Synthesis
    • Able to adapt previously seen solutions to data science problems for target domain-focused applications utilizing these core concepts
  • Intermediate Electives
    • Able to synthesize multiple concepts to solve, evaluate and validate the proposed data science problem from the end-to-end perspective
    • Able to identify and properly apply the textbook-level techniques suitable for solving each part of the complex data science problem pipeline
  • Advanced Electives
    • Able to formulate new domain-targeted data science problems, justify their business value, and make data-guided actionable decisions
    • Able to research the cutting edge technologies, compare them and create the optimal ones for solving the DS problems at hand
    • Able to lead a small team working on the end-to-end execution of DS projects

 

An Alternative to the NASEM Webinar

While I found this particular webinar to largely be a waste of time, I also attended the NASEM Roundtable on “Alternative Educational Pathways for Data Science” . While certainly not focused on data acumen I found the first presentation given at that round-table described an excellent overall curriculum structure that did build student’s data acumen. Eric Kolaczyk from Boston University described their non-traditional master’s program in Statistical Practice. By integrating their course work, practicum experiences, and more, students are forced to exercise and build their ability to make good judgments about data investigations, methods, and results. The talk is well worth your time if you’d like some ideas for non-standard ways to build student skills and abilities.

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

(Other) Official Curriculum Guides

Last week I discussed several places from which you could pull curriculum planning materials. This week will continue that theme, but with a bit more of an ‘official’ flavor, by discussing several professional societies’ curricular guides. While there is no (clear) leading data science professional society (and none with curricular guidelines to my knowledge), there are a few closely related societies with official guidelines. Depending on what path you took into data science, you may be more or less familiar with the following societies: Association of Computing Machinery (ACM), Institute of Electrical and Electronics Engineers (IEEE), Mathematical Association of America (MAA), and the American Statistical Association (ASA), . There are several other societies relevant to data science, but not as vital in terms of official curricular guidelines (SIAM, INFORMS, AMS, ASEE). All four major societies (ACM, IEEE, MAA, and ASA) have released curricular guidelines relevant to data science. This post will give a very high level overview of those guidelines and why you might care about what’s in them.

ACM and IEEE jointly released Curriculum Guidelines for Undergraduate Programs in Computer Science in 2013 (CS2013). The most valuable component of CS2013 for me is the specification of ‘Knowledge Areas’ that are obviously related to Data Science, and being able to see the professional community’s consensus on central learning objectives in these areas. Some clearly important/relevant areas are:

  • Computational Science
  • Discrete Structures
  • Graphics and Visualization
  • Information Management
  • Parallel and Distributed Computing

Other areas such as Algorithms and Complexity, Information Assurance and Security, or Programming Languages probably include specific learning objectives that are relevant to data science, but may not be needed in their entirety. Additionally, CS2013 allows you to to examine the suggested course hours expected to be devoted to these topics. From an industry perspective, this can provide valuable insight into whether a data scientist or computer scientist might be more knowledgeable about a particular subject. This differentiation in knowledge is important as data science strives to define itself independently of its founding disciplines. If you are interested in throwing your net a bit wider, ACM also has guides for other programs like Computer Engineering and Information Technology (coming in 2017) on their guidelines site.

The MAA’s 2015 Committee on the Undergraduate Programs (CUPM) in Mathematics Curriculum Guide to Majors in the Mathematical Sciences — CUPM Guide for short — can serve in largely the same way the CS2013 guide does, but from a mathematical/statistical approach. With more detailed reports on Applied Mathematics, Computational Science, Operations Research, and other areas of mathematics that data science often operates in, the CUPM Guide makes it possible to understand what exactly (from a mathematician’s or computational mathematician’s perspective) are the most relevant areas of mathematics to understand for success. This guide can also serve to help clarify exactly what sorts of mathematics courses a data science curriculum should require, by explaining where in the course structure specific topics like sets, relations, and functions, or other ideas get covered. In addition to their extensive undergraduate guide the MAA also provides a lot of interesting materials related to masters/Ph.D preparation, etc. These might be particular interesting as you consider what sorts of students to recruit or include in a master’s program.

Finally, the ASA has perhaps the most relevant and diverse, but in many ways least detailed, set of curriculum guides. The set of undergraduate guidelines and reports include how to assess instruction, program guidelines for statistical sciences, and even the Park 2016 Data Science guidelines (which I have commented on in other posts). They also have two sets of graduate guidelines from 2009 and 2012 for statistics masters/Ph.D. programs. What the ASA guidelines provide are much bigger, sweeping statements about the sorts of skills and knowledge that a statistics major should have. It includes side notes that give more details such as encouraged programming languages and even file formats. In many ways, I think the majority of the ASA guidelines could just replace “Statistics Major” with “Data Science Major” and remain nearly as applicable. The biggest difference might be in the level/depth required in “Statistical Methods and Theory” (less) and “Data Manipulation and Computation” (more). In a sense, this is at the heart of many statistician’s argument that “Data Science” isn’t really its own field. In practice though, I think the final implementation and mindset behinds a statistics major and a data science major will be very different, and certainly heavily influenced by the ‘host’ department.

That covers the breadth of the major professional societies’ curricular recommendations. I wasn’t able to find any (official) guidelines for a “business analytics” major from a professional society (see my resource page for a few unofficial documents), so if you know of one, please let me know.

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Blog Intro and Information

Welcome to “From the Director’s Desk” a blog about data science education and curriculum. If you are interested in receiving regular updates when new posts appear you can use the RSS feed link above, or subscribe to the google-group (read more for the link, you don’t need a gmail-account to subscribe!). You can find a bit more about me, Karl Schmitt on the about page. If you are looking for full degree curriculum development materials I’ve created a resource page and tracked posts with a Program Development category. Individual course materials are tracked either generally with the “Course Development” category, or individually by each course the post relates to. Please feel free to email me or leave comments if you have questions, thoughts or something to share!

The original blog introduction, with a bit of why the blog exists and what it seeks to cover is here.