Last week, there was an announcement of the first draft of ACM’s “Computing Competencies for Undergraduate Data Science Curricula” — I.E., ACM’s take on a Data Science Curriculum recommendation. The full draft can be found here. The ACM Data Science task force is explicitly asking for community feedback on this draft by March 31st. I was able to attend their town-hall feedback session at the SIGCSE Technical Symposium where there were both excitement, but also some concerns about the scope the curriculum recommendations take. This post is going to offer some reflections and thoughts on the draft, however I strongly encourage anyone involved with Data Science curriculum design or implementation to read it for yourself!
Chapter 1: Introduction and Background Materials
First, I’m really glad to see this being produced. I’ve commented previously on some of the other curriculum guidelines developed on this blog emphasizing that the ‘computing’ perspective was often a bit under-represented. I also need to praise the task-force for not simply reinventing the wheel! Their first substantial section is a review of the existing, relevant curriculum recommendations related to data science. They’ve done a thorough job (the first I’ve seen publicly posted), with some valuable insights into each. If you haven’t had a chance to read some of my blog posts about the other recommendations (See: Related Curricula, EDISON, Park City) their summary is an excellent starting place. One curriculum they examine that has not been discussed on this blog is the Business Higher Education Framework (BHEF) Data Science and Analytics (DSA) Competency Map (2016). Their discussion of this material can be found on page 7.
Another important thing to catch in their discussion of the task force’s charge, and work, is that they are only trying to define computing contribution to data science. This is in stark contrast to most of the other curriculum guidelines out there relating to data science. They all include the full-breath of what a data science curriculum might entail. In talking with the chair of the task force, there really is a recognition that this is only the first stage in developing a community recognized, full-fledged curriculum guide.
Chapter 2: The Competency Framework
The task force is taking a slightly different approach to developing the curriculum than ACM took with CS-2013. Instead of focusing exclusively on “Knowledge Areas” they are developing a competency framework. Given how much the field of data science leans on soft-skills, in addition to technical skills, this is certainly a reasonable approach. The main concern expressed by the task force chair, which I share, is that it is still important for the final guide to be highly usable to guide program development. While the current draft does not achieve the same level of usefulness that CS-2013 does, I have high hopes for their final product. The motivation for this switch is grounded heavily in current scholarship of teaching and learning alongside cognitive learning theory. This has a long-term potential to help transform educational settings from a passive learning environment to a more active, student-centered paradigm (which I am strongly in favor of!). However, it will require significantly more work to transform the current competencies into something usable for both student-centered design and programmatic design.
If you aren’t aware of the concepts of “Understanding by Design”, learning transfer theory, or how these interact on a ‘practical, operational level’ it would certainly be worth your time to read through this chapter carefully. It may provide you with many new ideas to consider when doing course planning or activity planning in general.
Appendix A: Draft of Competencies for Data Science
To begin with, this appendix is actually very massive. It is 23 pages long, 40% of the entire document. As of now, the task force is well aware that this section is actually too extensive for this to be truly useful, especially as currently presented. However, they will be forming several sub-committees to work on refining each of the competency areas in the next month or two. The target time-frame for a refined draft is late summer. The next sections of this post will reflect on the various competencies as stated.
- Computing Fundamentals
- Data Structures
- Software Engineering
This competency and its sub-categories clearly demonstrates the break from CS-2013. Where CS2013 organized content based on topical areas of computer science, here we see a smattering of ideas from several areas. It pulls several ideas from the area of “Algorithms and Complexity” with a strong focus on the algorithmic side, and the data/programming structures that support algorithm implementations. The beautiful thing is that these do fairly clearly express computing’s perspective on absolutely essential tasks that support best-usage of statistical and data-science ideas. Probably the most surprising thing for someone not from a CS background would be the inclusion of the ‘Software Engineering’ ideas. However, based on my experiences talking with industry practitioners, this is perhaps the most overlooked area of preparing future data scientists. It becomes especially critical when trying to move their models and techniques into actual production code that produces value for a company.
- Data Management
- Data Acquisition
- Data Governance
- Data Maintenance & Delivery
I have actually merged two knowledge areas as defined by the task-force in this. They had defined the knowledge areas of: “Data Acquisition and Governance” and “Data Management”. As described, these could be merged into one, more over-arching idea. That of how a data scientist actually deals with the “bytes” of data, regardless of the actual content of the data. It also talks about ideas such as selecting data sources, storing the data, querying the databases etc. This section obviously comes strongly from the “Information Sciences” or “Information Management” sector of computer science.
Something that might be missing (or might be buried in the IS language) is the idea of careful design of the actual collection of data. That is, does a survey, log, or other acquisition process actually collect information that is usable for the planned data-science task or goal.
- Data Protection and Sharing
Again, I’ve re-named the higher-level category. The task-force originally called this group “Data Privacy, Security, and Integrity”. While highly descriptive, as it matched exactly the sub-categories, it seemed slightly redundant to have it as the meta-category as well. This is an interesting grouping also, as the “Privacy” competency clearly covers things that most faculty and practitioners I discuss data science with would agree should be included. However, the “Security” and “Integrity” competencies dive into highly technical areas of encryption and message authentication. They both seem to have been heavily drawn from the realm of Cybersecurity. I expect that most of the existing data science (undergraduate) programs would find it highly challenging to include more than a very superficial coverage of this content. Even graduate programs might not do more than touch upon the idea of mathematical encryption unless the students themselves sought out additional course work (such as a cryptography class).
Even though I’m not sure programs are, or even could, do more coverage of this section of content, this may be a clear area for program expansion. Perhaps as more courses are developed that exclusively serve data science programs it will become possible to include more of these ideas.
- Machine Learning
- Data Mining
As could be expected, there are competencies related to actually learning something from data. The task force has (currently) chosen to split some of the ideas into two categories. The Machine Learning knowledge area is massive, and includes most of the details about algorithms, evaluation, processes and more. The Data Mining knowledge area seems to try and provide competencies related to overall usage and actual implementation of machine learning. I’ll let you pick through it yourself, but from my read through it seems to cover the majority of ideas that would be expected, including recognition of bias and decisions on outcomes.
My feedback – Ditch the separate knowledge areas, and provide some “sub” areas under Machine Learning.
- Big Data
- Problems of Scale
- Complexity Theory
- Sampling and Filtering
- Concurrency and Parallelism
Perhaps the area that drove data science into the lime-light, the task force has provided a nice break-down of sub-areas and related competencies. While a “sexy” area to have a course in, in my mind, this is actually a “nice to have” not a necessary content coverage area. Especially reading through all the details, it really does deal with “big” issues (appropriately!). However, lots and lots of data scientists that we train at the undergraduate level are simply not going to be dealing with these problems. Their day-to-day will be consumed with fundamentals, data governance and maintenance, and maybe, if they are lucky, some machine learning.
- Analysis and Presentation
The task force’s take on this section was from a more technical standpoint. Specifically, it draws from the area of ‘human-computer-interfaces’ or HCI. In walking the line of defining computing specific competencies, without edging into statistics or graphic design, I think this is an excellent section. I am glad to see its inclusion, and thoughtful consideration. Often CS students forget about the importance of thinking carefully about how a human will actually interact with a computer. Instead they typically focus just on what the computer will output.
- Continuing Professional Development
- Economic Considerations
- Privacy and Confidentiality
- Ethical Issues
- Legal Considerations
- Intellectual Property
- Change Management
- On Automation
While this competency area is framed as a “meta” area with sub-categories, it has nearly as many sub-categories as the entire rest of the framework. While I think most (perhaps even all) of these do belong as part of a curriculum/competency guide, this felt excessive as presented. This is especially true if we are considering the suggested content for an undergraduate curriculum. While I feel that all students should be aware of the idea of “intellectual property” getting into the weeds of different regulations, IP ideas, etc seems pretty excessive for most students. Most likely, I’d simply be encouraging them to know what falls under these ideas, and then tell them to talk to a lawyer. Similarly, discussing at length “Change Management” seems highly ambitious for most data science students, especially at the undergraduate level. While they might need to be aware that their work will foster change, and that someone should be managing it… it probably shouldn’t be them unless they get explicit training in it! And, given the scope of technical skills to cover in a data-science curriculum, I sincerely doubt there will be space for much of this.
While I’ve tried to provide some quick reflections on the entire draft, you should definitely go read it yourself! Or, keep your head up looking for the subsequent drafts and processes. ACM has a history of collecting very interdisciplinary teams for generating consensus curriculum guidelines, so I expect over the next few years we’ll see a fairly substantial effort to bring more perspectives to the table and generate an inclusive curriculum guide.