From Program Development

Announcement and Reflections on ACM’s Draft Data Science Curriculum

Last week, there was an announcement of the first draft of ACM’s “Computing Competencies for Undergraduate Data Science Curricula” — I.E., ACM’s take on a Data Science Curriculum recommendation. The full draft can be found here. The ACM Data Science task force is explicitly asking for community feedback on this draft by March 31st. I was able to attend their town-hall feedback session at the SIGCSE Technical Symposium where there were both excitement, but also some concerns about the scope the curriculum recommendations take. This post is going to offer some reflections and thoughts on the draft, however I strongly encourage anyone involved with Data Science curriculum design or implementation to read it for yourself!


Chapter 1: Introduction and Background Materials

First, I’m really glad to see this being produced. I’ve commented previously on some of the other curriculum guidelines developed on this blog emphasizing that the ‘computing’ perspective was often a bit under-represented. I also need to praise the task-force for not simply reinventing the wheel! Their first substantial section is a review of the existing, relevant curriculum recommendations related to data science. They’ve done a thorough job (the first I’ve seen publicly posted), with some valuable insights into each. If you haven’t had a chance to read some of my blog posts about the other recommendations (See: Related Curricula, EDISON, Park City) their summary is an excellent starting place. One curriculum they examine that has not been discussed on this blog is the Business Higher Education Framework (BHEF) Data Science and Analytics (DSA) Competency Map (2016). Their discussion of this material can be found on page 7.

Another important thing to catch in their discussion of the task force’s charge, and work, is that they are only trying to define computing contribution to data science. This is in stark contrast to most of the other curriculum guidelines out there relating to data science. They all include the full-breath of what a data science curriculum might entail. In talking with the chair of the task force, there really is a recognition that this is only the first stage in developing a community recognized, full-fledged curriculum guide.


Chapter 2: The Competency Framework

The task force is taking a slightly different approach to developing the curriculum than ACM took with CS-2013. Instead of focusing exclusively on “Knowledge Areas” they are developing a competency framework. Given how much the field of data science leans on soft-skills, in addition to technical skills, this is certainly a reasonable approach. The main concern expressed by the task force chair, which I share, is that it is still important for the final guide to be highly usable to guide program development. While the current draft does not achieve the same level of usefulness that CS-2013 does, I have high hopes for their final product. The motivation for this switch is grounded heavily in current scholarship of teaching and learning alongside cognitive learning theory. This has a long-term potential to help transform educational settings from a passive learning environment to a more active, student-centered paradigm (which I am strongly in favor of!). However, it will require significantly more work to transform the current competencies into something usable for both student-centered design and programmatic design.

If you aren’t aware of the concepts of “Understanding by Design”, learning transfer theory, or how these interact on a ‘practical, operational level’ it would certainly be worth your time to read through this chapter carefully. It may provide you with many new ideas to consider when doing course planning or activity planning in general.


Appendix A: Draft of Competencies for Data Science

To begin with, this appendix is actually very massive. It is 23 pages long, 40% of the entire document. As of now, the task force is well aware that this section is actually too extensive for this to be truly useful, especially as currently presented. However, they will be forming several sub-committees to work on refining each of the competency areas in the next month or two. The target time-frame for a refined draft is late summer. The next sections of this post will reflect on the various competencies as stated.


BTW: If you are interested in serving on one of these subcommittees, please email the task-force co-chairs, Andrea Danyluk and  Paul Leidig ASAP.


  • Computing Fundamentals
    • Programming
    • Data Structures
    • Algorithms
    • Software Engineering

This competency and its sub-categories clearly demonstrates the break from CS-2013. Where CS2013 organized content based on topical areas of computer science, here we see a smattering of ideas from several areas. It pulls several ideas from the area of “Algorithms and Complexity” with a strong focus on the algorithmic side, and the data/programming structures that support algorithm implementations. The beautiful thing is that these do fairly clearly express computing’s perspective on absolutely essential tasks that support best-usage of statistical and data-science ideas. Probably the most surprising thing for someone not from a CS background would be the inclusion of the ‘Software Engineering’ ideas. However, based on my experiences talking with industry practitioners, this is perhaps the most overlooked area of preparing future data scientists. It becomes especially critical when trying to move their models and techniques into actual production code that produces value for a company.


  • Data Management
    • Data Acquisition
    • Data Governance
    • Data Maintenance & Delivery

I have actually merged two knowledge areas as defined by the task-force in this. They had defined the knowledge areas of: “Data Acquisition and Governance” and “Data Management”. As described, these could be merged into one, more over-arching idea. That of how a data scientist actually deals with the “bytes” of data, regardless of the actual content of the data. It also talks about ideas such as selecting data sources, storing the data, querying the databases etc. This section obviously comes strongly from the “Information Sciences” or “Information Management” sector of computer science.

Something that might be missing (or might be buried in the IS language) is the idea of careful design of the actual collection of data. That is, does a survey, log, or other acquisition process actually collect information that is usable for the planned data-science task or goal.


  • Data Protection and Sharing
    • Privacy
    • Security
    • Integrity

Again, I’ve re-named the higher-level category. The task-force originally called this group “Data Privacy, Security, and Integrity”. While highly descriptive, as it matched exactly the sub-categories, it seemed slightly redundant to have it as the meta-category as well. This is an interesting grouping also, as the “Privacy” competency clearly covers things that most faculty and practitioners I discuss data science with would agree should be included. However, the “Security” and “Integrity” competencies dive into highly technical areas of encryption and message authentication. They both seem to have been heavily drawn from the realm of Cybersecurity. I expect that most of the existing data science (undergraduate) programs would find it highly challenging to include more than a very superficial coverage of this content. Even graduate programs might not do more than touch upon the idea of mathematical encryption unless the students themselves sought out additional course work (such as a cryptography class).

Even though I’m not sure programs are, or even could, do more coverage of this section of content, this may be a clear area for program expansion. Perhaps as more courses are developed that exclusively serve data science programs it will become possible to include more of these ideas.


  • Machine Learning
  • Data Mining

As could be expected, there are competencies related to actually learning something from data. The task force has (currently) chosen to split some of the ideas into two categories. The Machine Learning knowledge area is massive, and includes most of the details about algorithms, evaluation, processes and more. The Data Mining knowledge area seems to try and provide competencies related to overall usage and actual implementation of machine learning. I’ll let you pick through it yourself, but from my read through it seems to cover the majority of ideas that would be expected, including recognition of bias and decisions on outcomes.

My feedback – Ditch the separate knowledge areas, and provide some “sub” areas under Machine Learning.


  • Big Data
    • Problems of Scale
    • Complexity Theory
    • Sampling and Filtering
    • Concurrency and Parallelism

Perhaps the area that drove data science into the lime-light, the task force has provided a nice break-down of sub-areas and related competencies. While a “sexy” area to have a course in, in my mind, this is actually a “nice to have” not a necessary content coverage area. Especially reading through all the details, it really does deal with “big” issues (appropriately!). However, lots and lots of data scientists that we train at the undergraduate level are simply not going to be dealing with these problems. Their day-to-day will be consumed with fundamentals, data governance and maintenance, and maybe, if they are lucky, some machine learning.


  • Analysis and Presentation

The task force’s take on this section was from a more technical standpoint. Specifically, it draws from the area of ‘human-computer-interfaces’ or HCI. In walking the line of defining computing specific competencies, without edging into statistics or graphic design, I think this is an excellent section. I am glad to see its inclusion, and thoughtful consideration. Often CS students forget about the importance of thinking carefully about how a human will actually interact with a computer. Instead they typically focus just on what the computer will output.


  • Professionalism
    • Continuing Professional Development
    • Communication
    • Teamwork
    • Economic Considerations
    • Privacy and Confidentiality
    • Ethical Issues
    • Legal Considerations
    • Intellectual Property
    • Change Management
    • On Automation

While this competency area is framed as a “meta” area with sub-categories, it has nearly as many sub-categories as the entire rest of the framework. While I think most (perhaps even all) of these do belong as part of a curriculum/competency guide, this felt excessive as presented. This is especially true if we are considering the suggested content for an undergraduate curriculum. While I feel that all students should be aware of the idea of “intellectual property” getting into the weeds of different regulations, IP ideas, etc seems pretty excessive for most students. Most likely, I’d simply be encouraging them to know what falls under these ideas, and then tell them to talk to a lawyer. Similarly, discussing at length “Change Management” seems highly ambitious for most data science students, especially at the undergraduate level. While they might need to be aware that their work will foster change, and that someone should be managing it… it probably shouldn’t be them unless they get explicit training in it! And, given the scope of technical skills to cover in a data-science curriculum, I sincerely doubt there will be space for much of this.


While I’ve tried to provide some quick reflections on the entire draft, you should definitely go read it yourself! Or, keep your head up looking for the subsequent drafts and processes. ACM has a history of collecting very interdisciplinary teams for generating consensus curriculum guidelines, so I expect over the next few years we’ll see a fairly substantial effort to bring more perspectives to the table and generate an inclusive curriculum guide.

 

Guest Post: Open Sources and Data Science

Open source solutions are improving how students learn—and how instructors teach. Matt Mullenweg, the founder of WordPress revealed to TechCrunch a few years back his opinion on Open Source. “When I first got into technology I didn’t really understand what open source was. Once I started writing software, I realized how important this would be.”

Open source software is almost everywhere and tons of modern-day proprietary applications are built on top of it. Most students reading through the intricacies of data science will already be fairly familiar with open source software because many popular data science tools are open-source.

There’s a popular perception that open-source tools are not as good as their proprietary peers. However, similar to Linux, just because the underlying code is open and free does not necessarily mean that it is of a poorer quality. Truth be told, open source is probably the best in its class when it comes to development and data science.

In this post, we’ll have a straightforward look at how open source contributes to data science. We’ll also cover open-source tools and repositories that are related to data science.

Why is Open Source Good for Data Science?

Why exactly does Open Source and Data Science go hand-in-glove?

Open Source spurs innovation

Perhaps the most significant benefit of open source tools is that it allows developers the freedom to modify and customize their tools. This allows for quick improvements and experimentation which can, in turn, allow for extensive use of the package and its features.

As is the case with any other development that captures the interest of the technology sector as well as the public, the critical piece lies in the ability to bring the end product to the market as quickly and as thoroughly as possible. Open source tools prove to be a massive benefit to this end.

Faster problem solving

The open-source ecosystem helps you solve your data science problems faster. For instance, you can use tools like Jupyter for rapid prototyping and git for version control. There are other tools that you can add to your toolbelt like Docker to minimize dependency issues and make quick deployments.

Continuous contributions

Another example of a company that contributes to open-source is Google, and TensorFlow is the best example of Google’s contributions. Google uses TensorFlow for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. By open sourcing tools like TensorFlow, they get the benefits from contributors outside the core team. Once TF gets popular, many new research ideas would be implemented in TF first, which makes it more efficient and robust.

Google explains this topic in-depth in their open-source documentation. 

While open source work may have benevolent results it is not an act of charity. Releasing work as open source and the corresponding contribution process eventually result in a higher return on the initial investment made versus the alternative closed source process.

One of the significant benefits of open source, especially when it comes to data science breakthroughs, is the sheer membership of various open source communities that developers can tap into for problem-solving and debugging.

These forums have hundreds of answers to frequently asked questions, and given that open-source data science tools are poised to expand going forward, these communities and repositories of information are only poised to grow.

Contribute and give it back to the community

The best way to learn data science is to actively participate in the data science communities that you love. With open-source, that’s entirely possible because you can start by just following data science projects and repositories on GitHub. Take part in discussions and when you feel you’re ready, contribute to their code by volunteering to review code and submit patches to open-source security bugs.

This will help you get involved, gain exposure and learn details that might otherwise be impossible to learn from your degree curriculum.

Open Source Data Science Tools You Should Know About

KDNuggets recently a published the results of Data Science and Machine Learning poll they conducted earlier this year. The graph shows the tools with the strongest association and each tool’s rank based on their popularity.

 

The weight of the bar indicates the association between the tools. The numbers indicate the percentage of association between these tools. As you can see in the figure, TensorFlow and Keras is the most popular combination with a weight of 149%. Anaconda and scikit-learn is another popular combination of tools.

The number to the left indicates the rank of each tool based on popularity. The color is the value of the lift – green for more Python and red for more R.

We’ll be limiting our discussion to some of the open-source data science and machine learning tools. This list is not based on popularity, but instead usability from a learner’s perspective. Let’s get started.

TensorFlow

TensorFlow is an open source library, built for Python keeping in mind numerical computation with the goal of making machine learning more accessible and more efficient. Google’s TensorFlow eases the process of obtaining data, models for training, projecting projections and refining results.

Developed by the Google Brain team, TensorFlow is a library for a large-scale machine and deep learning. It gathers together many different machine learning and deep learning algorithms and uses them as a common metaphor. TensorFlow makes use of Python as a convenient front-end API to build out applications within the framework. It executes applications using high-performance C++.

TensorFlow can train and execute deep neural networks and use them for image recognition, handwritten digit classification recurrent neural networks word embeddings, sequence models for ML, natural language processing and partial differential equation (PDE) based simulations. It also supports scalable production prediction using models similar to those used in training.

Keras

Keras is a minimalist Python-based library, that is known for deep learning that runs on top of TensorFlow or Theano. Keras was developed to help implement deep learning models quickly and efficiently to aid in research and development.

Keras runs on Python 2.7 and 3.5 and executes CPUs and GPUs based on the base frameworks.

Keras was developed by an engineer at Google and has four guiding principles –

  1. Modularity – A model to understand a standalone sequence or graph. The fundamental concerns of a deep learning model’s components can be combined arbitrarily.
  2. Minimalism – The Keras library provides just enough information to help users achieve an outcome.
  3. Extensibility – Any new components are easy to add and implement within the framework. This is intentional allowing developers and researchers the freedom of trial and experimentation with new ideas.
  4. Python – There is no requirement for additional files with custom file specifications. When working in Keras, everything is native Python.

Keras’ deep learning process can be summarized as below –

  1. Define your Model – Create your sequence and add layers as needed
  2. Compile your Model – identify optimizers and loss functions
  3. Fit your Model – Use the existing data to execute the model
  4. Make Predictions – Use the developed model to trigger predictions based on the data

H2O

H2O is a scalable, fast and distributed open source machine learning framework that provides many algorithms. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It supports smart applications including deep learning, random forests, gradient boosting, generalized linear modeling, etc.

H2O is a business focused AI tool that allows users to derive insights from data by way of faster and improved predictive modeling. The core code of H2O is written in Java.

H2O helps with vast amounts of data that allows enterprise users with accurate and quick prediction. Additionally, H2O assists in extracting decision making information from large amounts of data.

Apache Mahout

Apache Mahout is an application based on an open source framework that uses the Hadoop platform. It assists with building scalable ML applications and corresponds to MLlib.

The three main features of Mahout are –

  1. A scalable and straightforward programming framework and environment
  2. A wide range of pre-packaged algorithms for Apache Spark + Scala, Apache Flink and H2O
  3. A vector math experimentation workplace called Samaras that has an R-like syntax which is dedicated to matrix calculation.

Anaconda

Anaconda is a real open source data science package that boasts a community of more than 6 million users. It is simple to download and install, and packages are available for MacOS, Linux and Windows.

Anaconda comes with 1,000+ data packages in addition to the standard Conda package and the virtual environment manager. This eliminates the necessity to install each library independently.

The R conda and Python packages in the Anaconda Repository are curated and compiled within a secure environment so that users get the benefit of optimized binaries that work efficiently on their system.

Sci-kit learn

Sci-kit Learn is a tool that enables machine learning in Python. It is efficient and straightforward to use for data mining and data analysis tasks. The package is reusable in many different contexts and is accessible to almost all users.

Sci-kit learn includes a number of different classification, clustering and regression algorithms including –

  1. Support vector machines
  2. Random forests
  3. k-means
  4. gradient boosting, and
  5. DBSCAN

Top 5 Open Source Repositories to Get Started with Data Science

For any data science student, GitHub is a great place to find useful resources to learn data science better.

Here are some of the top resources and repositories on GitHub. There are lots of good libraries out there that we haven’t covered in this post. If you’re familiar with data science repositories that you’ve found useful, please let share them in the comments.

Awesome Data Science Repo

The Awesome Data Science repository on GitHub is a go-to resource guide when it comes to data science. It has been developed over the years through multiple contributions with linked resources from getting-started guides, to infographics, to suggestions of experts you can follow on various social networking sites.

Here’s what you’ll find in that repo.

Machine Learning and Deep Learning Cheat Sheet

The Cheatsheets-Ai repository includes common techniques and tools put together in the form of cheatsheets. These range from simple tools like panda, to more complex procedures like Deep Learning.

Some of the common cheatsheets included here are – pandas, matplotlib, numpy, dplyr, scikit-learn, tidyr, ggplot, Neural Networks, and pySpark.

Oxford Deep Natural Learning Processing Course Lectures

With the introduction of Deep Learning, NLP has seen significant progress, thanks to the capabilities of Deep Learning Architectures like Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN).

This repository is based on Oxford NLP lectures and takes the study of Natural Language Processing to the next level. The lectures cover the different terminology and techniques used to advance material similar to using Recurrent Neural Networks for Language Modeling, Text to Speech, Speech Recognition, etc.

PyTorch

PyTorch is an open source machine learning library for Python, based on Torch, used for applications such as natural language processing. PyTorch has garnered a fair amount of attention from the Deep Learning Community given the easy of Pythonic style coding, faster prototyping, and dynamic computations.

The PyTorch tutorial repository includes codes for Deep Learning tasks right from the basics of creating a neural network using PyTorch to coding Generative Adversarial Network (GANs), RNNs, and neural style transfers. Most models are implemented using just 30 lines of code or less.

Resources of Neural Information Processing System (NIPS) 2017

NIPS 2017 includes a list of resources and slides of most tutorials, invited talks, and workshops held during the NIPS 2017 conference. For the uninitiated, NIPS is an annual conference held specifically for Machine Learning and Computational Neuroscience.

Most recent breakthrough research within the data science industry is a result of research that has been presented at these conferences.

Summary

Before starting a data science project, it is good to have a clear understanding of what the technical requirements are so that you can adapt resources and budgets accordingly. This is one of the main reasons an increasing number of organizations are choosing the flexibility of open source tools. The sheer variety of the open-source environment has helped expand the knowledge and bring in new technologies to this field than ever before.


This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Guest Post: Why Teach Machine Learning?

Guest Post by Limor Wainstein–

Why Teach Machine Learning?

Teaching machines to learn about the real world has been a goal in computer science since Alan Turing first showed how to mechanise logic. But it’s only recently that affordable hardware has evolved enough speed and capacity to make the idea commercially feasible in many domains – and more than feasible, seemingly inevitable.

Machine learning, alongside its siblings in data analytics and big data, is not only fashionable, it’s where the money and jobs are, thus attracting ambitious, commercially minded students. It’s also an increasingly important tool for all sciences, promoting interest among those aiming at careers in research and academia. Andrew Ng, former chief scientist at Baidu, the giant Chinese search engine company, and adjunct professor at Stanford, has called AI and machine learning ‘the new electricity’ for its potential to apply to and revolutionize all sectors of the economy and society.

That has become apparent in the job market. Towards the end of 2017, the Financial Times noted that three out of four of the top-paying jobs in software were for expertise in “the new profession” of machine learning. Ng says that the two biggest challenges for machine learning are acquiring the vast amounts of data required and finding skilled workers. Of the two, he said, the skill shortage is the biggest problem. Some entire job sectors, such as high frequency trading, are now entirely dependent on machine learning, and financial technology as a whole is moving in that direction rapidly. For example, J. P. Morgan recently issued a 280-page report on data analysis and machine learning in finance, focusing on the skills it needs to hire in large numbers – numbers that don’t exist.

Additional, highly-prominent machine learning domains exist alongside financial technology, for example, autonomous vehicles and medical diagnosis. Overtly AI-dominated companies like Google, Tesla and IBM are adept at garnering publicity. Such high-profile efforts mask the huge number of more mundane machine learning tasks that exist in every industry. Amazon, for example, uses machine learning across its entire retail system (from web interface to warehousing, packaging and delivery). Every company that operates with data at scale in retail has to follow those examples to compete.

Energy companies use machine learning to predict and manage supply and demand. Airlines manage pricing and route loading through machine learning. New medicines are developed using machine learning, and health services marshall their resources in response to short and long-term trends in demand, tracked and predicted by machine learning. Agriculture, ditto. In fact, it’s hard to find any area untouched by machine learning – even theology is in on the trend, with Murdoch University in Perth using machine learning to analyze ancient Thai palm-leaf texts on Buddhist doctrines. The new electricity, indeed.

So, what is machine learning?

Machine learning is a subset of artificial intelligence, but is mercifully free of the philosophical and near-religious arguments of some AI research. Instead, machine learning is simple to define and has well-defined tools, techniques and goals, and an ever-expanding field of practical applications.

Machine learning is the application of algorithms and techniques to data sets in order to find whether certain patterns exist. Whether this includes data acquisition and cleaning before analysis, or decision-making afterwards, depends on how tightly you want to draw the definition. All of these things are important in practical machine learning-based applications but are usually domain specific. However, the core of machine learning isn’t domain specific and can be applied very widely. This has led it to be taught as a self-contained field.

Machine learning is inherently cross-disciplinary, and this is the greatest challenge in teaching the subject. There is a huge and unavoidable mathematical component, involving statistics, predicate calculus, linear algebra, and related concepts. This can come as a shock to computing students who have successfully minimized their exposure to such mathematical ideas hereunto. Computing skills are equally important, as machine learning involves the efficient manipulation of large and disparate data sets through complex transformations, often in highly parallel environments. With many practical machine learning applications bounded by hardware limitations, a deep understanding of system architecture and its practical consequences is also necessary. These facts will come as an equal shock to students in statistical machine learning courses who have avoided significant programming or hardware experience. A good machine learning practitioner needs to be fluent not only in programming but in systems architecture and data design. In addition, the practitioner needs to understand which of the many mathematical techniques to apply to a particular problem and how to apply them correctly.

In a real-life work environment, a data scientist or data engineer will typically find machine learning techniques useful. She may even require them to excel at her job. For example, she may need to create algorithmic patterns to search for data, use data patterns to make decisions and predictions, or use other techniques, such as smart sorting or fuzzy logic to prepare and manipulate data. These skills are at the heart of modern data science. It is clear, therefore, that a serious data science program should provide solid coverage of machine learning skills and techniques.

How should you teach it?

Picking the exact mix of tools, languages, and technologies for a course is to some extent a secondary issue, and can easily be based on what resources and skills are available to best match your choice of syllabus, project work and structure. Machine learning is a product of the Internet age and as such has a continuing evolution of best practice in its DNA. Checking out – and participating in – online machine learning communities such as Kaggle is one of the best ways to ensure alignment between machine learning teaching and actual student needs.

As with any subject, some students will have the skills, interest or previous experience to easily follow one or both of the two major prongs of machine learning. Most will not. But teachers of machine learning have an advantage over their mathematician or computer science colleagues: they can use each prong to illustrate and contextualise the other. Students who experience a curriculum where each is taught independently often have problems – and this has been unfortunately common. On discussion boards where experienced ML practitioners advise students, disheartening comments abound.

Calvin John, an autonomous vehicle researcher, warned on Quora of his experience with a “…horrible textbook… very little conceptual basis for the theorems… bunch of isolated problems which were crudely connected in a very disjointed way”. Modern machine learning teaching is developing rapidly. Like many new interdisciplinary subjects, machine learning may be taught by different faculties, where each faculty is led by its own approach without relating to the needs of the other disciplines involved.

Andy J. Koh, program chair of informatics at the University of Washington, also discusses the subject of teaching machine learning in his essay “We need to learn how to teach machine learning”(August 21, 2017). He says: “We still know little about what students need to know, how to teach it, and what knowledge teachers need to have to teach it successfully.” He also points out the wide range of student abilities and experience among those interested in machine learning – not only from previous undergraduate courses, but from MOOCs and burgeoning commercial self-teaching online products. He nevertheless advocates the adoption of good pedagogical tools – evolving analogies and practical examples that combine theory and practice. It’s important, he says, to understand which concepts will be particularly difficult, realizing what ideas, good and bad, students bring with them.

It’s in the practical examples where machine learning teachers have the greatest chance to equip students with a good, broad and deep understanding of the field. Machine learning’s expanding applicability offers many choices – machine vision, text mining, natural language processing are popular examples. The topic should be suited to the project work across a syllabus. A judicious introduction of new mathematical ideas alongside practical work examples, or practical problems that lead to theoretical insights can reinforce student appreciation of the whole.

Here are some additional resources that discuss teaching machine learning:

A worked ML curriculum bringing together best-of-breed MOOC courses.

Another site that has several courses, including MOOCs and other deep-learning topics is fast.ai.

(They also have an interesting brief post on adding data science to a college curriculum)

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

NASEM Webinar 1: Data Acumen

This webinar aimed to discuss how to build undergraduate’s “data acumen”. If acumen isn’t a word you use regularly (didn’t before last year), it means “the ability to make good judgments and quick decisions”. Data acumen therefore is the ability to make good judgments and quick decisions with data. Certainly a valuable and important skill for students to develop! The webinar’s presenters were Dr. Nicole Lazar, University of Georgia and Dr. Mladen Vouk, North Carolina State University. Dr. Lazar is a professor of statistics at University of Georgia. Dr. Vouk is a distinguished professor of computer science and the Associate Vice Chancellor for Research Development and Administration.

Overall, this webinar seemed to be largely a waste of time, if your goal was to understand what activities, curricular design and practices would help students develop data acumen. (See my last paragraph for a suggested alternative) On the other-hand, if you’d like a decent description of the design and implementation of a capstone course, and the process of scaling a capstone course, listen to Dr. Lazar’s portion. If you still need an overview of the state of data science then Dr. Vouk’s portion provided a reasonable context for data science. The most valuable thing in the entire webinar was slides 26 and 27 (about minute 48). Slide 26 shows an excellent diagram for an “End-to-End Data Science Curriculum” that reasonably well articulates how a student might mature (and thereby gain data acumen), see figure 1 below. Slide 27 provides well-articulated learning objectives for core, intermediate and advanced data science courses (see table below)

From NASEM Data Acument Webinar. North Carolina State University’s Curriculum Vision
  • Core
    • Able to master individual core concepts within the bloom’s taxonomy:
      Knowledge, Comprehension, Application, Analysis, Evaluation, and Synthesis
    • Able to adapt previously seen solutions to data science problems for target domain-focused applications utilizing these core concepts
  • Intermediate Electives
    • Able to synthesize multiple concepts to solve, evaluate and validate the proposed data science problem from the end-to-end perspective
    • Able to identify and properly apply the textbook-level techniques suitable for solving each part of the complex data science problem pipeline
  • Advanced Electives
    • Able to formulate new domain-targeted data science problems, justify their business value, and make data-guided actionable decisions
    • Able to research the cutting edge technologies, compare them and create the optimal ones for solving the DS problems at hand
    • Able to lead a small team working on the end-to-end execution of DS projects

 

An Alternative to the NASEM Webinar

While I found this particular webinar to largely be a waste of time, I also attended the NASEM Roundtable on “Alternative Educational Pathways for Data Science” . While certainly not focused on data acumen I found the first presentation given at that round-table described an excellent overall curriculum structure that did build student’s data acumen. Eric Kolaczyk from Boston University described their non-traditional master’s program in Statistical Practice. By integrating their course work, practicum experiences, and more, students are forced to exercise and build their ability to make good judgments about data investigations, methods, and results. The talk is well worth your time if you’d like some ideas for non-standard ways to build student skills and abilities.

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

Python or R?

This week I want to discuss a potentially divisive issue, should a program (or course etc) be taught in Python or R. I think a reasonable case could be made for teaching either language. Pragmatically, if you want your program’s graduates to be truly competitive for the largest variety of jobs in the current market students need to at least be familiar with both (and possibly SAS or SPSS). There is already a lot of information and blog posts addressing this question and I’ve provided links to a few of my favorites at the end of this post. Rather than re-hashing those post’s pro’s and con’s I’m going to focus on aspects of each language related to teaching (and learning).

Before considering each language, I want to frame the discussion by (re)stating a program level student learning objective (SLO). In my first post about SLO’s objective 2 states: “Students will be able to implement solutions to mathematical and analytical questions in language(s) and tools appropriate for computer-based solutions, and do so with awareness of performance and design considerations“. Based on this objective, I’ll state three specific objectives for selecting a programming language:

  • A language which can implement (complete) solutions to data science questions
  • A language which allows good programming practices in terms of design
  • A language which allows implementation of solutions which can be improved/aware of performance issues

Why Choose R?

As a programming language that originated in academia, particularly within the statistics community, R seems like a very natural choice in terms of teaching data science. Much of the syntax, function naming and even thoughts about how to construct a data pipeline/workflow comes naturally from a statistical analysis perspective. This makes it very easy to convert knowledge of statistical processes into code an analysis within R. The easy conversion between notation and code becomes even more valuable when trying to work with advanced/obscure statistical techniques. With R’s origination in academic statistics, there is a much broader range of packages for uncommon techniques than in most other languages. This suggests a strong candidate for the first requirement when working in statistical domains.

Other software/packages that make R appealing to teach with are RStudio, Jupyter Notebooks and R Markdown. RStudio provides a clean, user-friendly interface for R that makes interacting with plots and data easy. It even aids the transition from spreadsheet software (like Excel) by providing a similar, GUI-driven interaction with (simple) data-frames. With Jupyter Notebooks’ recent addition of an R kernel option, it is also easy to transition from mathematics focused software like Maple and Mathematica. See this DataCamp blog-post for more information on using Jupyter Notebooks (or similar software) with R. Notebooks also facilitate teaching good practices such as code-blocks and code annotation. Finally, R Markdown provides a (reasonably) simple way to convert executable code directly into final reports/outputs. That functionality further supports the teaching of (some) good programming and design practices.

Why Choose Python?

Python was originally developed to be an easy to learn programming language (see Wikipedia’s history on Python). This means the whole language’s syntax and styling is easier to learn from scratch than most other languages (notably R). The basic Python data structure of lists naturally works like mathematical sets while dictionaries closely match logical constructions for unstructured data. Together with the use of indentation to indicate control flow, it is natural to when doing any introduction to the language, how to make Python code (human) readable. These traits speak directly to teaching/achieving our second language-related objective, “allows good programming practices/design”.

For teaching, Python starts with many of the same advantages as R. There is a long standing Python kernel for Jupyter Notebooks and several markdown packages available for turning code directly into html-styled reports. What makes Python noticeably different from R is that it is a general purpose programming language. In terms of teaching, this opens up some interesting options related to the first and third goals above. In terms of developing solutions to data science problems, Python easily allows a very broad range of both input and output. Specifically, it has high quality packages designed to deal with streaming data and better techniques for unstructured or big data. Also, because Python is regularly used to develop full programs and deployed software solutions, the methods available to study and improve performance are already well developed.

 

But What are People Actually Using?

There are way, way more Python users than R users (and probably will be for the foreseeable future) simply because Python is a general purpose programming language. However, we are more concerned with users within the data science communities. That focus however doesn’t make the answer to our question any more clear. 2016 Data from O’Reilly’s Data Science Salary Survey places R (57%) slightly ahead of Python (54%) which matches KDnugget’s rankings of R being slightly ahead in 2016. However, the 2017 KDNugget survey results now places Python slightly ahead. Burtch Works’ 2017 survey data however still has R significantly ahead, and in-fact still gives a very large market share to SAS which didn’t even make KDnugget’s list. But Burtch also notes that Python has been gaining shares each year. Remember when considering these results however, that these are all self-reported and self-selecting surveys! It is hard to tell if these changes are actual changes in use, or just a changing definition/reach of who’s responding to the surveys. For example, when Burtch Works breaks down their results at least one sub-group rarely used SAS and, similar to O’Reilly and KDnugget, had Python ahead. More and more people are identifying with doing data science each year, but many of them have been doing similar things for a long time.

Some Undisguised Opinions

There is obviously value in either programming language, but from my perspective there is a really strong winner in Python. From a curriculum/planning perspective, since Python is a general-purpose language it is entirely feasible to have standard, introductory programming courses from a computer science department taught in Python. This reduces (potentially wasteful) duplication of similar courses (does every discipline really need its own intro programming?). It also lets computer scientists take advantage of years of educational research into how to better teach programming! Not to mention that Python was intentionally designed to be easier to learn programming in.

Add to this that data science students don’t really experience any major disadvantages from having Python as the primary curricular language but do gain several benefits. Key benefits include longer-term skill viability and increased versatility in job options, etc. This versatility even plays out when considering including advanced CS courses in a data science curriculum. Most data science curriculums are already going to struggle to incorporate all the necessary foundational skills in a reasonable length undergraduate (or graduate) program. So why add programming courses beyond those already needed to meet typical CS prerequisites?

Finally, looking at the trends in language/tool use in data science just adds more validation to this idea. As companies move to working with unstructured or streaming data, Python becomes even more natural. All the surveys report increasing use of Python, without any signs of slowing down that increase. It is important for academic programs to not just react, but even anticipate trends and needs in the job market and industry.

Additional Resources

While I didn’t go into lots of details on the pro’s and con’s of R or Python (and didn’t even talk about SAS/SPSS) I have collected a few links that you might find valuable to read in making your own decision.

R vs. Python for Data Science: Summary of Modern Advances — EliteDataScience Dec 2016 — Does a nice job of highlighting the new things that make the languages pretty equal.

 

Python & R vs. SPSS & SAS — The Analytics Lab  – 2017 — This is nice because it also puts into perspective how SPSS and SAS play into the landscape as well as provides additional historic perspectives

Python vs. R: The battle for data scientist mind share — InfoWorld, 2017 — a fairly balanced perspective on the value of both

R vs. Python for Data Science — KDNuggets 2015 — A bit dated, but still provides some good comparisons.

(Other) Official Curriculum Guides

Last week I discussed several places from which you could pull curriculum planning materials. This week will continue that theme, but with a bit more of an ‘official’ flavor, by discussing several professional societies’ curricular guides. While there is no (clear) leading data science professional society (and none with curricular guidelines to my knowledge), there are a few closely related societies with official guidelines. Depending on what path you took into data science, you may be more or less familiar with the following societies: Association of Computing Machinery (ACM), Institute of Electrical and Electronics Engineers (IEEE), Mathematical Association of America (MAA), and the American Statistical Association (ASA), . There are several other societies relevant to data science, but not as vital in terms of official curricular guidelines (SIAM, INFORMS, AMS, ASEE). All four major societies (ACM, IEEE, MAA, and ASA) have released curricular guidelines relevant to data science. This post will give a very high level overview of those guidelines and why you might care about what’s in them.

ACM and IEEE jointly released Curriculum Guidelines for Undergraduate Programs in Computer Science in 2013 (CS2013). The most valuable component of CS2013 for me is the specification of ‘Knowledge Areas’ that are obviously related to Data Science, and being able to see the professional community’s consensus on central learning objectives in these areas. Some clearly important/relevant areas are:

  • Computational Science
  • Discrete Structures
  • Graphics and Visualization
  • Information Management
  • Parallel and Distributed Computing

Other areas such as Algorithms and Complexity, Information Assurance and Security, or Programming Languages probably include specific learning objectives that are relevant to data science, but may not be needed in their entirety. Additionally, CS2013 allows you to to examine the suggested course hours expected to be devoted to these topics. From an industry perspective, this can provide valuable insight into whether a data scientist or computer scientist might be more knowledgeable about a particular subject. This differentiation in knowledge is important as data science strives to define itself independently of its founding disciplines. If you are interested in throwing your net a bit wider, ACM also has guides for other programs like Computer Engineering and Information Technology (coming in 2017) on their guidelines site.

The MAA’s 2015 Committee on the Undergraduate Programs (CUPM) in Mathematics Curriculum Guide to Majors in the Mathematical Sciences — CUPM Guide for short — can serve in largely the same way the CS2013 guide does, but from a mathematical/statistical approach. With more detailed reports on Applied Mathematics, Computational Science, Operations Research, and other areas of mathematics that data science often operates in, the CUPM Guide makes it possible to understand what exactly (from a mathematician’s or computational mathematician’s perspective) are the most relevant areas of mathematics to understand for success. This guide can also serve to help clarify exactly what sorts of mathematics courses a data science curriculum should require, by explaining where in the course structure specific topics like sets, relations, and functions, or other ideas get covered. In addition to their extensive undergraduate guide the MAA also provides a lot of interesting materials related to masters/Ph.D preparation, etc. These might be particular interesting as you consider what sorts of students to recruit or include in a master’s program.

Finally, the ASA has perhaps the most relevant and diverse, but in many ways least detailed, set of curriculum guides. The set of undergraduate guidelines and reports include how to assess instruction, program guidelines for statistical sciences, and even the Park 2016 Data Science guidelines (which I have commented on in other posts). They also have two sets of graduate guidelines from 2009 and 2012 for statistics masters/Ph.D. programs. What the ASA guidelines provide are much bigger, sweeping statements about the sorts of skills and knowledge that a statistics major should have. It includes side notes that give more details such as encouraged programming languages and even file formats. In many ways, I think the majority of the ASA guidelines could just replace “Statistics Major” with “Data Science Major” and remain nearly as applicable. The biggest difference might be in the level/depth required in “Statistical Methods and Theory” (less) and “Data Manipulation and Computation” (more). In a sense, this is at the heart of many statistician’s argument that “Data Science” isn’t really its own field. In practice though, I think the final implementation and mindset behinds a statistics major and a data science major will be very different, and certainly heavily influenced by the ‘host’ department.

That covers the breadth of the major professional societies’ curricular recommendations. I wasn’t able to find any (official) guidelines for a “business analytics” major from a professional society (see my resource page for a few unofficial documents), so if you know of one, please let me know.

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Math (Courses) for Data Science

I want to share some thoughts on the math required for a data scientist (or at least, a data science undergraduate degree). The discussion can really be boiled down to one question: “Discrete Mathematics or Calculus 2?” Let’s first take a look at the outcomes from an in-progress and two completed working groups on outlining data science education.

An ACM organized workshop in 2015 included participants from ACM, ASA, IEEE-CS, AMS and more. That workshop’s report does not explicitly state any math requirements, but does make clear the need for sufficient supporting statistics courses. The clearest recommendations come from a group of faculty at the Park City Mathematical Institute in the summer of 2016. Their report gives suggestions on how to make a data science degree from existing courses and ideas for new integrated courses (this is the real gold in the report). If constructing a curriculum from existing courses, the group recommends three mathematics courses: Calculus 1, 2, and Linear Algebra. Last, a series of round-table discussions is currently underway by the National Academy of Science to discuss Data Science Post-Secondary Education. While all three NAS round tables are interesting, only the first is relevant to this discussion. At that meeting, there was a presentation on the underlying mathematics of data science. Their list of mathematics supporting data science included linear algebra, numerical analysis and graph theory.

In summary, all three groups clearly support the need for linear algebra to be a part of any data science curriculum. I doubt you’ll find many objections to this idea since linear algebra forms the mathematical foundation for manipulating data contained in tables or arrays as rows/columns. If nothing else, simply learning the notation is vitally important for anyone wanting to extend algorithms for data science. All three also clearly support at least two traditional statistics courses, up through regression analysis. A little less clearly, I would argue that all three support the requirement of a Calc 1 course. The NAS round-table discussed needing numerical analysis, which is traditionally based on calculus concepts. The ACM workshop supported disciplinary knowledge and just about all science disciplines require at least one semester of calculus.

Alright, on to the differences. The PCMI group included Calculus 2 in their “minimum” courses needed for data science. In my opinion, the suggestion that Calc-2 be included in the bare minimum courses for data science is indicative of the dominance of mathematicians (many applied) and statisticians in the group (there were a FEW computer scientists). While I think overall they are quite good, I think the inclusion of Calc 2 over discrete mathematics (as well as the odd location of data mining) clearly reflect this make-up. The presentation on mathematics (from two mathematicians) at the first NAS however included graph theory as one of the three main supporting mathematical areas. So, perhaps the question from these two groups is: “Calculus 2 or Discrete Mathematics?”

Here’s an alternative way to build an answer to this question. Instead of just focusing on the topics covered, what about the requirements for the other supporting disciplines that make up data science? Computer Science is pretty easy. Almost all programs require Calculus 1 and discrete mathematics, and the ACM 2013 guidelines include a list of core topics (set theory, graph theory and logic) that are traditionally covered in either a discrete mathematics course, or a combination of several mathematics courses. They also articulate very clearly that for some areas of computer science (like visualization or data science) linear algebra and statistics will be required. We can contrast this with typical mathematics requirements for statistics curriculum. For many statistics programs, a minimum of Calc 2 is required to support advanced probability courses (with a preference for multivariable calculus). The ASA 2014 guidelines specify that statistics majors should have both differentiation and integration (typically covered by Calc 1 and 2), and linear algebra.

Development from supporting disciplines can leave us just as confused as to what to require. I think there is an answer, but it requires taking off the mathematician glasses, and thinking about jobs, applications, and where a student might be headed. First, a good portion of researchers and practitioners doing data science use graphs and networks, often doing mining on those graphs for information. Turns out graphs (the node/edge type, not the line/bar plot type) are also a great way to visualize a lot of information. Another key skill when doing data science is the ability to partition data. That is, to think of data as either meeting, or not meeting specific criteria. This is encompassed in set theory in mathematics, and is sometimes partially covered as part of logic. Together these topics provide two new ways of thinking about data that aren’t included in other mathematics courses. The need for this sort of knowledge, and a basic introduction to proofs is why discrete mathematics courses came into existence, to allow CS majors to get these topics without taking another 3 or 4 mathematics courses. To me, this is a far stronger case for including discrete mathematics than the (possible) need of Calculus 2 for advanced statistics courses. If you are requiring 4 math courses, by all means, include Calculus 2 next. Or, if a student is particularly interested in understanding the theoretical underpinnings of data science (by taking more statistics courses) then they should take Calc 2. But if we are really thinking about an undergraduate degree as a stand-alone, prepared to enter the work force degree, Calc 2 does not seem to add a lot of direct value to the student’s degree.

Student Learning Objectives – Part 3

This post is part of a series on student learning objectives (SLO’s) both for curriculum and courses. The SLO’s in this post are course level, specifically for an “Introduction to Data Science” (Data 151) class for new students. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

In many ways, the general course SLO’s for Data 151 mirror the SLO’s for the program as a whole. Students need to leave with an understanding of what data science is, know about the basic algorithms, and be made aware of the ethic and moral issues surrounding the use of data. Data 151 is intended to be a hook that draws in students from across our university to learn about data and then consider adding a major in Data Science. It also draws in juniors and seniors in less technical disciplines like business. This  may in turn make Data 151 the only course where a student explicitly thinks about data. The major difference between the curricular and course SLO’s is the depth of understanding I expect students to leave the course with (as opposed to the program). This is most clear in the first two SLO’s below.

  1. Students understand the fundamental concepts of data science and knowledge discovery
  2. Students can apply and perform the basic algorithmic and computational tasks for data science

As said, these are very close to the first two SLO’s for the whole curriculum and related to both their ability to communicate data science concepts and also their ability to implement solutions, though in both cases with lower levels of expertise. Data 151 has two additional SLO’s that target the broader (potential) audience for the course (in addition to continuing majors). These are:

3. Students develop and improve analytical thinking for problem formulation and solution validation, especially using technology
4. Students prepare for success in a world overflowing with data.

In many cases, students in Intro to Data Science are still gaining experience (aren’t we all?) with general problem solving skills. Perhaps (to my mind) one of the most under-taught skills in STEM courses is how to actually formulate and structure the process of solving a problem. In many, many cases, a significant amount of time can be saved in the execution of problem solving by carefully planning out how you are going to explore or solve a problem. Data science even has this explicitly built into several locations in a typical workflow, specifically performing exploratory data analysis and planning for solution validation.

Meanwhile, the final objective is meant to really be a catch-all. The field of data science is changing incredibly rapidly, as are the ways data is generated and used. I wanted Data 151 to be something that is capable of covering current, bleeding-edge topics. This SLO also nicely encompasses my plans to bring in alumni and current practitioners as speakers to give the students insight into what future jobs might look like. Bringing in these speakers also provides a chance for students to get an industry perspective on workflows and processes, something that can be very different from academia’s problem solving process.

These SLO’s are pretty high-level, but intentionally so. At Valpo, we’ve got both “course objectives” and also topical objectives. My next post will take a look at the specific, topical objectives for Data 151, which deal with the more nitty-gritty topics of what will actually get covered in Data 151.