By Karl Schmitt

Announcement and Reflections on ACM’s Draft Data Science Curriculum

Last week, there was an announcement of the first draft of ACM’s “Computing Competencies for Undergraduate Data Science Curricula” — I.E., ACM’s take on a Data Science Curriculum recommendation. The full draft can be found here. The ACM Data Science task force is explicitly asking for community feedback on this draft by March 31st. I was able to attend their town-hall feedback session at the SIGCSE Technical Symposium where there were both excitement, but also some concerns about the scope the curriculum recommendations take. This post is going to offer some reflections and thoughts on the draft, however I strongly encourage anyone involved with Data Science curriculum design or implementation to read it for yourself!

Chapter 1: Introduction and Background Materials

First, I’m really glad to see this being produced. I’ve commented previously on some of the other curriculum guidelines developed on this blog emphasizing that the ‘computing’ perspective was often a bit under-represented. I also need to praise the task-force for not simply reinventing the wheel! Their first substantial section is a review of the existing, relevant curriculum recommendations related to data science. They’ve done a thorough job (the first I’ve seen publicly posted), with some valuable insights into each. If you haven’t had a chance to read some of my blog posts about the other recommendations (See: Related Curricula, EDISON, Park City) their summary is an excellent starting place. One curriculum they examine that has not been discussed on this blog is the Business Higher Education Framework (BHEF) Data Science and Analytics (DSA) Competency Map (2016). Their discussion of this material can be found on page 7.

Another important thing to catch in their discussion of the task force’s charge, and work, is that they are only trying to define computing contribution to data science. This is in stark contrast to most of the other curriculum guidelines out there relating to data science. They all include the full-breath of what a data science curriculum might entail. In talking with the chair of the task force, there really is a recognition that this is only the first stage in developing a community recognized, full-fledged curriculum guide.

Chapter 2: The Competency Framework

The task force is taking a slightly different approach to developing the curriculum than ACM took with CS-2013. Instead of focusing exclusively on “Knowledge Areas” they are developing a competency framework. Given how much the field of data science leans on soft-skills, in addition to technical skills, this is certainly a reasonable approach. The main concern expressed by the task force chair, which I share, is that it is still important for the final guide to be highly usable to guide program development. While the current draft does not achieve the same level of usefulness that CS-2013 does, I have high hopes for their final product. The motivation for this switch is grounded heavily in current scholarship of teaching and learning alongside cognitive learning theory. This has a long-term potential to help transform educational settings from a passive learning environment to a more active, student-centered paradigm (which I am strongly in favor of!). However, it will require significantly more work to transform the current competencies into something usable for both student-centered design and programmatic design.

If you aren’t aware of the concepts of “Understanding by Design”, learning transfer theory, or how these interact on a ‘practical, operational level’ it would certainly be worth your time to read through this chapter carefully. It may provide you with many new ideas to consider when doing course planning or activity planning in general.

Appendix A: Draft of Competencies for Data Science

To begin with, this appendix is actually very massive. It is 23 pages long, 40% of the entire document. As of now, the task force is well aware that this section is actually too extensive for this to be truly useful, especially as currently presented. However, they will be forming several sub-committees to work on refining each of the competency areas in the next month or two. The target time-frame for a refined draft is late summer. The next sections of this post will reflect on the various competencies as stated.

BTW: If you are interested in serving on one of these subcommittees, please email the task-force co-chairs, Andrea Danyluk and  Paul Leidig ASAP.

  • Computing Fundamentals
    • Programming
    • Data Structures
    • Algorithms
    • Software Engineering

This competency and its sub-categories clearly demonstrates the break from CS-2013. Where CS2013 organized content based on topical areas of computer science, here we see a smattering of ideas from several areas. It pulls several ideas from the area of “Algorithms and Complexity” with a strong focus on the algorithmic side, and the data/programming structures that support algorithm implementations. The beautiful thing is that these do fairly clearly express computing’s perspective on absolutely essential tasks that support best-usage of statistical and data-science ideas. Probably the most surprising thing for someone not from a CS background would be the inclusion of the ‘Software Engineering’ ideas. However, based on my experiences talking with industry practitioners, this is perhaps the most overlooked area of preparing future data scientists. It becomes especially critical when trying to move their models and techniques into actual production code that produces value for a company.

  • Data Management
    • Data Acquisition
    • Data Governance
    • Data Maintenance & Delivery

I have actually merged two knowledge areas as defined by the task-force in this. They had defined the knowledge areas of: “Data Acquisition and Governance” and “Data Management”. As described, these could be merged into one, more over-arching idea. That of how a data scientist actually deals with the “bytes” of data, regardless of the actual content of the data. It also talks about ideas such as selecting data sources, storing the data, querying the databases etc. This section obviously comes strongly from the “Information Sciences” or “Information Management” sector of computer science.

Something that might be missing (or might be buried in the IS language) is the idea of careful design of the actual collection of data. That is, does a survey, log, or other acquisition process actually collect information that is usable for the planned data-science task or goal.

  • Data Protection and Sharing
    • Privacy
    • Security
    • Integrity

Again, I’ve re-named the higher-level category. The task-force originally called this group “Data Privacy, Security, and Integrity”. While highly descriptive, as it matched exactly the sub-categories, it seemed slightly redundant to have it as the meta-category as well. This is an interesting grouping also, as the “Privacy” competency clearly covers things that most faculty and practitioners I discuss data science with would agree should be included. However, the “Security” and “Integrity” competencies dive into highly technical areas of encryption and message authentication. They both seem to have been heavily drawn from the realm of Cybersecurity. I expect that most of the existing data science (undergraduate) programs would find it highly challenging to include more than a very superficial coverage of this content. Even graduate programs might not do more than touch upon the idea of mathematical encryption unless the students themselves sought out additional course work (such as a cryptography class).

Even though I’m not sure programs are, or even could, do more coverage of this section of content, this may be a clear area for program expansion. Perhaps as more courses are developed that exclusively serve data science programs it will become possible to include more of these ideas.

  • Machine Learning
  • Data Mining

As could be expected, there are competencies related to actually learning something from data. The task force has (currently) chosen to split some of the ideas into two categories. The Machine Learning knowledge area is massive, and includes most of the details about algorithms, evaluation, processes and more. The Data Mining knowledge area seems to try and provide competencies related to overall usage and actual implementation of machine learning. I’ll let you pick through it yourself, but from my read through it seems to cover the majority of ideas that would be expected, including recognition of bias and decisions on outcomes.

My feedback – Ditch the separate knowledge areas, and provide some “sub” areas under Machine Learning.

  • Big Data
    • Problems of Scale
    • Complexity Theory
    • Sampling and Filtering
    • Concurrency and Parallelism

Perhaps the area that drove data science into the lime-light, the task force has provided a nice break-down of sub-areas and related competencies. While a “sexy” area to have a course in, in my mind, this is actually a “nice to have” not a necessary content coverage area. Especially reading through all the details, it really does deal with “big” issues (appropriately!). However, lots and lots of data scientists that we train at the undergraduate level are simply not going to be dealing with these problems. Their day-to-day will be consumed with fundamentals, data governance and maintenance, and maybe, if they are lucky, some machine learning.

  • Analysis and Presentation

The task force’s take on this section was from a more technical standpoint. Specifically, it draws from the area of ‘human-computer-interfaces’ or HCI. In walking the line of defining computing specific competencies, without edging into statistics or graphic design, I think this is an excellent section. I am glad to see its inclusion, and thoughtful consideration. Often CS students forget about the importance of thinking carefully about how a human will actually interact with a computer. Instead they typically focus just on what the computer will output.

  • Professionalism
    • Continuing Professional Development
    • Communication
    • Teamwork
    • Economic Considerations
    • Privacy and Confidentiality
    • Ethical Issues
    • Legal Considerations
    • Intellectual Property
    • Change Management
    • On Automation

While this competency area is framed as a “meta” area with sub-categories, it has nearly as many sub-categories as the entire rest of the framework. While I think most (perhaps even all) of these do belong as part of a curriculum/competency guide, this felt excessive as presented. This is especially true if we are considering the suggested content for an undergraduate curriculum. While I feel that all students should be aware of the idea of “intellectual property” getting into the weeds of different regulations, IP ideas, etc seems pretty excessive for most students. Most likely, I’d simply be encouraging them to know what falls under these ideas, and then tell them to talk to a lawyer. Similarly, discussing at length “Change Management” seems highly ambitious for most data science students, especially at the undergraduate level. While they might need to be aware that their work will foster change, and that someone should be managing it… it probably shouldn’t be them unless they get explicit training in it! And, given the scope of technical skills to cover in a data-science curriculum, I sincerely doubt there will be space for much of this.

While I’ve tried to provide some quick reflections on the entire draft, you should definitely go read it yourself! Or, keep your head up looking for the subsequent drafts and processes. ACM has a history of collecting very interdisciplinary teams for generating consensus curriculum guidelines, so I expect over the next few years we’ll see a fairly substantial effort to bring more perspectives to the table and generate an inclusive curriculum guide.


Guest Post: Open Sources and Data Science

Open source solutions are improving how students learn—and how instructors teach. Matt Mullenweg, the founder of WordPress revealed to TechCrunch a few years back his opinion on Open Source. “When I first got into technology I didn’t really understand what open source was. Once I started writing software, I realized how important this would be.”

Open source software is almost everywhere and tons of modern-day proprietary applications are built on top of it. Most students reading through the intricacies of data science will already be fairly familiar with open source software because many popular data science tools are open-source.

There’s a popular perception that open-source tools are not as good as their proprietary peers. However, similar to Linux, just because the underlying code is open and free does not necessarily mean that it is of a poorer quality. Truth be told, open source is probably the best in its class when it comes to development and data science.

In this post, we’ll have a straightforward look at how open source contributes to data science. We’ll also cover open-source tools and repositories that are related to data science.

Why is Open Source Good for Data Science?

Why exactly does Open Source and Data Science go hand-in-glove?

Open Source spurs innovation

Perhaps the most significant benefit of open source tools is that it allows developers the freedom to modify and customize their tools. This allows for quick improvements and experimentation which can, in turn, allow for extensive use of the package and its features.

As is the case with any other development that captures the interest of the technology sector as well as the public, the critical piece lies in the ability to bring the end product to the market as quickly and as thoroughly as possible. Open source tools prove to be a massive benefit to this end.

Faster problem solving

The open-source ecosystem helps you solve your data science problems faster. For instance, you can use tools like Jupyter for rapid prototyping and git for version control. There are other tools that you can add to your toolbelt like Docker to minimize dependency issues and make quick deployments.

Continuous contributions

Another example of a company that contributes to open-source is Google, and TensorFlow is the best example of Google’s contributions. Google uses TensorFlow for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. By open sourcing tools like TensorFlow, they get the benefits from contributors outside the core team. Once TF gets popular, many new research ideas would be implemented in TF first, which makes it more efficient and robust.

Google explains this topic in-depth in their open-source documentation. 

While open source work may have benevolent results it is not an act of charity. Releasing work as open source and the corresponding contribution process eventually result in a higher return on the initial investment made versus the alternative closed source process.

One of the significant benefits of open source, especially when it comes to data science breakthroughs, is the sheer membership of various open source communities that developers can tap into for problem-solving and debugging.

These forums have hundreds of answers to frequently asked questions, and given that open-source data science tools are poised to expand going forward, these communities and repositories of information are only poised to grow.

Contribute and give it back to the community

The best way to learn data science is to actively participate in the data science communities that you love. With open-source, that’s entirely possible because you can start by just following data science projects and repositories on GitHub. Take part in discussions and when you feel you’re ready, contribute to their code by volunteering to review code and submit patches to open-source security bugs.

This will help you get involved, gain exposure and learn details that might otherwise be impossible to learn from your degree curriculum.

Open Source Data Science Tools You Should Know About

KDNuggets recently a published the results of Data Science and Machine Learning poll they conducted earlier this year. The graph shows the tools with the strongest association and each tool’s rank based on their popularity.


The weight of the bar indicates the association between the tools. The numbers indicate the percentage of association between these tools. As you can see in the figure, TensorFlow and Keras is the most popular combination with a weight of 149%. Anaconda and scikit-learn is another popular combination of tools.

The number to the left indicates the rank of each tool based on popularity. The color is the value of the lift – green for more Python and red for more R.

We’ll be limiting our discussion to some of the open-source data science and machine learning tools. This list is not based on popularity, but instead usability from a learner’s perspective. Let’s get started.


TensorFlow is an open source library, built for Python keeping in mind numerical computation with the goal of making machine learning more accessible and more efficient. Google’s TensorFlow eases the process of obtaining data, models for training, projecting projections and refining results.

Developed by the Google Brain team, TensorFlow is a library for a large-scale machine and deep learning. It gathers together many different machine learning and deep learning algorithms and uses them as a common metaphor. TensorFlow makes use of Python as a convenient front-end API to build out applications within the framework. It executes applications using high-performance C++.

TensorFlow can train and execute deep neural networks and use them for image recognition, handwritten digit classification recurrent neural networks word embeddings, sequence models for ML, natural language processing and partial differential equation (PDE) based simulations. It also supports scalable production prediction using models similar to those used in training.


Keras is a minimalist Python-based library, that is known for deep learning that runs on top of TensorFlow or Theano. Keras was developed to help implement deep learning models quickly and efficiently to aid in research and development.

Keras runs on Python 2.7 and 3.5 and executes CPUs and GPUs based on the base frameworks.

Keras was developed by an engineer at Google and has four guiding principles –

  1. Modularity – A model to understand a standalone sequence or graph. The fundamental concerns of a deep learning model’s components can be combined arbitrarily.
  2. Minimalism – The Keras library provides just enough information to help users achieve an outcome.
  3. Extensibility – Any new components are easy to add and implement within the framework. This is intentional allowing developers and researchers the freedom of trial and experimentation with new ideas.
  4. Python – There is no requirement for additional files with custom file specifications. When working in Keras, everything is native Python.

Keras’ deep learning process can be summarized as below –

  1. Define your Model – Create your sequence and add layers as needed
  2. Compile your Model – identify optimizers and loss functions
  3. Fit your Model – Use the existing data to execute the model
  4. Make Predictions – Use the developed model to trigger predictions based on the data


H2O is a scalable, fast and distributed open source machine learning framework that provides many algorithms. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It supports smart applications including deep learning, random forests, gradient boosting, generalized linear modeling, etc.

H2O is a business focused AI tool that allows users to derive insights from data by way of faster and improved predictive modeling. The core code of H2O is written in Java.

H2O helps with vast amounts of data that allows enterprise users with accurate and quick prediction. Additionally, H2O assists in extracting decision making information from large amounts of data.

Apache Mahout

Apache Mahout is an application based on an open source framework that uses the Hadoop platform. It assists with building scalable ML applications and corresponds to MLlib.

The three main features of Mahout are –

  1. A scalable and straightforward programming framework and environment
  2. A wide range of pre-packaged algorithms for Apache Spark + Scala, Apache Flink and H2O
  3. A vector math experimentation workplace called Samaras that has an R-like syntax which is dedicated to matrix calculation.


Anaconda is a real open source data science package that boasts a community of more than 6 million users. It is simple to download and install, and packages are available for MacOS, Linux and Windows.

Anaconda comes with 1,000+ data packages in addition to the standard Conda package and the virtual environment manager. This eliminates the necessity to install each library independently.

The R conda and Python packages in the Anaconda Repository are curated and compiled within a secure environment so that users get the benefit of optimized binaries that work efficiently on their system.

Sci-kit learn

Sci-kit Learn is a tool that enables machine learning in Python. It is efficient and straightforward to use for data mining and data analysis tasks. The package is reusable in many different contexts and is accessible to almost all users.

Sci-kit learn includes a number of different classification, clustering and regression algorithms including –

  1. Support vector machines
  2. Random forests
  3. k-means
  4. gradient boosting, and

Top 5 Open Source Repositories to Get Started with Data Science

For any data science student, GitHub is a great place to find useful resources to learn data science better.

Here are some of the top resources and repositories on GitHub. There are lots of good libraries out there that we haven’t covered in this post. If you’re familiar with data science repositories that you’ve found useful, please let share them in the comments.

Awesome Data Science Repo

The Awesome Data Science repository on GitHub is a go-to resource guide when it comes to data science. It has been developed over the years through multiple contributions with linked resources from getting-started guides, to infographics, to suggestions of experts you can follow on various social networking sites.

Here’s what you’ll find in that repo.

Machine Learning and Deep Learning Cheat Sheet

The Cheatsheets-Ai repository includes common techniques and tools put together in the form of cheatsheets. These range from simple tools like panda, to more complex procedures like Deep Learning.

Some of the common cheatsheets included here are – pandas, matplotlib, numpy, dplyr, scikit-learn, tidyr, ggplot, Neural Networks, and pySpark.

Oxford Deep Natural Learning Processing Course Lectures

With the introduction of Deep Learning, NLP has seen significant progress, thanks to the capabilities of Deep Learning Architectures like Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN).

This repository is based on Oxford NLP lectures and takes the study of Natural Language Processing to the next level. The lectures cover the different terminology and techniques used to advance material similar to using Recurrent Neural Networks for Language Modeling, Text to Speech, Speech Recognition, etc.


PyTorch is an open source machine learning library for Python, based on Torch, used for applications such as natural language processing. PyTorch has garnered a fair amount of attention from the Deep Learning Community given the easy of Pythonic style coding, faster prototyping, and dynamic computations.

The PyTorch tutorial repository includes codes for Deep Learning tasks right from the basics of creating a neural network using PyTorch to coding Generative Adversarial Network (GANs), RNNs, and neural style transfers. Most models are implemented using just 30 lines of code or less.

Resources of Neural Information Processing System (NIPS) 2017

NIPS 2017 includes a list of resources and slides of most tutorials, invited talks, and workshops held during the NIPS 2017 conference. For the uninitiated, NIPS is an annual conference held specifically for Machine Learning and Computational Neuroscience.

Most recent breakthrough research within the data science industry is a result of research that has been presented at these conferences.


Before starting a data science project, it is good to have a clear understanding of what the technical requirements are so that you can adapt resources and budgets accordingly. This is one of the main reasons an increasing number of organizations are choosing the flexibility of open source tools. The sheer variety of the open-source environment has helped expand the knowledge and bring in new technologies to this field than ever before.

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

JMM Session: Technology and Resources for Teaching Statistics and Data Science

This blog post is a collection of the presentations from the session I chaired at the 2019 Joint Mathematics Meeting. The session was titled “Technology and Resources for Teaching Statistics and Data Science”. It was co-sponsored by the MAA Committee on Technology in Mathematics Education (CTiME) and the SIG-MAA: Statistics Education (Stat-Ed).

The abstract for the session was:

One of the five skill areas in the American Statistical Association’s curriculum guidelines is “Data Manipulation and Computation” (pg. 9), embracing the need for students to be competent with programming languages, simulation techniques, algorithmic thinking, data management and manipulation, as well as visualization techniques. Additionally, the emphasis on using real data and problems and their inherent complexity means that technology is often necessary outside of specifically prescribed computational courses. This session invites instructors to contribute talks exploring the use of any software or technology in statistics education. Talks may include effective instructional or pedagogical techniques for linking programming to statistics, interesting classroom problems and the use of technology to solve them, or more.

Abstracts for the talks can be found here.

Teaching a Technology-Rich Intro Stat Course in a Traditional Classroom, presented by Patti Frazer Lock, St. Lawrence University

Using the Islands in an Introductory Statistics Course. presented by Carl Clark, Indian River State College

StatPowers-A Simple Web-Based Statistics Suite for Introductory Statistics, presented by Brian R Powers, Arizona State University

Using R Programming to Enhance Mathematical and Statistical Learning, presented by Joseph McCollum, Siena College

Computational Experience for Linear Regression and Time Series using R, presented by Rasitha R. Jayasekare, Butler University

Statistics teaching and research with R, presented by Leon Kaganovskiy, Touro College

GAISEing into the Future with Fun, Flexible Mobile Data Collection and Analysis, presented by Adam F. Childers, Roanoke College

Written Vs. Digital Feedback; Which improves Student Learning?, presented by David R. Galbreath, United States Military Academy

Using Authentic Data in Spreadsheet Assignments and Quizzes to Improve Students’ Attitudes towards Elementary Statistics, presented by Daniel A. Showalter, Eastern Mennonite University

Democratizing Data: Expanding Opportunities for Students in Data Science, presented  by Robin L. Angotti, University of Washington Bothell

I hope to have in the future the recordings of the session posted. Stay tuned for an updated post (I’ll also send an announcement).

Analytics Insight Interview

I was recently asked by Analytics Insight to provide some responses to interview questions about the state of the industry and Valpo’s data science programs. I thought the readers of this blog might be interested in my responses. They also interviewed several other program directors which collectively provide a nice snap-shot of the data science and analytics programs that are going on across the country.

My responses are below, and match the full Valpo article found here:

The entire e-newsletter is available through this link, with interviews from:

Purdue – Krannert School of Management – Director Karthik Kannan
St. Mary’s College – Director Kristin Kuter
University of Iowa – College of Business – Dept. Head Barrett W. Thomas
University of San Francisco – Director David Uminsky

The industry is seeing a rising importance of Big Data Analytics and AI. How do you see these emerging technologies impact the business sector?

Just as previous manufacturing technologies have impacted the type of work that the average employee must do, I expect that AI and Analytics will also drive major innovations in work-flow and process. Previously we saw first factory line workers improving the throughput, and often quality, of hand-made goods. Later improvements in machine automation moved workers into supervisory and trouble-shooting roles of the machine processes. I think there will now be another level of abstraction/distance between the produced good(s) and the oversight workers with AI.

How is Valparaiso University’s Analytics and Data Science Program contributing to the growth and transformation of analytics and big data education?

Dr. Schmitt, the Director of Data Science Programs at Valparaiso recently received a TRIPODS+EDU grant from the National Science Foundation (NSF) to investigate student difficulties in learning data science. Valpo is a prime site for doing this sort of education-focused research. We hope to continue to expand that work in the near future.

See the NSF News announcement:

See the Valparaiso University Press Release:

What is the edge Valparaiso University’s Analytics and Data Science Program has over other institutes in the industry?

Valpo’s programs have two features that I believe make it standout from its competitors, which are intimately tied together. First, students begin working with external clients and real, decision-level data as early as their second semester (that is, freshman year). Second, Valpo has extensive ties with non-profits and government agencies, due to its religious affiliations and history of social impact (1st in the Nation for Contribution to Public Good by the Washington Monthly). Together this means that every year a student studies at Valpo, their classwork and time can contribute to changing the world into a better place.

Kindly brief us about your role at Valparaiso University’s Analytics and Data Science Program and your journey in this highly promising sector.

I took over directing the Analytics and Modeling (AMOD) program in 2014, one year after I was originally hired at Valpo. For me, this was incredibly exciting as the ideas central to the program, applied mathematics with simulation, modelling, and statistical analysis were the core of my scientific background and research areas. Since then I have taken a lead role in shaping both the existing graduate program (AMOD) and forming a new undergraduate major, Data Science. My primary role within both programs is to help shape the student experience to develop analytical talent from recruitment to employment. I help with a variety of recruitment activities, advise all students enrolled in the programs, teach several of the key courses, and help many find internships or jobs after graduation.

What would you advice to aspiring big data and analytics candidates?

Learn Python. Seriously, I’m passionate that Python will over time become the core language for data science and analytics. Mostly because it’s simply got a far, far wider user-base than other languages (maybe not in this specific field, but overall).

This is followed closely by: Learn GIT – and not just the desktop version!

What are some of the challenges faced by the industry today?

I think the biggest challenges faced by the broader field and industry is fragmentation and lack of vision across disciplinary boundaries. Turf wars between mathematics, statistics, computer science, business, or whoever, aren’t really going to serve the further development of the field. Moreover, I think lack of willingness to see the value of other’s approaches to issues can blind researchers and practitioners to major issues. Central among these is lack of mathematical/statistical support for conclusions, reproducibility of results, and inherent bias in any data or analysis.

Please share some major achievements of Valparaiso University’s Analytics and Data Science Program under your leadership

Under my leadership Valparaiso University has established a full undergraduate major in Data Science and developed two new undergraduate courses (for Valpo) with two more coming next year. The enrolment in the graduate program (AMOD) has also increased by over 300%. We have also received a National Science Foundation grant for education research through the TRIPODS+X mechanism, one of only 4 educational grants.

Can you throw light on the latest employment trends in big data and analytics industry?

Sadly, I’m probably a bit out of touch here. I keep hearing about companies from nearly every sector wanting analytics skills, but the challenge is that each segment requires very very different training to be on the edge of trends. Much of my focus is on identifying, and then providing to students the skills that will allow them to ride today and tomorrow’s trends with equal skill.


Guest Post: Docker And the Data Scientist

Educational institutes and educational professionals often face a problem when it comes to creating a common platform where educators and students can view and share code. One such university in Turkey had to find a way to address a common complaint from students viz. their compute environments were different when compared to the testing machine.


The professor at Bilkent University in Ankara, Turkey, decided to use a technology called Docker to power a web-platform that can create lab instances and grade assignments.


So, what is Docker? We will answer that in a while. But before Docker was available, the next best solution was to use virtual machines. However, these machines needed to be extremely powerful and consequently required an expensive infrastructure, which most institutes couldn’t set aside a budget for. Students were forced to log on to a shared server which inadvertently negatively affected each other’s programs, or worse, crashed the whole infrastructure. Needless to say, it was impracticable to assign a virtual machine to each student.


They used Docker to build a web-based application called Programming Assignment Grading System (PAGS). A similar technique can be of adopted by universities for creating lab instances and grading assignments for data science classes.


Although we haven’t formally defined what Docker is, the above example demonstrates what Docker can do. The rest of the article focuses on Docker and how it can transform the education and data science industry.


The article is divided into 4 sections. First, we’ll start with an introduction on Docker and Docker containers. Then, we’ll answer the question, “Who is Docker for?” The third part will give you an overview of how Docker is a useful tool for data scientists. In the final section, we’ll dive into a couple of  interesting use cases for Docker in data science. Let’s get started!

What is Docker?


Docker is the leading software containerization platform that is being actively developed by Docker Inc. It is an open source project that is designed to help you create, run and deploy applications inside containers.


So, what is a container? A container, by definition, comprises all the dependencies, libraries and other related files required to run an application. Once you’ve created a container for your application, you can run it on any Linux machine regardless of the way your underlying machine is configured. If the machine that you’re using at one end is Ubuntu, and it’s Red Hat at the other end, fret not! Docker is precisely meant for situations like these.


You can create a snapshot of a container and this snapshot is generally known as an image. Conversely, you can call a container an instance of a docker image. Docker images are inert and immutable files. When someone asked the difference between an image and a container on StackOverflow, a web developer named Julian came up a quick analogy. “The image is the recipe, the container is the cake”, he said and that just sums it up.


You can store Docker images in a cloud registry like Docker Hub. There are numerous user-contributed Docker images that should cover almost all the general use cases. You can also create and share your private Docker images with your co-workers and your organization. Alternatively, you can push them into a public repository so as to return it back to the community.


The concept of Docker is very similar to that of a Virtual Machine (VM). However, virtual machines are very demanding beasts and run considerably slower on a less powerful hardware. A VM works in such a way that it allows a piece of hardware to be shared between VMs. This allows you to run one or more virtual operating systems inside your host operating system. But you might need to upgrade your processor if you’re seriously planning to run your software on a virtual machine.


Unlike a VM, Docker uses the host kernel instead of creating new kernel instances. The virtualization happens at the kernel level and not at the topmost level. The Docker encapsulates everything that’s required for running the application on that host machine. This tremendously improves the performance of the application and reduces its size. What gives Docker the significant lead is the fact that it enables separation of concerns between the infrastructure, IT Ops, the developer and the application. This creates a positive environment for enhanced innovation and collaboration.

Who is Docker for?

Docker is essentially a container platform largely aimed at businesses. It enables IT businesses to efficiently select and administer a complete application, without the fear of an infrastructure or architecture lock-in.


Enterprises use Docker for everything from setting up their development environment to deploying their application for production and testing. When you need to build more advanced systems, like a data warehouse comprise of multiple modules, containers make a lot of sense. You can actually save several days of work that you’d otherwise have to spend configuring each machine.


However, the Docker platform isn’t just relevant to developers and enterprises alone. It’s actually a pretty useful tool for data scientists, analysts and even for schools and colleges. There are educational institutions and universities that are keen to transform digitally but are held back by their existing infrastructure.

Docker and Data Science

Why should you use Docker if you’re a data scientist? Here are three reasons pointed out by Hamel Hussain over at Towards Data Science:


If you are a professional data scientist, it is imperative that your work can be reproduced. Reproducibility helps facilitate review by your peers, ensure the analysis, model and application that you have built can run unhindered which makes your deliverables both robust and time-tested.


As an example, let us assume that you have built a Python model, however, it has not proven to be enough to run pip-freeze and transfer the resulting file to a colleague. This would largely be because of Python-specific dependencies.


Imagine if you could find a way around manually moving the Python dependencies like  the compiler, config. files, drivers, etc. You can be free of Python-related dependencies by simply bundling everything within a Docker container. This not only reduces the task of others having to recreate your environment, it also ensures that your work is much more accessible.

Ability to Port Your Compute Environment

If you are a data scientist who is specializing in Machine Learning, the ability to frequently and efficiently change your computing environment has a considerable effect on your productivity.


It is often the case that the work of data science starts with prototyping, research, and exploration. This doesn’t essentially need special computing power to start. That said, often comes a stage where multiple compute resources can prove quite helpful in increasing the speed of your workflow.


A number of data scientists find themselves limited to a local computing environment largely because of a perceived hindrance of re-creating their individual local environment onto a device remotely. Here, Docker makes the difference. It allows you to port your work environment, including libraries, files etc in just a few clicks. Additionally, the ability to swiftly port your computing environment is a substantial advantage in Kaggle competitions.

Enhance your Engineering Skills

Once you are comfortable with using Docker, you can then deploy models as containers that can help make your work readily accessible to other users. Additionally, various other applications that you may require as part of your data science workflow interaction may already exist in a container within a Docker application.

Use Cases for Docker in Data Science

By making your applications portable, cheaper and more secure, Docker helps to free up time as well as resources that can be spent on other important things. It can help transform IT without the need to re-tool, re-educate or re-code any of your existing applications, staff or policies.


Here are just a few of the use cases of how Docker can help different organizations:

Docker for Education

Let’s revisit the Docker use case that we discussed in the introduction. The faculty at the university used Docker to create a container to run their application called PAGS. It allowed the students to have the same environment for their compute machines and test machines without the need of a VM.


Docker provides a common environment that can run on a container on any given Linux machine. This almost always guarantees it to run with similar results on a different machine using the same container. Without Docker, this would have required more infrastructure and resources that they didn’t have.


Another particularly interesting scenario is setting up lab instances. Dependending on how you want a machine to be configured, you can take a snapshot of it to build a Docker image. You can then pull the snapshot into all other lab instances saving you time and resources.

Docker for Data Science Environment Set Up

Consider a scenario where you need to explore a few data science libraries in Python or R, but;

  1.  without spending a lot of time installing either language on your machine,
  2.  browsing and figuring out which dependencies are essential and
  3.  finally getting down to identifying what works best for your version of Windows/OSX/Linux.


This is where Docker can help.


Using Docker, you can get a Jupyter ‘Data Science’ stack installed and ready to execute in no time flat. It allows you to run a ‘plug and play’ version of a Jupyter data science stack from within a container.


To start, you would need to first install Docker Community Edition on your machine. Once done, restart your machine and get your Jupyter container set up. Prior to running a container, you would need to specify the base image for the container


In most cases, the image that you’re looking for has already been built by a prior user and includes everything needed to use a fully loaded data science stack of your choice. All that needs to be done is to specify a pre-defined image that Docker can use to start a container.


In this article, we have just hit the top of the iceberg in terms of what can be done with Docker. We have focused only on specific areas of Docker that a data scientist may most often encounter. Below are some further sources that can help you during your journey of using and implementing Docker.


  1. Basic Docker Terminologies
  2. Useful Docker Commands
  3. Dockerfile Reference
  4. Pushing and Pulling to and from Docker Hub

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Guest Post: Why Teach Machine Learning?

Guest Post by Limor Wainstein–

Why Teach Machine Learning?

Teaching machines to learn about the real world has been a goal in computer science since Alan Turing first showed how to mechanise logic. But it’s only recently that affordable hardware has evolved enough speed and capacity to make the idea commercially feasible in many domains – and more than feasible, seemingly inevitable.

Machine learning, alongside its siblings in data analytics and big data, is not only fashionable, it’s where the money and jobs are, thus attracting ambitious, commercially minded students. It’s also an increasingly important tool for all sciences, promoting interest among those aiming at careers in research and academia. Andrew Ng, former chief scientist at Baidu, the giant Chinese search engine company, and adjunct professor at Stanford, has called AI and machine learning ‘the new electricity’ for its potential to apply to and revolutionize all sectors of the economy and society.

That has become apparent in the job market. Towards the end of 2017, the Financial Times noted that three out of four of the top-paying jobs in software were for expertise in “the new profession” of machine learning. Ng says that the two biggest challenges for machine learning are acquiring the vast amounts of data required and finding skilled workers. Of the two, he said, the skill shortage is the biggest problem. Some entire job sectors, such as high frequency trading, are now entirely dependent on machine learning, and financial technology as a whole is moving in that direction rapidly. For example, J. P. Morgan recently issued a 280-page report on data analysis and machine learning in finance, focusing on the skills it needs to hire in large numbers – numbers that don’t exist.

Additional, highly-prominent machine learning domains exist alongside financial technology, for example, autonomous vehicles and medical diagnosis. Overtly AI-dominated companies like Google, Tesla and IBM are adept at garnering publicity. Such high-profile efforts mask the huge number of more mundane machine learning tasks that exist in every industry. Amazon, for example, uses machine learning across its entire retail system (from web interface to warehousing, packaging and delivery). Every company that operates with data at scale in retail has to follow those examples to compete.

Energy companies use machine learning to predict and manage supply and demand. Airlines manage pricing and route loading through machine learning. New medicines are developed using machine learning, and health services marshall their resources in response to short and long-term trends in demand, tracked and predicted by machine learning. Agriculture, ditto. In fact, it’s hard to find any area untouched by machine learning – even theology is in on the trend, with Murdoch University in Perth using machine learning to analyze ancient Thai palm-leaf texts on Buddhist doctrines. The new electricity, indeed.

So, what is machine learning?

Machine learning is a subset of artificial intelligence, but is mercifully free of the philosophical and near-religious arguments of some AI research. Instead, machine learning is simple to define and has well-defined tools, techniques and goals, and an ever-expanding field of practical applications.

Machine learning is the application of algorithms and techniques to data sets in order to find whether certain patterns exist. Whether this includes data acquisition and cleaning before analysis, or decision-making afterwards, depends on how tightly you want to draw the definition. All of these things are important in practical machine learning-based applications but are usually domain specific. However, the core of machine learning isn’t domain specific and can be applied very widely. This has led it to be taught as a self-contained field.

Machine learning is inherently cross-disciplinary, and this is the greatest challenge in teaching the subject. There is a huge and unavoidable mathematical component, involving statistics, predicate calculus, linear algebra, and related concepts. This can come as a shock to computing students who have successfully minimized their exposure to such mathematical ideas hereunto. Computing skills are equally important, as machine learning involves the efficient manipulation of large and disparate data sets through complex transformations, often in highly parallel environments. With many practical machine learning applications bounded by hardware limitations, a deep understanding of system architecture and its practical consequences is also necessary. These facts will come as an equal shock to students in statistical machine learning courses who have avoided significant programming or hardware experience. A good machine learning practitioner needs to be fluent not only in programming but in systems architecture and data design. In addition, the practitioner needs to understand which of the many mathematical techniques to apply to a particular problem and how to apply them correctly.

In a real-life work environment, a data scientist or data engineer will typically find machine learning techniques useful. She may even require them to excel at her job. For example, she may need to create algorithmic patterns to search for data, use data patterns to make decisions and predictions, or use other techniques, such as smart sorting or fuzzy logic to prepare and manipulate data. These skills are at the heart of modern data science. It is clear, therefore, that a serious data science program should provide solid coverage of machine learning skills and techniques.

How should you teach it?

Picking the exact mix of tools, languages, and technologies for a course is to some extent a secondary issue, and can easily be based on what resources and skills are available to best match your choice of syllabus, project work and structure. Machine learning is a product of the Internet age and as such has a continuing evolution of best practice in its DNA. Checking out – and participating in – online machine learning communities such as Kaggle is one of the best ways to ensure alignment between machine learning teaching and actual student needs.

As with any subject, some students will have the skills, interest or previous experience to easily follow one or both of the two major prongs of machine learning. Most will not. But teachers of machine learning have an advantage over their mathematician or computer science colleagues: they can use each prong to illustrate and contextualise the other. Students who experience a curriculum where each is taught independently often have problems – and this has been unfortunately common. On discussion boards where experienced ML practitioners advise students, disheartening comments abound.

Calvin John, an autonomous vehicle researcher, warned on Quora of his experience with a “…horrible textbook… very little conceptual basis for the theorems… bunch of isolated problems which were crudely connected in a very disjointed way”. Modern machine learning teaching is developing rapidly. Like many new interdisciplinary subjects, machine learning may be taught by different faculties, where each faculty is led by its own approach without relating to the needs of the other disciplines involved.

Andy J. Koh, program chair of informatics at the University of Washington, also discusses the subject of teaching machine learning in his essay “We need to learn how to teach machine learning”(August 21, 2017). He says: “We still know little about what students need to know, how to teach it, and what knowledge teachers need to have to teach it successfully.” He also points out the wide range of student abilities and experience among those interested in machine learning – not only from previous undergraduate courses, but from MOOCs and burgeoning commercial self-teaching online products. He nevertheless advocates the adoption of good pedagogical tools – evolving analogies and practical examples that combine theory and practice. It’s important, he says, to understand which concepts will be particularly difficult, realizing what ideas, good and bad, students bring with them.

It’s in the practical examples where machine learning teachers have the greatest chance to equip students with a good, broad and deep understanding of the field. Machine learning’s expanding applicability offers many choices – machine vision, text mining, natural language processing are popular examples. The topic should be suited to the project work across a syllabus. A judicious introduction of new mathematical ideas alongside practical work examples, or practical problems that lead to theoretical insights can reinforce student appreciation of the whole.

Here are some additional resources that discuss teaching machine learning:

A worked ML curriculum bringing together best-of-breed MOOC courses.

Another site that has several courses, including MOOCs and other deep-learning topics is

(They also have an interesting brief post on adding data science to a college curriculum)

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Advice for Project-Based Courses

This post will be part of a series on teaching a project-based “Introduction to Data Science” course. It is also part of my collection of resources and materials on teaching this type of course, which can be found on its own section of the blog here.

Here I will focus on summarizing some of the advice I received when designing the course and should be generally applicable to ANY project-based course, not just an “Intro” course.

By far, the biggest, most important piece of planning/advice is:

Find clients who understand (or work to understand) working with students, and are willing to actually commit time to the project.

Based on advice/conversations and experience, this doesn’t need to be enormous, approximately one (1) hour per week of meeting/interacting through a CONSISTENT liaison with the students, and reasonable responses otherwise (to emails, phone calls etc). Remember when establishing this expectation that the client is receiving many times their investment in work-hours. In a 3-person group with 3-5 hours/week per student, 1-hr employee (and students) ==> 6 – 12 additional hours of (cheap/free) work. Why is this SO important?

  • For Project Success:
    1. Makes sure client will fulfill their end of agreements (data, etc) by personalizing it and identifying the responsible party at the client
    2. Makes sure the students are actually working on things the client wants to see happen
    3. Makes sure students feel obligated to keep working on project throughout semester (avoids last-minute crams)
  • For Learning Success:
    1. It gets students to practice talking about their work/data with a non-data science expert
    2. Provides regular check-ins and reporting so that project can’t derail (similar to above)
    3. Helps students stay out of the “weeds” of project improvement or code writing by reminding them regularly of the larger picture

Digging a little deeper, some value ideas came from a conversation with Joseph Mertz from Carnegie Mellon University (CMU), who has run or participated in CMU’s team-based and capstone project courses  for nearly two decades. He suggested that getting the commitment from clients can come from several means. Possibly this is a charge-per-project, but it could also be by “formalizing” the whole project selection process. Examples include requiring a kick-off event and a conclusion celebration where other students and faculty are invited. One truly interesting tidbit he offered was the suggestion to have the final project presentations be done by the client and focusing on what value they received from the project. This can really increase the long-term impact for the students and your colleagues. It also might increase the long-term satisfaction from the client.

This is getting long, so here’s a quick list of other advice (to be expanded into other posts later perhaps):

  • Start looking for projects early. Way earlier than you think (perhaps a full semester or more!)
  • Be sure to manage clearly the expectations the clients have for their projects (especially compared to your expectations). This also relates back to the idea above of having clients that understand what it means to work with a student group.
  • Consider carefully your learning objectives, and how that relates to projects/clients.
    • Do you want students to get the full experience of project scoping/design?
    • Do you want clients to have a good idea or specific question to be answered (simplifying the above)
    • Should incoming data be clean already (more algorithm/presentation design), or raw (the whole process), or even missing (generation/collection of data)
  • When designing YOUR deadlines and expectations, remember that sometimes clients are hard to work with.
    • Are you (the student’s professor) be the ‘real’ client — Hard deadlines, clear project expectations…
    • Is the client/contact the ‘real’ client — big flexibility on specific deliverables, deadlines, etc (with-in reason)

NASEM Webinar 1: Data Acumen

This webinar aimed to discuss how to build undergraduate’s “data acumen”. If acumen isn’t a word you use regularly (didn’t before last year), it means “the ability to make good judgments and quick decisions”. Data acumen therefore is the ability to make good judgments and quick decisions with data. Certainly a valuable and important skill for students to develop! The webinar’s presenters were Dr. Nicole Lazar, University of Georgia and Dr. Mladen Vouk, North Carolina State University. Dr. Lazar is a professor of statistics at University of Georgia. Dr. Vouk is a distinguished professor of computer science and the Associate Vice Chancellor for Research Development and Administration.

Overall, this webinar seemed to be largely a waste of time, if your goal was to understand what activities, curricular design and practices would help students develop data acumen. (See my last paragraph for a suggested alternative) On the other-hand, if you’d like a decent description of the design and implementation of a capstone course, and the process of scaling a capstone course, listen to Dr. Lazar’s portion. If you still need an overview of the state of data science then Dr. Vouk’s portion provided a reasonable context for data science. The most valuable thing in the entire webinar was slides 26 and 27 (about minute 48). Slide 26 shows an excellent diagram for an “End-to-End Data Science Curriculum” that reasonably well articulates how a student might mature (and thereby gain data acumen), see figure 1 below. Slide 27 provides well-articulated learning objectives for core, intermediate and advanced data science courses (see table below)

From NASEM Data Acument Webinar. North Carolina State University’s Curriculum Vision
  • Core
    • Able to master individual core concepts within the bloom’s taxonomy:
      Knowledge, Comprehension, Application, Analysis, Evaluation, and Synthesis
    • Able to adapt previously seen solutions to data science problems for target domain-focused applications utilizing these core concepts
  • Intermediate Electives
    • Able to synthesize multiple concepts to solve, evaluate and validate the proposed data science problem from the end-to-end perspective
    • Able to identify and properly apply the textbook-level techniques suitable for solving each part of the complex data science problem pipeline
  • Advanced Electives
    • Able to formulate new domain-targeted data science problems, justify their business value, and make data-guided actionable decisions
    • Able to research the cutting edge technologies, compare them and create the optimal ones for solving the DS problems at hand
    • Able to lead a small team working on the end-to-end execution of DS projects


An Alternative to the NASEM Webinar

While I found this particular webinar to largely be a waste of time, I also attended the NASEM Roundtable on “Alternative Educational Pathways for Data Science” . While certainly not focused on data acumen I found the first presentation given at that round-table described an excellent overall curriculum structure that did build student’s data acumen. Eric Kolaczyk from Boston University described their non-traditional master’s program in Statistical Practice. By integrating their course work, practicum experiences, and more, students are forced to exercise and build their ability to make good judgments about data investigations, methods, and results. The talk is well worth your time if you’d like some ideas for non-standard ways to build student skills and abilities.

Spring 2018: What’s Happening

Spring semester is off to an intense start for me! I’m again teaching an “Introduction to Data Science” using a project-based methodology. We’ve got 6 awesome projects from for-profit, government, and internal clients. There’s also plenty going on in the data science world as faculty gear up for SIGCSE (which has several data science sessions) and other conferences over the summer.

I’m going to run a series of summaries for the National Academy of Sciences, Engineering and Medicine “Webinar Series on [Envisioning] Data Science Undergraduate Education”. If you weren’t able to watch them this fall (I wasn’t!) I’ll be watching them, summarizing the general content and pointing out useful highlights to take away. I’m hoping to get one out about every week (no promises though!)

You can find the summaries under the category “Webinar Summaries” and I’ll also tag them with NASEM Webinar. If there’s some pressing question you’d love to see a post on, let me know!