This post will have comments and ideas from my attendance at the 2019 Symposium on Data Science and Statistics, 2019 in Belleavue Washington. It will also have the final draft of my presentation slides.
Slides from my presentation:
This post will have comments and ideas from my attendance at the 2019 Symposium on Data Science and Statistics, 2019 in Belleavue Washington. It will also have the final draft of my presentation slides.
Slides from my presentation:
Last week, there was an announcement of the first draft of ACM’s “Computing Competencies for Undergraduate Data Science Curricula” — I.E., ACM’s take on a Data Science Curriculum recommendation. The full draft can be found here. The ACM Data Science task force is explicitly asking for community feedback on this draft by March 31st. I was able to attend their town-hall feedback session at the SIGCSE Technical Symposium where there were both excitement, but also some concerns about the scope the curriculum recommendations take. This post is going to offer some reflections and thoughts on the draft, however I strongly encourage anyone involved with Data Science curriculum design or implementation to read it for yourself!
Chapter 1: Introduction and Background Materials
First, I’m really glad to see this being produced. I’ve commented previously on some of the other curriculum guidelines developed on this blog emphasizing that the ‘computing’ perspective was often a bit under-represented. I also need to praise the task-force for not simply reinventing the wheel! Their first substantial section is a review of the existing, relevant curriculum recommendations related to data science. They’ve done a thorough job (the first I’ve seen publicly posted), with some valuable insights into each. If you haven’t had a chance to read some of my blog posts about the other recommendations (See: Related Curricula, EDISON, Park City) their summary is an excellent starting place. One curriculum they examine that has not been discussed on this blog is the Business Higher Education Framework (BHEF) Data Science and Analytics (DSA) Competency Map (2016). Their discussion of this material can be found on page 7.
Another important thing to catch in their discussion of the task force’s charge, and work, is that they are only trying to define computing contribution to data science. This is in stark contrast to most of the other curriculum guidelines out there relating to data science. They all include the full-breath of what a data science curriculum might entail. In talking with the chair of the task force, there really is a recognition that this is only the first stage in developing a community recognized, full-fledged curriculum guide.
Chapter 2: The Competency Framework
The task force is taking a slightly different approach to developing the curriculum than ACM took with CS-2013. Instead of focusing exclusively on “Knowledge Areas” they are developing a competency framework. Given how much the field of data science leans on soft-skills, in addition to technical skills, this is certainly a reasonable approach. The main concern expressed by the task force chair, which I share, is that it is still important for the final guide to be highly usable to guide program development. While the current draft does not achieve the same level of usefulness that CS-2013 does, I have high hopes for their final product. The motivation for this switch is grounded heavily in current scholarship of teaching and learning alongside cognitive learning theory. This has a long-term potential to help transform educational settings from a passive learning environment to a more active, student-centered paradigm (which I am strongly in favor of!). However, it will require significantly more work to transform the current competencies into something usable for both student-centered design and programmatic design.
If you aren’t aware of the concepts of “Understanding by Design”, learning transfer theory, or how these interact on a ‘practical, operational level’ it would certainly be worth your time to read through this chapter carefully. It may provide you with many new ideas to consider when doing course planning or activity planning in general.
Appendix A: Draft of Competencies for Data Science
To begin with, this appendix is actually very massive. It is 23 pages long, 40% of the entire document. As of now, the task force is well aware that this section is actually too extensive for this to be truly useful, especially as currently presented. However, they will be forming several sub-committees to work on refining each of the competency areas in the next month or two. The target time-frame for a refined draft is late summer. The next sections of this post will reflect on the various competencies as stated.
This competency and its sub-categories clearly demonstrates the break from CS-2013. Where CS2013 organized content based on topical areas of computer science, here we see a smattering of ideas from several areas. It pulls several ideas from the area of “Algorithms and Complexity” with a strong focus on the algorithmic side, and the data/programming structures that support algorithm implementations. The beautiful thing is that these do fairly clearly express computing’s perspective on absolutely essential tasks that support best-usage of statistical and data-science ideas. Probably the most surprising thing for someone not from a CS background would be the inclusion of the ‘Software Engineering’ ideas. However, based on my experiences talking with industry practitioners, this is perhaps the most overlooked area of preparing future data scientists. It becomes especially critical when trying to move their models and techniques into actual production code that produces value for a company.
I have actually merged two knowledge areas as defined by the task-force in this. They had defined the knowledge areas of: “Data Acquisition and Governance” and “Data Management”. As described, these could be merged into one, more over-arching idea. That of how a data scientist actually deals with the “bytes” of data, regardless of the actual content of the data. It also talks about ideas such as selecting data sources, storing the data, querying the databases etc. This section obviously comes strongly from the “Information Sciences” or “Information Management” sector of computer science.
Something that might be missing (or might be buried in the IS language) is the idea of careful design of the actual collection of data. That is, does a survey, log, or other acquisition process actually collect information that is usable for the planned data-science task or goal.
Again, I’ve re-named the higher-level category. The task-force originally called this group “Data Privacy, Security, and Integrity”. While highly descriptive, as it matched exactly the sub-categories, it seemed slightly redundant to have it as the meta-category as well. This is an interesting grouping also, as the “Privacy” competency clearly covers things that most faculty and practitioners I discuss data science with would agree should be included. However, the “Security” and “Integrity” competencies dive into highly technical areas of encryption and message authentication. They both seem to have been heavily drawn from the realm of Cybersecurity. I expect that most of the existing data science (undergraduate) programs would find it highly challenging to include more than a very superficial coverage of this content. Even graduate programs might not do more than touch upon the idea of mathematical encryption unless the students themselves sought out additional course work (such as a cryptography class).
Even though I’m not sure programs are, or even could, do more coverage of this section of content, this may be a clear area for program expansion. Perhaps as more courses are developed that exclusively serve data science programs it will become possible to include more of these ideas.
As could be expected, there are competencies related to actually learning something from data. The task force has (currently) chosen to split some of the ideas into two categories. The Machine Learning knowledge area is massive, and includes most of the details about algorithms, evaluation, processes and more. The Data Mining knowledge area seems to try and provide competencies related to overall usage and actual implementation of machine learning. I’ll let you pick through it yourself, but from my read through it seems to cover the majority of ideas that would be expected, including recognition of bias and decisions on outcomes.
My feedback – Ditch the separate knowledge areas, and provide some “sub” areas under Machine Learning.
Perhaps the area that drove data science into the lime-light, the task force has provided a nice break-down of sub-areas and related competencies. While a “sexy” area to have a course in, in my mind, this is actually a “nice to have” not a necessary content coverage area. Especially reading through all the details, it really does deal with “big” issues (appropriately!). However, lots and lots of data scientists that we train at the undergraduate level are simply not going to be dealing with these problems. Their day-to-day will be consumed with fundamentals, data governance and maintenance, and maybe, if they are lucky, some machine learning.
The task force’s take on this section was from a more technical standpoint. Specifically, it draws from the area of ‘human-computer-interfaces’ or HCI. In walking the line of defining computing specific competencies, without edging into statistics or graphic design, I think this is an excellent section. I am glad to see its inclusion, and thoughtful consideration. Often CS students forget about the importance of thinking carefully about how a human will actually interact with a computer. Instead they typically focus just on what the computer will output.
While this competency area is framed as a “meta” area with sub-categories, it has nearly as many sub-categories as the entire rest of the framework. While I think most (perhaps even all) of these do belong as part of a curriculum/competency guide, this felt excessive as presented. This is especially true if we are considering the suggested content for an undergraduate curriculum. While I feel that all students should be aware of the idea of “intellectual property” getting into the weeds of different regulations, IP ideas, etc seems pretty excessive for most students. Most likely, I’d simply be encouraging them to know what falls under these ideas, and then tell them to talk to a lawyer. Similarly, discussing at length “Change Management” seems highly ambitious for most data science students, especially at the undergraduate level. While they might need to be aware that their work will foster change, and that someone should be managing it… it probably shouldn’t be them unless they get explicit training in it! And, given the scope of technical skills to cover in a data-science curriculum, I sincerely doubt there will be space for much of this.
While I’ve tried to provide some quick reflections on the entire draft, you should definitely go read it yourself! Or, keep your head up looking for the subsequent drafts and processes. ACM has a history of collecting very interdisciplinary teams for generating consensus curriculum guidelines, so I expect over the next few years we’ll see a fairly substantial effort to bring more perspectives to the table and generate an inclusive curriculum guide.
Open source solutions are improving how students learn—and how instructors teach. Matt Mullenweg, the founder of WordPress revealed to TechCrunch a few years back his opinion on Open Source. “When I first got into technology I didn’t really understand what open source was. Once I started writing software, I realized how important this would be.”
Open source software is almost everywhere and tons of modern-day proprietary applications are built on top of it. Most students reading through the intricacies of data science will already be fairly familiar with open source software because many popular data science tools are open-source.
There’s a popular perception that open-source tools are not as good as their proprietary peers. However, similar to Linux, just because the underlying code is open and free does not necessarily mean that it is of a poorer quality. Truth be told, open source is probably the best in its class when it comes to development and data science.
In this post, we’ll have a straightforward look at how open source contributes to data science. We’ll also cover open-source tools and repositories that are related to data science.
Why exactly does Open Source and Data Science go hand-in-glove?
Perhaps the most significant benefit of open source tools is that it allows developers the freedom to modify and customize their tools. This allows for quick improvements and experimentation which can, in turn, allow for extensive use of the package and its features.
As is the case with any other development that captures the interest of the technology sector as well as the public, the critical piece lies in the ability to bring the end product to the market as quickly and as thoroughly as possible. Open source tools prove to be a massive benefit to this end.
The open-source ecosystem helps you solve your data science problems faster. For instance, you can use tools like Jupyter for rapid prototyping and git for version control. There are other tools that you can add to your toolbelt like Docker to minimize dependency issues and make quick deployments.
Another example of a company that contributes to open-source is Google, and TensorFlow is the best example of Google’s contributions. Google uses TensorFlow for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. By open sourcing tools like TensorFlow, they get the benefits from contributors outside the core team. Once TF gets popular, many new research ideas would be implemented in TF first, which makes it more efficient and robust.
Google explains this topic in-depth in their open-source documentation.
While open source work may have benevolent results it is not an act of charity. Releasing work as open source and the corresponding contribution process eventually result in a higher return on the initial investment made versus the alternative closed source process.
One of the significant benefits of open source, especially when it comes to data science breakthroughs, is the sheer membership of various open source communities that developers can tap into for problem-solving and debugging.
These forums have hundreds of answers to frequently asked questions, and given that open-source data science tools are poised to expand going forward, these communities and repositories of information are only poised to grow.
The best way to learn data science is to actively participate in the data science communities that you love. With open-source, that’s entirely possible because you can start by just following data science projects and repositories on GitHub. Take part in discussions and when you feel you’re ready, contribute to their code by volunteering to review code and submit patches to open-source security bugs.
This will help you get involved, gain exposure and learn details that might otherwise be impossible to learn from your degree curriculum.
KDNuggets recently a published the results of Data Science and Machine Learning poll they conducted earlier this year. The graph shows the tools with the strongest association and each tool’s rank based on their popularity.
The weight of the bar indicates the association between the tools. The numbers indicate the percentage of association between these tools. As you can see in the figure, TensorFlow and Keras is the most popular combination with a weight of 149%. Anaconda and scikit-learn is another popular combination of tools.
The number to the left indicates the rank of each tool based on popularity. The color is the value of the lift – green for more Python and red for more R.
We’ll be limiting our discussion to some of the open-source data science and machine learning tools. This list is not based on popularity, but instead usability from a learner’s perspective. Let’s get started.
TensorFlow is an open source library, built for Python keeping in mind numerical computation with the goal of making machine learning more accessible and more efficient. Google’s TensorFlow eases the process of obtaining data, models for training, projecting projections and refining results.
Developed by the Google Brain team, TensorFlow is a library for a large-scale machine and deep learning. It gathers together many different machine learning and deep learning algorithms and uses them as a common metaphor. TensorFlow makes use of Python as a convenient front-end API to build out applications within the framework. It executes applications using high-performance C++.
TensorFlow can train and execute deep neural networks and use them for image recognition, handwritten digit classification recurrent neural networks word embeddings, sequence models for ML, natural language processing and partial differential equation (PDE) based simulations. It also supports scalable production prediction using models similar to those used in training.
Keras is a minimalist Python-based library, that is known for deep learning that runs on top of TensorFlow or Theano. Keras was developed to help implement deep learning models quickly and efficiently to aid in research and development.
Keras runs on Python 2.7 and 3.5 and executes CPUs and GPUs based on the base frameworks.
Keras was developed by an engineer at Google and has four guiding principles –
Keras’ deep learning process can be summarized as below –
H2O is a scalable, fast and distributed open source machine learning framework that provides many algorithms. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It supports smart applications including deep learning, random forests, gradient boosting, generalized linear modeling, etc.
H2O is a business focused AI tool that allows users to derive insights from data by way of faster and improved predictive modeling. The core code of H2O is written in Java.
H2O helps with vast amounts of data that allows enterprise users with accurate and quick prediction. Additionally, H2O assists in extracting decision making information from large amounts of data.
Apache Mahout is an application based on an open source framework that uses the Hadoop platform. It assists with building scalable ML applications and corresponds to MLlib.
The three main features of Mahout are –
Anaconda is a real open source data science package that boasts a community of more than 6 million users. It is simple to download and install, and packages are available for MacOS, Linux and Windows.
Anaconda comes with 1,000+ data packages in addition to the standard Conda package and the virtual environment manager. This eliminates the necessity to install each library independently.
The R conda and Python packages in the Anaconda Repository are curated and compiled within a secure environment so that users get the benefit of optimized binaries that work efficiently on their system.
Sci-kit Learn is a tool that enables machine learning in Python. It is efficient and straightforward to use for data mining and data analysis tasks. The package is reusable in many different contexts and is accessible to almost all users.
Sci-kit learn includes a number of different classification, clustering and regression algorithms including –
For any data science student, GitHub is a great place to find useful resources to learn data science better.
Here are some of the top resources and repositories on GitHub. There are lots of good libraries out there that we haven’t covered in this post. If you’re familiar with data science repositories that you’ve found useful, please let share them in the comments.
The Awesome Data Science repository on GitHub is a go-to resource guide when it comes to data science. It has been developed over the years through multiple contributions with linked resources from getting-started guides, to infographics, to suggestions of experts you can follow on various social networking sites.
Here’s what you’ll find in that repo.
The Cheatsheets-Ai repository includes common techniques and tools put together in the form of cheatsheets. These range from simple tools like panda, to more complex procedures like Deep Learning.
Some of the common cheatsheets included here are – pandas, matplotlib, numpy, dplyr, scikit-learn, tidyr, ggplot, Neural Networks, and pySpark.
With the introduction of Deep Learning, NLP has seen significant progress, thanks to the capabilities of Deep Learning Architectures like Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN).
This repository is based on Oxford NLP lectures and takes the study of Natural Language Processing to the next level. The lectures cover the different terminology and techniques used to advance material similar to using Recurrent Neural Networks for Language Modeling, Text to Speech, Speech Recognition, etc.
PyTorch is an open source machine learning library for Python, based on Torch, used for applications such as natural language processing. PyTorch has garnered a fair amount of attention from the Deep Learning Community given the easy of Pythonic style coding, faster prototyping, and dynamic computations.
The PyTorch tutorial repository includes codes for Deep Learning tasks right from the basics of creating a neural network using PyTorch to coding Generative Adversarial Network (GANs), RNNs, and neural style transfers. Most models are implemented using just 30 lines of code or less.
NIPS 2017 includes a list of resources and slides of most tutorials, invited talks, and workshops held during the NIPS 2017 conference. For the uninitiated, NIPS is an annual conference held specifically for Machine Learning and Computational Neuroscience.
Most recent breakthrough research within the data science industry is a result of research that has been presented at these conferences.
Before starting a data science project, it is good to have a clear understanding of what the technical requirements are so that you can adapt resources and budgets accordingly. This is one of the main reasons an increasing number of organizations are choosing the flexibility of open source tools. The sheer variety of the open-source environment has helped expand the knowledge and bring in new technologies to this field than ever before.
This was a guest post by: Limor Wainstein
This blog post is a collection of the presentations from the session I chaired at the 2019 Joint Mathematics Meeting. The session was titled “Technology and Resources for Teaching Statistics and Data Science”. It was co-sponsored by the MAA Committee on Technology in Mathematics Education (CTiME) and the SIG-MAA: Statistics Education (Stat-Ed).
The abstract for the session was:
One of the five skill areas in the American Statistical Association’s curriculum guidelines is “Data Manipulation and Computation” (pg. 9), embracing the need for students to be competent with programming languages, simulation techniques, algorithmic thinking, data management and manipulation, as well as visualization techniques. Additionally, the emphasis on using real data and problems and their inherent complexity means that technology is often necessary outside of specifically prescribed computational courses. This session invites instructors to contribute talks exploring the use of any software or technology in statistics education. Talks may include effective instructional or pedagogical techniques for linking programming to statistics, interesting classroom problems and the use of technology to solve them, or more.
Teaching a Technology-Rich Intro Stat Course in a Traditional Classroom, presented by Patti Frazer Lock, St. Lawrence University
Using the Islands in an Introductory Statistics Course. presented by Carl Clark, Indian River State College
StatPowers-A Simple Web-Based Statistics Suite for Introductory Statistics, presented by Brian R Powers, Arizona State University
Using R Programming to Enhance Mathematical and Statistical Learning, presented by Joseph McCollum, Siena College
Computational Experience for Linear Regression and Time Series using R, presented by Rasitha R. Jayasekare, Butler University
Statistics teaching and research with R, presented by Leon Kaganovskiy, Touro College
GAISEing into the Future with Fun, Flexible Mobile Data Collection and Analysis, presented by Adam F. Childers, Roanoke College
Written Vs. Digital Feedback; Which improves Student Learning?, presented by David R. Galbreath, United States Military Academy
Using Authentic Data in Spreadsheet Assignments and Quizzes to Improve Students’ Attitudes towards Elementary Statistics, presented by Daniel A. Showalter, Eastern Mennonite University
Democratizing Data: Expanding Opportunities for Students in Data Science, presented by Robin L. Angotti, University of Washington Bothell
I hope to have in the future the recordings of the session posted. Stay tuned for an updated post (I’ll also send an announcement).
I was recently asked by Analytics Insight to provide some responses to interview questions about the state of the industry and Valpo’s data science programs. I thought the readers of this blog might be interested in my responses. They also interviewed several other program directors which collectively provide a nice snap-shot of the data science and analytics programs that are going on across the country.
My responses are below, and match the full Valpo article found here:
The entire e-newsletter is available through this link, with interviews from:
Purdue – Krannert School of Management – Director Karthik Kannan
St. Mary’s College – Director Kristin Kuter
University of Iowa – College of Business – Dept. Head Barrett W. Thomas
University of San Francisco – Director David Uminsky
Just as previous manufacturing technologies have impacted the type of work that the average employee must do, I expect that AI and Analytics will also drive major innovations in work-flow and process. Previously we saw first factory line workers improving the throughput, and often quality, of hand-made goods. Later improvements in machine automation moved workers into supervisory and trouble-shooting roles of the machine processes. I think there will now be another level of abstraction/distance between the produced good(s) and the oversight workers with AI.
Dr. Schmitt, the Director of Data Science Programs at Valparaiso recently received a TRIPODS+EDU grant from the National Science Foundation (NSF) to investigate student difficulties in learning data science. Valpo is a prime site for doing this sort of education-focused research. We hope to continue to expand that work in the near future.
See the NSF News announcement:
See the Valparaiso University Press Release:
Valpo’s programs have two features that I believe make it standout from its competitors, which are intimately tied together. First, students begin working with external clients and real, decision-level data as early as their second semester (that is, freshman year). Second, Valpo has extensive ties with non-profits and government agencies, due to its religious affiliations and history of social impact (1st in the Nation for Contribution to Public Good by the Washington Monthly). Together this means that every year a student studies at Valpo, their classwork and time can contribute to changing the world into a better place.
I took over directing the Analytics and Modeling (AMOD) program in 2014, one year after I was originally hired at Valpo. For me, this was incredibly exciting as the ideas central to the program, applied mathematics with simulation, modelling, and statistical analysis were the core of my scientific background and research areas. Since then I have taken a lead role in shaping both the existing graduate program (AMOD) and forming a new undergraduate major, Data Science. My primary role within both programs is to help shape the student experience to develop analytical talent from recruitment to employment. I help with a variety of recruitment activities, advise all students enrolled in the programs, teach several of the key courses, and help many find internships or jobs after graduation.
Learn Python. Seriously, I’m passionate that Python will over time become the core language for data science and analytics. Mostly because it’s simply got a far, far wider user-base than other languages (maybe not in this specific field, but overall).
This is followed closely by: Learn GIT – and not just the desktop version!
I think the biggest challenges faced by the broader field and industry is fragmentation and lack of vision across disciplinary boundaries. Turf wars between mathematics, statistics, computer science, business, or whoever, aren’t really going to serve the further development of the field. Moreover, I think lack of willingness to see the value of other’s approaches to issues can blind researchers and practitioners to major issues. Central among these is lack of mathematical/statistical support for conclusions, reproducibility of results, and inherent bias in any data or analysis.
Under my leadership Valparaiso University has established a full undergraduate major in Data Science and developed two new undergraduate courses (for Valpo) with two more coming next year. The enrolment in the graduate program (AMOD) has also increased by over 300%. We have also received a National Science Foundation grant for education research through the TRIPODS+X mechanism, one of only 4 educational grants.
Sadly, I’m probably a bit out of touch here. I keep hearing about companies from nearly every sector wanting analytics skills, but the challenge is that each segment requires very very different training to be on the edge of trends. Much of my focus is on identifying, and then providing to students the skills that will allow them to ride today and tomorrow’s trends with equal skill.
Educational institutes and educational professionals often face a problem when it comes to creating a common platform where educators and students can view and share code. One such university in Turkey had to find a way to address a common complaint from students viz. their compute environments were different when compared to the testing machine.
The professor at Bilkent University in Ankara, Turkey, decided to use a technology called Docker to power a web-platform that can create lab instances and grade assignments.
So, what is Docker? We will answer that in a while. But before Docker was available, the next best solution was to use virtual machines. However, these machines needed to be extremely powerful and consequently required an expensive infrastructure, which most institutes couldn’t set aside a budget for. Students were forced to log on to a shared server which inadvertently negatively affected each other’s programs, or worse, crashed the whole infrastructure. Needless to say, it was impracticable to assign a virtual machine to each student.
They used Docker to build a web-based application called Programming Assignment Grading System (PAGS). A similar technique can be of adopted by universities for creating lab instances and grading assignments for data science classes.
Although we haven’t formally defined what Docker is, the above example demonstrates what Docker can do. The rest of the article focuses on Docker and how it can transform the education and data science industry.
The article is divided into 4 sections. First, we’ll start with an introduction on Docker and Docker containers. Then, we’ll answer the question, “Who is Docker for?” The third part will give you an overview of how Docker is a useful tool for data scientists. In the final section, we’ll dive into a couple of interesting use cases for Docker in data science. Let’s get started!
Docker is the leading software containerization platform that is being actively developed by Docker Inc. It is an open source project that is designed to help you create, run and deploy applications inside containers.
So, what is a container? A container, by definition, comprises all the dependencies, libraries and other related files required to run an application. Once you’ve created a container for your application, you can run it on any Linux machine regardless of the way your underlying machine is configured. If the machine that you’re using at one end is Ubuntu, and it’s Red Hat at the other end, fret not! Docker is precisely meant for situations like these.
You can create a snapshot of a container and this snapshot is generally known as an image. Conversely, you can call a container an instance of a docker image. Docker images are inert and immutable files. When someone asked the difference between an image and a container on StackOverflow, a web developer named Julian came up a quick analogy. “The image is the recipe, the container is the cake”, he said and that just sums it up.
You can store Docker images in a cloud registry like Docker Hub. There are numerous user-contributed Docker images that should cover almost all the general use cases. You can also create and share your private Docker images with your co-workers and your organization. Alternatively, you can push them into a public repository so as to return it back to the community.
The concept of Docker is very similar to that of a Virtual Machine (VM). However, virtual machines are very demanding beasts and run considerably slower on a less powerful hardware. A VM works in such a way that it allows a piece of hardware to be shared between VMs. This allows you to run one or more virtual operating systems inside your host operating system. But you might need to upgrade your processor if you’re seriously planning to run your software on a virtual machine.
Unlike a VM, Docker uses the host kernel instead of creating new kernel instances. The virtualization happens at the kernel level and not at the topmost level. The Docker encapsulates everything that’s required for running the application on that host machine. This tremendously improves the performance of the application and reduces its size. What gives Docker the significant lead is the fact that it enables separation of concerns between the infrastructure, IT Ops, the developer and the application. This creates a positive environment for enhanced innovation and collaboration.
Docker is essentially a container platform largely aimed at businesses. It enables IT businesses to efficiently select and administer a complete application, without the fear of an infrastructure or architecture lock-in.
Enterprises use Docker for everything from setting up their development environment to deploying their application for production and testing. When you need to build more advanced systems, like a data warehouse comprise of multiple modules, containers make a lot of sense. You can actually save several days of work that you’d otherwise have to spend configuring each machine.
However, the Docker platform isn’t just relevant to developers and enterprises alone. It’s actually a pretty useful tool for data scientists, analysts and even for schools and colleges. There are educational institutions and universities that are keen to transform digitally but are held back by their existing infrastructure.
Why should you use Docker if you’re a data scientist? Here are three reasons pointed out by Hamel Hussain over at Towards Data Science:
If you are a professional data scientist, it is imperative that your work can be reproduced. Reproducibility helps facilitate review by your peers, ensure the analysis, model and application that you have built can run unhindered which makes your deliverables both robust and time-tested.
As an example, let us assume that you have built a Python model, however, it has not proven to be enough to run pip-freeze and transfer the resulting file to a colleague. This would largely be because of Python-specific dependencies.
Imagine if you could find a way around manually moving the Python dependencies like the compiler, config. files, drivers, etc. You can be free of Python-related dependencies by simply bundling everything within a Docker container. This not only reduces the task of others having to recreate your environment, it also ensures that your work is much more accessible.
If you are a data scientist who is specializing in Machine Learning, the ability to frequently and efficiently change your computing environment has a considerable effect on your productivity.
It is often the case that the work of data science starts with prototyping, research, and exploration. This doesn’t essentially need special computing power to start. That said, often comes a stage where multiple compute resources can prove quite helpful in increasing the speed of your workflow.
A number of data scientists find themselves limited to a local computing environment largely because of a perceived hindrance of re-creating their individual local environment onto a device remotely. Here, Docker makes the difference. It allows you to port your work environment, including libraries, files etc in just a few clicks. Additionally, the ability to swiftly port your computing environment is a substantial advantage in Kaggle competitions.
Once you are comfortable with using Docker, you can then deploy models as containers that can help make your work readily accessible to other users. Additionally, various other applications that you may require as part of your data science workflow interaction may already exist in a container within a Docker application.
By making your applications portable, cheaper and more secure, Docker helps to free up time as well as resources that can be spent on other important things. It can help transform IT without the need to re-tool, re-educate or re-code any of your existing applications, staff or policies.
Here are just a few of the use cases of how Docker can help different organizations:
Let’s revisit the Docker use case that we discussed in the introduction. The faculty at the university used Docker to create a container to run their application called PAGS. It allowed the students to have the same environment for their compute machines and test machines without the need of a VM.
Docker provides a common environment that can run on a container on any given Linux machine. This almost always guarantees it to run with similar results on a different machine using the same container. Without Docker, this would have required more infrastructure and resources that they didn’t have.
Another particularly interesting scenario is setting up lab instances. Dependending on how you want a machine to be configured, you can take a snapshot of it to build a Docker image. You can then pull the snapshot into all other lab instances saving you time and resources.
Consider a scenario where you need to explore a few data science libraries in Python or R, but;
This is where Docker can help.
Using Docker, you can get a Jupyter ‘Data Science’ stack installed and ready to execute in no time flat. It allows you to run a ‘plug and play’ version of a Jupyter data science stack from within a container.
To start, you would need to first install Docker Community Edition on your machine. Once done, restart your machine and get your Jupyter container set up. Prior to running a container, you would need to specify the base image for the container
In most cases, the image that you’re looking for has already been built by a prior user and includes everything needed to use a fully loaded data science stack of your choice. All that needs to be done is to specify a pre-defined image that Docker can use to start a container.
In this article, we have just hit the top of the iceberg in terms of what can be done with Docker. We have focused only on specific areas of Docker that a data scientist may most often encounter. Below are some further sources that can help you during your journey of using and implementing Docker.
This was a guest post by: Limor Wainstein
Guest Post by Limor Wainstein–
Why Teach Machine Learning?
Teaching machines to learn about the real world has been a goal in computer science since Alan Turing first showed how to mechanise logic. But it’s only recently that affordable hardware has evolved enough speed and capacity to make the idea commercially feasible in many domains – and more than feasible, seemingly inevitable.
Machine learning, alongside its siblings in data analytics and big data, is not only fashionable, it’s where the money and jobs are, thus attracting ambitious, commercially minded students. It’s also an increasingly important tool for all sciences, promoting interest among those aiming at careers in research and academia. Andrew Ng, former chief scientist at Baidu, the giant Chinese search engine company, and adjunct professor at Stanford, has called AI and machine learning ‘the new electricity’ for its potential to apply to and revolutionize all sectors of the economy and society.
That has become apparent in the job market. Towards the end of 2017, the Financial Times noted that three out of four of the top-paying jobs in software were for expertise in “the new profession” of machine learning. Ng says that the two biggest challenges for machine learning are acquiring the vast amounts of data required and finding skilled workers. Of the two, he said, the skill shortage is the biggest problem. Some entire job sectors, such as high frequency trading, are now entirely dependent on machine learning, and financial technology as a whole is moving in that direction rapidly. For example, J. P. Morgan recently issued a 280-page report on data analysis and machine learning in finance, focusing on the skills it needs to hire in large numbers – numbers that don’t exist.
Additional, highly-prominent machine learning domains exist alongside financial technology, for example, autonomous vehicles and medical diagnosis. Overtly AI-dominated companies like Google, Tesla and IBM are adept at garnering publicity. Such high-profile efforts mask the huge number of more mundane machine learning tasks that exist in every industry. Amazon, for example, uses machine learning across its entire retail system (from web interface to warehousing, packaging and delivery). Every company that operates with data at scale in retail has to follow those examples to compete.
Energy companies use machine learning to predict and manage supply and demand. Airlines manage pricing and route loading through machine learning. New medicines are developed using machine learning, and health services marshall their resources in response to short and long-term trends in demand, tracked and predicted by machine learning. Agriculture, ditto. In fact, it’s hard to find any area untouched by machine learning – even theology is in on the trend, with Murdoch University in Perth using machine learning to analyze ancient Thai palm-leaf texts on Buddhist doctrines. The new electricity, indeed.
So, what is machine learning?
Machine learning is a subset of artificial intelligence, but is mercifully free of the philosophical and near-religious arguments of some AI research. Instead, machine learning is simple to define and has well-defined tools, techniques and goals, and an ever-expanding field of practical applications.
Machine learning is the application of algorithms and techniques to data sets in order to find whether certain patterns exist. Whether this includes data acquisition and cleaning before analysis, or decision-making afterwards, depends on how tightly you want to draw the definition. All of these things are important in practical machine learning-based applications but are usually domain specific. However, the core of machine learning isn’t domain specific and can be applied very widely. This has led it to be taught as a self-contained field.
Machine learning is inherently cross-disciplinary, and this is the greatest challenge in teaching the subject. There is a huge and unavoidable mathematical component, involving statistics, predicate calculus, linear algebra, and related concepts. This can come as a shock to computing students who have successfully minimized their exposure to such mathematical ideas hereunto. Computing skills are equally important, as machine learning involves the efficient manipulation of large and disparate data sets through complex transformations, often in highly parallel environments. With many practical machine learning applications bounded by hardware limitations, a deep understanding of system architecture and its practical consequences is also necessary. These facts will come as an equal shock to students in statistical machine learning courses who have avoided significant programming or hardware experience. A good machine learning practitioner needs to be fluent not only in programming but in systems architecture and data design. In addition, the practitioner needs to understand which of the many mathematical techniques to apply to a particular problem and how to apply them correctly.
In a real-life work environment, a data scientist or data engineer will typically find machine learning techniques useful. She may even require them to excel at her job. For example, she may need to create algorithmic patterns to search for data, use data patterns to make decisions and predictions, or use other techniques, such as smart sorting or fuzzy logic to prepare and manipulate data. These skills are at the heart of modern data science. It is clear, therefore, that a serious data science program should provide solid coverage of machine learning skills and techniques.
How should you teach it?
Picking the exact mix of tools, languages, and technologies for a course is to some extent a secondary issue, and can easily be based on what resources and skills are available to best match your choice of syllabus, project work and structure. Machine learning is a product of the Internet age and as such has a continuing evolution of best practice in its DNA. Checking out – and participating in – online machine learning communities such as Kaggle is one of the best ways to ensure alignment between machine learning teaching and actual student needs.
As with any subject, some students will have the skills, interest or previous experience to easily follow one or both of the two major prongs of machine learning. Most will not. But teachers of machine learning have an advantage over their mathematician or computer science colleagues: they can use each prong to illustrate and contextualise the other. Students who experience a curriculum where each is taught independently often have problems – and this has been unfortunately common. On discussion boards where experienced ML practitioners advise students, disheartening comments abound.
Calvin John, an autonomous vehicle researcher, warned on Quora of his experience with a “…horrible textbook… very little conceptual basis for the theorems… bunch of isolated problems which were crudely connected in a very disjointed way”. Modern machine learning teaching is developing rapidly. Like many new interdisciplinary subjects, machine learning may be taught by different faculties, where each faculty is led by its own approach without relating to the needs of the other disciplines involved.
Andy J. Koh, program chair of informatics at the University of Washington, also discusses the subject of teaching machine learning in his essay “We need to learn how to teach machine learning”(August 21, 2017). He says: “We still know little about what students need to know, how to teach it, and what knowledge teachers need to have to teach it successfully.” He also points out the wide range of student abilities and experience among those interested in machine learning – not only from previous undergraduate courses, but from MOOCs and burgeoning commercial self-teaching online products. He nevertheless advocates the adoption of good pedagogical tools – evolving analogies and practical examples that combine theory and practice. It’s important, he says, to understand which concepts will be particularly difficult, realizing what ideas, good and bad, students bring with them.
It’s in the practical examples where machine learning teachers have the greatest chance to equip students with a good, broad and deep understanding of the field. Machine learning’s expanding applicability offers many choices – machine vision, text mining, natural language processing are popular examples. The topic should be suited to the project work across a syllabus. A judicious introduction of new mathematical ideas alongside practical work examples, or practical problems that lead to theoretical insights can reinforce student appreciation of the whole.
Here are some additional resources that discuss teaching machine learning:
A worked ML curriculum bringing together best-of-breed MOOC courses.
Another site that has several courses, including MOOCs and other deep-learning topics is fast.ai.
(They also have an interesting brief post on adding data science to a college curriculum)
This was a guest post by: Limor Wainstein
Spring semester is off to an intense start for me! I’m again teaching an “Introduction to Data Science” using a project-based methodology. We’ve got 6 awesome projects from for-profit, government, and internal clients. There’s also plenty going on in the data science world as faculty gear up for SIGCSE (which has several data science sessions) and other conferences over the summer.
I’m going to run a series of summaries for the National Academy of Sciences, Engineering and Medicine “Webinar Series on [Envisioning] Data Science Undergraduate Education”. If you weren’t able to watch them this fall (I wasn’t!) I’ll be watching them, summarizing the general content and pointing out useful highlights to take away. I’m hoping to get one out about every week (no promises though!)
You can find the summaries under the category “Webinar Summaries” and I’ll also tag them with NASEM Webinar. If there’s some pressing question you’d love to see a post on, let me know!
October has been an incredibly busy month! I’ve been traveling a lot, taking part in a wide variety of activities around data science education. It’s been a pretty big month and I’m here to give you a very quick run-down of what’s been happening!
The month kicked off with the Midwest Big Data Innovation Hub’s “All-hands on Deck” Meeting. I was invited there as part of a planning grant the hub had received to develop a spoke proposal for the hub to create a “Resource Center for Non-R1 Universities”. The meeting was very interesting, and we got to hear about some really neat work on using data science to advance agriculture, smart cities and more. The most relevant for data science education though was the final panel, “Education and Workforce Development.” Panelists included Jim Barkley, David Mongeau and Renata Rawlings-Goss. You can find their slides on the Midwest Big Data Hub (Barkley Slides, Mongeau Slides, Rawlings-Goss Slides). There is also a video recording of the panel here. The other important event that happened at the meeting was the afternoon grant-planning session. While I can’t share documents from that yet, I left very excited about the possibilities of establishing an important educational center for data science education that would help address the needs of non-R1 institutions. Some of the ideas that were shared included providing a clearing house for internships and project opportunities, connecting smaller institutions with interesting research projects and facilitating finding instructional expertise for most esoteric courses.
Mid-Month (October 20th), the National Academy of Sciences’ held their 4th roundtable on Data Science Education, “Alternative Institutional and Educational Mechanisms”. You can find the webcast and agenda webpage here. I attended as a member of the public and was able to contribute a few ideas and questions. There were several great presentations and some perspectives on education I hadn’t considered were definitely presented. Eric Kolaczyk gave a great presentation that described a very nicely integrated learning pathway for building data expertise at the master’s level. The MS in Statistical Practice It is one of the few programs I know of (now) that actually redesigned several of their courses to make a more effective data science education, and cohesive learning structure. It was also very informative to hear about Metis’s data science “bootcamps”. It’s pretty clear Metis is doing some excellent education work in data science, but very different from traditional, academic education. Additional talks worth listening to were Andrew Bray, explaining the origin and evolution of the American Statistical Association’s DataFest events, Ron Brachman describing Cornell Tech’s ‘entrepreneurial’ focused data science, and Catherine Cramer discussing the New York Hall of Science‘s Network Science education initiatives (I plan to use some of this material for with my students who do network science research!).
Additionally, the National Academy of Sciences have released an interim report on the “Envisioning the Data Science Discipline” studies going on. The report is definitely worth reading and provides some very interesting views and findings. There’s also a strong call for community input, so send you ideas in!
The last activity I participated in during October was the South Big Data Hub‘s workshop “Keeping Data Science Broad: Workshop on Negotiating the Digital and Data Divide“. This workshop was an incredible pleasure to join! I think the best part was that with the entire room filled with people who have already been thinking about what data science and data science education might look like, we were able to frequently move beyond the “what is data science” discussion. It meant that we could really start discussing the roadblocks and opportunities inherent in data science. While I can’t share more of the actual outcomes/products from the workshop yet, we’ve got a really aggressive schedule to turn the output into a report (due Dec 6th!). I’m hopeful that something really powerful will come out. I know there was a lot of writing accomplished while there (I wrote 5-6 pages, and others did too) so look for another announcement of a report in early december.
Finally, while I haven’t been participating/watching them much yet. I need to mention the ongoing webinar series being run by the National Academy of Sciences. You can find the entire webinar series here. October saw 4 webinars posters, “Communication Skills and Teamwork”, “Inter-Departmental Collaboration and Institutional Organization”, “Ethics”, and “Assessment and Evaluation for Data Science Programs”. I’m still hoping to watch these and provide summary posts… but that hasn’t happened yet. If any of my readers have been watching them and would like a guest-post with a summary, please get in touch!
This post is a summary and reflection on the webinar “Data Science Education in Traditional Contexts”. The webinar was hosted on Aug 28th by the South Big Data Innovation Hub as part of their Keeping Data Science Broad: Bridging the Data Divide series. You can watch the entire webinar here. The webinar consisted of 5 speakers and a discussion section. I’ve provided a short summary of each panelist’s presentation and the questions discussed at the end. The speakers, in order were:
The first speaker was Paul Anderson, Program Director for Data Science at the College of Charleston. His portion of the presentation runs from 0:01:50-0:13:45, and expands on three challenges he has experienced, (1) being an unknown entity, (2) recruiting, and (3) designing an effective capstone. His first point, being an unknown entity, impacts a broad range of activities related to implementing and running a data science program. It can cause a challenge when trying to convince administrators to support the program or new initiatives (such as external collaborations). It means that other disciplines may not be interested in developing joint course work (or approving your curricular changes). His second point discussed what he’s learned from several years of working on recruitment. His first observation here ties to his first overall point: If your colleagues don’t know what data science is, how are most high school students to know (or even your students)?. This has led him to have limited success with direct recruitment from high schools. Instead, he’s focused on retooling the program’s Introduction to Data Science Course to be a microcosm of his entire program, both in terms of process and rigor. He’s also worked to make his program friendly to students switching majors or double majoring by having limited prerequisites. His final portion discussed the various forms of capstone experiences Charleston has experimented with. Starting from an initially 1-to-1 student-faculty project pair, moving into more group-based with a general faculty mentorship model. If you are considering including a capstone experience (and you should!) it’s probably worth listening to this portion. However, not all colleges or universities will have sufficient students/faculty to move into their final model.
The second speaker was Mary Rudis, Associate Professor of Mathematics at Great Bay Community College. Her portion runs 0:14:25-0:19:19 and 0:20:46-0:29:08. A significant portion of her presentation outlines the large enrollment and performance gap of non-white and first generation college students. Dr. Rudis saw building both an Associate Degree in Analytics, and a Certificate in Data – Practical Data Science as the best way to combat these gaps. In researching the state of jobs/education she found that community college students were struggling to compete for the limited internships and entry-level job opportunities available in data science, compared to 4-yr college students (like local M.I.T. students). Most companies in terms of hires were looking for Master’s level education, or significant work experience in the field. To help her students succeed, she built an articulation program with UNH-Manchester so that upon final graduation, students originally enrolled at GBCC would be full-qualified for the current job market.
The third speaker was Karl Schmitt, Assistant Professor of Mathematics and Statistics, Affiliate Professor of Computing and Information Sciences, and the Director of Data Sciences at Valparaiso University. His presentation runs from 0:30:30 – 0:45:20. The core of the presentation expanded on Dr. Anderson’s first point about data science being an unknown entity. He sought to provide ideas about how to differentiate programs from other similar programs, both at the college/university level, but also make the programs different when looking outside his own institution. Valparaiso has 6 data-focused programs:
His talk described how the programs can be differentiated in terms of the data user/professional that the program trains, and also in terms of course content and focus. He also talked about how Valpo is differentiating its program from other schools with a focus on Data Science for Social Good. This has been achieved in part by seeking industry partners from the government and non-profit sectors, rather than traditional industrial partners.
The fourth speaker was Pei Xu, Assistant Professor of Business Analytics, Auburn University. Her portion of the presentation runs from 0:46:05 – 0:57:55 and describers Auburn’s undergraduate Business Analytics Degree. Auburn’s curriculum is designed around the data science process of Problem Formulation -> Data Prep -> Modeling -> Analysis -> Presentation. Each of the core classes covers 1-2 stages of this process, with the specialized degree courses typically beginning in a student’s sophomore year. Their program also actively engages many businesses to visit and provide information sessions. Dr. Xu detailed 4 challenges she’s faced related to their program. First, she has found it hard to recruit qualified faculty for teaching courses, which she’s overcome by progressively hiring over the last few years. She has also found many students to be turned away by the high quantitative and computational nature of the program. This has been addressed by building a stronger emphasis on project-based learning and more interpretation than innovative process development. Third, she discussed how many of the core courses in their program have significant overlap between courses. For example, many courses in different areas all need to discuss data cleaning/preparation. Auburn’s faculty has spent significant curriculum development time discussing and planning exactly what content is duplicated and where. Finally, deciding between the various analytics tools for both the general curriculum and specific classes has proved challenging (you can see an extended discussion by me of Python/R and others in here).
The fifth speaker was Herman “Gene” Ray, Associate Professor of Statistics and Director for the Center for Statistics and Analytics Research, Kennesaw State University. His presentation is from 0:58:36 – 1:07:35 and focuses on KSU’s Applied Statistics Minor. KSU’s program strongly focuses on domain areas, with most courses having a high-level of applications included and types of experiential learning opportunities. Additionally, almost all their courses use SAS in addition to introducing their students to a full range of data science software/tools. The first experiential learning model KSU uses is an integration of corporate data-sets and guided tasks from business. The second model is a ‘sponsored research class’ with teams of undergraduates led by a graduate student on corporation provided problems or data. Gene provided extended examples about an epidemiology company and about Southron Power Company. The key benefits KSU has seen are that students receive real world exposure, practice interacting with companies, potentially even receiving awards, internships, and jobs. The largest challenge to this experiential learning model is that is requires a significant amount of time, first to develop the relationships with companies, managing corporate expectations, and finally in the actual execution of projects for both faculty and students.
The additional discussion begins at 1:08:32. Rather than summarize all the responses (which were fairly short), I’m simply going to list the questions, in-order as they were answered and encourage interested readers to listen to that portion of the webinar or stay tuned for follow-up posts here.