Tagged Languages

Guest Post: Docker And the Data Scientist

Educational institutes and educational professionals often face a problem when it comes to creating a common platform where educators and students can view and share code. One such university in Turkey had to find a way to address a common complaint from students viz. their compute environments were different when compared to the testing machine.

 

The professor at Bilkent University in Ankara, Turkey, decided to use a technology called Docker to power a web-platform that can create lab instances and grade assignments.

 

So, what is Docker? We will answer that in a while. But before Docker was available, the next best solution was to use virtual machines. However, these machines needed to be extremely powerful and consequently required an expensive infrastructure, which most institutes couldn’t set aside a budget for. Students were forced to log on to a shared server which inadvertently negatively affected each other’s programs, or worse, crashed the whole infrastructure. Needless to say, it was impracticable to assign a virtual machine to each student.

 

They used Docker to build a web-based application called Programming Assignment Grading System (PAGS). A similar technique can be of adopted by universities for creating lab instances and grading assignments for data science classes.

 

Although we haven’t formally defined what Docker is, the above example demonstrates what Docker can do. The rest of the article focuses on Docker and how it can transform the education and data science industry.

 

The article is divided into 4 sections. First, we’ll start with an introduction on Docker and Docker containers. Then, we’ll answer the question, “Who is Docker for?” The third part will give you an overview of how Docker is a useful tool for data scientists. In the final section, we’ll dive into a couple of  interesting use cases for Docker in data science. Let’s get started!

What is Docker?

 

Docker is the leading software containerization platform that is being actively developed by Docker Inc. It is an open source project that is designed to help you create, run and deploy applications inside containers.

 

So, what is a container? A container, by definition, comprises all the dependencies, libraries and other related files required to run an application. Once you’ve created a container for your application, you can run it on any Linux machine regardless of the way your underlying machine is configured. If the machine that you’re using at one end is Ubuntu, and it’s Red Hat at the other end, fret not! Docker is precisely meant for situations like these.

 

You can create a snapshot of a container and this snapshot is generally known as an image. Conversely, you can call a container an instance of a docker image. Docker images are inert and immutable files. When someone asked the difference between an image and a container on StackOverflow, a web developer named Julian came up a quick analogy. “The image is the recipe, the container is the cake”, he said and that just sums it up.

 

You can store Docker images in a cloud registry like Docker Hub. There are numerous user-contributed Docker images that should cover almost all the general use cases. You can also create and share your private Docker images with your co-workers and your organization. Alternatively, you can push them into a public repository so as to return it back to the community.

 

The concept of Docker is very similar to that of a Virtual Machine (VM). However, virtual machines are very demanding beasts and run considerably slower on a less powerful hardware. A VM works in such a way that it allows a piece of hardware to be shared between VMs. This allows you to run one or more virtual operating systems inside your host operating system. But you might need to upgrade your processor if you’re seriously planning to run your software on a virtual machine.

 

Unlike a VM, Docker uses the host kernel instead of creating new kernel instances. The virtualization happens at the kernel level and not at the topmost level. The Docker encapsulates everything that’s required for running the application on that host machine. This tremendously improves the performance of the application and reduces its size. What gives Docker the significant lead is the fact that it enables separation of concerns between the infrastructure, IT Ops, the developer and the application. This creates a positive environment for enhanced innovation and collaboration.

Who is Docker for?

Docker is essentially a container platform largely aimed at businesses. It enables IT businesses to efficiently select and administer a complete application, without the fear of an infrastructure or architecture lock-in.

 

Enterprises use Docker for everything from setting up their development environment to deploying their application for production and testing. When you need to build more advanced systems, like a data warehouse comprise of multiple modules, containers make a lot of sense. You can actually save several days of work that you’d otherwise have to spend configuring each machine.

 

However, the Docker platform isn’t just relevant to developers and enterprises alone. It’s actually a pretty useful tool for data scientists, analysts and even for schools and colleges. There are educational institutions and universities that are keen to transform digitally but are held back by their existing infrastructure.

Docker and Data Science

Why should you use Docker if you’re a data scientist? Here are three reasons pointed out by Hamel Hussain over at Towards Data Science:

Reproducibility

If you are a professional data scientist, it is imperative that your work can be reproduced. Reproducibility helps facilitate review by your peers, ensure the analysis, model and application that you have built can run unhindered which makes your deliverables both robust and time-tested.

 

As an example, let us assume that you have built a Python model, however, it has not proven to be enough to run pip-freeze and transfer the resulting file to a colleague. This would largely be because of Python-specific dependencies.

 

Imagine if you could find a way around manually moving the Python dependencies like  the compiler, config. files, drivers, etc. You can be free of Python-related dependencies by simply bundling everything within a Docker container. This not only reduces the task of others having to recreate your environment, it also ensures that your work is much more accessible.

Ability to Port Your Compute Environment

If you are a data scientist who is specializing in Machine Learning, the ability to frequently and efficiently change your computing environment has a considerable effect on your productivity.

 

It is often the case that the work of data science starts with prototyping, research, and exploration. This doesn’t essentially need special computing power to start. That said, often comes a stage where multiple compute resources can prove quite helpful in increasing the speed of your workflow.

 

A number of data scientists find themselves limited to a local computing environment largely because of a perceived hindrance of re-creating their individual local environment onto a device remotely. Here, Docker makes the difference. It allows you to port your work environment, including libraries, files etc in just a few clicks. Additionally, the ability to swiftly port your computing environment is a substantial advantage in Kaggle competitions.

Enhance your Engineering Skills

Once you are comfortable with using Docker, you can then deploy models as containers that can help make your work readily accessible to other users. Additionally, various other applications that you may require as part of your data science workflow interaction may already exist in a container within a Docker application.

Use Cases for Docker in Data Science

By making your applications portable, cheaper and more secure, Docker helps to free up time as well as resources that can be spent on other important things. It can help transform IT without the need to re-tool, re-educate or re-code any of your existing applications, staff or policies.

 

Here are just a few of the use cases of how Docker can help different organizations:

Docker for Education

Let’s revisit the Docker use case that we discussed in the introduction. The faculty at the university used Docker to create a container to run their application called PAGS. It allowed the students to have the same environment for their compute machines and test machines without the need of a VM.

 

Docker provides a common environment that can run on a container on any given Linux machine. This almost always guarantees it to run with similar results on a different machine using the same container. Without Docker, this would have required more infrastructure and resources that they didn’t have.

 

Another particularly interesting scenario is setting up lab instances. Dependending on how you want a machine to be configured, you can take a snapshot of it to build a Docker image. You can then pull the snapshot into all other lab instances saving you time and resources.

Docker for Data Science Environment Set Up

Consider a scenario where you need to explore a few data science libraries in Python or R, but;

  1.  without spending a lot of time installing either language on your machine,
  2.  browsing and figuring out which dependencies are essential and
  3.  finally getting down to identifying what works best for your version of Windows/OSX/Linux.

 

This is where Docker can help.

 

Using Docker, you can get a Jupyter ‘Data Science’ stack installed and ready to execute in no time flat. It allows you to run a ‘plug and play’ version of a Jupyter data science stack from within a container.

 

To start, you would need to first install Docker Community Edition on your machine. Once done, restart your machine and get your Jupyter container set up. Prior to running a container, you would need to specify the base image for the container

 

In most cases, the image that you’re looking for has already been built by a prior user and includes everything needed to use a fully loaded data science stack of your choice. All that needs to be done is to specify a pre-defined image that Docker can use to start a container.

Conclusion

In this article, we have just hit the top of the iceberg in terms of what can be done with Docker. We have focused only on specific areas of Docker that a data scientist may most often encounter. Below are some further sources that can help you during your journey of using and implementing Docker.

 

  1. Basic Docker Terminologies
  2. Useful Docker Commands
  3. Dockerfile Reference
  4. Pushing and Pulling to and from Docker Hub

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Python or R?

This week I want to discuss a potentially divisive issue, should a program (or course etc) be taught in Python or R. I think a reasonable case could be made for teaching either language. Pragmatically, if you want your program’s graduates to be truly competitive for the largest variety of jobs in the current market students need to at least be familiar with both (and possibly SAS or SPSS). There is already a lot of information and blog posts addressing this question and I’ve provided links to a few of my favorites at the end of this post. Rather than re-hashing those post’s pro’s and con’s I’m going to focus on aspects of each language related to teaching (and learning).

Before considering each language, I want to frame the discussion by (re)stating a program level student learning objective (SLO). In my first post about SLO’s objective 2 states: “Students will be able to implement solutions to mathematical and analytical questions in language(s) and tools appropriate for computer-based solutions, and do so with awareness of performance and design considerations“. Based on this objective, I’ll state three specific objectives for selecting a programming language:

  • A language which can implement (complete) solutions to data science questions
  • A language which allows good programming practices in terms of design
  • A language which allows implementation of solutions which can be improved/aware of performance issues

Why Choose R?

As a programming language that originated in academia, particularly within the statistics community, R seems like a very natural choice in terms of teaching data science. Much of the syntax, function naming and even thoughts about how to construct a data pipeline/workflow comes naturally from a statistical analysis perspective. This makes it very easy to convert knowledge of statistical processes into code an analysis within R. The easy conversion between notation and code becomes even more valuable when trying to work with advanced/obscure statistical techniques. With R’s origination in academic statistics, there is a much broader range of packages for uncommon techniques than in most other languages. This suggests a strong candidate for the first requirement when working in statistical domains.

Other software/packages that make R appealing to teach with are RStudio, Jupyter Notebooks and R Markdown. RStudio provides a clean, user-friendly interface for R that makes interacting with plots and data easy. It even aids the transition from spreadsheet software (like Excel) by providing a similar, GUI-driven interaction with (simple) data-frames. With Jupyter Notebooks’ recent addition of an R kernel option, it is also easy to transition from mathematics focused software like Maple and Mathematica. See this DataCamp blog-post for more information on using Jupyter Notebooks (or similar software) with R. Notebooks also facilitate teaching good practices such as code-blocks and code annotation. Finally, R Markdown provides a (reasonably) simple way to convert executable code directly into final reports/outputs. That functionality further supports the teaching of (some) good programming and design practices.

Why Choose Python?

Python was originally developed to be an easy to learn programming language (see Wikipedia’s history on Python). This means the whole language’s syntax and styling is easier to learn from scratch than most other languages (notably R). The basic Python data structure of lists naturally works like mathematical sets while dictionaries closely match logical constructions for unstructured data. Together with the use of indentation to indicate control flow, it is natural to when doing any introduction to the language, how to make Python code (human) readable. These traits speak directly to teaching/achieving our second language-related objective, “allows good programming practices/design”.

For teaching, Python starts with many of the same advantages as R. There is a long standing Python kernel for Jupyter Notebooks and several markdown packages available for turning code directly into html-styled reports. What makes Python noticeably different from R is that it is a general purpose programming language. In terms of teaching, this opens up some interesting options related to the first and third goals above. In terms of developing solutions to data science problems, Python easily allows a very broad range of both input and output. Specifically, it has high quality packages designed to deal with streaming data and better techniques for unstructured or big data. Also, because Python is regularly used to develop full programs and deployed software solutions, the methods available to study and improve performance are already well developed.

 

But What are People Actually Using?

There are way, way more Python users than R users (and probably will be for the foreseeable future) simply because Python is a general purpose programming language. However, we are more concerned with users within the data science communities. That focus however doesn’t make the answer to our question any more clear. 2016 Data from O’Reilly’s Data Science Salary Survey places R (57%) slightly ahead of Python (54%) which matches KDnugget’s rankings of R being slightly ahead in 2016. However, the 2017 KDNugget survey results now places Python slightly ahead. Burtch Works’ 2017 survey data however still has R significantly ahead, and in-fact still gives a very large market share to SAS which didn’t even make KDnugget’s list. But Burtch also notes that Python has been gaining shares each year. Remember when considering these results however, that these are all self-reported and self-selecting surveys! It is hard to tell if these changes are actual changes in use, or just a changing definition/reach of who’s responding to the surveys. For example, when Burtch Works breaks down their results at least one sub-group rarely used SAS and, similar to O’Reilly and KDnugget, had Python ahead. More and more people are identifying with doing data science each year, but many of them have been doing similar things for a long time.

Some Undisguised Opinions

There is obviously value in either programming language, but from my perspective there is a really strong winner in Python. From a curriculum/planning perspective, since Python is a general-purpose language it is entirely feasible to have standard, introductory programming courses from a computer science department taught in Python. This reduces (potentially wasteful) duplication of similar courses (does every discipline really need its own intro programming?). It also lets computer scientists take advantage of years of educational research into how to better teach programming! Not to mention that Python was intentionally designed to be easier to learn programming in.

Add to this that data science students don’t really experience any major disadvantages from having Python as the primary curricular language but do gain several benefits. Key benefits include longer-term skill viability and increased versatility in job options, etc. This versatility even plays out when considering including advanced CS courses in a data science curriculum. Most data science curriculums are already going to struggle to incorporate all the necessary foundational skills in a reasonable length undergraduate (or graduate) program. So why add programming courses beyond those already needed to meet typical CS prerequisites?

Finally, looking at the trends in language/tool use in data science just adds more validation to this idea. As companies move to working with unstructured or streaming data, Python becomes even more natural. All the surveys report increasing use of Python, without any signs of slowing down that increase. It is important for academic programs to not just react, but even anticipate trends and needs in the job market and industry.

Additional Resources

While I didn’t go into lots of details on the pro’s and con’s of R or Python (and didn’t even talk about SAS/SPSS) I have collected a few links that you might find valuable to read in making your own decision.

R vs. Python for Data Science: Summary of Modern Advances — EliteDataScience Dec 2016 — Does a nice job of highlighting the new things that make the languages pretty equal.

 

Python & R vs. SPSS & SAS — The Analytics Lab  – 2017 — This is nice because it also puts into perspective how SPSS and SAS play into the landscape as well as provides additional historic perspectives

Python vs. R: The battle for data scientist mind share — InfoWorld, 2017 — a fairly balanced perspective on the value of both

R vs. Python for Data Science — KDNuggets 2015 — A bit dated, but still provides some good comparisons.

Version Control and Reproducible Research/Data Science

A current hot-topic in research, especially within statistically driven or based research is “reproducible research”. In academia, the process of peer-review publication is meant to assure that any finding are reproducible by other scientists. But those of us in the trenches, and especially on the data-side of things know that is a theoretical outcome (the reproduciblity) and far more rarely something tested. While academia is rightly under fire for this lack of actual, reproducible research (see this great example from epidemiology) this is even more of a problem in industry. If the analysis can’t be reproduced, then it can’t be applied to new client base.

So why bring this up on a educational blog? I think its important to embed the idea of reproducible work deep inside our teaching and assignment practices. While the idea of repeating a specific analysis once the data has changed isn’t really novel, it becomes far more relevant when we begin talking about filtering or cleaning the input data. Just think about searching for outliers in a data-set. First, we might plot a histogram of values/categories, then we go back, remove the data points that we want ignored, and replot the histogram. BAM! The we have a perfect opportunity to teach the value of reproducible work! We used exactly the same visualization technique (a histogram), on practically the same data (with outliers and without outliers).

Where does the reproduction of the work fit in though? Python and R both have histogram functions, so this is definitely a toy example (but the whole idea of functions can serve to emphasize the idea of reproducible/reusable work). Instead, I think this is where the instructor has an opportunity. This idea of cleaning outliers could easily be demonstrated in the command line window of R or an interactive Python shell. And then you’ve lost your teaching moment. Instead, if this is embedded in an R script or Python/R Notebook you can reuse the code, retrace whatever removal process you used, etc. In the courses I’ve taught, I’ve seen student after student complete these sorts of tasks in the command-line window, especially when told to do so as part of an active, in-class demo. But they never move the code into a script so when they are left to their own devices they flounder and have to go look for help.

I titled this post “Version Control and Reproducible Research” … you might be wondering what version control has to do with this topic. The ideas described above are great if you are the sole purveyor of your code/project. But if you have your students working in teams, or are trying to collaborate yourself, this might not be exactly ideal. But it’s getting pretty close! Here’s the last nugget you need to make this work… version control. Or in this case, I’m specifically talking about using GitHub. The short version of what could be an entire separate post (I’ll probably try to do one eventually) is that git (and the cloud repository github) is the tool that software developers designed to facilitate collaborative software development without the desire to kill each other from broken code. It stores versions of code (or really any file) that can be jointly contributed to without breaking each other’s work. For now, I’ll point you to a few resources on this..

First, a bit more from an industry blog on workflows to promote reproduction using github — Stripe’s Notebooks and Github Post

Second, for using Git/GitHub with R — Jenny Bryan, Prof. University of British Columbia — Note that this is a really long, complete webpage/workshop resource!

Third, a template/package for Python to help structure your reproducible git-hub work — Cookiecutter Data Science —  (heck, this could be an entire lesson itself in how to manage a project– more on that later)

Fourth, a template/package for R to help structure your reproducible git-hub/R work — ProjectTemplate