From Other

Guest Post: Open Sources and Data Science

Open source solutions are improving how students learn—and how instructors teach. Matt Mullenweg, the founder of WordPress revealed to TechCrunch a few years back his opinion on Open Source. “When I first got into technology I didn’t really understand what open source was. Once I started writing software, I realized how important this would be.”

Open source software is almost everywhere and tons of modern-day proprietary applications are built on top of it. Most students reading through the intricacies of data science will already be fairly familiar with open source software because many popular data science tools are open-source.

There’s a popular perception that open-source tools are not as good as their proprietary peers. However, similar to Linux, just because the underlying code is open and free does not necessarily mean that it is of a poorer quality. Truth be told, open source is probably the best in its class when it comes to development and data science.

In this post, we’ll have a straightforward look at how open source contributes to data science. We’ll also cover open-source tools and repositories that are related to data science.

Why is Open Source Good for Data Science?

Why exactly does Open Source and Data Science go hand-in-glove?

Open Source spurs innovation

Perhaps the most significant benefit of open source tools is that it allows developers the freedom to modify and customize their tools. This allows for quick improvements and experimentation which can, in turn, allow for extensive use of the package and its features.

As is the case with any other development that captures the interest of the technology sector as well as the public, the critical piece lies in the ability to bring the end product to the market as quickly and as thoroughly as possible. Open source tools prove to be a massive benefit to this end.

Faster problem solving

The open-source ecosystem helps you solve your data science problems faster. For instance, you can use tools like Jupyter for rapid prototyping and git for version control. There are other tools that you can add to your toolbelt like Docker to minimize dependency issues and make quick deployments.

Continuous contributions

Another example of a company that contributes to open-source is Google, and TensorFlow is the best example of Google’s contributions. Google uses TensorFlow for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. By open sourcing tools like TensorFlow, they get the benefits from contributors outside the core team. Once TF gets popular, many new research ideas would be implemented in TF first, which makes it more efficient and robust.

Google explains this topic in-depth in their open-source documentation. 

While open source work may have benevolent results it is not an act of charity. Releasing work as open source and the corresponding contribution process eventually result in a higher return on the initial investment made versus the alternative closed source process.

One of the significant benefits of open source, especially when it comes to data science breakthroughs, is the sheer membership of various open source communities that developers can tap into for problem-solving and debugging.

These forums have hundreds of answers to frequently asked questions, and given that open-source data science tools are poised to expand going forward, these communities and repositories of information are only poised to grow.

Contribute and give it back to the community

The best way to learn data science is to actively participate in the data science communities that you love. With open-source, that’s entirely possible because you can start by just following data science projects and repositories on GitHub. Take part in discussions and when you feel you’re ready, contribute to their code by volunteering to review code and submit patches to open-source security bugs.

This will help you get involved, gain exposure and learn details that might otherwise be impossible to learn from your degree curriculum.

Open Source Data Science Tools You Should Know About

KDNuggets recently a published the results of Data Science and Machine Learning poll they conducted earlier this year. The graph shows the tools with the strongest association and each tool’s rank based on their popularity.

 

The weight of the bar indicates the association between the tools. The numbers indicate the percentage of association between these tools. As you can see in the figure, TensorFlow and Keras is the most popular combination with a weight of 149%. Anaconda and scikit-learn is another popular combination of tools.

The number to the left indicates the rank of each tool based on popularity. The color is the value of the lift – green for more Python and red for more R.

We’ll be limiting our discussion to some of the open-source data science and machine learning tools. This list is not based on popularity, but instead usability from a learner’s perspective. Let’s get started.

TensorFlow

TensorFlow is an open source library, built for Python keeping in mind numerical computation with the goal of making machine learning more accessible and more efficient. Google’s TensorFlow eases the process of obtaining data, models for training, projecting projections and refining results.

Developed by the Google Brain team, TensorFlow is a library for a large-scale machine and deep learning. It gathers together many different machine learning and deep learning algorithms and uses them as a common metaphor. TensorFlow makes use of Python as a convenient front-end API to build out applications within the framework. It executes applications using high-performance C++.

TensorFlow can train and execute deep neural networks and use them for image recognition, handwritten digit classification recurrent neural networks word embeddings, sequence models for ML, natural language processing and partial differential equation (PDE) based simulations. It also supports scalable production prediction using models similar to those used in training.

Keras

Keras is a minimalist Python-based library, that is known for deep learning that runs on top of TensorFlow or Theano. Keras was developed to help implement deep learning models quickly and efficiently to aid in research and development.

Keras runs on Python 2.7 and 3.5 and executes CPUs and GPUs based on the base frameworks.

Keras was developed by an engineer at Google and has four guiding principles –

  1. Modularity – A model to understand a standalone sequence or graph. The fundamental concerns of a deep learning model’s components can be combined arbitrarily.
  2. Minimalism – The Keras library provides just enough information to help users achieve an outcome.
  3. Extensibility – Any new components are easy to add and implement within the framework. This is intentional allowing developers and researchers the freedom of trial and experimentation with new ideas.
  4. Python – There is no requirement for additional files with custom file specifications. When working in Keras, everything is native Python.

Keras’ deep learning process can be summarized as below –

  1. Define your Model – Create your sequence and add layers as needed
  2. Compile your Model – identify optimizers and loss functions
  3. Fit your Model – Use the existing data to execute the model
  4. Make Predictions – Use the developed model to trigger predictions based on the data

H2O

H2O is a scalable, fast and distributed open source machine learning framework that provides many algorithms. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It supports smart applications including deep learning, random forests, gradient boosting, generalized linear modeling, etc.

H2O is a business focused AI tool that allows users to derive insights from data by way of faster and improved predictive modeling. The core code of H2O is written in Java.

H2O helps with vast amounts of data that allows enterprise users with accurate and quick prediction. Additionally, H2O assists in extracting decision making information from large amounts of data.

Apache Mahout

Apache Mahout is an application based on an open source framework that uses the Hadoop platform. It assists with building scalable ML applications and corresponds to MLlib.

The three main features of Mahout are –

  1. A scalable and straightforward programming framework and environment
  2. A wide range of pre-packaged algorithms for Apache Spark + Scala, Apache Flink and H2O
  3. A vector math experimentation workplace called Samaras that has an R-like syntax which is dedicated to matrix calculation.

Anaconda

Anaconda is a real open source data science package that boasts a community of more than 6 million users. It is simple to download and install, and packages are available for MacOS, Linux and Windows.

Anaconda comes with 1,000+ data packages in addition to the standard Conda package and the virtual environment manager. This eliminates the necessity to install each library independently.

The R conda and Python packages in the Anaconda Repository are curated and compiled within a secure environment so that users get the benefit of optimized binaries that work efficiently on their system.

Sci-kit learn

Sci-kit Learn is a tool that enables machine learning in Python. It is efficient and straightforward to use for data mining and data analysis tasks. The package is reusable in many different contexts and is accessible to almost all users.

Sci-kit learn includes a number of different classification, clustering and regression algorithms including –

  1. Support vector machines
  2. Random forests
  3. k-means
  4. gradient boosting, and
  5. DBSCAN

Top 5 Open Source Repositories to Get Started with Data Science

For any data science student, GitHub is a great place to find useful resources to learn data science better.

Here are some of the top resources and repositories on GitHub. There are lots of good libraries out there that we haven’t covered in this post. If you’re familiar with data science repositories that you’ve found useful, please let share them in the comments.

Awesome Data Science Repo

The Awesome Data Science repository on GitHub is a go-to resource guide when it comes to data science. It has been developed over the years through multiple contributions with linked resources from getting-started guides, to infographics, to suggestions of experts you can follow on various social networking sites.

Here’s what you’ll find in that repo.

Machine Learning and Deep Learning Cheat Sheet

The Cheatsheets-Ai repository includes common techniques and tools put together in the form of cheatsheets. These range from simple tools like panda, to more complex procedures like Deep Learning.

Some of the common cheatsheets included here are – pandas, matplotlib, numpy, dplyr, scikit-learn, tidyr, ggplot, Neural Networks, and pySpark.

Oxford Deep Natural Learning Processing Course Lectures

With the introduction of Deep Learning, NLP has seen significant progress, thanks to the capabilities of Deep Learning Architectures like Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN).

This repository is based on Oxford NLP lectures and takes the study of Natural Language Processing to the next level. The lectures cover the different terminology and techniques used to advance material similar to using Recurrent Neural Networks for Language Modeling, Text to Speech, Speech Recognition, etc.

PyTorch

PyTorch is an open source machine learning library for Python, based on Torch, used for applications such as natural language processing. PyTorch has garnered a fair amount of attention from the Deep Learning Community given the easy of Pythonic style coding, faster prototyping, and dynamic computations.

The PyTorch tutorial repository includes codes for Deep Learning tasks right from the basics of creating a neural network using PyTorch to coding Generative Adversarial Network (GANs), RNNs, and neural style transfers. Most models are implemented using just 30 lines of code or less.

Resources of Neural Information Processing System (NIPS) 2017

NIPS 2017 includes a list of resources and slides of most tutorials, invited talks, and workshops held during the NIPS 2017 conference. For the uninitiated, NIPS is an annual conference held specifically for Machine Learning and Computational Neuroscience.

Most recent breakthrough research within the data science industry is a result of research that has been presented at these conferences.

Summary

Before starting a data science project, it is good to have a clear understanding of what the technical requirements are so that you can adapt resources and budgets accordingly. This is one of the main reasons an increasing number of organizations are choosing the flexibility of open source tools. The sheer variety of the open-source environment has helped expand the knowledge and bring in new technologies to this field than ever before.


This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Guest Post: Docker And the Data Scientist

Educational institutes and educational professionals often face a problem when it comes to creating a common platform where educators and students can view and share code. One such university in Turkey had to find a way to address a common complaint from students viz. their compute environments were different when compared to the testing machine.

 

The professor at Bilkent University in Ankara, Turkey, decided to use a technology called Docker to power a web-platform that can create lab instances and grade assignments.

 

So, what is Docker? We will answer that in a while. But before Docker was available, the next best solution was to use virtual machines. However, these machines needed to be extremely powerful and consequently required an expensive infrastructure, which most institutes couldn’t set aside a budget for. Students were forced to log on to a shared server which inadvertently negatively affected each other’s programs, or worse, crashed the whole infrastructure. Needless to say, it was impracticable to assign a virtual machine to each student.

 

They used Docker to build a web-based application called Programming Assignment Grading System (PAGS). A similar technique can be of adopted by universities for creating lab instances and grading assignments for data science classes.

 

Although we haven’t formally defined what Docker is, the above example demonstrates what Docker can do. The rest of the article focuses on Docker and how it can transform the education and data science industry.

 

The article is divided into 4 sections. First, we’ll start with an introduction on Docker and Docker containers. Then, we’ll answer the question, “Who is Docker for?” The third part will give you an overview of how Docker is a useful tool for data scientists. In the final section, we’ll dive into a couple of  interesting use cases for Docker in data science. Let’s get started!

What is Docker?

 

Docker is the leading software containerization platform that is being actively developed by Docker Inc. It is an open source project that is designed to help you create, run and deploy applications inside containers.

 

So, what is a container? A container, by definition, comprises all the dependencies, libraries and other related files required to run an application. Once you’ve created a container for your application, you can run it on any Linux machine regardless of the way your underlying machine is configured. If the machine that you’re using at one end is Ubuntu, and it’s Red Hat at the other end, fret not! Docker is precisely meant for situations like these.

 

You can create a snapshot of a container and this snapshot is generally known as an image. Conversely, you can call a container an instance of a docker image. Docker images are inert and immutable files. When someone asked the difference between an image and a container on StackOverflow, a web developer named Julian came up a quick analogy. “The image is the recipe, the container is the cake”, he said and that just sums it up.

 

You can store Docker images in a cloud registry like Docker Hub. There are numerous user-contributed Docker images that should cover almost all the general use cases. You can also create and share your private Docker images with your co-workers and your organization. Alternatively, you can push them into a public repository so as to return it back to the community.

 

The concept of Docker is very similar to that of a Virtual Machine (VM). However, virtual machines are very demanding beasts and run considerably slower on a less powerful hardware. A VM works in such a way that it allows a piece of hardware to be shared between VMs. This allows you to run one or more virtual operating systems inside your host operating system. But you might need to upgrade your processor if you’re seriously planning to run your software on a virtual machine.

 

Unlike a VM, Docker uses the host kernel instead of creating new kernel instances. The virtualization happens at the kernel level and not at the topmost level. The Docker encapsulates everything that’s required for running the application on that host machine. This tremendously improves the performance of the application and reduces its size. What gives Docker the significant lead is the fact that it enables separation of concerns between the infrastructure, IT Ops, the developer and the application. This creates a positive environment for enhanced innovation and collaboration.

Who is Docker for?

Docker is essentially a container platform largely aimed at businesses. It enables IT businesses to efficiently select and administer a complete application, without the fear of an infrastructure or architecture lock-in.

 

Enterprises use Docker for everything from setting up their development environment to deploying their application for production and testing. When you need to build more advanced systems, like a data warehouse comprise of multiple modules, containers make a lot of sense. You can actually save several days of work that you’d otherwise have to spend configuring each machine.

 

However, the Docker platform isn’t just relevant to developers and enterprises alone. It’s actually a pretty useful tool for data scientists, analysts and even for schools and colleges. There are educational institutions and universities that are keen to transform digitally but are held back by their existing infrastructure.

Docker and Data Science

Why should you use Docker if you’re a data scientist? Here are three reasons pointed out by Hamel Hussain over at Towards Data Science:

Reproducibility

If you are a professional data scientist, it is imperative that your work can be reproduced. Reproducibility helps facilitate review by your peers, ensure the analysis, model and application that you have built can run unhindered which makes your deliverables both robust and time-tested.

 

As an example, let us assume that you have built a Python model, however, it has not proven to be enough to run pip-freeze and transfer the resulting file to a colleague. This would largely be because of Python-specific dependencies.

 

Imagine if you could find a way around manually moving the Python dependencies like  the compiler, config. files, drivers, etc. You can be free of Python-related dependencies by simply bundling everything within a Docker container. This not only reduces the task of others having to recreate your environment, it also ensures that your work is much more accessible.

Ability to Port Your Compute Environment

If you are a data scientist who is specializing in Machine Learning, the ability to frequently and efficiently change your computing environment has a considerable effect on your productivity.

 

It is often the case that the work of data science starts with prototyping, research, and exploration. This doesn’t essentially need special computing power to start. That said, often comes a stage where multiple compute resources can prove quite helpful in increasing the speed of your workflow.

 

A number of data scientists find themselves limited to a local computing environment largely because of a perceived hindrance of re-creating their individual local environment onto a device remotely. Here, Docker makes the difference. It allows you to port your work environment, including libraries, files etc in just a few clicks. Additionally, the ability to swiftly port your computing environment is a substantial advantage in Kaggle competitions.

Enhance your Engineering Skills

Once you are comfortable with using Docker, you can then deploy models as containers that can help make your work readily accessible to other users. Additionally, various other applications that you may require as part of your data science workflow interaction may already exist in a container within a Docker application.

Use Cases for Docker in Data Science

By making your applications portable, cheaper and more secure, Docker helps to free up time as well as resources that can be spent on other important things. It can help transform IT without the need to re-tool, re-educate or re-code any of your existing applications, staff or policies.

 

Here are just a few of the use cases of how Docker can help different organizations:

Docker for Education

Let’s revisit the Docker use case that we discussed in the introduction. The faculty at the university used Docker to create a container to run their application called PAGS. It allowed the students to have the same environment for their compute machines and test machines without the need of a VM.

 

Docker provides a common environment that can run on a container on any given Linux machine. This almost always guarantees it to run with similar results on a different machine using the same container. Without Docker, this would have required more infrastructure and resources that they didn’t have.

 

Another particularly interesting scenario is setting up lab instances. Dependending on how you want a machine to be configured, you can take a snapshot of it to build a Docker image. You can then pull the snapshot into all other lab instances saving you time and resources.

Docker for Data Science Environment Set Up

Consider a scenario where you need to explore a few data science libraries in Python or R, but;

  1.  without spending a lot of time installing either language on your machine,
  2.  browsing and figuring out which dependencies are essential and
  3.  finally getting down to identifying what works best for your version of Windows/OSX/Linux.

 

This is where Docker can help.

 

Using Docker, you can get a Jupyter ‘Data Science’ stack installed and ready to execute in no time flat. It allows you to run a ‘plug and play’ version of a Jupyter data science stack from within a container.

 

To start, you would need to first install Docker Community Edition on your machine. Once done, restart your machine and get your Jupyter container set up. Prior to running a container, you would need to specify the base image for the container

 

In most cases, the image that you’re looking for has already been built by a prior user and includes everything needed to use a fully loaded data science stack of your choice. All that needs to be done is to specify a pre-defined image that Docker can use to start a container.

Conclusion

In this article, we have just hit the top of the iceberg in terms of what can be done with Docker. We have focused only on specific areas of Docker that a data scientist may most often encounter. Below are some further sources that can help you during your journey of using and implementing Docker.

 

  1. Basic Docker Terminologies
  2. Useful Docker Commands
  3. Dockerfile Reference
  4. Pushing and Pulling to and from Docker Hub

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Intro to DS Assignment Sites

As an instructor, I want to provide high-quality assignments that are focused (so they achieve the learning objective), engaging (so they aren’t bored), and well supported (so they don’t end up frustrated). In an ideal world, I’d have time to write, test, debug, and administer all my own, course-tailored assignments that meet these goals. I, however, do not live in an ideal world, nor have enough graduate/undergraduate minions to mimic this ideal world. Instead, I’ve turned to using a few sites that already host assignments, resources, and even include auto-grading (without me needing to learn/setup the system).

Learn2Mine (L2M) is the first site I used in conjunction with my Data Mining course, and more recently my Introduction to Data Science course. Learn2Mine is a free, open source platform developed at the College of Charleston (CoC). While I have only really made use of the contents already there and CoC’s hosted site, you can contribute, or host your own version by getting the source directly from github. Dr. Anderson is fairly responsive about keeping the site running and grading.

The positive features for L2M (beyond being totally free/open source) are that it includes a mix of both introductory programming assignments and several more advanced machine learning/data mining lessons. It even has several search algorithm lessons (which I tend not to use). All of the lessons include auto-graded response boxes which also provide limited feedback of the errors generated when comparing submitted work to answers. There is also an interface for instructors to create their own ‘courses’ which consist of a series of the lessons on L2M. This allows the instructor to see student progress through lessons and download a grade-book in spreadsheet format.

Downsides for L2M are in-line with what you pay for (or invest in time-wise). Even though there is feedback when students get answers wrong, this often just consists of the identification of mismatched output lines (so pretty sparse). Students often get very frustrated trying to figure out what they are missing. This is exacerbated by the fact that often the instructions are unclear or insufficient to allow students to simply do the lessons. Also, as might be expected from a locally built/maintained project, there are a lot of “polish” features missing, such as being able to reorder assignments in a course, or associate a name with an account. Students have an account associated with the email they login with so it can sometimes be challenging to connect records with students. Overall, I’ve been considering phasing L2M out of my normal assignment structure, though the possibility of hosting my own local version and implementing different, more explained lessons has also been tempting.

The prime contender to replace L2M for me has been DataCamp. I’ve know about DataCamp for a while now but had the first chance to actually use it and make assignments from it this spring when I was looking for data visualization lessons (see visualization resources post). I’ve gone through a few lessons myself and found DataCamp to basically be exactly what I’d want/envision online course-work to be. Most courses consist of short videos (a best practice) followed by several guided coding exercises. DataCamp is not (sort of) free, which turns out to be a pro and a con.

If it’s not free, why is DataCamp going to replace L2M for me? Great question. Because, for academic purposes, Datacamp IS free. If you are an instructor for an academic institution teaching a course with 10+ students in, you can request free, premium access for students enrolled in your course(s). That access is limited (they give you 6 months), but hey, it’s free. What else makes DataCamp a nicer replacement? First the coding exercises are scaffolded, that is, early exercises have more prewritten code while later exercises require you to remember and use what you’ve already learned. In addition, the coding exercises have reasonably helpful error messages and help often allowing you to more accurately debug code. They’ve also got built in hints/help available, so a student can’t get permanently stuck. Using those however decreases the “exp” they gain, so you can still track how success a student has been without help. The other major advantage is that DataCamp has a SIGNIFICANTLY larger set of lessons/courses available to pull from.

There is no free lunch in data/computer science though. DataCamp does have a few downsides. Perhaps the biggest is the granularity available in assignments. You have three choices, “collect xp”, “complete chapter”, or “complete course”. Given that a chapter is really the smallest cohesive learning unit on DataCamp, this makes a lot of sense educationally. However, that also means it’s not exactly an alternative for giving individual lab/homework assignments. Instead, it would serve best as a resource/major assignment related to learning how to program in python/r, or a bigger topic.

Finally, I want to mention Gradescope. Gradescope isn’t data science educational site. Instead it’s a jack-of-all trades which can help ease the burden of assignments and grading. If DataCamp took L2M and removed granularity/options, Gradescope (in this context) goes the other direction. Lots of faculty use it for all kinds of courses, from computer science or mathematics to writing. Given its purpose, Gradescope doesn’t have any specific assignments (maybe that was obvious). Instead, it can serve as an autograder or collection site for your assignments. I’ve included it here for those that might already have assignments (or who get them from others) but still want a speedy, simple way to get feedback to students.

I’d be remiss if I didn’t point out that there are some alternatives to DataCamp, depending on your goals. If all you need students to do is learn to program (not necessarily in a data-centric style) try Codecademy or explore Code.org. I also know there is an alternative to Gradescope (but I couldn’t track down the name/site if someone knows, please email me or leave a comment). What I recall is that the alternative is NOT free, but does provide better support and scaling. You might also consider what options are available or integratable with your learning management system (DataCamp IS…but maybe not by you..).

Hopefully you found this post informative, if you’ve got other suggestions of websites with assignments (particularly data-science related) please let me know or leave a comment.

 

Version Control and Reproducible Research/Data Science

A current hot-topic in research, especially within statistically driven or based research is “reproducible research”. In academia, the process of peer-review publication is meant to assure that any finding are reproducible by other scientists. But those of us in the trenches, and especially on the data-side of things know that is a theoretical outcome (the reproduciblity) and far more rarely something tested. While academia is rightly under fire for this lack of actual, reproducible research (see this great example from epidemiology) this is even more of a problem in industry. If the analysis can’t be reproduced, then it can’t be applied to new client base.

So why bring this up on a educational blog? I think its important to embed the idea of reproducible work deep inside our teaching and assignment practices. While the idea of repeating a specific analysis once the data has changed isn’t really novel, it becomes far more relevant when we begin talking about filtering or cleaning the input data. Just think about searching for outliers in a data-set. First, we might plot a histogram of values/categories, then we go back, remove the data points that we want ignored, and replot the histogram. BAM! The we have a perfect opportunity to teach the value of reproducible work! We used exactly the same visualization technique (a histogram), on practically the same data (with outliers and without outliers).

Where does the reproduction of the work fit in though? Python and R both have histogram functions, so this is definitely a toy example (but the whole idea of functions can serve to emphasize the idea of reproducible/reusable work). Instead, I think this is where the instructor has an opportunity. This idea of cleaning outliers could easily be demonstrated in the command line window of R or an interactive Python shell. And then you’ve lost your teaching moment. Instead, if this is embedded in an R script or Python/R Notebook you can reuse the code, retrace whatever removal process you used, etc. In the courses I’ve taught, I’ve seen student after student complete these sorts of tasks in the command-line window, especially when told to do so as part of an active, in-class demo. But they never move the code into a script so when they are left to their own devices they flounder and have to go look for help.

I titled this post “Version Control and Reproducible Research” … you might be wondering what version control has to do with this topic. The ideas described above are great if you are the sole purveyor of your code/project. But if you have your students working in teams, or are trying to collaborate yourself, this might not be exactly ideal. But it’s getting pretty close! Here’s the last nugget you need to make this work… version control. Or in this case, I’m specifically talking about using GitHub. The short version of what could be an entire separate post (I’ll probably try to do one eventually) is that git (and the cloud repository github) is the tool that software developers designed to facilitate collaborative software development without the desire to kill each other from broken code. It stores versions of code (or really any file) that can be jointly contributed to without breaking each other’s work. For now, I’ll point you to a few resources on this..

First, a bit more from an industry blog on workflows to promote reproduction using github — Stripe’s Notebooks and Github Post

Second, for using Git/GitHub with R — Jenny Bryan, Prof. University of British Columbia — Note that this is a really long, complete webpage/workshop resource!

Third, a template/package for Python to help structure your reproducible git-hub work — Cookiecutter Data Science —  (heck, this could be an entire lesson itself in how to manage a project– more on that later)

Fourth, a template/package for R to help structure your reproducible git-hub/R work — ProjectTemplate