From January 2019

Guest Post: Open Sources and Data Science

Open source solutions are improving how students learn—and how instructors teach. Matt Mullenweg, the founder of WordPress revealed to TechCrunch a few years back his opinion on Open Source. “When I first got into technology I didn’t really understand what open source was. Once I started writing software, I realized how important this would be.”

Open source software is almost everywhere and tons of modern-day proprietary applications are built on top of it. Most students reading through the intricacies of data science will already be fairly familiar with open source software because many popular data science tools are open-source.

There’s a popular perception that open-source tools are not as good as their proprietary peers. However, similar to Linux, just because the underlying code is open and free does not necessarily mean that it is of a poorer quality. Truth be told, open source is probably the best in its class when it comes to development and data science.

In this post, we’ll have a straightforward look at how open source contributes to data science. We’ll also cover open-source tools and repositories that are related to data science.

Why is Open Source Good for Data Science?

Why exactly does Open Source and Data Science go hand-in-glove?

Open Source spurs innovation

Perhaps the most significant benefit of open source tools is that it allows developers the freedom to modify and customize their tools. This allows for quick improvements and experimentation which can, in turn, allow for extensive use of the package and its features.

As is the case with any other development that captures the interest of the technology sector as well as the public, the critical piece lies in the ability to bring the end product to the market as quickly and as thoroughly as possible. Open source tools prove to be a massive benefit to this end.

Faster problem solving

The open-source ecosystem helps you solve your data science problems faster. For instance, you can use tools like Jupyter for rapid prototyping and git for version control. There are other tools that you can add to your toolbelt like Docker to minimize dependency issues and make quick deployments.

Continuous contributions

Another example of a company that contributes to open-source is Google, and TensorFlow is the best example of Google’s contributions. Google uses TensorFlow for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. By open sourcing tools like TensorFlow, they get the benefits from contributors outside the core team. Once TF gets popular, many new research ideas would be implemented in TF first, which makes it more efficient and robust.

Google explains this topic in-depth in their open-source documentation. 

While open source work may have benevolent results it is not an act of charity. Releasing work as open source and the corresponding contribution process eventually result in a higher return on the initial investment made versus the alternative closed source process.

One of the significant benefits of open source, especially when it comes to data science breakthroughs, is the sheer membership of various open source communities that developers can tap into for problem-solving and debugging.

These forums have hundreds of answers to frequently asked questions, and given that open-source data science tools are poised to expand going forward, these communities and repositories of information are only poised to grow.

Contribute and give it back to the community

The best way to learn data science is to actively participate in the data science communities that you love. With open-source, that’s entirely possible because you can start by just following data science projects and repositories on GitHub. Take part in discussions and when you feel you’re ready, contribute to their code by volunteering to review code and submit patches to open-source security bugs.

This will help you get involved, gain exposure and learn details that might otherwise be impossible to learn from your degree curriculum.

Open Source Data Science Tools You Should Know About

KDNuggets recently a published the results of Data Science and Machine Learning poll they conducted earlier this year. The graph shows the tools with the strongest association and each tool’s rank based on their popularity.

 

The weight of the bar indicates the association between the tools. The numbers indicate the percentage of association between these tools. As you can see in the figure, TensorFlow and Keras is the most popular combination with a weight of 149%. Anaconda and scikit-learn is another popular combination of tools.

The number to the left indicates the rank of each tool based on popularity. The color is the value of the lift – green for more Python and red for more R.

We’ll be limiting our discussion to some of the open-source data science and machine learning tools. This list is not based on popularity, but instead usability from a learner’s perspective. Let’s get started.

TensorFlow

TensorFlow is an open source library, built for Python keeping in mind numerical computation with the goal of making machine learning more accessible and more efficient. Google’s TensorFlow eases the process of obtaining data, models for training, projecting projections and refining results.

Developed by the Google Brain team, TensorFlow is a library for a large-scale machine and deep learning. It gathers together many different machine learning and deep learning algorithms and uses them as a common metaphor. TensorFlow makes use of Python as a convenient front-end API to build out applications within the framework. It executes applications using high-performance C++.

TensorFlow can train and execute deep neural networks and use them for image recognition, handwritten digit classification recurrent neural networks word embeddings, sequence models for ML, natural language processing and partial differential equation (PDE) based simulations. It also supports scalable production prediction using models similar to those used in training.

Keras

Keras is a minimalist Python-based library, that is known for deep learning that runs on top of TensorFlow or Theano. Keras was developed to help implement deep learning models quickly and efficiently to aid in research and development.

Keras runs on Python 2.7 and 3.5 and executes CPUs and GPUs based on the base frameworks.

Keras was developed by an engineer at Google and has four guiding principles –

  1. Modularity – A model to understand a standalone sequence or graph. The fundamental concerns of a deep learning model’s components can be combined arbitrarily.
  2. Minimalism – The Keras library provides just enough information to help users achieve an outcome.
  3. Extensibility – Any new components are easy to add and implement within the framework. This is intentional allowing developers and researchers the freedom of trial and experimentation with new ideas.
  4. Python – There is no requirement for additional files with custom file specifications. When working in Keras, everything is native Python.

Keras’ deep learning process can be summarized as below –

  1. Define your Model – Create your sequence and add layers as needed
  2. Compile your Model – identify optimizers and loss functions
  3. Fit your Model – Use the existing data to execute the model
  4. Make Predictions – Use the developed model to trigger predictions based on the data

H2O

H2O is a scalable, fast and distributed open source machine learning framework that provides many algorithms. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It supports smart applications including deep learning, random forests, gradient boosting, generalized linear modeling, etc.

H2O is a business focused AI tool that allows users to derive insights from data by way of faster and improved predictive modeling. The core code of H2O is written in Java.

H2O helps with vast amounts of data that allows enterprise users with accurate and quick prediction. Additionally, H2O assists in extracting decision making information from large amounts of data.

Apache Mahout

Apache Mahout is an application based on an open source framework that uses the Hadoop platform. It assists with building scalable ML applications and corresponds to MLlib.

The three main features of Mahout are –

  1. A scalable and straightforward programming framework and environment
  2. A wide range of pre-packaged algorithms for Apache Spark + Scala, Apache Flink and H2O
  3. A vector math experimentation workplace called Samaras that has an R-like syntax which is dedicated to matrix calculation.

Anaconda

Anaconda is a real open source data science package that boasts a community of more than 6 million users. It is simple to download and install, and packages are available for MacOS, Linux and Windows.

Anaconda comes with 1,000+ data packages in addition to the standard Conda package and the virtual environment manager. This eliminates the necessity to install each library independently.

The R conda and Python packages in the Anaconda Repository are curated and compiled within a secure environment so that users get the benefit of optimized binaries that work efficiently on their system.

Sci-kit learn

Sci-kit Learn is a tool that enables machine learning in Python. It is efficient and straightforward to use for data mining and data analysis tasks. The package is reusable in many different contexts and is accessible to almost all users.

Sci-kit learn includes a number of different classification, clustering and regression algorithms including –

  1. Support vector machines
  2. Random forests
  3. k-means
  4. gradient boosting, and
  5. DBSCAN

Top 5 Open Source Repositories to Get Started with Data Science

For any data science student, GitHub is a great place to find useful resources to learn data science better.

Here are some of the top resources and repositories on GitHub. There are lots of good libraries out there that we haven’t covered in this post. If you’re familiar with data science repositories that you’ve found useful, please let share them in the comments.

Awesome Data Science Repo

The Awesome Data Science repository on GitHub is a go-to resource guide when it comes to data science. It has been developed over the years through multiple contributions with linked resources from getting-started guides, to infographics, to suggestions of experts you can follow on various social networking sites.

Here’s what you’ll find in that repo.

Machine Learning and Deep Learning Cheat Sheet

The Cheatsheets-Ai repository includes common techniques and tools put together in the form of cheatsheets. These range from simple tools like panda, to more complex procedures like Deep Learning.

Some of the common cheatsheets included here are – pandas, matplotlib, numpy, dplyr, scikit-learn, tidyr, ggplot, Neural Networks, and pySpark.

Oxford Deep Natural Learning Processing Course Lectures

With the introduction of Deep Learning, NLP has seen significant progress, thanks to the capabilities of Deep Learning Architectures like Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN).

This repository is based on Oxford NLP lectures and takes the study of Natural Language Processing to the next level. The lectures cover the different terminology and techniques used to advance material similar to using Recurrent Neural Networks for Language Modeling, Text to Speech, Speech Recognition, etc.

PyTorch

PyTorch is an open source machine learning library for Python, based on Torch, used for applications such as natural language processing. PyTorch has garnered a fair amount of attention from the Deep Learning Community given the easy of Pythonic style coding, faster prototyping, and dynamic computations.

The PyTorch tutorial repository includes codes for Deep Learning tasks right from the basics of creating a neural network using PyTorch to coding Generative Adversarial Network (GANs), RNNs, and neural style transfers. Most models are implemented using just 30 lines of code or less.

Resources of Neural Information Processing System (NIPS) 2017

NIPS 2017 includes a list of resources and slides of most tutorials, invited talks, and workshops held during the NIPS 2017 conference. For the uninitiated, NIPS is an annual conference held specifically for Machine Learning and Computational Neuroscience.

Most recent breakthrough research within the data science industry is a result of research that has been presented at these conferences.

Summary

Before starting a data science project, it is good to have a clear understanding of what the technical requirements are so that you can adapt resources and budgets accordingly. This is one of the main reasons an increasing number of organizations are choosing the flexibility of open source tools. The sheer variety of the open-source environment has helped expand the knowledge and bring in new technologies to this field than ever before.


This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

JMM Session: Technology and Resources for Teaching Statistics and Data Science

This blog post is a collection of the presentations from the session I chaired at the 2019 Joint Mathematics Meeting. The session was titled “Technology and Resources for Teaching Statistics and Data Science”. It was co-sponsored by the MAA Committee on Technology in Mathematics Education (CTiME) and the SIG-MAA: Statistics Education (Stat-Ed).

The abstract for the session was:

One of the five skill areas in the American Statistical Association’s curriculum guidelines is “Data Manipulation and Computation” (pg. 9), embracing the need for students to be competent with programming languages, simulation techniques, algorithmic thinking, data management and manipulation, as well as visualization techniques. Additionally, the emphasis on using real data and problems and their inherent complexity means that technology is often necessary outside of specifically prescribed computational courses. This session invites instructors to contribute talks exploring the use of any software or technology in statistics education. Talks may include effective instructional or pedagogical techniques for linking programming to statistics, interesting classroom problems and the use of technology to solve them, or more.

Abstracts for the talks can be found here.

Teaching a Technology-Rich Intro Stat Course in a Traditional Classroom, presented by Patti Frazer Lock, St. Lawrence University

Using the Islands in an Introductory Statistics Course. presented by Carl Clark, Indian River State College

StatPowers-A Simple Web-Based Statistics Suite for Introductory Statistics, presented by Brian R Powers, Arizona State University

Using R Programming to Enhance Mathematical and Statistical Learning, presented by Joseph McCollum, Siena College

Computational Experience for Linear Regression and Time Series using R, presented by Rasitha R. Jayasekare, Butler University

Statistics teaching and research with R, presented by Leon Kaganovskiy, Touro College

GAISEing into the Future with Fun, Flexible Mobile Data Collection and Analysis, presented by Adam F. Childers, Roanoke College

Written Vs. Digital Feedback; Which improves Student Learning?, presented by David R. Galbreath, United States Military Academy

Using Authentic Data in Spreadsheet Assignments and Quizzes to Improve Students’ Attitudes towards Elementary Statistics, presented by Daniel A. Showalter, Eastern Mennonite University

Democratizing Data: Expanding Opportunities for Students in Data Science, presented  by Robin L. Angotti, University of Washington Bothell

I hope to have in the future the recordings of the session posted. Stay tuned for an updated post (I’ll also send an announcement).