This week I want to discuss a potentially divisive issue, should a program (or course etc) be taught in Python or R. I think a reasonable case could be made for teaching either language. Pragmatically, if you want your program’s graduates to be truly competitive for the largest variety of jobs in the current market students need to at least be familiar with both (and possibly SAS or SPSS). There is already a lot of information and blog posts addressing this question and I’ve provided links to a few of my favorites at the end of this post. Rather than re-hashing those post’s pro’s and con’s I’m going to focus on aspects of each language related to teaching (and learning).
Before considering each language, I want to frame the discussion by (re)stating a program level student learning objective (SLO). In my first post about SLO’s objective 2 states: “Students will be able to implement solutions to mathematical and analytical questions in language(s) and tools appropriate for computer-based solutions, and do so with awareness of performance and design considerations“. Based on this objective, I’ll state three specific objectives for selecting a programming language:
- A language which can implement (complete) solutions to data science questions
- A language which allows good programming practices in terms of design
- A language which allows implementation of solutions which can be improved/aware of performance issues
Why Choose R?
As a programming language that originated in academia, particularly within the statistics community, R seems like a very natural choice in terms of teaching data science. Much of the syntax, function naming and even thoughts about how to construct a data pipeline/workflow comes naturally from a statistical analysis perspective. This makes it very easy to convert knowledge of statistical processes into code an analysis within R. The easy conversion between notation and code becomes even more valuable when trying to work with advanced/obscure statistical techniques. With R’s origination in academic statistics, there is a much broader range of packages for uncommon techniques than in most other languages. This suggests a strong candidate for the first requirement when working in statistical domains.
Other software/packages that make R appealing to teach with are RStudio, Jupyter Notebooks and R Markdown. RStudio provides a clean, user-friendly interface for R that makes interacting with plots and data easy. It even aids the transition from spreadsheet software (like Excel) by providing a similar, GUI-driven interaction with (simple) data-frames. With Jupyter Notebooks’ recent addition of an R kernel option, it is also easy to transition from mathematics focused software like Maple and Mathematica. See this DataCamp blog-post for more information on using Jupyter Notebooks (or similar software) with R. Notebooks also facilitate teaching good practices such as code-blocks and code annotation. Finally, R Markdown provides a (reasonably) simple way to convert executable code directly into final reports/outputs. That functionality further supports the teaching of (some) good programming and design practices.
Why Choose Python?
Python was originally developed to be an easy to learn programming language (see Wikipedia’s history on Python). This means the whole language’s syntax and styling is easier to learn from scratch than most other languages (notably R). The basic Python data structure of lists naturally works like mathematical sets while dictionaries closely match logical constructions for unstructured data. Together with the use of indentation to indicate control flow, it is natural to when doing any introduction to the language, how to make Python code (human) readable. These traits speak directly to teaching/achieving our second language-related objective, “allows good programming practices/design”.
For teaching, Python starts with many of the same advantages as R. There is a long standing Python kernel for Jupyter Notebooks and several markdown packages available for turning code directly into html-styled reports. What makes Python noticeably different from R is that it is a general purpose programming language. In terms of teaching, this opens up some interesting options related to the first and third goals above. In terms of developing solutions to data science problems, Python easily allows a very broad range of both input and output. Specifically, it has high quality packages designed to deal with streaming data and better techniques for unstructured or big data. Also, because Python is regularly used to develop full programs and deployed software solutions, the methods available to study and improve performance are already well developed.
But What are People Actually Using?
There are way, way more Python users than R users (and probably will be for the foreseeable future) simply because Python is a general purpose programming language. However, we are more concerned with users within the data science communities. That focus however doesn’t make the answer to our question any more clear. 2016 Data from O’Reilly’s Data Science Salary Survey places R (57%) slightly ahead of Python (54%) which matches KDnugget’s rankings of R being slightly ahead in 2016. However, the 2017 KDNugget survey results now places Python slightly ahead. Burtch Works’ 2017 survey data however still has R significantly ahead, and in-fact still gives a very large market share to SAS which didn’t even make KDnugget’s list. But Burtch also notes that Python has been gaining shares each year. Remember when considering these results however, that these are all self-reported and self-selecting surveys! It is hard to tell if these changes are actual changes in use, or just a changing definition/reach of who’s responding to the surveys. For example, when Burtch Works breaks down their results at least one sub-group rarely used SAS and, similar to O’Reilly and KDnugget, had Python ahead. More and more people are identifying with doing data science each year, but many of them have been doing similar things for a long time.
Some Undisguised Opinions
There is obviously value in either programming language, but from my perspective there is a really strong winner in Python. From a curriculum/planning perspective, since Python is a general-purpose language it is entirely feasible to have standard, introductory programming courses from a computer science department taught in Python. This reduces (potentially wasteful) duplication of similar courses (does every discipline really need its own intro programming?). It also lets computer scientists take advantage of years of educational research into how to better teach programming! Not to mention that Python was intentionally designed to be easier to learn programming in.
Add to this that data science students don’t really experience any major disadvantages from having Python as the primary curricular language but do gain several benefits. Key benefits include longer-term skill viability and increased versatility in job options, etc. This versatility even plays out when considering including advanced CS courses in a data science curriculum. Most data science curriculums are already going to struggle to incorporate all the necessary foundational skills in a reasonable length undergraduate (or graduate) program. So why add programming courses beyond those already needed to meet typical CS prerequisites?
Finally, looking at the trends in language/tool use in data science just adds more validation to this idea. As companies move to working with unstructured or streaming data, Python becomes even more natural. All the surveys report increasing use of Python, without any signs of slowing down that increase. It is important for academic programs to not just react, but even anticipate trends and needs in the job market and industry.
While I didn’t go into lots of details on the pro’s and con’s of R or Python (and didn’t even talk about SAS/SPSS) I have collected a few links that you might find valuable to read in making your own decision.
R vs. Python for Data Science: Summary of Modern Advances — EliteDataScience Dec 2016 — Does a nice job of highlighting the new things that make the languages pretty equal.
Python & R vs. SPSS & SAS — The Analytics Lab – 2017 — This is nice because it also puts into perspective how SPSS and SAS play into the landscape as well as provides additional historic perspectives
Python vs. R: The battle for data scientist mind share — InfoWorld, 2017 — a fairly balanced perspective on the value of both
R vs. Python for Data Science — KDNuggets 2015 — A bit dated, but still provides some good comparisons.