By Karl Schmitt

Analytics Insight Interview

I was recently asked by Analytics Insight to provide some responses to interview questions about the state of the industry and Valpo’s data science programs. I thought the readers of this blog might be interested in my responses. They also interviewed several other program directors which collectively provide a nice snap-shot of the data science and analytics programs that are going on across the country.

My responses are below, and match the full Valpo article found here:

https://www.analyticsinsight.net/valparaiso-university-helping-forge-the-future-of-big-data-and-analytics/

The entire e-newsletter is available through this link, with interviews from:

Purdue – Krannert School of Management – Director Karthik Kannan
St. Mary’s College – Director Kristin Kuter
University of Iowa – College of Business – Dept. Head Barrett W. Thomas
University of San Francisco – Director David Uminsky


The industry is seeing a rising importance of Big Data Analytics and AI. How do you see these emerging technologies impact the business sector?

Just as previous manufacturing technologies have impacted the type of work that the average employee must do, I expect that AI and Analytics will also drive major innovations in work-flow and process. Previously we saw first factory line workers improving the throughput, and often quality, of hand-made goods. Later improvements in machine automation moved workers into supervisory and trouble-shooting roles of the machine processes. I think there will now be another level of abstraction/distance between the produced good(s) and the oversight workers with AI.

How is Valparaiso University’s Analytics and Data Science Program contributing to the growth and transformation of analytics and big data education?

Dr. Schmitt, the Director of Data Science Programs at Valparaiso recently received a TRIPODS+EDU grant from the National Science Foundation (NSF) to investigate student difficulties in learning data science. Valpo is a prime site for doing this sort of education-focused research. We hope to continue to expand that work in the near future.

See the NSF News announcement:

https://www.nsf.gov/news/news_summ.jsp?cntn_id=296537&org=NSF&from=news

See the Valparaiso University Press Release:

https://www.valpo.edu/news/2018/09/11/national-science-foundation-awards-tripodsx-grant-to-valpo-researchers-to-advance-data-science-education/

What is the edge Valparaiso University’s Analytics and Data Science Program has over other institutes in the industry?

Valpo’s programs have two features that I believe make it standout from its competitors, which are intimately tied together. First, students begin working with external clients and real, decision-level data as early as their second semester (that is, freshman year). Second, Valpo has extensive ties with non-profits and government agencies, due to its religious affiliations and history of social impact (1st in the Nation for Contribution to Public Good by the Washington Monthly). Together this means that every year a student studies at Valpo, their classwork and time can contribute to changing the world into a better place.

Kindly brief us about your role at Valparaiso University’s Analytics and Data Science Program and your journey in this highly promising sector.

I took over directing the Analytics and Modeling (AMOD) program in 2014, one year after I was originally hired at Valpo. For me, this was incredibly exciting as the ideas central to the program, applied mathematics with simulation, modelling, and statistical analysis were the core of my scientific background and research areas. Since then I have taken a lead role in shaping both the existing graduate program (AMOD) and forming a new undergraduate major, Data Science. My primary role within both programs is to help shape the student experience to develop analytical talent from recruitment to employment. I help with a variety of recruitment activities, advise all students enrolled in the programs, teach several of the key courses, and help many find internships or jobs after graduation.

What would you advice to aspiring big data and analytics candidates?

Learn Python. Seriously, I’m passionate that Python will over time become the core language for data science and analytics. Mostly because it’s simply got a far, far wider user-base than other languages (maybe not in this specific field, but overall).

This is followed closely by: Learn GIT – and not just the desktop version!

What are some of the challenges faced by the industry today?

I think the biggest challenges faced by the broader field and industry is fragmentation and lack of vision across disciplinary boundaries. Turf wars between mathematics, statistics, computer science, business, or whoever, aren’t really going to serve the further development of the field. Moreover, I think lack of willingness to see the value of other’s approaches to issues can blind researchers and practitioners to major issues. Central among these is lack of mathematical/statistical support for conclusions, reproducibility of results, and inherent bias in any data or analysis.

Please share some major achievements of Valparaiso University’s Analytics and Data Science Program under your leadership

Under my leadership Valparaiso University has established a full undergraduate major in Data Science and developed two new undergraduate courses (for Valpo) with two more coming next year. The enrolment in the graduate program (AMOD) has also increased by over 300%. We have also received a National Science Foundation grant for education research through the TRIPODS+X mechanism, one of only 4 educational grants.

Can you throw light on the latest employment trends in big data and analytics industry?

Sadly, I’m probably a bit out of touch here. I keep hearing about companies from nearly every sector wanting analytics skills, but the challenge is that each segment requires very very different training to be on the edge of trends. Much of my focus is on identifying, and then providing to students the skills that will allow them to ride today and tomorrow’s trends with equal skill.

 

Guest Post: Docker And the Data Scientist

Educational institutes and educational professionals often face a problem when it comes to creating a common platform where educators and students can view and share code. One such university in Turkey had to find a way to address a common complaint from students viz. their compute environments were different when compared to the testing machine.

 

The professor at Bilkent University in Ankara, Turkey, decided to use a technology called Docker to power a web-platform that can create lab instances and grade assignments.

 

So, what is Docker? We will answer that in a while. But before Docker was available, the next best solution was to use virtual machines. However, these machines needed to be extremely powerful and consequently required an expensive infrastructure, which most institutes couldn’t set aside a budget for. Students were forced to log on to a shared server which inadvertently negatively affected each other’s programs, or worse, crashed the whole infrastructure. Needless to say, it was impracticable to assign a virtual machine to each student.

 

They used Docker to build a web-based application called Programming Assignment Grading System (PAGS). A similar technique can be of adopted by universities for creating lab instances and grading assignments for data science classes.

 

Although we haven’t formally defined what Docker is, the above example demonstrates what Docker can do. The rest of the article focuses on Docker and how it can transform the education and data science industry.

 

The article is divided into 4 sections. First, we’ll start with an introduction on Docker and Docker containers. Then, we’ll answer the question, “Who is Docker for?” The third part will give you an overview of how Docker is a useful tool for data scientists. In the final section, we’ll dive into a couple of  interesting use cases for Docker in data science. Let’s get started!

What is Docker?

 

Docker is the leading software containerization platform that is being actively developed by Docker Inc. It is an open source project that is designed to help you create, run and deploy applications inside containers.

 

So, what is a container? A container, by definition, comprises all the dependencies, libraries and other related files required to run an application. Once you’ve created a container for your application, you can run it on any Linux machine regardless of the way your underlying machine is configured. If the machine that you’re using at one end is Ubuntu, and it’s Red Hat at the other end, fret not! Docker is precisely meant for situations like these.

 

You can create a snapshot of a container and this snapshot is generally known as an image. Conversely, you can call a container an instance of a docker image. Docker images are inert and immutable files. When someone asked the difference between an image and a container on StackOverflow, a web developer named Julian came up a quick analogy. “The image is the recipe, the container is the cake”, he said and that just sums it up.

 

You can store Docker images in a cloud registry like Docker Hub. There are numerous user-contributed Docker images that should cover almost all the general use cases. You can also create and share your private Docker images with your co-workers and your organization. Alternatively, you can push them into a public repository so as to return it back to the community.

 

The concept of Docker is very similar to that of a Virtual Machine (VM). However, virtual machines are very demanding beasts and run considerably slower on a less powerful hardware. A VM works in such a way that it allows a piece of hardware to be shared between VMs. This allows you to run one or more virtual operating systems inside your host operating system. But you might need to upgrade your processor if you’re seriously planning to run your software on a virtual machine.

 

Unlike a VM, Docker uses the host kernel instead of creating new kernel instances. The virtualization happens at the kernel level and not at the topmost level. The Docker encapsulates everything that’s required for running the application on that host machine. This tremendously improves the performance of the application and reduces its size. What gives Docker the significant lead is the fact that it enables separation of concerns between the infrastructure, IT Ops, the developer and the application. This creates a positive environment for enhanced innovation and collaboration.

Who is Docker for?

Docker is essentially a container platform largely aimed at businesses. It enables IT businesses to efficiently select and administer a complete application, without the fear of an infrastructure or architecture lock-in.

 

Enterprises use Docker for everything from setting up their development environment to deploying their application for production and testing. When you need to build more advanced systems, like a data warehouse comprise of multiple modules, containers make a lot of sense. You can actually save several days of work that you’d otherwise have to spend configuring each machine.

 

However, the Docker platform isn’t just relevant to developers and enterprises alone. It’s actually a pretty useful tool for data scientists, analysts and even for schools and colleges. There are educational institutions and universities that are keen to transform digitally but are held back by their existing infrastructure.

Docker and Data Science

Why should you use Docker if you’re a data scientist? Here are three reasons pointed out by Hamel Hussain over at Towards Data Science:

Reproducibility

If you are a professional data scientist, it is imperative that your work can be reproduced. Reproducibility helps facilitate review by your peers, ensure the analysis, model and application that you have built can run unhindered which makes your deliverables both robust and time-tested.

 

As an example, let us assume that you have built a Python model, however, it has not proven to be enough to run pip-freeze and transfer the resulting file to a colleague. This would largely be because of Python-specific dependencies.

 

Imagine if you could find a way around manually moving the Python dependencies like  the compiler, config. files, drivers, etc. You can be free of Python-related dependencies by simply bundling everything within a Docker container. This not only reduces the task of others having to recreate your environment, it also ensures that your work is much more accessible.

Ability to Port Your Compute Environment

If you are a data scientist who is specializing in Machine Learning, the ability to frequently and efficiently change your computing environment has a considerable effect on your productivity.

 

It is often the case that the work of data science starts with prototyping, research, and exploration. This doesn’t essentially need special computing power to start. That said, often comes a stage where multiple compute resources can prove quite helpful in increasing the speed of your workflow.

 

A number of data scientists find themselves limited to a local computing environment largely because of a perceived hindrance of re-creating their individual local environment onto a device remotely. Here, Docker makes the difference. It allows you to port your work environment, including libraries, files etc in just a few clicks. Additionally, the ability to swiftly port your computing environment is a substantial advantage in Kaggle competitions.

Enhance your Engineering Skills

Once you are comfortable with using Docker, you can then deploy models as containers that can help make your work readily accessible to other users. Additionally, various other applications that you may require as part of your data science workflow interaction may already exist in a container within a Docker application.

Use Cases for Docker in Data Science

By making your applications portable, cheaper and more secure, Docker helps to free up time as well as resources that can be spent on other important things. It can help transform IT without the need to re-tool, re-educate or re-code any of your existing applications, staff or policies.

 

Here are just a few of the use cases of how Docker can help different organizations:

Docker for Education

Let’s revisit the Docker use case that we discussed in the introduction. The faculty at the university used Docker to create a container to run their application called PAGS. It allowed the students to have the same environment for their compute machines and test machines without the need of a VM.

 

Docker provides a common environment that can run on a container on any given Linux machine. This almost always guarantees it to run with similar results on a different machine using the same container. Without Docker, this would have required more infrastructure and resources that they didn’t have.

 

Another particularly interesting scenario is setting up lab instances. Dependending on how you want a machine to be configured, you can take a snapshot of it to build a Docker image. You can then pull the snapshot into all other lab instances saving you time and resources.

Docker for Data Science Environment Set Up

Consider a scenario where you need to explore a few data science libraries in Python or R, but;

  1.  without spending a lot of time installing either language on your machine,
  2.  browsing and figuring out which dependencies are essential and
  3.  finally getting down to identifying what works best for your version of Windows/OSX/Linux.

 

This is where Docker can help.

 

Using Docker, you can get a Jupyter ‘Data Science’ stack installed and ready to execute in no time flat. It allows you to run a ‘plug and play’ version of a Jupyter data science stack from within a container.

 

To start, you would need to first install Docker Community Edition on your machine. Once done, restart your machine and get your Jupyter container set up. Prior to running a container, you would need to specify the base image for the container

 

In most cases, the image that you’re looking for has already been built by a prior user and includes everything needed to use a fully loaded data science stack of your choice. All that needs to be done is to specify a pre-defined image that Docker can use to start a container.

Conclusion

In this article, we have just hit the top of the iceberg in terms of what can be done with Docker. We have focused only on specific areas of Docker that a data scientist may most often encounter. Below are some further sources that can help you during your journey of using and implementing Docker.

 

  1. Basic Docker Terminologies
  2. Useful Docker Commands
  3. Dockerfile Reference
  4. Pushing and Pulling to and from Docker Hub

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Guest Post: Why Teach Machine Learning?

Guest Post by Limor Wainstein–

Why Teach Machine Learning?

Teaching machines to learn about the real world has been a goal in computer science since Alan Turing first showed how to mechanise logic. But it’s only recently that affordable hardware has evolved enough speed and capacity to make the idea commercially feasible in many domains – and more than feasible, seemingly inevitable.

Machine learning, alongside its siblings in data analytics and big data, is not only fashionable, it’s where the money and jobs are, thus attracting ambitious, commercially minded students. It’s also an increasingly important tool for all sciences, promoting interest among those aiming at careers in research and academia. Andrew Ng, former chief scientist at Baidu, the giant Chinese search engine company, and adjunct professor at Stanford, has called AI and machine learning ‘the new electricity’ for its potential to apply to and revolutionize all sectors of the economy and society.

That has become apparent in the job market. Towards the end of 2017, the Financial Times noted that three out of four of the top-paying jobs in software were for expertise in “the new profession” of machine learning. Ng says that the two biggest challenges for machine learning are acquiring the vast amounts of data required and finding skilled workers. Of the two, he said, the skill shortage is the biggest problem. Some entire job sectors, such as high frequency trading, are now entirely dependent on machine learning, and financial technology as a whole is moving in that direction rapidly. For example, J. P. Morgan recently issued a 280-page report on data analysis and machine learning in finance, focusing on the skills it needs to hire in large numbers – numbers that don’t exist.

Additional, highly-prominent machine learning domains exist alongside financial technology, for example, autonomous vehicles and medical diagnosis. Overtly AI-dominated companies like Google, Tesla and IBM are adept at garnering publicity. Such high-profile efforts mask the huge number of more mundane machine learning tasks that exist in every industry. Amazon, for example, uses machine learning across its entire retail system (from web interface to warehousing, packaging and delivery). Every company that operates with data at scale in retail has to follow those examples to compete.

Energy companies use machine learning to predict and manage supply and demand. Airlines manage pricing and route loading through machine learning. New medicines are developed using machine learning, and health services marshall their resources in response to short and long-term trends in demand, tracked and predicted by machine learning. Agriculture, ditto. In fact, it’s hard to find any area untouched by machine learning – even theology is in on the trend, with Murdoch University in Perth using machine learning to analyze ancient Thai palm-leaf texts on Buddhist doctrines. The new electricity, indeed.

So, what is machine learning?

Machine learning is a subset of artificial intelligence, but is mercifully free of the philosophical and near-religious arguments of some AI research. Instead, machine learning is simple to define and has well-defined tools, techniques and goals, and an ever-expanding field of practical applications.

Machine learning is the application of algorithms and techniques to data sets in order to find whether certain patterns exist. Whether this includes data acquisition and cleaning before analysis, or decision-making afterwards, depends on how tightly you want to draw the definition. All of these things are important in practical machine learning-based applications but are usually domain specific. However, the core of machine learning isn’t domain specific and can be applied very widely. This has led it to be taught as a self-contained field.

Machine learning is inherently cross-disciplinary, and this is the greatest challenge in teaching the subject. There is a huge and unavoidable mathematical component, involving statistics, predicate calculus, linear algebra, and related concepts. This can come as a shock to computing students who have successfully minimized their exposure to such mathematical ideas hereunto. Computing skills are equally important, as machine learning involves the efficient manipulation of large and disparate data sets through complex transformations, often in highly parallel environments. With many practical machine learning applications bounded by hardware limitations, a deep understanding of system architecture and its practical consequences is also necessary. These facts will come as an equal shock to students in statistical machine learning courses who have avoided significant programming or hardware experience. A good machine learning practitioner needs to be fluent not only in programming but in systems architecture and data design. In addition, the practitioner needs to understand which of the many mathematical techniques to apply to a particular problem and how to apply them correctly.

In a real-life work environment, a data scientist or data engineer will typically find machine learning techniques useful. She may even require them to excel at her job. For example, she may need to create algorithmic patterns to search for data, use data patterns to make decisions and predictions, or use other techniques, such as smart sorting or fuzzy logic to prepare and manipulate data. These skills are at the heart of modern data science. It is clear, therefore, that a serious data science program should provide solid coverage of machine learning skills and techniques.

How should you teach it?

Picking the exact mix of tools, languages, and technologies for a course is to some extent a secondary issue, and can easily be based on what resources and skills are available to best match your choice of syllabus, project work and structure. Machine learning is a product of the Internet age and as such has a continuing evolution of best practice in its DNA. Checking out – and participating in – online machine learning communities such as Kaggle is one of the best ways to ensure alignment between machine learning teaching and actual student needs.

As with any subject, some students will have the skills, interest or previous experience to easily follow one or both of the two major prongs of machine learning. Most will not. But teachers of machine learning have an advantage over their mathematician or computer science colleagues: they can use each prong to illustrate and contextualise the other. Students who experience a curriculum where each is taught independently often have problems – and this has been unfortunately common. On discussion boards where experienced ML practitioners advise students, disheartening comments abound.

Calvin John, an autonomous vehicle researcher, warned on Quora of his experience with a “…horrible textbook… very little conceptual basis for the theorems… bunch of isolated problems which were crudely connected in a very disjointed way”. Modern machine learning teaching is developing rapidly. Like many new interdisciplinary subjects, machine learning may be taught by different faculties, where each faculty is led by its own approach without relating to the needs of the other disciplines involved.

Andy J. Koh, program chair of informatics at the University of Washington, also discusses the subject of teaching machine learning in his essay “We need to learn how to teach machine learning”(August 21, 2017). He says: “We still know little about what students need to know, how to teach it, and what knowledge teachers need to have to teach it successfully.” He also points out the wide range of student abilities and experience among those interested in machine learning – not only from previous undergraduate courses, but from MOOCs and burgeoning commercial self-teaching online products. He nevertheless advocates the adoption of good pedagogical tools – evolving analogies and practical examples that combine theory and practice. It’s important, he says, to understand which concepts will be particularly difficult, realizing what ideas, good and bad, students bring with them.

It’s in the practical examples where machine learning teachers have the greatest chance to equip students with a good, broad and deep understanding of the field. Machine learning’s expanding applicability offers many choices – machine vision, text mining, natural language processing are popular examples. The topic should be suited to the project work across a syllabus. A judicious introduction of new mathematical ideas alongside practical work examples, or practical problems that lead to theoretical insights can reinforce student appreciation of the whole.

Here are some additional resources that discuss teaching machine learning:

A worked ML curriculum bringing together best-of-breed MOOC courses.

Another site that has several courses, including MOOCs and other deep-learning topics is fast.ai.

(They also have an interesting brief post on adding data science to a college curriculum)

This was a guest post by: Limor Wainstein

Limor is a technical writer and editor with over 10 years’ experience writing technical articles and documentation for various audiences, including technical on-site content, software documentation, and dev guides. She holds a BA in Sociology and Literature and is an MA student in Science, Technology, Society (STS) at Bar-Ilan University. Limor is focusing her studies on the sociology of technology and is planning her research around coworking spaces in Israel.

Advice for Project-Based Courses

This post will be part of a series on teaching a project-based “Introduction to Data Science” course. It is also part of my collection of resources and materials on teaching this type of course, which can be found on its own section of the blog here.

Here I will focus on summarizing some of the advice I received when designing the course and should be generally applicable to ANY project-based course, not just an “Intro” course.

By far, the biggest, most important piece of planning/advice is:

Find clients who understand (or work to understand) working with students, and are willing to actually commit time to the project.

Based on advice/conversations and experience, this doesn’t need to be enormous, approximately one (1) hour per week of meeting/interacting through a CONSISTENT liaison with the students, and reasonable responses otherwise (to emails, phone calls etc). Remember when establishing this expectation that the client is receiving many times their investment in work-hours. In a 3-person group with 3-5 hours/week per student, 1-hr employee (and students) ==> 6 – 12 additional hours of (cheap/free) work. Why is this SO important?

  • For Project Success:
    1. Makes sure client will fulfill their end of agreements (data, etc) by personalizing it and identifying the responsible party at the client
    2. Makes sure the students are actually working on things the client wants to see happen
    3. Makes sure students feel obligated to keep working on project throughout semester (avoids last-minute crams)
  • For Learning Success:
    1. It gets students to practice talking about their work/data with a non-data science expert
    2. Provides regular check-ins and reporting so that project can’t derail (similar to above)
    3. Helps students stay out of the “weeds” of project improvement or code writing by reminding them regularly of the larger picture

Digging a little deeper, some value ideas came from a conversation with Joseph Mertz from Carnegie Mellon University (CMU), who has run or participated in CMU’s team-based and capstone project courses  for nearly two decades. He suggested that getting the commitment from clients can come from several means. Possibly this is a charge-per-project, but it could also be by “formalizing” the whole project selection process. Examples include requiring a kick-off event and a conclusion celebration where other students and faculty are invited. One truly interesting tidbit he offered was the suggestion to have the final project presentations be done by the client and focusing on what value they received from the project. This can really increase the long-term impact for the students and your colleagues. It also might increase the long-term satisfaction from the client.


This is getting long, so here’s a quick list of other advice (to be expanded into other posts later perhaps):

  • Start looking for projects early. Way earlier than you think (perhaps a full semester or more!)
  • Be sure to manage clearly the expectations the clients have for their projects (especially compared to your expectations). This also relates back to the idea above of having clients that understand what it means to work with a student group.
  • Consider carefully your learning objectives, and how that relates to projects/clients.
    • Do you want students to get the full experience of project scoping/design?
    • Do you want clients to have a good idea or specific question to be answered (simplifying the above)
    • Should incoming data be clean already (more algorithm/presentation design), or raw (the whole process), or even missing (generation/collection of data)
  • When designing YOUR deadlines and expectations, remember that sometimes clients are hard to work with.
    • Are you (the student’s professor) be the ‘real’ client — Hard deadlines, clear project expectations…
    • Is the client/contact the ‘real’ client — big flexibility on specific deliverables, deadlines, etc (with-in reason)

NASEM Webinar 1: Data Acumen

This webinar aimed to discuss how to build undergraduate’s “data acumen”. If acumen isn’t a word you use regularly (didn’t before last year), it means “the ability to make good judgments and quick decisions”. Data acumen therefore is the ability to make good judgments and quick decisions with data. Certainly a valuable and important skill for students to develop! The webinar’s presenters were Dr. Nicole Lazar, University of Georgia and Dr. Mladen Vouk, North Carolina State University. Dr. Lazar is a professor of statistics at University of Georgia. Dr. Vouk is a distinguished professor of computer science and the Associate Vice Chancellor for Research Development and Administration.

Overall, this webinar seemed to be largely a waste of time, if your goal was to understand what activities, curricular design and practices would help students develop data acumen. (See my last paragraph for a suggested alternative) On the other-hand, if you’d like a decent description of the design and implementation of a capstone course, and the process of scaling a capstone course, listen to Dr. Lazar’s portion. If you still need an overview of the state of data science then Dr. Vouk’s portion provided a reasonable context for data science. The most valuable thing in the entire webinar was slides 26 and 27 (about minute 48). Slide 26 shows an excellent diagram for an “End-to-End Data Science Curriculum” that reasonably well articulates how a student might mature (and thereby gain data acumen), see figure 1 below. Slide 27 provides well-articulated learning objectives for core, intermediate and advanced data science courses (see table below)

From NASEM Data Acument Webinar. North Carolina State University’s Curriculum Vision
  • Core
    • Able to master individual core concepts within the bloom’s taxonomy:
      Knowledge, Comprehension, Application, Analysis, Evaluation, and Synthesis
    • Able to adapt previously seen solutions to data science problems for target domain-focused applications utilizing these core concepts
  • Intermediate Electives
    • Able to synthesize multiple concepts to solve, evaluate and validate the proposed data science problem from the end-to-end perspective
    • Able to identify and properly apply the textbook-level techniques suitable for solving each part of the complex data science problem pipeline
  • Advanced Electives
    • Able to formulate new domain-targeted data science problems, justify their business value, and make data-guided actionable decisions
    • Able to research the cutting edge technologies, compare them and create the optimal ones for solving the DS problems at hand
    • Able to lead a small team working on the end-to-end execution of DS projects

 

An Alternative to the NASEM Webinar

While I found this particular webinar to largely be a waste of time, I also attended the NASEM Roundtable on “Alternative Educational Pathways for Data Science” . While certainly not focused on data acumen I found the first presentation given at that round-table described an excellent overall curriculum structure that did build student’s data acumen. Eric Kolaczyk from Boston University described their non-traditional master’s program in Statistical Practice. By integrating their course work, practicum experiences, and more, students are forced to exercise and build their ability to make good judgments about data investigations, methods, and results. The talk is well worth your time if you’d like some ideas for non-standard ways to build student skills and abilities.

Spring 2018: What’s Happening

Spring semester is off to an intense start for me! I’m again teaching an “Introduction to Data Science” using a project-based methodology. We’ve got 6 awesome projects from for-profit, government, and internal clients. There’s also plenty going on in the data science world as faculty gear up for SIGCSE (which has several data science sessions) and other conferences over the summer.

I’m going to run a series of summaries for the National Academy of Sciences, Engineering and Medicine “Webinar Series on [Envisioning] Data Science Undergraduate Education”. If you weren’t able to watch them this fall (I wasn’t!) I’ll be watching them, summarizing the general content and pointing out useful highlights to take away. I’m hoping to get one out about every week (no promises though!)

You can find the summaries under the category “Webinar Summaries” and I’ll also tag them with NASEM Webinar. If there’s some pressing question you’d love to see a post on, let me know!

Big Month in Data Education — October

October has been an incredibly busy month! I’ve been traveling a lot, taking part in a wide variety of activities around data science education. It’s been a pretty big month and I’m here to give you a very quick run-down of what’s been happening!

The month kicked off with the Midwest Big Data Innovation Hub’s “All-hands on Deck” Meeting. I was invited there as part of a planning grant the hub had received to develop a spoke proposal for the hub to create a “Resource Center for Non-R1 Universities”. The meeting was very interesting, and we got to hear about some really neat work on using data science to advance agriculture, smart cities and more. The most relevant for data science education though was the final panel, “Education and Workforce Development.” Panelists included Jim Barkley, David Mongeau and Renata Rawlings-Goss. You can find their slides on the Midwest Big Data Hub (Barkley Slides, Mongeau Slides, Rawlings-Goss Slides). There is also a video recording of the panel here. The other important event that happened at the meeting was the afternoon grant-planning session. While I can’t share documents from that yet, I left very excited about the possibilities of establishing an important educational center for data science education that would help address the needs of non-R1 institutions. Some of the ideas that were shared included providing a clearing house for internships and project opportunities, connecting smaller institutions with interesting research projects and facilitating finding instructional expertise for most esoteric courses.

Mid-Month (October 20th), the National Academy of Sciences’ held their 4th roundtable on Data Science Education, “Alternative Institutional and Educational Mechanisms”. You can find the webcast and agenda webpage here. I attended as a member of the public and was able to contribute a few ideas and questions. There were several great presentations and some perspectives on education I hadn’t considered were definitely presented. Eric Kolaczyk gave a great presentation that described a very nicely integrated learning pathway for building data expertise at the master’s level. The MS in Statistical Practice It is one of the few programs I know of (now) that actually redesigned several of their courses to make a more effective data science education, and cohesive learning structure. It was also very informative to hear about Metis’s data science “bootcamps”. It’s pretty clear Metis is doing some excellent education work in data science, but very different from traditional, academic education. Additional talks worth listening to were Andrew Bray, explaining the origin and evolution of the American Statistical Association’s DataFest events, Ron Brachman describing Cornell Tech’s ‘entrepreneurial’ focused data science, and Catherine Cramer discussing the New York Hall of Science‘s Network Science education initiatives (I plan to use some of this material for with my students who do network science research!).

Additionally, the National Academy of Sciences have released an interim report on the “Envisioning the Data Science Discipline” studies going on.  The report is definitely worth reading and provides some very interesting views and findings. There’s also a strong call for community input, so send you ideas in!

The last activity I participated in during October was the South Big Data Hub‘s workshop “Keeping Data Science Broad: Workshop on Negotiating the Digital and Data Divide“. This workshop was an incredible pleasure to join! I think the best part was that with the entire room filled with people who have already been thinking about what data science and data science education might look like, we were able to frequently move beyond the “what is data science” discussion. It meant that we could really start discussing the roadblocks and opportunities inherent in data science. While I can’t share more of the actual outcomes/products from the workshop yet, we’ve got a really aggressive schedule to turn the output into a report (due Dec 6th!). I’m hopeful that something really powerful will come out. I know there was a lot of writing accomplished while there (I wrote 5-6 pages, and others did too) so look for another announcement of a report in early december.

Finally, while I haven’t been participating/watching them much yet. I need to mention the ongoing webinar series being run by the National Academy of Sciences. You can find the entire webinar series here. October saw 4 webinars posters, “Communication Skills and Teamwork”, “Inter-Departmental Collaboration and Institutional Organization”, “Ethics”, and “Assessment and Evaluation for Data Science Programs”. I’m still hoping to watch these and provide summary posts… but that hasn’t happened yet. If any of my readers have been watching them and would like a guest-post with a summary, please get in touch!

Webinar Summary: Data Science Education in Traditional Contexts

Introduction

This post is a summary and reflection on the webinar “Data Science Education in Traditional Contexts”. The webinar was hosted on Aug 28th by the South Big Data Innovation Hub as part of their Keeping Data Science Broad: Bridging the Data Divide series. You can watch the entire webinar here. The webinar consisted of 5 speakers and a discussion section. I’ve provided a short summary of each panelist’s presentation and the questions discussed at the end. The speakers, in order were:

  • Paul Anderson, College of Charleston
  • Mary Rudis, Great Bay Community College
  • Karl Schmitt, Valparaiso University
  • Pei Xu, Auburn University
  • Herman “Gene” Ray, Kennesaw State University

Summary of Presentation by Paul Anderson, College of Charleston

The first speaker was Paul Anderson, Program Director for Data Science at the College of Charleston. His portion of the presentation runs from 0:01:50-0:13:45, and expands on three challenges he has experienced, (1) being an unknown entity, (2) recruiting, and (3) designing an effective capstone. His first point, being an unknown entity, impacts a broad range of activities related to implementing and running a data science program. It can cause a challenge when trying to convince administrators to support the program or new initiatives (such as external collaborations). It means that other disciplines may not be interested in developing joint course work (or approving your curricular changes). His second point discussed what he’s learned from several years of working on recruitment. His first observation here ties to his first overall point: If your colleagues don’t know what data science is, how are most high school students to know (or even your students)?. This has led him to have limited success with direct recruitment from high schools. Instead, he’s focused on retooling the program’s Introduction to Data Science Course to be a microcosm of his entire program, both in terms of process and rigor. He’s also worked to make his program friendly to students switching majors or double majoring by having limited prerequisites. His final portion discussed the various forms of capstone experiences Charleston has experimented with. Starting from an initially 1-to-1 student-faculty project pair, moving into more group-based with a general faculty mentorship model. If you are considering including a capstone experience (and you should!) it’s probably worth listening to this portion. However, not all colleges or universities will have sufficient students/faculty to move into their final model.

Summary of Presentation by Mary Rudis, Great Bay Community College

The second speaker was Mary Rudis, Associate Professor of Mathematics at Great Bay Community College. Her portion runs 0:14:25-0:19:19 and 0:20:46-0:29:08. A significant portion of her presentation outlines the large enrollment and performance gap of non-white and first generation college students. Dr. Rudis saw building both an Associate Degree in Analytics, and a Certificate in Data – Practical Data Science as the best way to combat these gaps. In researching the state of jobs/education she found that community college students were struggling to compete for the limited internships and entry-level job opportunities available in data science, compared to 4-yr college students (like local M.I.T. students). Most companies in terms of hires were looking for Master’s level education, or significant work experience in the field. To help her students succeed, she built an articulation program with UNH-Manchester so that upon final graduation, students originally enrolled at GBCC would be full-qualified for the current job market.

Summary of Presentation by Karl Schmitt, Valparaiso University

The third speaker was Karl Schmitt, Assistant Professor of Mathematics and Statistics, Affiliate Professor of Computing and Information Sciences, and the Director of Data Sciences at Valparaiso University. His presentation runs from 0:30:30 – 0:45:20. The core of the presentation expanded on Dr. Anderson’s first point about data science being an unknown entity. He sought to provide ideas about how to differentiate programs from other similar programs, both at the college/university level, but also make the programs different when looking outside his own institution. Valparaiso has 6 data-focused programs:

His talk described how the programs can be differentiated in terms of the data user/professional that the program trains, and also in terms of course content and focus. He also talked about how Valpo is differentiating its program from other schools with a focus on Data Science for Social Good. This has been achieved in part by seeking industry partners from the government and non-profit sectors, rather than traditional industrial partners.

Summary of Presentation by Pei Xu, Auburn University

The fourth speaker was Pei Xu, Assistant Professor of Business Analytics, Auburn University. Her portion of the presentation runs from 0:46:05 – 0:57:55 and describers Auburn’s undergraduate Business Analytics Degree. Auburn’s curriculum is designed around the data science process of Problem Formulation -> Data Prep -> Modeling -> Analysis -> Presentation. Each of the core classes covers 1-2 stages of this process, with the specialized degree courses typically beginning in a student’s sophomore year. Their program also actively engages many businesses to visit and provide information sessions. Dr. Xu detailed 4 challenges she’s faced related to their program. First, she has found it hard to recruit qualified faculty for teaching courses, which she’s overcome by progressively hiring over the last few years. She has also found many students to be turned away by the high quantitative and computational nature of the program. This has been addressed by building a stronger emphasis on project-based learning and more interpretation than innovative process development. Third, she discussed how many of the core courses in their program have significant overlap between courses. For example, many courses in different areas all need to discuss data cleaning/preparation. Auburn’s faculty has spent significant curriculum development time discussing and planning exactly what content is duplicated and where. Finally, deciding between the various analytics tools for both the general curriculum and specific classes has proved challenging (you can see an extended discussion by me of Python/R and others in here).

Summary of Presentation by Herman “Gene” Ray, Kennesaw State University

The fifth speaker was Herman “Gene” Ray, Associate Professor of Statistics and Director for the Center for Statistics and Analytics Research, Kennesaw State University. His presentation is from 0:58:36 – 1:07:35 and focuses on KSU’s Applied Statistics Minor.  KSU’s program strongly focuses on domain areas, with most courses having a high-level of applications included and types of experiential learning opportunities. Additionally, almost all their courses use SAS in addition to introducing their students to a full range of data science software/tools. The first experiential learning model KSU uses is an integration of corporate data-sets and guided tasks from business. The second model is a ‘sponsored research class’ with teams of undergraduates led by a graduate student on corporation provided problems or data. Gene provided extended examples about an epidemiology company and about Southron Power Company. The key benefits KSU has seen are that students receive real world exposure, practice interacting with companies, potentially even receiving awards, internships, and jobs. The largest challenge to this experiential learning model is that is requires a significant amount of time, first to develop the relationships with companies, managing corporate expectations, and finally in the actual execution of projects for both faculty and students.

Additional Webinar Discussion

The additional discussion begins at 1:08:32. Rather than summarize all the responses (which were fairly short), I’m simply going to list the questions, in-order as they were answered and encourage interested readers to listen to that portion of the webinar or stay tuned for follow-up posts here.

  1. What can High Schools do to prepare students for data science?
  2. What sort of mix do programs have between teaching analysis vs. presentation skills?
  3. Is it feasible for community colleges to only have an Introduction to Data Science course?
  4. How have prerequisites or program design affected diversity in data science?
  5. How is ethics being taught in each program? (and a side conversation about assessment)

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

A Computational Linear Algebra Course

Within mathematics, Linear Algebra (LA) has held a long-standing importance. Many curriculums used it for decades as the first class in which students encountered proofs (though this has changed in recent years for a significant portion of programs). Many other disciplines like Meteorology, much of engineering, and others require at least some course in basic matrix-based mathematics. This is especially true for Data Science which relies heavily on linear algebra for data manipulation and decomposition algorithms. Most practitioners and instructors would agree on the importance of the topic, but what exactly should students be learning in that course (or courses)?

This is a challenging question, made even more difficult if LA actually is a mathematics program’s introduction to proofs for majors. Generally speaking, the disciplines that use mathematics as a tool don’t particularly value this proof-based approach. Additionally, traditional proof-based mathematics are almost inherently non-computational, in the sense that very few proofs of traditionally taught concepts require the use of a computer, or complex computations not possible by hand. This leads educators to spend significant portions of a course teaching things like row-operations which are then executed by hand. This leads to a (potentially) deep disconnect between many of the concepts and skills learned and the actual application of LA to solve problems.

Recognizing this disconnect, I’ve long wanted to develop a “Computational Linear Algebra” course, that potentially builds on a traditional LA course. A course that takes all the basic linear algebra but moves it into the computational realm, highlighting key applications and algorithms. I haven’t had that chance, but this week I got forwarded a blog post from a colleague that got me revved up again about this idea. Jeremy Howard and Rachel Thomas of fast.ai have just released a new course that exemplifies this idea.

The course takes a non-traditional (for math) approach to learning, focusing on a “try it first” mentality. This sort of idea has a lot of support from within CS as an alternative way to teaching introductory programming. So, while it might seem a bit unusual for a math course, in the crossover world between mathematics and computer science (where the topic lives) it makes a ton of sense. Rachel does a great job of motivating and explaining their approach in this other blog-post from fast.ai.

I have not had the time yet to dive into their materials, but will report back again when I do. Or, feel free to contact me if you try their materials in a course (good or bad!)