Tagged Inspirations

Webinar Summary: Data Science Education in Traditional Contexts

Introduction

This post is a summary and reflection on the webinar “Data Science Education in Traditional Contexts”. The webinar was hosted on Aug 28th by the South Big Data Innovation Hub as part of their Keeping Data Science Broad: Bridging the Data Divide series. You can watch the entire webinar here. The webinar consisted of 5 speakers and a discussion section. I’ve provided a short summary of each panelist’s presentation and the questions discussed at the end. The speakers, in order were:

  • Paul Anderson, College of Charleston
  • Mary Rudis, Great Bay Community College
  • Karl Schmitt, Valparaiso University
  • Pei Xu, Auburn University
  • Herman “Gene” Ray, Kennesaw State University

Summary of Presentation by Paul Anderson, College of Charleston

The first speaker was Paul Anderson, Program Director for Data Science at the College of Charleston. His portion of the presentation runs from 0:01:50-0:13:45, and expands on three challenges he has experienced, (1) being an unknown entity, (2) recruiting, and (3) designing an effective capstone. His first point, being an unknown entity, impacts a broad range of activities related to implementing and running a data science program. It can cause a challenge when trying to convince administrators to support the program or new initiatives (such as external collaborations). It means that other disciplines may not be interested in developing joint course work (or approving your curricular changes). His second point discussed what he’s learned from several years of working on recruitment. His first observation here ties to his first overall point: If your colleagues don’t know what data science is, how are most high school students to know (or even your students)?. This has led him to have limited success with direct recruitment from high schools. Instead, he’s focused on retooling the program’s Introduction to Data Science Course to be a microcosm of his entire program, both in terms of process and rigor. He’s also worked to make his program friendly to students switching majors or double majoring by having limited prerequisites. His final portion discussed the various forms of capstone experiences Charleston has experimented with. Starting from an initially 1-to-1 student-faculty project pair, moving into more group-based with a general faculty mentorship model. If you are considering including a capstone experience (and you should!) it’s probably worth listening to this portion. However, not all colleges or universities will have sufficient students/faculty to move into their final model.

Summary of Presentation by Mary Rudis, Great Bay Community College

The second speaker was Mary Rudis, Associate Professor of Mathematics at Great Bay Community College. Her portion runs 0:14:25-0:19:19 and 0:20:46-0:29:08. A significant portion of her presentation outlines the large enrollment and performance gap of non-white and first generation college students. Dr. Rudis saw building both an Associate Degree in Analytics, and a Certificate in Data – Practical Data Science as the best way to combat these gaps. In researching the state of jobs/education she found that community college students were struggling to compete for the limited internships and entry-level job opportunities available in data science, compared to 4-yr college students (like local M.I.T. students). Most companies in terms of hires were looking for Master’s level education, or significant work experience in the field. To help her students succeed, she built an articulation program with UNH-Manchester so that upon final graduation, students originally enrolled at GBCC would be full-qualified for the current job market.

Summary of Presentation by Karl Schmitt, Valparaiso University

The third speaker was Karl Schmitt, Assistant Professor of Mathematics and Statistics, Affiliate Professor of Computing and Information Sciences, and the Director of Data Sciences at Valparaiso University. His presentation runs from 0:30:30 – 0:45:20. The core of the presentation expanded on Dr. Anderson’s first point about data science being an unknown entity. He sought to provide ideas about how to differentiate programs from other similar programs, both at the college/university level, but also make the programs different when looking outside his own institution. Valparaiso has 6 data-focused programs:

His talk described how the programs can be differentiated in terms of the data user/professional that the program trains, and also in terms of course content and focus. He also talked about how Valpo is differentiating its program from other schools with a focus on Data Science for Social Good. This has been achieved in part by seeking industry partners from the government and non-profit sectors, rather than traditional industrial partners.

Summary of Presentation by Pei Xu, Auburn University

The fourth speaker was Pei Xu, Assistant Professor of Business Analytics, Auburn University. Her portion of the presentation runs from 0:46:05 – 0:57:55 and describers Auburn’s undergraduate Business Analytics Degree. Auburn’s curriculum is designed around the data science process of Problem Formulation -> Data Prep -> Modeling -> Analysis -> Presentation. Each of the core classes covers 1-2 stages of this process, with the specialized degree courses typically beginning in a student’s sophomore year. Their program also actively engages many businesses to visit and provide information sessions. Dr. Xu detailed 4 challenges she’s faced related to their program. First, she has found it hard to recruit qualified faculty for teaching courses, which she’s overcome by progressively hiring over the last few years. She has also found many students to be turned away by the high quantitative and computational nature of the program. This has been addressed by building a stronger emphasis on project-based learning and more interpretation than innovative process development. Third, she discussed how many of the core courses in their program have significant overlap between courses. For example, many courses in different areas all need to discuss data cleaning/preparation. Auburn’s faculty has spent significant curriculum development time discussing and planning exactly what content is duplicated and where. Finally, deciding between the various analytics tools for both the general curriculum and specific classes has proved challenging (you can see an extended discussion by me of Python/R and others in here).

Summary of Presentation by Herman “Gene” Ray, Kennesaw State University

The fifth speaker was Herman “Gene” Ray, Associate Professor of Statistics and Director for the Center for Statistics and Analytics Research, Kennesaw State University. His presentation is from 0:58:36 – 1:07:35 and focuses on KSU’s Applied Statistics Minor.  KSU’s program strongly focuses on domain areas, with most courses having a high-level of applications included and types of experiential learning opportunities. Additionally, almost all their courses use SAS in addition to introducing their students to a full range of data science software/tools. The first experiential learning model KSU uses is an integration of corporate data-sets and guided tasks from business. The second model is a ‘sponsored research class’ with teams of undergraduates led by a graduate student on corporation provided problems or data. Gene provided extended examples about an epidemiology company and about Southron Power Company. The key benefits KSU has seen are that students receive real world exposure, practice interacting with companies, potentially even receiving awards, internships, and jobs. The largest challenge to this experiential learning model is that is requires a significant amount of time, first to develop the relationships with companies, managing corporate expectations, and finally in the actual execution of projects for both faculty and students.

Additional Webinar Discussion

The additional discussion begins at 1:08:32. Rather than summarize all the responses (which were fairly short), I’m simply going to list the questions, in-order as they were answered and encourage interested readers to listen to that portion of the webinar or stay tuned for follow-up posts here.

  1. What can High Schools do to prepare students for data science?
  2. What sort of mix do programs have between teaching analysis vs. presentation skills?
  3. Is it feasible for community colleges to only have an Introduction to Data Science course?
  4. How have prerequisites or program design affected diversity in data science?
  5. How is ethics being taught in each program? (and a side conversation about assessment)

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Why an Undergraduate Data Science Degree?

The job ‘Data Scientist’ was heralded as “The Sexiest Job of the 21st Century” by Harvard Business Review in 2012[1] at a crest of the ongoing publicity in the career fields associated with ‘big data.’ Articles on both the discipline and reality regularly appear in a variety of popular press outlets, including The Economist[2] and The New York Times[3], concurrently with growing discussion in more scholarly venues. The increased need for this specialty is driven by the fact that human activity is already generating petabytes of data each day and “data is projected by some experts to increase by 2,000 percent between now and 2020”[4]. Society will need more professionals and researchers capable of competently dealing with the huge influx of data that will be accumulated in the next decade and onward.

All this is great, and certainly helps motivate the creation of an undergraduate degree in data science (the language above came from our internal proposal), but it’s not what actually inspired me to start the process. That came from two sessions at SIGCSE 2014[5].The first was on a paper by Paul Anderson,  James Bowring, Renee McCauley, George Pothering and Christopher Starr titled: “An Undergraduate Degree in Data Science: Curriculum and a Decade of Implementation Experience” (DOI: http://dx.doi.org/10.1145/2538862.2538936 also linked on the resource pages). The other was a panel session, “Data Science as an Undergraduate Degree” with Paul Anderson, James McGuffee and David Uminsky (DOI: http://dx.doi.org/10.1145/2538862.2538868). At these sessions I got to hear what the College of Charleston (Paul Anderson) and the University of San Francisco (David Uminsky) were doing with undergraduate degrees. And it sounded like things that Valparaiso University was already offering, with the exception of perhaps an introductory course in data science. Moreover, it sounded like exactly the sort of degree I wish I’d been able to take as an undergraduate!

However, being able to actually follow-through with offering the program had more to do with several additional factors (besides the excitement). Before diving further into the process of actually creating the curriculum and elements, I want to discuss what made Valpo ready to start a Data Science degree so you can evaluate for yourself if it’s even feasible…

Valparaiso University already had…

  • A large Mathematics & Statistics Department (14 tenured/tenure-track faculty, two full-time lectures, and adjuncts).
  • Significant faculty experience in operations research, graph theory and scientific computing
  • A deep statistics curriculum including an actuarial science major
  • A complete computer science degree, covering all the basics
  • The Mathematics & Statistics Department and Computing & Information Sciences department had only recently split into two departments, so still had very strong communication and ties together.
  • A large, top-ranked college of engineering requiring more frequent offerings of mathematics and statistics electives, partially populated by engineers.
  • A master’s degree in Information Technology with 150-250 students, regularly offering courses in data mining, and information management systems (databases).
  • A master’s degree in Analytics and Modeling, where many of the courses were cross-listed with undergraduate courses

Together these factors combined to allow Valpo to start the new degree with very minimal curricular changes or additions which is not something feasible at most schools. Now, you certainly don’t need all of these factors to start your own program, but I you’ll probably at least need strong mathematics, statistics and computer science departments with good, clear communication between them. The rest just makes it easier.


[1] https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

[2] 5,300 search hits for “Big Data” in the print edition of the Economist. www.economist.com

[3] 304 search hits for “Big Data” articles in the last 12 months of NY Times. www.nytimes.com

[4] http://www.wired.com/2015/01/a-new-generation-of-data-requires-next-generation-systems/

[5] SIGCSE refer’s to the Association of Computing Machinery (ACM)’s Special Interest Group for Computer Science Education. Specifically, SIGCSE is usually used to refer to the group’s annual Technical Symposium, typically held in early March.