Tagged data science education

Big Month in Data Education — October

October has been an incredibly busy month! I’ve been traveling a lot, taking part in a wide variety of activities around data science education. It’s been a pretty big month and I’m here to give you a very quick run-down of what’s been happening!

The month kicked off with the Midwest Big Data Innovation Hub’s “All-hands on Deck” Meeting. I was invited there as part of a planning grant the hub had received to develop a spoke proposal for the hub to create a “Resource Center for Non-R1 Universities”. The meeting was very interesting, and we got to hear about some really neat work on using data science to advance agriculture, smart cities and more. The most relevant for data science education though was the final panel, “Education and Workforce Development.” Panelists included Jim Barkley, David Mongeau and Renata Rawlings-Goss. You can find their slides on the Midwest Big Data Hub (Barkley Slides, Mongeau Slides, Rawlings-Goss Slides). There is also a video recording of the panel here. The other important event that happened at the meeting was the afternoon grant-planning session. While I can’t share documents from that yet, I left very excited about the possibilities of establishing an important educational center for data science education that would help address the needs of non-R1 institutions. Some of the ideas that were shared included providing a clearing house for internships and project opportunities, connecting smaller institutions with interesting research projects and facilitating finding instructional expertise for most esoteric courses.

Mid-Month (October 20th), the National Academy of Sciences’ held their 4th roundtable on Data Science Education, “Alternative Institutional and Educational Mechanisms”. You can find the webcast and agenda webpage here. I attended as a member of the public and was able to contribute a few ideas and questions. There were several great presentations and some perspectives on education I hadn’t considered were definitely presented. Eric Kolaczyk gave a great presentation that described a very nicely integrated learning pathway for building data expertise at the master’s level. The MS in Statistical Practice It is one of the few programs I know of (now) that actually redesigned several of their courses to make a more effective data science education, and cohesive learning structure. It was also very informative to hear about Metis’s data science “bootcamps”. It’s pretty clear Metis is doing some excellent education work in data science, but very different from traditional, academic education. Additional talks worth listening to were Andrew Bray, explaining the origin and evolution of the American Statistical Association’s DataFest events, Ron Brachman describing Cornell Tech’s ‘entrepreneurial’ focused data science, and Catherine Cramer discussing the New York Hall of Science‘s Network Science education initiatives (I plan to use some of this material for with my students who do network science research!).

Additionally, the National Academy of Sciences have released an interim report on the “Envisioning the Data Science Discipline” studies going on.  The report is definitely worth reading and provides some very interesting views and findings. There’s also a strong call for community input, so send you ideas in!

The last activity I participated in during October was the South Big Data Hub‘s workshop “Keeping Data Science Broad: Workshop on Negotiating the Digital and Data Divide“. This workshop was an incredible pleasure to join! I think the best part was that with the entire room filled with people who have already been thinking about what data science and data science education might look like, we were able to frequently move beyond the “what is data science” discussion. It meant that we could really start discussing the roadblocks and opportunities inherent in data science. While I can’t share more of the actual outcomes/products from the workshop yet, we’ve got a really aggressive schedule to turn the output into a report (due Dec 6th!). I’m hopeful that something really powerful will come out. I know there was a lot of writing accomplished while there (I wrote 5-6 pages, and others did too) so look for another announcement of a report in early december.

Finally, while I haven’t been participating/watching them much yet. I need to mention the ongoing webinar series being run by the National Academy of Sciences. You can find the entire webinar series here. October saw 4 webinars posters, “Communication Skills and Teamwork”, “Inter-Departmental Collaboration and Institutional Organization”, “Ethics”, and “Assessment and Evaluation for Data Science Programs”. I’m still hoping to watch these and provide summary posts… but that hasn’t happened yet. If any of my readers have been watching them and would like a guest-post with a summary, please get in touch!

Webinar Summary: Data Science Education in Traditional Contexts

Introduction

This post is a summary and reflection on the webinar “Data Science Education in Traditional Contexts”. The webinar was hosted on Aug 28th by the South Big Data Innovation Hub as part of their Keeping Data Science Broad: Bridging the Data Divide series. You can watch the entire webinar here. The webinar consisted of 5 speakers and a discussion section. I’ve provided a short summary of each panelist’s presentation and the questions discussed at the end. The speakers, in order were:

  • Paul Anderson, College of Charleston
  • Mary Rudis, Great Bay Community College
  • Karl Schmitt, Valparaiso University
  • Pei Xu, Auburn University
  • Herman “Gene” Ray, Kennesaw State University

Summary of Presentation by Paul Anderson, College of Charleston

The first speaker was Paul Anderson, Program Director for Data Science at the College of Charleston. His portion of the presentation runs from 0:01:50-0:13:45, and expands on three challenges he has experienced, (1) being an unknown entity, (2) recruiting, and (3) designing an effective capstone. His first point, being an unknown entity, impacts a broad range of activities related to implementing and running a data science program. It can cause a challenge when trying to convince administrators to support the program or new initiatives (such as external collaborations). It means that other disciplines may not be interested in developing joint course work (or approving your curricular changes). His second point discussed what he’s learned from several years of working on recruitment. His first observation here ties to his first overall point: If your colleagues don’t know what data science is, how are most high school students to know (or even your students)?. This has led him to have limited success with direct recruitment from high schools. Instead, he’s focused on retooling the program’s Introduction to Data Science Course to be a microcosm of his entire program, both in terms of process and rigor. He’s also worked to make his program friendly to students switching majors or double majoring by having limited prerequisites. His final portion discussed the various forms of capstone experiences Charleston has experimented with. Starting from an initially 1-to-1 student-faculty project pair, moving into more group-based with a general faculty mentorship model. If you are considering including a capstone experience (and you should!) it’s probably worth listening to this portion. However, not all colleges or universities will have sufficient students/faculty to move into their final model.

Summary of Presentation by Mary Rudis, Great Bay Community College

The second speaker was Mary Rudis, Associate Professor of Mathematics at Great Bay Community College. Her portion runs 0:14:25-0:19:19 and 0:20:46-0:29:08. A significant portion of her presentation outlines the large enrollment and performance gap of non-white and first generation college students. Dr. Rudis saw building both an Associate Degree in Analytics, and a Certificate in Data – Practical Data Science as the best way to combat these gaps. In researching the state of jobs/education she found that community college students were struggling to compete for the limited internships and entry-level job opportunities available in data science, compared to 4-yr college students (like local M.I.T. students). Most companies in terms of hires were looking for Master’s level education, or significant work experience in the field. To help her students succeed, she built an articulation program with UNH-Manchester so that upon final graduation, students originally enrolled at GBCC would be full-qualified for the current job market.

Summary of Presentation by Karl Schmitt, Valparaiso University

The third speaker was Karl Schmitt, Assistant Professor of Mathematics and Statistics, Affiliate Professor of Computing and Information Sciences, and the Director of Data Sciences at Valparaiso University. His presentation runs from 0:30:30 – 0:45:20. The core of the presentation expanded on Dr. Anderson’s first point about data science being an unknown entity. He sought to provide ideas about how to differentiate programs from other similar programs, both at the college/university level, but also make the programs different when looking outside his own institution. Valparaiso has 6 data-focused programs:

His talk described how the programs can be differentiated in terms of the data user/professional that the program trains, and also in terms of course content and focus. He also talked about how Valpo is differentiating its program from other schools with a focus on Data Science for Social Good. This has been achieved in part by seeking industry partners from the government and non-profit sectors, rather than traditional industrial partners.

Summary of Presentation by Pei Xu, Auburn University

The fourth speaker was Pei Xu, Assistant Professor of Business Analytics, Auburn University. Her portion of the presentation runs from 0:46:05 – 0:57:55 and describers Auburn’s undergraduate Business Analytics Degree. Auburn’s curriculum is designed around the data science process of Problem Formulation -> Data Prep -> Modeling -> Analysis -> Presentation. Each of the core classes covers 1-2 stages of this process, with the specialized degree courses typically beginning in a student’s sophomore year. Their program also actively engages many businesses to visit and provide information sessions. Dr. Xu detailed 4 challenges she’s faced related to their program. First, she has found it hard to recruit qualified faculty for teaching courses, which she’s overcome by progressively hiring over the last few years. She has also found many students to be turned away by the high quantitative and computational nature of the program. This has been addressed by building a stronger emphasis on project-based learning and more interpretation than innovative process development. Third, she discussed how many of the core courses in their program have significant overlap between courses. For example, many courses in different areas all need to discuss data cleaning/preparation. Auburn’s faculty has spent significant curriculum development time discussing and planning exactly what content is duplicated and where. Finally, deciding between the various analytics tools for both the general curriculum and specific classes has proved challenging (you can see an extended discussion by me of Python/R and others in here).

Summary of Presentation by Herman “Gene” Ray, Kennesaw State University

The fifth speaker was Herman “Gene” Ray, Associate Professor of Statistics and Director for the Center for Statistics and Analytics Research, Kennesaw State University. His presentation is from 0:58:36 – 1:07:35 and focuses on KSU’s Applied Statistics Minor.  KSU’s program strongly focuses on domain areas, with most courses having a high-level of applications included and types of experiential learning opportunities. Additionally, almost all their courses use SAS in addition to introducing their students to a full range of data science software/tools. The first experiential learning model KSU uses is an integration of corporate data-sets and guided tasks from business. The second model is a ‘sponsored research class’ with teams of undergraduates led by a graduate student on corporation provided problems or data. Gene provided extended examples about an epidemiology company and about Southron Power Company. The key benefits KSU has seen are that students receive real world exposure, practice interacting with companies, potentially even receiving awards, internships, and jobs. The largest challenge to this experiential learning model is that is requires a significant amount of time, first to develop the relationships with companies, managing corporate expectations, and finally in the actual execution of projects for both faculty and students.

Additional Webinar Discussion

The additional discussion begins at 1:08:32. Rather than summarize all the responses (which were fairly short), I’m simply going to list the questions, in-order as they were answered and encourage interested readers to listen to that portion of the webinar or stay tuned for follow-up posts here.

  1. What can High Schools do to prepare students for data science?
  2. What sort of mix do programs have between teaching analysis vs. presentation skills?
  3. Is it feasible for community colleges to only have an Introduction to Data Science course?
  4. How have prerequisites or program design affected diversity in data science?
  5. How is ethics being taught in each program? (and a side conversation about assessment)

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

(Other) Official Curriculum Guides

Last week I discussed several places from which you could pull curriculum planning materials. This week will continue that theme, but with a bit more of an ‘official’ flavor, by discussing several professional societies’ curricular guides. While there is no (clear) leading data science professional society (and none with curricular guidelines to my knowledge), there are a few closely related societies with official guidelines. Depending on what path you took into data science, you may be more or less familiar with the following societies: Association of Computing Machinery (ACM), Institute of Electrical and Electronics Engineers (IEEE), Mathematical Association of America (MAA), and the American Statistical Association (ASA), . There are several other societies relevant to data science, but not as vital in terms of official curricular guidelines (SIAM, INFORMS, AMS, ASEE). All four major societies (ACM, IEEE, MAA, and ASA) have released curricular guidelines relevant to data science. This post will give a very high level overview of those guidelines and why you might care about what’s in them.

ACM and IEEE jointly released Curriculum Guidelines for Undergraduate Programs in Computer Science in 2013 (CS2013). The most valuable component of CS2013 for me is the specification of ‘Knowledge Areas’ that are obviously related to Data Science, and being able to see the professional community’s consensus on central learning objectives in these areas. Some clearly important/relevant areas are:

  • Computational Science
  • Discrete Structures
  • Graphics and Visualization
  • Information Management
  • Parallel and Distributed Computing

Other areas such as Algorithms and Complexity, Information Assurance and Security, or Programming Languages probably include specific learning objectives that are relevant to data science, but may not be needed in their entirety. Additionally, CS2013 allows you to to examine the suggested course hours expected to be devoted to these topics. From an industry perspective, this can provide valuable insight into whether a data scientist or computer scientist might be more knowledgeable about a particular subject. This differentiation in knowledge is important as data science strives to define itself independently of its founding disciplines. If you are interested in throwing your net a bit wider, ACM also has guides for other programs like Computer Engineering and Information Technology (coming in 2017) on their guidelines site.

The MAA’s 2015 Committee on the Undergraduate Programs (CUPM) in Mathematics Curriculum Guide to Majors in the Mathematical Sciences — CUPM Guide for short — can serve in largely the same way the CS2013 guide does, but from a mathematical/statistical approach. With more detailed reports on Applied Mathematics, Computational Science, Operations Research, and other areas of mathematics that data science often operates in, the CUPM Guide makes it possible to understand what exactly (from a mathematician’s or computational mathematician’s perspective) are the most relevant areas of mathematics to understand for success. This guide can also serve to help clarify exactly what sorts of mathematics courses a data science curriculum should require, by explaining where in the course structure specific topics like sets, relations, and functions, or other ideas get covered. In addition to their extensive undergraduate guide the MAA also provides a lot of interesting materials related to masters/Ph.D preparation, etc. These might be particular interesting as you consider what sorts of students to recruit or include in a master’s program.

Finally, the ASA has perhaps the most relevant and diverse, but in many ways least detailed, set of curriculum guides. The set of undergraduate guidelines and reports include how to assess instruction, program guidelines for statistical sciences, and even the Park 2016 Data Science guidelines (which I have commented on in other posts). They also have two sets of graduate guidelines from 2009 and 2012 for statistics masters/Ph.D. programs. What the ASA guidelines provide are much bigger, sweeping statements about the sorts of skills and knowledge that a statistics major should have. It includes side notes that give more details such as encouraged programming languages and even file formats. In many ways, I think the majority of the ASA guidelines could just replace “Statistics Major” with “Data Science Major” and remain nearly as applicable. The biggest difference might be in the level/depth required in “Statistical Methods and Theory” (less) and “Data Manipulation and Computation” (more). In a sense, this is at the heart of many statistician’s argument that “Data Science” isn’t really its own field. In practice though, I think the final implementation and mindset behinds a statistics major and a data science major will be very different, and certainly heavily influenced by the ‘host’ department.

That covers the breadth of the major professional societies’ curricular recommendations. I wasn’t able to find any (official) guidelines for a “business analytics” major from a professional society (see my resource page for a few unofficial documents), so if you know of one, please let me know.

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Blog Intro and Information

Welcome to “From the Director’s Desk” a blog about data science education and curriculum. If you are interested in receiving regular updates when new posts appear you can use the RSS feed link above, or subscribe to the google-group (read more for the link, you don’t need a gmail-account to subscribe!). You can find a bit more about me, Karl Schmitt on the about page. If you are looking for full degree curriculum development materials I’ve created a resource page and tracked posts with a Program Development category. Individual course materials are tracked either generally with the “Course Development” category, or individually by each course the post relates to. Please feel free to email me or leave comments if you have questions, thoughts or something to share!

The original blog introduction, with a bit of why the blog exists and what it seeks to cover is here.