Tagged Program Development

Webinar Summary: Data Science Education in Traditional Contexts

Introduction

This post is a summary and reflection on the webinar “Data Science Education in Traditional Contexts”. The webinar was hosted on Aug 28th by the South Big Data Innovation Hub as part of their Keeping Data Science Broad: Bridging the Data Divide series. You can watch the entire webinar here. The webinar consisted of 5 speakers and a discussion section. I’ve provided a short summary of each panelist’s presentation and the questions discussed at the end. The speakers, in order were:

  • Paul Anderson, College of Charleston
  • Mary Rudis, Great Bay Community College
  • Karl Schmitt, Valparaiso University
  • Pei Xu, Auburn University
  • Herman “Gene” Ray, Kennesaw State University

Summary of Presentation by Paul Anderson, College of Charleston

The first speaker was Paul Anderson, Program Director for Data Science at the College of Charleston. His portion of the presentation runs from 0:01:50-0:13:45, and expands on three challenges he has experienced, (1) being an unknown entity, (2) recruiting, and (3) designing an effective capstone. His first point, being an unknown entity, impacts a broad range of activities related to implementing and running a data science program. It can cause a challenge when trying to convince administrators to support the program or new initiatives (such as external collaborations). It means that other disciplines may not be interested in developing joint course work (or approving your curricular changes). His second point discussed what he’s learned from several years of working on recruitment. His first observation here ties to his first overall point: If your colleagues don’t know what data science is, how are most high school students to know (or even your students)?. This has led him to have limited success with direct recruitment from high schools. Instead, he’s focused on retooling the program’s Introduction to Data Science Course to be a microcosm of his entire program, both in terms of process and rigor. He’s also worked to make his program friendly to students switching majors or double majoring by having limited prerequisites. His final portion discussed the various forms of capstone experiences Charleston has experimented with. Starting from an initially 1-to-1 student-faculty project pair, moving into more group-based with a general faculty mentorship model. If you are considering including a capstone experience (and you should!) it’s probably worth listening to this portion. However, not all colleges or universities will have sufficient students/faculty to move into their final model.

Summary of Presentation by Mary Rudis, Great Bay Community College

The second speaker was Mary Rudis, Associate Professor of Mathematics at Great Bay Community College. Her portion runs 0:14:25-0:19:19 and 0:20:46-0:29:08. A significant portion of her presentation outlines the large enrollment and performance gap of non-white and first generation college students. Dr. Rudis saw building both an Associate Degree in Analytics, and a Certificate in Data – Practical Data Science as the best way to combat these gaps. In researching the state of jobs/education she found that community college students were struggling to compete for the limited internships and entry-level job opportunities available in data science, compared to 4-yr college students (like local M.I.T. students). Most companies in terms of hires were looking for Master’s level education, or significant work experience in the field. To help her students succeed, she built an articulation program with UNH-Manchester so that upon final graduation, students originally enrolled at GBCC would be full-qualified for the current job market.

Summary of Presentation by Karl Schmitt, Valparaiso University

The third speaker was Karl Schmitt, Assistant Professor of Mathematics and Statistics, Affiliate Professor of Computing and Information Sciences, and the Director of Data Sciences at Valparaiso University. His presentation runs from 0:30:30 – 0:45:20. The core of the presentation expanded on Dr. Anderson’s first point about data science being an unknown entity. He sought to provide ideas about how to differentiate programs from other similar programs, both at the college/university level, but also make the programs different when looking outside his own institution. Valparaiso has 6 data-focused programs:

His talk described how the programs can be differentiated in terms of the data user/professional that the program trains, and also in terms of course content and focus. He also talked about how Valpo is differentiating its program from other schools with a focus on Data Science for Social Good. This has been achieved in part by seeking industry partners from the government and non-profit sectors, rather than traditional industrial partners.

Summary of Presentation by Pei Xu, Auburn University

The fourth speaker was Pei Xu, Assistant Professor of Business Analytics, Auburn University. Her portion of the presentation runs from 0:46:05 – 0:57:55 and describers Auburn’s undergraduate Business Analytics Degree. Auburn’s curriculum is designed around the data science process of Problem Formulation -> Data Prep -> Modeling -> Analysis -> Presentation. Each of the core classes covers 1-2 stages of this process, with the specialized degree courses typically beginning in a student’s sophomore year. Their program also actively engages many businesses to visit and provide information sessions. Dr. Xu detailed 4 challenges she’s faced related to their program. First, she has found it hard to recruit qualified faculty for teaching courses, which she’s overcome by progressively hiring over the last few years. She has also found many students to be turned away by the high quantitative and computational nature of the program. This has been addressed by building a stronger emphasis on project-based learning and more interpretation than innovative process development. Third, she discussed how many of the core courses in their program have significant overlap between courses. For example, many courses in different areas all need to discuss data cleaning/preparation. Auburn’s faculty has spent significant curriculum development time discussing and planning exactly what content is duplicated and where. Finally, deciding between the various analytics tools for both the general curriculum and specific classes has proved challenging (you can see an extended discussion by me of Python/R and others in here).

Summary of Presentation by Herman “Gene” Ray, Kennesaw State University

The fifth speaker was Herman “Gene” Ray, Associate Professor of Statistics and Director for the Center for Statistics and Analytics Research, Kennesaw State University. His presentation is from 0:58:36 – 1:07:35 and focuses on KSU’s Applied Statistics Minor.  KSU’s program strongly focuses on domain areas, with most courses having a high-level of applications included and types of experiential learning opportunities. Additionally, almost all their courses use SAS in addition to introducing their students to a full range of data science software/tools. The first experiential learning model KSU uses is an integration of corporate data-sets and guided tasks from business. The second model is a ‘sponsored research class’ with teams of undergraduates led by a graduate student on corporation provided problems or data. Gene provided extended examples about an epidemiology company and about Southron Power Company. The key benefits KSU has seen are that students receive real world exposure, practice interacting with companies, potentially even receiving awards, internships, and jobs. The largest challenge to this experiential learning model is that is requires a significant amount of time, first to develop the relationships with companies, managing corporate expectations, and finally in the actual execution of projects for both faculty and students.

Additional Webinar Discussion

The additional discussion begins at 1:08:32. Rather than summarize all the responses (which were fairly short), I’m simply going to list the questions, in-order as they were answered and encourage interested readers to listen to that portion of the webinar or stay tuned for follow-up posts here.

  1. What can High Schools do to prepare students for data science?
  2. What sort of mix do programs have between teaching analysis vs. presentation skills?
  3. Is it feasible for community colleges to only have an Introduction to Data Science course?
  4. How have prerequisites or program design affected diversity in data science?
  5. How is ethics being taught in each program? (and a side conversation about assessment)

Keeping Data Science Broad-Webinar

Please join me and other data science program directors for an educational webinar exploring undergraduate programs.

Keeping Data Science Broad: Data Science Education in Traditional Contexts | Aug 31, 2017 | Virtual
This webinar will highlight data science undergraduate programs that have been implemented at teaching institutions, community colleges, universities, minority-serving institutions, and more. The goal is to provide case studies about data science degrees and curricula being developed by primarily undergraduate serving institutions. Such institutions are crucial connectors in the establishment of a robust data science pipeline and workforce but they can have different constraints than large research-focused institutions when developing data science education programming.

More details about the webinar will be posted soon on the South Hub website: http://www.southbdhub.org/datadivideworkshop.html

Python or R?

This week I want to discuss a potentially divisive issue, should a program (or course etc) be taught in Python or R. I think a reasonable case could be made for teaching either language. Pragmatically, if you want your program’s graduates to be truly competitive for the largest variety of jobs in the current market students need to at least be familiar with both (and possibly SAS or SPSS). There is already a lot of information and blog posts addressing this question and I’ve provided links to a few of my favorites at the end of this post. Rather than re-hashing those post’s pro’s and con’s I’m going to focus on aspects of each language related to teaching (and learning).

Before considering each language, I want to frame the discussion by (re)stating a program level student learning objective (SLO). In my first post about SLO’s objective 2 states: “Students will be able to implement solutions to mathematical and analytical questions in language(s) and tools appropriate for computer-based solutions, and do so with awareness of performance and design considerations“. Based on this objective, I’ll state three specific objectives for selecting a programming language:

  • A language which can implement (complete) solutions to data science questions
  • A language which allows good programming practices in terms of design
  • A language which allows implementation of solutions which can be improved/aware of performance issues

Why Choose R?

As a programming language that originated in academia, particularly within the statistics community, R seems like a very natural choice in terms of teaching data science. Much of the syntax, function naming and even thoughts about how to construct a data pipeline/workflow comes naturally from a statistical analysis perspective. This makes it very easy to convert knowledge of statistical processes into code an analysis within R. The easy conversion between notation and code becomes even more valuable when trying to work with advanced/obscure statistical techniques. With R’s origination in academic statistics, there is a much broader range of packages for uncommon techniques than in most other languages. This suggests a strong candidate for the first requirement when working in statistical domains.

Other software/packages that make R appealing to teach with are RStudio, Jupyter Notebooks and R Markdown. RStudio provides a clean, user-friendly interface for R that makes interacting with plots and data easy. It even aids the transition from spreadsheet software (like Excel) by providing a similar, GUI-driven interaction with (simple) data-frames. With Jupyter Notebooks’ recent addition of an R kernel option, it is also easy to transition from mathematics focused software like Maple and Mathematica. See this DataCamp blog-post for more information on using Jupyter Notebooks (or similar software) with R. Notebooks also facilitate teaching good practices such as code-blocks and code annotation. Finally, R Markdown provides a (reasonably) simple way to convert executable code directly into final reports/outputs. That functionality further supports the teaching of (some) good programming and design practices.

Why Choose Python?

Python was originally developed to be an easy to learn programming language (see Wikipedia’s history on Python). This means the whole language’s syntax and styling is easier to learn from scratch than most other languages (notably R). The basic Python data structure of lists naturally works like mathematical sets while dictionaries closely match logical constructions for unstructured data. Together with the use of indentation to indicate control flow, it is natural to when doing any introduction to the language, how to make Python code (human) readable. These traits speak directly to teaching/achieving our second language-related objective, “allows good programming practices/design”.

For teaching, Python starts with many of the same advantages as R. There is a long standing Python kernel for Jupyter Notebooks and several markdown packages available for turning code directly into html-styled reports. What makes Python noticeably different from R is that it is a general purpose programming language. In terms of teaching, this opens up some interesting options related to the first and third goals above. In terms of developing solutions to data science problems, Python easily allows a very broad range of both input and output. Specifically, it has high quality packages designed to deal with streaming data and better techniques for unstructured or big data. Also, because Python is regularly used to develop full programs and deployed software solutions, the methods available to study and improve performance are already well developed.

 

But What are People Actually Using?

There are way, way more Python users than R users (and probably will be for the foreseeable future) simply because Python is a general purpose programming language. However, we are more concerned with users within the data science communities. That focus however doesn’t make the answer to our question any more clear. 2016 Data from O’Reilly’s Data Science Salary Survey places R (57%) slightly ahead of Python (54%) which matches KDnugget’s rankings of R being slightly ahead in 2016. However, the 2017 KDNugget survey results now places Python slightly ahead. Burtch Works’ 2017 survey data however still has R significantly ahead, and in-fact still gives a very large market share to SAS which didn’t even make KDnugget’s list. But Burtch also notes that Python has been gaining shares each year. Remember when considering these results however, that these are all self-reported and self-selecting surveys! It is hard to tell if these changes are actual changes in use, or just a changing definition/reach of who’s responding to the surveys. For example, when Burtch Works breaks down their results at least one sub-group rarely used SAS and, similar to O’Reilly and KDnugget, had Python ahead. More and more people are identifying with doing data science each year, but many of them have been doing similar things for a long time.

Some Undisguised Opinions

There is obviously value in either programming language, but from my perspective there is a really strong winner in Python. From a curriculum/planning perspective, since Python is a general-purpose language it is entirely feasible to have standard, introductory programming courses from a computer science department taught in Python. This reduces (potentially wasteful) duplication of similar courses (does every discipline really need its own intro programming?). It also lets computer scientists take advantage of years of educational research into how to better teach programming! Not to mention that Python was intentionally designed to be easier to learn programming in.

Add to this that data science students don’t really experience any major disadvantages from having Python as the primary curricular language but do gain several benefits. Key benefits include longer-term skill viability and increased versatility in job options, etc. This versatility even plays out when considering including advanced CS courses in a data science curriculum. Most data science curriculums are already going to struggle to incorporate all the necessary foundational skills in a reasonable length undergraduate (or graduate) program. So why add programming courses beyond those already needed to meet typical CS prerequisites?

Finally, looking at the trends in language/tool use in data science just adds more validation to this idea. As companies move to working with unstructured or streaming data, Python becomes even more natural. All the surveys report increasing use of Python, without any signs of slowing down that increase. It is important for academic programs to not just react, but even anticipate trends and needs in the job market and industry.

Additional Resources

While I didn’t go into lots of details on the pro’s and con’s of R or Python (and didn’t even talk about SAS/SPSS) I have collected a few links that you might find valuable to read in making your own decision.

R vs. Python for Data Science: Summary of Modern Advances — EliteDataScience Dec 2016 — Does a nice job of highlighting the new things that make the languages pretty equal.

 

Python & R vs. SPSS & SAS — The Analytics Lab  – 2017 — This is nice because it also puts into perspective how SPSS and SAS play into the landscape as well as provides additional historic perspectives

Python vs. R: The battle for data scientist mind share — InfoWorld, 2017 — a fairly balanced perspective on the value of both

R vs. Python for Data Science — KDNuggets 2015 — A bit dated, but still provides some good comparisons.

(Other) Official Curriculum Guides

Last week I discussed several places from which you could pull curriculum planning materials. This week will continue that theme, but with a bit more of an ‘official’ flavor, by discussing several professional societies’ curricular guides. While there is no (clear) leading data science professional society (and none with curricular guidelines to my knowledge), there are a few closely related societies with official guidelines. Depending on what path you took into data science, you may be more or less familiar with the following societies: Association of Computing Machinery (ACM), Institute of Electrical and Electronics Engineers (IEEE), Mathematical Association of America (MAA), and the American Statistical Association (ASA), . There are several other societies relevant to data science, but not as vital in terms of official curricular guidelines (SIAM, INFORMS, AMS, ASEE). All four major societies (ACM, IEEE, MAA, and ASA) have released curricular guidelines relevant to data science. This post will give a very high level overview of those guidelines and why you might care about what’s in them.

ACM and IEEE jointly released Curriculum Guidelines for Undergraduate Programs in Computer Science in 2013 (CS2013). The most valuable component of CS2013 for me is the specification of ‘Knowledge Areas’ that are obviously related to Data Science, and being able to see the professional community’s consensus on central learning objectives in these areas. Some clearly important/relevant areas are:

  • Computational Science
  • Discrete Structures
  • Graphics and Visualization
  • Information Management
  • Parallel and Distributed Computing

Other areas such as Algorithms and Complexity, Information Assurance and Security, or Programming Languages probably include specific learning objectives that are relevant to data science, but may not be needed in their entirety. Additionally, CS2013 allows you to to examine the suggested course hours expected to be devoted to these topics. From an industry perspective, this can provide valuable insight into whether a data scientist or computer scientist might be more knowledgeable about a particular subject. This differentiation in knowledge is important as data science strives to define itself independently of its founding disciplines. If you are interested in throwing your net a bit wider, ACM also has guides for other programs like Computer Engineering and Information Technology (coming in 2017) on their guidelines site.

The MAA’s 2015 Committee on the Undergraduate Programs (CUPM) in Mathematics Curriculum Guide to Majors in the Mathematical Sciences — CUPM Guide for short — can serve in largely the same way the CS2013 guide does, but from a mathematical/statistical approach. With more detailed reports on Applied Mathematics, Computational Science, Operations Research, and other areas of mathematics that data science often operates in, the CUPM Guide makes it possible to understand what exactly (from a mathematician’s or computational mathematician’s perspective) are the most relevant areas of mathematics to understand for success. This guide can also serve to help clarify exactly what sorts of mathematics courses a data science curriculum should require, by explaining where in the course structure specific topics like sets, relations, and functions, or other ideas get covered. In addition to their extensive undergraduate guide the MAA also provides a lot of interesting materials related to masters/Ph.D preparation, etc. These might be particular interesting as you consider what sorts of students to recruit or include in a master’s program.

Finally, the ASA has perhaps the most relevant and diverse, but in many ways least detailed, set of curriculum guides. The set of undergraduate guidelines and reports include how to assess instruction, program guidelines for statistical sciences, and even the Park 2016 Data Science guidelines (which I have commented on in other posts). They also have two sets of graduate guidelines from 2009 and 2012 for statistics masters/Ph.D. programs. What the ASA guidelines provide are much bigger, sweeping statements about the sorts of skills and knowledge that a statistics major should have. It includes side notes that give more details such as encouraged programming languages and even file formats. In many ways, I think the majority of the ASA guidelines could just replace “Statistics Major” with “Data Science Major” and remain nearly as applicable. The biggest difference might be in the level/depth required in “Statistical Methods and Theory” (less) and “Data Manipulation and Computation” (more). In a sense, this is at the heart of many statistician’s argument that “Data Science” isn’t really its own field. In practice though, I think the final implementation and mindset behinds a statistics major and a data science major will be very different, and certainly heavily influenced by the ‘host’ department.

That covers the breadth of the major professional societies’ curricular recommendations. I wasn’t able to find any (official) guidelines for a “business analytics” major from a professional society (see my resource page for a few unofficial documents), so if you know of one, please let me know.

Course/Curriculum Resource Sites

Last week I posted about specific websites you might use to host or pull assignments from. This week I want to take a broader look at overall curriculum design. This is by no means a comprehensive posting of sites that have curriculum available, instead it’s intended to help reduce your search time for this kind of material.

If you are looking to find wholesale curriculums, including course materials, there are a few options available to start the creative juices flowing. The first, and probably most academic, is the European Data Science Academy (EDSA). The EDSA is grant funded with a large number of academic (university) and research institute partners from across Europe. The thing I like best about this work is that they started with a demand analysis study of the skills needed and current jobs in data science across the EU. Furthermore, from the start the project built in a feedback and revision cycle to improve and enhance the topics, delivery, etc. To understand their vision, see the image below.

This idea of continual improvement was more than just a grant seeking ploy as shown by their list of releases, revisions, and project deliverables. While the current site still lists four learning modules as unreleased, they are expected July 2017.

Overall, their curriculum structure (I haven’t evaluated their deeper content) has a fairly high emphasis on computational topics, with less statistics/mathematical underpinnings. You can experience their curriculum directly (it’s free/open access) through their online course portal. What might be far more valuable though is their actual grant’s deliverables. These deliverables include details on the overall design principles in their structure with learning objectives, individual courses with their own learning objectives, descriptions of lesson topics/content and more. Using their outlines and ideas to guide your own construction of a curriculum is both reasonable and a great way to make sure you aren’t missing any major, important topic, however, this should be done with proper attribution and license checking (of course).

The other two places to look for curricular inspiration are also in the ‘open source’ category, but not funded by grants or (traditional) academic institutions. The Open Source Data Science Masters was constructed by Clare Corthell, who has gone on to found his own data science consulting firm and other initiatives. While not every link on the site is actually to a free resource (there’s several books to buy etc), it does a pretty nice job of highlighting the topics that will need to be covered (if possible), and provides lots of places to start pulling course materials from (or getting inspiration/ideas for content). The primary curriculum is python focused, however he also has a collection of R resources.

Corthell isn’t the only one though with an “open source” or “free” data science (masters) degree. Another collection of relatively similar material was collected by David Venturi, who’s now a content developer at Udacity (writing data science curriculum of course). For those designing curriculums, both Corthell and Venturi provide excellent resources and places to frame your learning. However if you hit this page trying to get into data science, read this Quora post that I think accurately highlights the challenges of learning from/with these open source programs.

Another similar alternative, that I’d peg closer to an undergraduate degree, is the Open Source Society University‘s data science curriculum. Their curriculum assumes a lot less pre-knowledge in mathematics and statistics, providing links for Calculus, Intro Statistics, etc. This content is probably more in-line with the recommendations for curriculum from the Park’s paper (see my Curriculum Resources page). What I particularly like about this (from a learning perspective) is that it actually details the amount of work per week required to learn from each course. You’ll see a large repetition of topics, but the OSS-Univ’s curriculum has a lot less advanced material, with only a few courses in big data, wrangling, etc.

At the end of the day, if you are looking to implement an undergraduate or graduate degree in data science, your university is going to have to offer duplicates of a significant subset of classes from these curriculums. While emulation might be the highest form of praise, we’ll each need our own, unique take on these courses while striving for sufficient similarity to have a semi-standardized knowledge base for practitioners. Good luck!

 

Math (Courses) for Data Science

I want to share some thoughts on the math required for a data scientist (or at least, a data science undergraduate degree). The discussion can really be boiled down to one question: “Discrete Mathematics or Calculus 2?” Let’s first take a look at the outcomes from an in-progress and two completed working groups on outlining data science education.

An ACM organized workshop in 2015 included participants from ACM, ASA, IEEE-CS, AMS and more. That workshop’s report does not explicitly state any math requirements, but does make clear the need for sufficient supporting statistics courses. The clearest recommendations come from a group of faculty at the Park City Mathematical Institute in the summer of 2016. Their report gives suggestions on how to make a data science degree from existing courses and ideas for new integrated courses (this is the real gold in the report). If constructing a curriculum from existing courses, the group recommends three mathematics courses: Calculus 1, 2, and Linear Algebra. Last, a series of round-table discussions is currently underway by the National Academy of Science to discuss Data Science Post-Secondary Education. While all three NAS round tables are interesting, only the first is relevant to this discussion. At that meeting, there was a presentation on the underlying mathematics of data science. Their list of mathematics supporting data science included linear algebra, numerical analysis and graph theory.

In summary, all three groups clearly support the need for linear algebra to be a part of any data science curriculum. I doubt you’ll find many objections to this idea since linear algebra forms the mathematical foundation for manipulating data contained in tables or arrays as rows/columns. If nothing else, simply learning the notation is vitally important for anyone wanting to extend algorithms for data science. All three also clearly support at least two traditional statistics courses, up through regression analysis. A little less clearly, I would argue that all three support the requirement of a Calc 1 course. The NAS round-table discussed needing numerical analysis, which is traditionally based on calculus concepts. The ACM workshop supported disciplinary knowledge and just about all science disciplines require at least one semester of calculus.

Alright, on to the differences. The PCMI group included Calculus 2 in their “minimum” courses needed for data science. In my opinion, the suggestion that Calc-2 be included in the bare minimum courses for data science is indicative of the dominance of mathematicians (many applied) and statisticians in the group (there were a FEW computer scientists). While I think overall they are quite good, I think the inclusion of Calc 2 over discrete mathematics (as well as the odd location of data mining) clearly reflect this make-up. The presentation on mathematics (from two mathematicians) at the first NAS however included graph theory as one of the three main supporting mathematical areas. So, perhaps the question from these two groups is: “Calculus 2 or Discrete Mathematics?”

Here’s an alternative way to build an answer to this question. Instead of just focusing on the topics covered, what about the requirements for the other supporting disciplines that make up data science? Computer Science is pretty easy. Almost all programs require Calculus 1 and discrete mathematics, and the ACM 2013 guidelines include a list of core topics (set theory, graph theory and logic) that are traditionally covered in either a discrete mathematics course, or a combination of several mathematics courses. They also articulate very clearly that for some areas of computer science (like visualization or data science) linear algebra and statistics will be required. We can contrast this with typical mathematics requirements for statistics curriculum. For many statistics programs, a minimum of Calc 2 is required to support advanced probability courses (with a preference for multivariable calculus). The ASA 2014 guidelines specify that statistics majors should have both differentiation and integration (typically covered by Calc 1 and 2), and linear algebra.

Development from supporting disciplines can leave us just as confused as to what to require. I think there is an answer, but it requires taking off the mathematician glasses, and thinking about jobs, applications, and where a student might be headed. First, a good portion of researchers and practitioners doing data science use graphs and networks, often doing mining on those graphs for information. Turns out graphs (the node/edge type, not the line/bar plot type) are also a great way to visualize a lot of information. Another key skill when doing data science is the ability to partition data. That is, to think of data as either meeting, or not meeting specific criteria. This is encompassed in set theory in mathematics, and is sometimes partially covered as part of logic. Together these topics provide two new ways of thinking about data that aren’t included in other mathematics courses. The need for this sort of knowledge, and a basic introduction to proofs is why discrete mathematics courses came into existence, to allow CS majors to get these topics without taking another 3 or 4 mathematics courses. To me, this is a far stronger case for including discrete mathematics than the (possible) need of Calculus 2 for advanced statistics courses. If you are requiring 4 math courses, by all means, include Calculus 2 next. Or, if a student is particularly interested in understanding the theoretical underpinnings of data science (by taking more statistics courses) then they should take Calc 2. But if we are really thinking about an undergraduate degree as a stand-alone, prepared to enter the work force degree, Calc 2 does not seem to add a lot of direct value to the student’s degree.

Student Learning Objectives – Part 4

This post is part of a series on student learning objectives (SLO’s) for both curriculum and courses. The SLO’s in this post are course level, specifically topical objectives for an “Introduction to Data Science” (Data 151) class for new students. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

The last post focused on high-level learning objectives for the course “Introduction to Data Science” (I’ve repeated them below for reference). Those are certainly the big picture, but those four objectives are hardly enough to really design day-to-day lessons around. Data 151 also has seven topical objectives tied directly to those general objectives and modeled after Paul Anderson’s DISC 101 course objectives. I’ll tie each topical objective back to the course’s overall goals.

General Course Objectives:

A. Students understand the fundamental concepts of data science and knowledge discovery
B. Students can apply and perform the basic algorithmic and computational tasks for data science
C. Students develop and improve analytical thinking for problem formulation and solution validation, especially using technology
D. Students prepare for success in a world overflowing with data.

Topical Objectives:

  1. gain an overview of the field of knowledge discovery (A)
  2. learn introductory and state-of-the-art data mining algorithms (A,B)
  3. be able to distinguish and translate between data, information, and knowledge (A, C)
  4. apply algorithms for inductive and deductive reasoning (B,C)
  5. apply information filtering and validation on real world datasets (B,C)
  6. understand the social, ethical, and legal issues of informatics and data science (A,D)
  7. apply data mining, statistical inference, and machine learning algorithms to a variety of datasets including text, image, biological, and health (B,D)

Four of the topical objectives (1,2, 3 & 6) focus on guiding students towards understanding the fundamental concepts behind data science. One can hardly call a course an “introduction” without giving an overall picture of the field (Obj. 1) or spending time understanding key tools that practitioners use (Obj. 2). While I fully anticipate that the state-of-the-art algorithms will change, the basics like k-Nearest Neighbor, k-Means, and Decision Trees will certainly not. These algorithms provide a nice gateway into understanding the ideas of learning from a collection of data (Obj. A).

We also know in data science that what you can learn from a data-set is limited by the quality of the input data (like a lot of other things in life, garbage-in = garbage-out). Objectives 5 & 7 articulate the sorts of data that will be used in the course, both real-world data and a mix of prepared/known data sets. These data sets provide a way to actually practice Objectives 2 & 4 in more than just an abstract way. I want students to walk away from this class knowing how practitioners actually make use of algorithms. Students need to get their hands dirty putting some of those algorithms to work (Obj. B/C).

Now, I think it’s important to note here that in their projects and general work, I’m not expecting a really deep understanding or application of the algorithms. That’s saved for two later courses, one explicitly on data mining and the other their capstone sequence. In Data 151 they should be learning enough to continue learning on their own, understand and interact with people who are really doing this work, and to grasp how the ideas can and are shaping the evolution of various disciplines or industries.

While Objectives 2, 4 & 5 articulate using data science skills, Objectives 2-5 have a second layer as well. These objectives aim to have students think about the implications and knowledge that comes from the data science process. This course is about more than just data engineering or data mining, it’s really about the questions and, well, science that can done with data. It is only when students can understand the processes of both inductive and deductive reasoning for science, or transform raw data into actionable knowledge that they become aware of the true power of the field (Obj. B/C).

Last, but certainly not least, Objective 6. As we know from Spider-Man (and some other great speeches), “With great power comes great responsibilities.” If you believe, like I do, that data science could dramatically change what we know and how industries and society is run… then I hope you are also a little nervous, perhaps occasionally terrified. Because if we DON’T talk about the social, ethical, and legal issues surrounding informatics and data science we might well end up with something like Ultron (the artificial intelligence gone bad in Marvel’s “Avengers: Age of Ultron”). More likely, we’ll end up with biased learning algorithms that perpetuate injustices or inequality. Making sure students have at least started to think about these sorts of issues may not prevent them from happening, but it is one (in my mind necessary) step towards that goal (Obj. D).

Together this is a pretty hefty set of things to accomplish in a semester. All in all though, I think they serve as a great lead into the entire field, and the overall goals of Valpo’s Data Science program (described in previous posts). Even if a student only takes Data 151 (as some certainly will), they will leave with a broad understanding of the field, enough knowledge to interact successfully with experts, and enough insight to see the real value that the effective and intelligent use of data can provide. I hope my business students are now prepared to be the “data-savvy business managers” that McKinsey & Co. described a few years ago and that the rest (C.S., Math and Stats) can work with, or become true data scientists, engineers, or creators.

Student Learning Objectives – Part 3

This post is part of a series on student learning objectives (SLO’s) both for curriculum and courses. The SLO’s in this post are course level, specifically for an “Introduction to Data Science” (Data 151) class for new students. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

In many ways, the general course SLO’s for Data 151 mirror the SLO’s for the program as a whole. Students need to leave with an understanding of what data science is, know about the basic algorithms, and be made aware of the ethic and moral issues surrounding the use of data. Data 151 is intended to be a hook that draws in students from across our university to learn about data and then consider adding a major in Data Science. It also draws in juniors and seniors in less technical disciplines like business. This  may in turn make Data 151 the only course where a student explicitly thinks about data. The major difference between the curricular and course SLO’s is the depth of understanding I expect students to leave the course with (as opposed to the program). This is most clear in the first two SLO’s below.

  1. Students understand the fundamental concepts of data science and knowledge discovery
  2. Students can apply and perform the basic algorithmic and computational tasks for data science

As said, these are very close to the first two SLO’s for the whole curriculum and related to both their ability to communicate data science concepts and also their ability to implement solutions, though in both cases with lower levels of expertise. Data 151 has two additional SLO’s that target the broader (potential) audience for the course (in addition to continuing majors). These are:

3. Students develop and improve analytical thinking for problem formulation and solution validation, especially using technology
4. Students prepare for success in a world overflowing with data.

In many cases, students in Intro to Data Science are still gaining experience (aren’t we all?) with general problem solving skills. Perhaps (to my mind) one of the most under-taught skills in STEM courses is how to actually formulate and structure the process of solving a problem. In many, many cases, a significant amount of time can be saved in the execution of problem solving by carefully planning out how you are going to explore or solve a problem. Data science even has this explicitly built into several locations in a typical workflow, specifically performing exploratory data analysis and planning for solution validation.

Meanwhile, the final objective is meant to really be a catch-all. The field of data science is changing incredibly rapidly, as are the ways data is generated and used. I wanted Data 151 to be something that is capable of covering current, bleeding-edge topics. This SLO also nicely encompasses my plans to bring in alumni and current practitioners as speakers to give the students insight into what future jobs might look like. Bringing in these speakers also provides a chance for students to get an industry perspective on workflows and processes, something that can be very different from academia’s problem solving process.

These SLO’s are pretty high-level, but intentionally so. At Valpo, we’ve got both “course objectives” and also topical objectives. My next post will take a look at the specific, topical objectives for Data 151, which deal with the more nitty-gritty topics of what will actually get covered in Data 151.

Student Learning Objectives – Part 2

This post is part of a series on student learning objectives (SLO’s) both for curriculum and courses. The SLO’s in this post are curricular level and address soft-skills and versatile expertise. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

Data science is acknowledged as an interdisciplinary field where a practitioner must have both broadly transferable skills ( data skills that apply to any data problem) and specific skills/knowledge within domains that questions arise in. So, a successful data science graduate must have domain knowledge, but it could be ANY domain knowledge. Moreover, they need to not only have domain knowledge but be able to actually transfer the data skills they’ve learned in the program to actually solving domain problems. And we need a student learning objective that captures that. Therefore, our third SLO is:

3. Students will be able to apply the skills and methods of the major [data science] for problem solving in data-intensive fields such as economics, meteorology, and biology, among others.

This objective lets us both capture the need for learning data skills to the point of transfer without specifying exactly what fields the student actually transfers their skills to. Since Valparaiso has its roots in a liberal arts tradition, this was particularly important since we want to encourage students in any discipline at all to consider adding a data science degree. While there are obviously more ‘traditional’ data-intensive fields (as listed), there are projects in any discipline that can make use of data skills, even as esoteric a field as studying historic french cookbooks (we have a computer science student helping with that project…). In practice, this objective can be met by students in a variety of ways, but most often by adding a minor or taking one of a specific set of data-centric discipline courses.

The final curriculum SLO I want to discuss deals with a (in one sense) completely non-technical skill, but stems from Valparaiso’s core values as grounded in the Lutheran tradition.

4. Students will be aware of and engaged with the use and misuse of analytical and statistical data-derived conclusions in the wider world

The news recently has covering many issues related to ethical use of data. Cathy O’Neil’s book “Weapons of Math Destruction” also highlights several cases of significantly unjust uses of data or data models. The Philosophical Transactions of the Royal Society even released a special themed issue in Dec. 2016 on ‘The ethical impact of data science‘. As a religiously based institution, Valpo cannot simply ignore training students in the potential pitfalls of building data models. But, even beyond that I believe that across the field, data scientists must think about, and understand, the moral, ethical and social impact of the projects they work on. I believe every member of our society has a personal responsibility to act in ways that support the common good, even if/while they are making decisions to promote their own good. Without proper training it is far too easy to overlook the possible bias inherent in collected data, designed models, or even the implementation of action in response to discoveries. The impact from a less than thoughtful choice or model could severely impact hundreds, thousands, or even millions of people. I certainly don’t want look back and realize I had trained that data scientist on my conscience…

Student Learning Objectives (Part 1)

This post is part of a series on student learning objectives (SLO’s) both for curriculum and courses. The SLO’s in this post are curricular level and address soft-skills and versatile expertise. Love them or hate them, student learning objectives are a part of higher education (I for one appreciate how they provide focus for curriculum and courses).

While there definitely is still debate about what exactly should be included in a full curriculum (Post forth-coming on the ASA’s guidelines), there’s definitely some consensus on the basic knowledge needed to succeed. That consensus centers around (1) needing a mix of hard-skills between Mathematics, Statistics and Computer Science (MS-CS) and (2) that data scientists (or really students in general) will need a variety of soft-skills. For example, respondents to O’Reilly’s 2016 Data Science Salary Survey reported that of the tasks they had a major involvement in 58% spent time “communicating findings to business decision makers”, 39% were “organizing and guiding team projects” and 28% had to “communicate with people outside [their] company”. This leads to the first SLO:

1. Students will demonstrate the ability to communicate effectively about mathematical and statistical concepts, as well as complex data analysis, in both written and verbal formats.

Phrase that SLO however you like, but I suspect you’ll find nearly all data science programs will need to have something along those lines. Our next SLO focuses even more on the hard-skills. We know our students will need technical skills including programming, software design, and experience working with big data. However, if you are reading this blog, you probably are aware enough about the high volatility and rapid changes that are occurring in the business and academic world of data science. Significant software workflows exist now that didn’t exist 6 months ago (heck, probably 2 months!) for answering data questions. Taking these into consideration, our second SLO addresses the skill need however phrased in a very generic way so that it will maintain its long-term viability.

2. Students will be able to implement solutions to mathematical and analytical questions in language(s) and tools appropriate for computer-based solutions, and do so with awareness of performance and design considerations.

With these two objectives, from our perspective we’ve covered all the generic skills that a student with any data science degree is going to need to be successful at. Part 2 will tackle SLO’s that relate to application knowledge and ethics.