94-887 Applied Analytics: the Machine Learning Pipeline (Spring 2020)

Course Information

94-887, Applied Analytics: the Machine Learning Pipeline, will be taught in the Spring semester of 2020. Classes begin 1/13/20 and end 5/1/20. Time: MW 10:30-11:50am. Room: Hamburg Hall 1007

Instructor

Jeremy C. Weiss M.D. Ph.D., Assistant Professor of Health Informatics; jeremyweiss@cmu.edu, OH: Wednesdays 5-6:30pm in room 2101E/F.

TA(s): Dylan Fitzpatrick, djfitzpa@cmu.edu, OH: Fridays 11am-noon in room 1208, OH on Friday 1/31 moved to Thursday 1/30 11-noon.

Faculty assistant: Carole McCoy, HBH 2102

Please bring your questions to meetings during office hours. Please direct questions to the TA and or instructor by email or on the Canvas discussion board.

Course Description

Machine learning is a valued set of analytics techniques, a confluence of ideas from computer science, statistics, economics, physics, and others. Machine learning is transforming fields with new capabilities, ways of understanding and visualizing data, and is becoming a key driver in decision making. However, knowing when (and how) to apply appropriate machine learning techniques requires understanding of data, machine learning, and the problem domain. This class seeks to teach students how to address the entire machine learning pipeline, starting from messy data and provisional questions and ending with actionable interpretations and insights.

The course will cover discovery, planning, analysis, and interpretation. Discovery involves understanding the data at hand, determining what is and is not answerable, and question generation. Planning involves contrasting the application of the desired machine learning method on ideal clean data with the messy data at hand. Dealing with representation, missing data, and designing appropriate machine learning machinery are all involved in planning. Analysis involves applying the machine learning method, checking model performance and assumptions in a principled and responsible manner. Interpretation involves the transformation of algorthm outputs into meaningful and actionable characterizations of the results. Each part of the pipeline is interconnected and students will learn to anticipate and address limitations through understanding of the pipeline as a whole.

Throughout the course we will focus on one vertical, health care, recognizing that the methods developed will generalize to others. We will work with real, messy, structured and unstructured data--including databases, text, and images. We will contrast machine learning methods against what is currently used in health care analytics, and describe the advantages and promise of each.

This course will be a mixture of lectures, discussions and coding workshops. There will be a final project and no final exam.

Course prerequisites

Students should have completed or be concurrently taking Data Mining, Machine Learning for Problem Solving, ML 17-601, ML 17-401 or the equivalent. Experience with R, Python or another programming language is required. We will be using R for this course, and introductory background to R is helpful.

Evaluation Method

Grades will be based on:

weekly exercises, mix of programming and short response assignments, 75%
course project, 25%
- proposal, 5%; report and app, 20%;

Course Objectives

learn and adapt the mathematical formulations of machine learning methods for principled application
perform end-to-end machine learning analysis, including: data exploration, preparation, cleaning, prediction, validation, visualization, and interpretation
build working knowledge of a data science pipeline: e.g. R tidyverse (we will use this one for class); e.g. python scikit-learn pandas
develop machine learning algorithms tailored to data and business or research question
understand the strengths and limitations of existing analytic strategies, including: randomized controlled trials, observational studies, Cox proportional hazards, logistic regression

Grading Scale

All grades are tallied and at the end of the course they are scaled to meet the Heinz grading policy.

Cheating and Plagiarism Notice

The project and that is submitted for grading is to be the work of the individual or team alone. Similarly, homework assignments should be your work alone, although you are encouraged to discuss the problems with your classmates. Results that are identical or nearly identical across projects may be regarded as cheating. Penalties for cheating include lowering your grade or failing the course. In extreme cases, the instructors may recommend the termination of your enrollment at CMU.

Additional Course Policies

Homework Policy: The lowest two homework grades will be dropped.
Late Work Policy: You are expected to turn in all work on time. Assignments turned in within 24 hours of the deadline will be marked down 25%. Additional late assignments will not be accepted.
Wellness Policy: Take care of yourself and take care of others around you. There are resources to help you both in Heinz and around the University. The Counseling and Psychological Services (CaPS) help line is 412-268-2922. If the situation is life threatening, call the police.

Course Topics

data wrangling and visualization

evaluating predictions

partition-based methods

bias-variance tradeoffs

ensembling

prediction versus attribution

logistic regression

preprocessing

missing data

Bayesian networks

encoding domain expertise

neural networks/deep learning

dimensionality reduction;

temporal modeling

language modeling

technical debt

Course materials

There is not a required textbook. Readings will come from multiple sources and will be provided on Canvas and or in class. Recommended texts available on the Resources page include James' et al's Introduction to Statistical Learning (ISL), Bishop's Pattern Recognition and Machine Learning (PRML), Murphy's Machine Learning: a Probabilistic Perspective (Murphy), and Deisenroth et al's Mathematics for Machine Learning.

Practicum methods

R, Rstudio, Rmarkdown, tidyverse, debug, keras, shiny