Care, Health and Reasoning Machines

The CHARM laboratory is led by Jeremy Weiss. We develop probabilistic machine learning algorithms and deploy them in healthcare settings at CMU and Pittsburgh. The Heinz College at CMU holds expertise in analytics-driven health policy, excellence in Computer Science and Machine Learning, and close ties to our healthcare industry partners and innovation community.

PhD students at CMU/Pittsburgh: if you have inquiries about my lab, send me an email with your CMU/Pittsburgh address and include "PhD" in the subject line.

Prospective PhD students not currently at CMU: consider applying to Heinz College, the Joint PhD Program in Machine Learning and Public Policy, or the University of Pittsburgh PhD Program in Biomedical Informatics.

News

Survival-supervised topic modeling with anchor words: characterizing pancreatitis outcomes accepted to NIPS workshop of Machine Learning for Health, 2017.

Piecewise-constant parametric approximations for survival learning is accepted to Machine Learning for Healthcare 2017.

Motivation

Why EHRs: Electronic health records (EHRs) document over 80% of medical encounters, a 7-fold increase from one decade ago, as a result of government subsidies through the HITECH Act. Using EHR data, we can now directly conduct clinical analyses to improve health outcomes, less the high cost of conducting longitudinal studies. Alongside EHR data, device and -omic data are coming online that will require scalable and integrated methods to allow to make personalized predictions and recommendations for patients.

Call for machine learning: However, EHR data are messy. Structured data--a fraction of data contained in EHRs--come as databases, not fixed-length feature vectors. Medicine changes over time, in prevalence, treatment, and best practices. What an EHR stores also changes over time.

Conclusions from data that guide health policy will require machine learning and statistical approaches that account for such characteristics.

However, with machine learning we can understand EHR data better than we ever have before. We cast our analyses under the terms characterization, prediction, and intervention.

Ongoing projects:

k-year risk of disease y

We can flexibly automate the process of constructing a longitudinal clinical study from EHR data. Most observational studies can be constructed using sorting, filtering, and matching operations. We are working to operationalize the use of EHR data across diseases and time.

Timelines as random and regular process mixtures

E.g. a patient with diabetes has random events (like diabetic ketoacidosis, diabetic retinopathy, etc.) mixed with regular events (blood sugar measurements, eye and foot exams, A1C measurements). Modeling mixtures of random and probabilistically regular events requires machine learning model development.

Timeline as a point process

Constrained machine learning for inference

Inference is the primary limiting factor in scaling all sorts of machine learning models to big data, and many such problems have been shown to be NP-hard. In the learning-for-inference subfield, Weiss et al., AAAI 2015 (pdf) showed how approximate inference could be markedly improved by using predictions from a constrained machine learning setup. We are analyzing this problem in greater detail.

Constrained ML to build a more accurate sampler

Deep timelines with cascade units

Timelines, or aperiodically time-stamped data, are understudied in the framework of deep learning. Yet, all transactional and log data arrive in this format. We are working to scalably characterize health transaction trends across time using deep learning, one of the most promising algorithms for predictive analytics.

Individualized treatment effects (ITE) from randomized, semi-randomized and observational data

Randomized control trials (RCTs) aka A-B testing model the average treatment effect. To avoid heterogeneity, trials narrow their inclusion criteria, but then want to make general conclusions about the value of their therapy. Observational studies have plentiful data in populations with heterogeneity yet must make strong assumptions when used to model cause and effect. Understanding methods to estimate the individualized treatment effect are crucial for making personalized causal claims. We believe the use of semi-randomized data can aid us and are developing ITE estimators from such data.

ITE: attempting to estimate individualized treatment effect i.e. on the optimal diagonal


Home | Lab | Teaching | CV