Final project: missing data & topcoded values ==== author: Prof. Chouldechova date: font-family: Gill Sans autosize: false width:1920 height:1080 Final project: missing data ==== There is a fair bit of missingness in the data set. There are several approaches to dealing with missing data: 1. **Exclude** - You can omit observations with missing values (e.g., remove any rows that contain missing data) 2. **Impute** - R has various packages (Amelia, mice, mi, `impute()` function from Hmisc, etc.) that can help with imputing missing values. 3. Think carefully about whether certain kinds of missingness are **informative** Final project: missing data ==== The downsides of the **Impute** approach: - Imputation methods often rely on fairly **strong assumptions** concerning the process governing the appearance of missing values (assumptions such as MAR, missing at random; or MCAR, missing completely at random). - This is **a lot of hassle** to go through unless you want practice imputing values Final project: missing data ==== Why the **think carefully** approach can be a good one: - For factor variables, you can treat missing values as just another factor level. Sometimes **missingness can be informative (predictive)**, leading to a significant coefficient for the missing level. - E.g., Just now we ran a logistic regression in which we used `?` as one of the levels of the `workingclass` variable to indicate individuals whose working class is unknown. Having `workingclass = ?` turned out to be strong associated with earning under 50k a year. - For numeric variables, there's not much you can do. Just recode negative values to `NA`. Final project: missing data ==== **My recommendation** 1. Start by **thinking carefully** about missing values 2. If nothing interesting turns up, go ahead and **exclude** them (code as `NA`, proceed accordingly) - **Warning**: Trying to impute can consume a lot of time - Not guaranteed to produce better results than what you'd have if you just excluded all observations with missing values. Final project: topcoded outcome variable ==== - The income variable that you have available is **topcoded**. - For the top 2% of earners, you don't observe their actual income. - Instead, their income is recorded as the average of the top 2% of incomes. - Standard regression applied to data with a topcoded outcome is **inconsistent**. - i.e., even if you had infinite data, your coefficient estimates won't converge to the "true" coefficients. Final project: topcoded outcome variable ==== 1. **Tobit regression** (censored regression). - We didn't talk about this method in class - It's not too difficult to understand if you already understand linear regression. - [A tutorial can be found here](http://www.ats.ucla.edu/stat/r/dae/tobit.htm). 2. Try fitting the regression models / running hypothesis tests two ways - **First way**: include the topcoded observations - **Second way**: exclude all observations with topcoded outcomes - If your estimates change a lot, then you probably don't want to use the topcoded observations - If you go this route, be sure to explain what omiting the high earning individuals means for the scope of your conclusions. **My recommendation**: Take approach (2), unless you want practice with tobit regression.