--- title: Lecture 11 - Categorical variables and interaction terms in linear regression, Stratified regressions author: Prof. Alexandra Chouldechova date: 94-842 output: html_document: toc: true toc_depth: 5 --- Let's begin by loading the packages we'll need to get started ```{r} library(tidyverse) library(knitr) options(scipen=4) ``` We'll begin by doing all the same data processing as in previous lectures ```{r} # Load data from MASS into a tibble # Rename variables # Recode categorical variables birthwt <- as_tibble(MASS::birthwt) %>% rename(birthwt.below.2500 = low, mother.age = age, mother.weight = lwt, mother.smokes = smoke, previous.prem.labor = ptl, hypertension = ht, uterine.irr = ui, physician.visits = ftv, birthwt.grams = bwt) %>% mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other", )) %>% mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"), ~ recode_factor(.x, `0` = "no", `1` = "yes")) ``` ## Assessing significance of factors and interactions in regression ### Factors in linear regression #### Interpreting coefficients of factor variables In the case of quantitative predictors, we're more or less comfortable with the interpretation of the linear model coefficient as a "slope" or a "unit increase in outcome per unit increase in the covariate". This isn't the right interpretation for factor variables. In particular, the notion of a slope or unit change no longer makes sense when talking about a categorical variable. E.g., what does it even mean to say "unit increase in major" when studying the effect of college major on future earnings? To understand what the coefficients really mean, let's go back to the birthwt data and try regressing birthweight on mother's race and mother's age. ```{r} # Fit regression model birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt) # Regression model summary summary(birthwt.lm) ``` Note that there are two coefficients estimated for the race variable (`raceblack` and `raceother`). What's happening here? When you put a factor variable into a regression, you're allowing a **different intercept at every level of the factor**. In the present example, you're saying that you want to model `birthwt.grams` as **Baby's birthweight = Intercept(based on mother's race) + $\beta$ * mother's age**
We can rewrite this more succinctly as: $$ y = \text{Intercept}_{race} + \beta \times \text{age} $$ Essentially you're saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother's age, but potentially different intercepts. Here's a picture of what's happening. ```{r} # Calculate race-specific intercepts intercepts <- c(coef(birthwt.lm)["(Intercept)"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"]) lines.df <- data.frame(intercepts = intercepts, slopes = rep(coef(birthwt.lm)["mother.age"], 3), race = levels(birthwt$race)) lines.df qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) ``` How do we interpret the 2 race coefficients? For categorical variables, the interpretation is relative to the given baseline. The baseline is just whatever level comes first (here, "WHITE"). E.g., the estimate of `raceother` means that the estimated intercept is `r abs(round(coef(birthwt.lm)["raceother"], 0))`g lower among "OTHER" race mothers compared to WHITE mothers. Similarly, the estimated intercept is `r abs(round(coef(birthwt.lm)["raceblack"],0))`g lower for BLACK mothers than WHITE mothers. > Another way of putting it: Among mothers of the same age, babies of black mothers are born on average weighing `r abs(round(coef(birthwt.lm)["raceblack"], 0))`g less than babies of white mothers. > **Note**: If you look at the inline code chunks, you'll notice I'm using the `abs()` function to take the absolute value of the coefficients. This is to avoid statements like: Among mothers of the same age, babies of BLACK mothers are born on average weighing `r round(coef(birthwt.lm)["raceblack"], 0)`g less than babies of WHITE mothers. > This statement is not correct, because saying something like "-100g less" actually translates into "100g more". ##### Why is one of the levels missing in the regression? As you've already noticed, there is no coefficient called "racewhite" in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term. Let's peek under the hood. Using the `model.matrix()` function on our linear model object, we can get the data matrix that underlies our regression. Here are the first 20 rows. ```{r} head(model.matrix(birthwt.lm), 20) ``` Even though we think of the regression `birthwt.grams ~ race + mother.age` as being a regression on two variables (and an intercept), it's actually a regression on 3 variables (and an intercept). This is because the `race` variable gets represented as two dummy variables: one for `race == black` and the other for `race == other`. Why isn't there a column for representing the indicator of `race == white`? This gets back to our colinearity issue. By definition, we have that **raceblack + raceother + racewhite = 1 = (Intercept) **
This is because for every observation, one and only one of the race dummy variables will equal 1. Thus the group of 4 variables {raceblack, raceother, racewhite, (Intercept)} is perfectly colinear, and we can't include all 4 of them in the model. The default behavior in R is to remove the dummy corresponding to the first level of the factor (here, racewhite), and to keep the rest. #### Interaction terms Let's go back to the regression line plot we generated above. ```{r} qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) ``` We have seen similar plots before by using the `geom_smooth` or `stat_smooth` commands in `ggplot`. Compare the plot above to the following. ```{r} qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm") ``` In this case we have not only race-specific intercepts, but also **race-specific slopes**. The plot above corresponds to the model: **Baby's birthweight = Intercept(based on mother's race) + $\beta$(based on mother's race) * mother's age**
We can rewrite this more succinctly as: $$ y = \text{Intercept}_{race} + \beta_{race}\times \text{age} $$ To specify this interaction model in R, we use the following syntax ```{r} birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt) summary(birthwt.lm.interact) ``` We now have new terms appearing. Terms like `raceblack:mother.age` are deviations from the baseline slope (the coefficient of `mother.age` in the model) in the same way that terms like `raceblack` are deviations from the baseline intercept. This models says that: > On average among WHITE mothers, every additional year of age is associated with a `r abs(round(coef(birthwt.lm.interact)["mother.age"], 1))`g INCREASE in the birthweight of the baby. To get the slope for BLACK mothers, we need to add the interaction term to the baseline. $$ \begin{aligned} \beta_{raceblack} &= \beta_{racewhite} + \beta_{raceblack:mother.age} \\ &= \text{mother.age} + \text{raceblack:mother.age} \\ &= `r round(coef(birthwt.lm.interact)["mother.age"], 1)` + (`r round(coef(birthwt.lm.interact)["raceblack:mother.age"], 1)`) \\ &= `r round(coef(birthwt.lm.interact)["mother.age"] + coef(birthwt.lm.interact)["raceblack:mother.age"], 1)` \end{aligned} $$ This slope estimate is NEGATIVE, which agrees with the regression plot above. #### Is a categorical variable in a regression statistically significant? Last class we considered modelling birthweight as a linear function of mother's age, allowing for a race-specific intercept for each of the three race categories. This model is fit again below. ```{r} birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt) ``` Here's a visualization of the model fit that we wound up with. Note that while there are 3 lines shown, this is a visualization of just one model: `birthwt.grams ~ race + mother.age`. This model produces 3 lines because the coefficients of the `race` variable result in different intercepts. ```{r} # Calculate race-specific intercepts intercepts <- c(coef(birthwt.lm)["(Intercept)"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"], coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"]) lines.df <- data.frame(intercepts = intercepts, slopes = rep(coef(birthwt.lm)["mother.age"], 3), race = levels(birthwt$race)) qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) ``` At this stage we may be interested in assessing whether the `race` variable is statistically significant. i.e., Does including the `race` variable significantly improve the fit of our model, or is the simpler model `birthwt.grams ~ mother.age` just as good? Essentially, we want to know if the race-specific intercepts capture significantly more variation in the outcome (birthweight) than the single intercept model, or if allowing for different intercepts isn't doing much more than capturing random fluctuations in the data. Here's a picture of the two models we're comparing: ```{r, fig.width = 10} library(gridExtra) plot.complex <- qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + geom_abline(aes(intercept = intercepts, slope = slopes, color = race), data = lines.df) # Single intercept model (birthwt.grams ~ mother.age) p <- ggplot(birthwt, aes(x = mother.age, y = birthwt.grams)) plot.simple <- p + geom_point(aes(colour = race)) + stat_smooth(method = "lm") grid.arrange(plot.complex, plot.simple, ncol = 2) ``` To test this hypothesis, we use the `anova` function (not to be confused with the `aov` function). This function compares two **nested** models, accounting for their residual sums of squares (how well they fit the data) and their complexity (how many more variables are in the larger model) to assess statistical significance. If you have seen "F" tests in regression or "likelihood ratio tests" before, this is the type of test we're performing with `anova`. ```{r} # Fit the simpler model with mother.age as the only predictor birthwt.lm.simple <- lm(birthwt.grams ~ mother.age, data = birthwt) # Compare to more complex model anova(birthwt.lm.simple, birthwt.lm) ``` This output tells us that the `race` variable is statistically significant: It is unlikely that the improvement in fit when the add the `race` variable is simply due to random fluctuations in the data. Thus it is important to consider race when modeling how birthweight depends on the mother's age. #### Is an interaction term significant? Assessing significance of interaction terms operates on the same principle. We once again ask whether the improvement in model fit is worth the increased complexity of our model. For instance, consider the example we saw last class, where we allowed for a race-specific slope in addition to the race-specific intercept from before. ```{r} birthwt.lm.interact <- lm(birthwt.grams ~ race * mother.age, data = birthwt) summary(birthwt.lm.interact) ``` Here's a side-by-side visual comparison of the `race + mother.age` model and the `race + mother.age + race*mother.age` interaction model. ```{r, fig.width = 10} plot.interact <- qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) grid.arrange(plot.interact, plot.complex, ncol = 2) ``` So, do the lines with different slopes fit the data significantly better than the common slope model? Let's compare the two with the `anova()` function. ```{r} anova(birthwt.lm, birthwt.lm.interact) ``` This p-value turns out to not be statistically significant. So even though the estimated slopes in the interaction model look very different, our estimates are quite variable, so we don't have enough evidence to conclude that the interaction term (different slopes) is providing significant additional explanatory power over the simpler `race + mother.age` model. #### Is my complex model signficantly better than a simpler one? The testing strategy above applies to any two nested models. Here's an example where we add in a few more variables and see how it compares to the `race + mother.age` model from earlier. ```{r} birthwt.lm.complex <- lm(birthwt.grams ~ mother.smokes + physician.visits + race + mother.age, data = birthwt) summary(birthwt.lm.complex) ``` Let's compare to our earlier model: ```{r} anova(birthwt.lm, birthwt.lm.complex) ``` Highly significant! This is probably due to the fact that mother's smoking status has a tremendously high association with birthweight. ### Summarizing stratified regressions #### Gapminder life expectancy data ```{r} gapminder <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt", delim = "\t") # Load data gapminder # A look at the data ``` #### Example: Maximum life expectancy for each continent Before diving into a regression, let's look at some data summaries that use some familiar functions. Here's how we can form a table showing the maximum life expectancy on each continent each year, and the country that attained that maximum. ```{r, cache = TRUE} gapminder %>% group_by(continent, year) %>% summarize(max.life.exp = max(lifeExp), country = country[which.max(lifeExp)]) ``` #### Example (more involved): Fitting a linear model for each country We're now going to go through an example where we get the life expectancy in 1952 and the rate of change in life expectancy over time for each country. The rate of change will be obtained by regressing `lifeExp` on `year`. ##### (1) Figure out how to extract the desired information. Let's start with the data for a single country. ```{r, fig.align='center', fig.height=4, fig.width=5} country.name <- "Ireland" # Pick a country gapminder.sub <- filter(gapminder, country == country.name) # Pull data for this country gapminder.sub # Scatterplot of life exp vs year # with a regression line overlaid qplot(year, lifeExp, data = gapminder.sub, main = paste("Life expectancy in", country.name)) + geom_smooth(method = "lm") ``` We can confirm that it's a pretty good model for other countries as well, though not for all of them ```{r, fig.height = 20} ggplot(data = gapminder, aes(x = year, y = lifeExp)) + facet_wrap( ~ country) + geom_point() + geom_smooth(method = "lm") ``` Now let's fit a regression and extract the slope. ```{r} life.exp.lm <- lm(lifeExp ~ year, data = gapminder.sub) # Fit model coef(life.exp.lm) # Get coefficients coef(life.exp.lm)["year"] # The slope that we wanted ``` Here's how we can do everything in one line: ```{r} coef(lm(lifeExp ~ year, data = gapminder.sub))["year"] ``` ##### (2) Extract information for each country Here we'll extract the slope in the way we practiced above, and we'll also extract the "origin" life expectancy: the given country's life expectancy in 1952, the first year of our data. For the purpose of plotting we'll also want the continent information, so we'll capture that in this call too. ```{r} progress.df <- gapminder %>% group_by(country) %>% summarize(continent = continent[1], origin = lifeExp[year == 1952], slope = lm(lifeExp ~ year)$coef["year"]) progress.df ``` #### What can we learn from this output? Let's summarize our findings by creating a bar chart for the origin and slope, with the bars colored by continent. We'll do this in ggplot. #### Plotting origins coloured by continent This code is analogous to that presented at the end of Lecture 6 ```{r, fig.width = 18, warning = FALSE} # Reorder country factor by origin # Construct bar chart progress.df %>% mutate(country = reorder(country, origin)) %>% ggplot(aes(x = country, y = origin, fill = continent)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) ``` #### Plotting slopes coloured by continent ```{r, fig.width = 18, warning = FALSE} # Reorder country factor by slope # Construct bar chart progress.df %>% mutate(country = reorder(country, slope)) %>% ggplot(aes(x = country, y = slope, fill = continent)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) ``` These are very interesting plots. What can you tell from looking at them? #### Looking at per capita GDP by year Let's start by looking at some plots of how GDP per capita varied by year ```{r, fig.height = 20, fig.width = 15, cache = TRUE} # Use qplot from ggplot2 to generate plots qplot(year, gdpPercap, facets = ~ country, data = gapminder, colour = continent) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` What if we want to rearrange the plots by continent? This can be done by changing the order of the `country` level. ```{r, fig.height = 20, fig.width = 15, cache = TRUE} # First step: reorder the countries by continent # Produce a data frame that has the countries ordered alphabetically within continent # Arrange sorts the data according to the variable(s) provided country.df <- gapminder %>% group_by(country) %>% summarize(continent = continent[1]) %>% arrange(continent) gapminder.ordered <- gapminder %>% mutate(country = factor(country, levels = country.df$country)) # Let's make sure that things are now ordered correctly... levels(gapminder.ordered$country) # Use qplot from ggplot2 to generate plots qplot(year, gdpPercap, facets = ~ country, data = gapminder.ordered, colour = continent) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + stat_smooth(method = "lm") ```