Let’s begin by loading the packages we’ll need to get started

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.3.2     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(knitr)

options(scipen=4)

We’ll begin by doing all the same data processing as in previous lectures

# Load data from MASS into a tibble
# Rename variables
# Recode categorical variables
birthwt <- as_tibble(MASS::birthwt) %>%
  rename(birthwt.below.2500 = low, 
         mother.age = age,
         mother.weight = lwt,
         mother.smokes = smoke,
         previous.prem.labor = ptl,
         hypertension = ht,
         uterine.irr = ui,
         physician.visits = ftv,
         birthwt.grams = bwt)  %>%
  mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other", )) %>%
  mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
            ~ recode_factor(.x, `0` = "no", `1` = "yes"))

Assessing significance of factors and interactions in regression

Factors in linear regression

Interpreting coefficients of factor variables

In the case of quantitative predictors, we’re more or less comfortable with the interpretation of the linear model coefficient as a “slope” or a “unit increase in outcome per unit increase in the covariate”. This isn’t the right interpretation for factor variables. In particular, the notion of a slope or unit change no longer makes sense when talking about a categorical variable. E.g., what does it even mean to say “unit increase in major” when studying the effect of college major on future earnings?

To understand what the coefficients really mean, let’s go back to the birthwt data and try regressing birthweight on mother’s race and mother’s age.

# Fit regression model
birthwt.lm <- lm(birthwt.grams ~ race + mother.age, data = birthwt)

# Regression model summary
summary(birthwt.lm)
## 
## Call:
## lm(formula = birthwt.grams ~ race + mother.age, data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2131.57  -488.02    -1.16   521.87  1757.07 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2949.979    255.352  11.553   <2e-16 ***
## raceblack   -365.715    160.636  -2.277   0.0240 *  
## raceother   -285.466    115.531  -2.471   0.0144 *  
## mother.age     6.288     10.073   0.624   0.5332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 715.7 on 185 degrees of freedom
## Multiple R-squared:  0.05217,    Adjusted R-squared:  0.0368 
## F-statistic: 3.394 on 3 and 185 DF,  p-value: 0.01909

Note that there are two coefficients estimated for the race variable (raceblack and raceother). What’s happening here?

When you put a factor variable into a regression, you’re allowing a different intercept at every level of the factor. In the present example, you’re saying that you want to model birthwt.grams as

Baby’s birthweight = Intercept(based on mother’s race) + \(\beta\) * mother’s age


We can rewrite this more succinctly as: \[ y = \text{Intercept}_{race} + \beta \times \text{age} \]

Essentially you’re saying that your data is broken down into 3 racial groups, and you want to model your data as having the same slope governing how birthweight changes with mother’s age, but potentially different intercepts. Here’s a picture of what’s happening.

# Calculate race-specific intercepts
intercepts <- c(coef(birthwt.lm)["(Intercept)"],
                coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceblack"],
                coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"])

lines.df <- data.frame(intercepts = intercepts,
                       slopes = rep(coef(birthwt.lm)["mother.age"], 3),
                       race = levels(birthwt$race))

lines.df
##   intercepts   slopes  race
## 1   2949.979 6.287741 white
## 2   2584.264 6.287741 black
## 3   2664.513 6.287741 other
qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + 
  geom_abline(aes(intercept = intercepts, 
                  slope = slopes, 
                  color = race), data = lines.df)

How do we interpret the 2 race coefficients? For categorical variables, the interpretation is relative to the given baseline. The baseline is just whatever level comes first (here, “WHITE”). E.g., the estimate of raceother means that the estimated intercept is 285g lower among “OTHER” race mothers compared to WHITE mothers. Similarly, the estimated intercept is 366g lower for BLACK mothers than WHITE mothers.

Another way of putting it: Among mothers of the same age, babies of black mothers are born on average weighing 366g less than babies of white mothers.

Note: If you look at the inline code chunks, you’ll notice I’m using the abs() function to take the absolute value of the coefficients. This is to avoid statements like:

Among mothers of the same age, babies of BLACK mothers are born on average weighing -366g less than babies of WHITE mothers.

This statement is not correct, because saying something like “-100g less” actually translates into “100g more”.

Why is one of the levels missing in the regression?

As you’ve already noticed, there is no coefficient called “racewhite” in the estimated model. This is because this coefficient gets absorbed into the overall (Intercept) term.

Let’s peek under the hood. Using the model.matrix() function on our linear model object, we can get the data matrix that underlies our regression. Here are the first 20 rows.

head(model.matrix(birthwt.lm), 20)
##    (Intercept) raceblack raceother mother.age
## 1            1         1         0         19
## 2            1         0         1         33
## 3            1         0         0         20
## 4            1         0         0         21
## 5            1         0         0         18
## 6            1         0         1         21
## 7            1         0         0         22
## 8            1         0         1         17
## 9            1         0         0         29
## 10           1         0         0         26
## 11           1         0         1         19
## 12           1         0         1         19
## 13           1         0         1         22
## 14           1         0         1         30
## 15           1         0         0         18
## 16           1         0         0         18
## 17           1         1         0         15
## 18           1         0         0         25
## 19           1         0         1         20
## 20           1         0         0         28

Even though we think of the regression birthwt.grams ~ race + mother.age as being a regression on two variables (and an intercept), it’s actually a regression on 3 variables (and an intercept). This is because the race variable gets represented as two dummy variables: one for race == black and the other for race == other.

Why isn’t there a column for representing the indicator of race == white? This gets back to our colinearity issue. By definition, we have that

raceblack + raceother + racewhite = 1 = (Intercept)


This is because for every observation, one and only one of the race dummy variables will equal 1. Thus the group of 4 variables {raceblack, raceother, racewhite, (Intercept)} is perfectly colinear, and we can’t include all 4 of them in the model. The default behavior in R is to remove the dummy corresponding to the first level of the factor (here, racewhite), and to keep the rest.

Interaction terms

Let’s go back to the regression line plot we generated above.

qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) + 
  geom_abline(aes(intercept = intercepts, 
                  slope = slopes, 
                  color = race), data = lines.df)

We have seen similar plots before by using the geom_smooth or stat_smooth commands in ggplot. Compare the plot above to the following.

qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +
  stat_smooth(method = "lm", se = FALSE, fullrange = TRUE)
## `geom_smooth()` using formula 'y ~ x'