Lecture 8 - Hypothesis testing in R

Exploring the birthwt data

We’ll begin by doing all the same data processing as in previous lectures

# Load data from MASS into a tibble
birthwt <- as_tibble(MASS::birthwt)

# Rename variables
birthwt <- birthwt %>%
  rename(birthwt.below.2500 = low, 
         mother.age = age,
         mother.weight = lwt,
         mother.smokes = smoke,
         previous.prem.labor = ptl,
         hypertension = ht,
         uterine.irr = ui,
         physician.visits = ftv,
         birthwt.grams = bwt)

# Change factor level names
birthwt <- birthwt %>%
  mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
  mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
            ~ recode_factor(.x, `0` = "no", `1` = "yes"))

Over the past two lectures we created various tables and graphics to help us better understand the data. Our focus for today is to run hypothesis tests to assess whether the trends we observed last time are statistically significant.

One of the main reasons we want to understand hypothesis testing is that it is important for our tables and figures to convey statistical uncertainty in any cases where it is non-negligible, and where failing to account for it may produce misleading conclusions.

Testing differences in means

One of the most common statistical tasks is to compare an outcome between two groups. The example here looks at comparing birth weight between smoking and non-smoking mothers.

To start, it always helps to plot things

# Create boxplot showing how birthwt.grams varies between
# smoking status
qplot(x = mother.smokes, y = birthwt.grams,
      geom = "boxplot", data = birthwt,
      xlab = "Mother smokes", 
      ylab = "Birthweight (grams)",
      fill = I("lightblue"))

This plot suggests that smoking is associated with lower birth weight.

How can we assess whether this difference is statistically significant?

Let’s compute a summary table

# Notice the consistent use of round() to ensure that our summaries 
# do not have too many decimal values
birthwt %>%
  group_by(mother.smokes) %>%
  summarize(mean.birthwt = round(mean(birthwt.grams), 0),
            sd.birthwt = round(sd(birthwt.grams), 0))

## # A tibble: 2 x 3
##   mother.smokes mean.birthwt sd.birthwt
##   <fct>                <dbl>      <dbl>
## 1 no                    3056        753
## 2 yes                   2772        660

The standard deviation is good to have, but to assess statistical significance we really want to have the standard error (which the standard deviation adjusted by the group size).

birthwt %>%
  group_by(mother.smokes) %>%
  summarize(num.obs = n(),
            mean.birthwt = round(mean(birthwt.grams), 0),
            sd.birthwt = round(sd(birthwt.grams), 0),
            se.birthwt = round(sd(birthwt.grams) / sqrt(num.obs), 0))

## # A tibble: 2 x 5
##   mother.smokes num.obs mean.birthwt sd.birthwt se.birthwt
##   <fct>           <int>        <dbl>      <dbl>      <dbl>
## 1 no                115         3056        753         70
## 2 yes                74         2772        660         77

t-test via t.test()

This difference is looking quite significant. To run a two-sample t-test, we can simple use the t.test() function.

birthwt.t.test <- t.test(birthwt.grams ~ mother.smokes, data = birthwt)
birthwt.t.test

## 
##  Welch Two Sample t-test
## 
## data:  birthwt.grams by mother.smokes
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
##  mean in group no mean in group yes 
##          3055.696          2771.919

We see from this output that the difference is highly significant. The t.test() function also outputs a confidence interval for us.

Notice that the function returns a lot of information, and we can access this information element by element

names(birthwt.t.test)

##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

birthwt.t.test$p.value   # p-value

## [1] 0.007002548

birthwt.t.test$estimate  # group means

##  mean in group no mean in group yes 
##          3055.696          2771.919

birthwt.t.test$conf.int  # confidence interval for difference

## [1]  78.57486 488.97860
## attr(,"conf.level")
## [1] 0.95

attr(birthwt.t.test$conf.int, "conf.level")  # confidence level

## [1] 0.95

The ability to pull specific information from the output of the hypothesis test allows you to report your results using inline code chunks. That is, you don’t have to hardcode estimates, p-values, confidence intervals, etc.

# Calculate difference in means between smoking and nonsmoking groups
birthwt.t.test$estimate

##  mean in group no mean in group yes 
##          3055.696          2771.919

birthwt.smoke.diff <- round(birthwt.t.test$estimate[1] - birthwt.t.test$estimate[2], 1)

# Confidence level as a %
conf.level <- attr(birthwt.t.test$conf.int, "conf.level") * 100

Example: Here’s what happens when we knit the following paragraph.

Our study finds that birth weights are on average `r birthwt.smoke.diff`g higher in the non-smoking group compared to the smoking group (t-statistic `r round(birthwt.t.test$statistic,2)`, p=`r round(birthwt.t.test$p.value, 3)`, `r conf.level`% CI [`r round(birthwt.t.test$conf.int,1)`]g)

Output: Our study finds that birth weights are on average 283.8g higher in the non-smoking group compared to the smoking group (t-statistic 2.73, p=0.007, 95% CI [78.6, 489]g)

One other thing to know is that t.test() accepts input in multiple forms. I like using the formula form whenever it’s available, as I find it to be more easily interpretable. Here’s another way of specifying the same information.

with(birthwt, t.test(x=birthwt.grams[mother.smokes=="no"], 
                     y=birthwt.grams[mother.smokes=="yes"]))

## 
##  Welch Two Sample t-test
## 
## data:  birthwt.grams[mother.smokes == "no"] and birthwt.grams[mother.smokes == "yes"]
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
## mean of x mean of y 
##  3055.696  2771.919

Specifying x and y arguments to the t.test function runs a t-test to check whether x and y have the same mean.

What is statistical significance testing doing?

Here’s a little simulation where we have two groups, a treatment groups and a control group. We’re going to simulate observations from both groups. We’ll run the simulation two ways.

First simulation (Null case): the treatment has no effect
Second simulation (Non-null case): the treatment on average increases outcome

set.seed(12345)
# Function to generate data
generateSimulationData <- function(n1, n2, mean.shift = 0) {
  y <- rnorm(n1 + n2) + c(rep(0, n1), rep(mean.shift, n2))
  groups <- c(rep("control", n1), rep("treatment", n2))
  data.frame(y = y, groups = groups)
}

Let’s look at a single realization in the null setting.

n1 = 30
n2 = 40
# Observation, null case
obs.data <- generateSimulationData(n1 = n1, n2 = n2)
obs.data

##              y    groups
## 1   0.58552882   control
## 2   0.70946602   control
## 3  -0.10930331   control
## 4  -0.45349717   control
## 5   0.60588746   control
## 6  -1.81795597   control
## 7   0.63009855   control
## 8  -0.27618411   control
## 9  -0.28415974   control
## 10 -0.91932200   control
## 11 -0.11624781   control
## 12  1.81731204   control
## 13  0.37062786   control
## 14  0.52021646   control
## 15 -0.75053199   control
## 16  0.81689984   control
## 17 -0.88635752   control
## 18 -0.33157759   control
## 19  1.12071265   control
## 20  0.29872370   control
## 21  0.77962192   control
## 22  1.45578508   control
## 23 -0.64432843   control
## 24 -1.55313741   control
## 25 -1.59770952   control
## 26  1.80509752   control
## 27 -0.48164736   control
## 28  0.62037980   control
## 29  0.61212349   control
## 30 -0.16231098   control
## 31  0.81187318 treatment
## 32  2.19683355 treatment
## 33  2.04919034 treatment
## 34  1.63244564 treatment
## 35  0.25427119 treatment
## 36  0.49118828 treatment
## 37 -0.32408658 treatment
## 38 -1.66205024 treatment
## 39  1.76773385 treatment
## 40  0.02580105 treatment
## 41  1.12851083 treatment
## 42 -2.38035806 treatment
## 43 -1.06026555 treatment
## 44  0.93714054 treatment
## 45  0.85445172 treatment
## 46  1.46072940 treatment
## 47 -1.41309878 treatment
## 48  0.56740325 treatment
## 49  0.58318765 treatment
## 50 -1.30679883 treatment
## 51 -0.54038607 treatment
## 52  1.94769266 treatment
## 53  0.05359027 treatment
## 54  0.35166284 treatment
## 55 -0.67097654 treatment
## 56  0.27795369 treatment
## 57  0.69117127 treatment
## 58  0.82379533 treatment
## 59  2.14506502 treatment
## 60 -2.34694398 treatment
## 61  0.14959198 treatment
## 62 -1.34253148 treatment
## 63  0.55330308 treatment
## 64  1.58996284 treatment
## 65 -0.58687959 treatment
## 66 -1.83237731 treatment
## 67  0.88813943 treatment
## 68  1.59348847 treatment
## 69  0.51685467 treatment
## 70 -1.29567168 treatment

# Box plots
qplot(x = groups, y = y, data = obs.data, geom = "boxplot")

# Density plots
qplot(fill = groups, x = y, data = obs.data, geom = "density", 
      alpha = I(0.5),
      adjust = 1.5, 
      xlim = c(-4, 6))

# t-test
t.test(y ~ groups, data = obs.data)

## 
##  Welch Two Sample t-test
## 
## data:  y by groups
## t = -0.61095, df = 67.998, p-value = 0.5433
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6856053  0.3641889
## sample estimates:
##   mean in group control mean in group treatment 
##              0.07880701              0.23951518

And here’s what happens in a random realization in the non-null setting.

# Non-null case, very strong treatment effect
# Observation, null case
obs.data <- generateSimulationData(n1 = n1, n2 = n2, mean.shift = 1.5)

# Box plots
qplot(x = groups, y = y, data = obs.data, geom = "boxplot")

# Density plots
qplot(fill = groups, x = y, data = obs.data, geom = "density", 
      alpha = I(0.5),
      adjust = 1.5, 
      xlim = c(-4, 6))

# t-test
t.test(y ~ groups, data = obs.data)

## 
##  Welch Two Sample t-test
## 
## data:  y by groups
## t = -4.3081, df = 64.785, p-value = 5.708e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6911828 -0.6197985
## sample estimates:
##   mean in group control mean in group treatment 
##               0.4191634               1.5746541

More interestingly, let’s see what happens if we repeat our simulation 10000 times and look at the p-values. We’ll use a moderate effect of 0.5 instead of the really strong effect of 1.5 in this simulation.

NUM_ITER <- 10000
pvals <- matrix(0, nrow = NUM_ITER, ncol = 2)
for(i in 1:NUM_ITER) {
  # Generate data
  obs.null <- generateSimulationData(n1 = n1, n2 = n2)
  obs.alt <- generateSimulationData(n1 = n1, n2 = n2, mean.shift = 0.5)
  
  # Record p-values
  pvals[i, 1] <- t.test(y ~ groups, data = obs.null)$p.value
  pvals[i, 2] <- t.test(y ~ groups, data = obs.alt)$p.value
}

pvals <- as.data.frame(pvals)
colnames(pvals) <- c("null", "nonnull")

# Plotting routine
qplot(x = null, data = pvals, xlab = "p-value",
      xlim = c(0, 1),
      main = "P-value when treatment has 0 effect")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x = nonnull, data = pvals, xlab = "p-value",
      xlim = c(0, 1),
      main = "P-value when treatment has MODERATE effect")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s show both histograms on the same plot.

# Let's start by reshaping the data
# This approach isn't the best one, but it works well for this simple case
pvals.df <- data.frame(pvals = c(pvals$null, pvals$nonnull),
                       case = c(rep("null", NUM_ITER), 
                                rep("nonnull", NUM_ITER)))

# Plot
ggplot(data = pvals.df, aes(x = pvals, fill = case)) +
  geom_histogram(alpha=0.75, position="identity") +
  xlim(0,1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing missing values (geom_bar).

Why do the distributions look this way?

Let’s think back to what you learned about hypothesis testing in your statistics class. You probably learned that if we want to control Type I error (i.e., the likelihood of rejecting then the null hypothesis is true) at some level \(\alpha\), we should reject the null when the observed p-value is less than \(\alpha\). As a probability statement, this says that:

\[ \mathbb{P}_{H_0}(\text{p-value} \le \alpha) = \alpha \] Now, the p-value is a random variable that depends on (i.e., gets its randomness from) the observed data used to compute it. The expression above may look familiar to you. Recall that for a random variable \(X\), the CDF (cumulative distribution function) of \(X\) is defined as

\[ F_X(x) = \mathbb{P}(X \le x) \] So what we know is that, when the null is true, the p-value has the CDF \(F(x) = x\) for \(0 \le x \le 1\). This is the CDF of the uniform distribution on the interval \([0,1]\). Now, looking back at the blue histogram in our simulation, we see that this is exactly what the observed p-value distribution looks like. It looks exactly like a sample of uniform random variables from the interval \([0,1]\).

Now how about the distribution of p-values when the null is FALSE? In your statistics class, you would have learned about the power of a test, often denoted as \(\beta\), which is the likelihood of rejecting when the null is FALSE. i.e., the power is

\[ \beta = \mathbb{P}_{H_a}(\text{p-value} \le \alpha) , \] and we generally want \(\beta\) to be much larger than \(\alpha\) for all values of \(\alpha\). If you look at the p-value distribution for the non-null setting, the power \(\beta\) at some threshold \(\alpha\) is given by the (normalized) area under the pink curve to the left of \(\alpha\). This area under the blue curve is equal to \(\alpha\). And because p-values tend to be smaller under the non-null, it is greater than \(\alpha\) when the null is FALSE.

What if sample is small and data are non-Gaussian?

In your statistics classes you’ve been taught to approach the t-test with caution. If your data is highly skewed, you would need a very large sample size for the t-statistic to actually be t-distributed.

When it doubt, you can run a non-parametric test. Here’s how we run a Mann-Whitney U test (aka Wilcoxon rank-sum test) using the wilcox.test() function.

# Formula specification
birthwt.wilcox.test <- wilcox.test(birthwt.grams ~ mother.smokes, data=birthwt, conf.int=TRUE)
birthwt.wilcox.test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  birthwt.grams by mother.smokes
## W = 5249.5, p-value = 0.006768
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##   85.00004 512.00005
## sample estimates:
## difference in location 
##               306.1846

# x,y specification
with(birthwt, wilcox.test(x=birthwt.grams[mother.smokes=="no"], 
                     y=birthwt.grams[mother.smokes=="yes"]))

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  birthwt.grams[mother.smokes == "no"] and birthwt.grams[mother.smokes == "yes"]
## W = 5249.5, p-value = 0.006768
## alternative hypothesis: true location shift is not equal to 0

In general, hypothesis tests in R return an object of class htest which has similar attributes to what we saw in the t-test.

class(birthwt.wilcox.test)

## [1] "htest"

Here’s a summary of the attributes:

name	description
statistic	the value of the test statistic with a name describing it.
parameter	the parameter(s) for the exact distribution of the test statistic.
p.value	the p-value for the test.
null.value	the location parameter mu.
alternative	a character string describing the alternative hypothesis
method	the type of test applied.
data.name	a character string giving the names of the data.
conf.int	a confidence interval for the location parameter. (Only present if argument conf.int = TRUE.)
estimate	an estimate of the location parameter. (Only present if argument conf.int = TRUE.)

Is the data normal?

I would recommend using a non-parametric test when the data appears highly non-normal and the sample size is small. If you really want to stick to t-testing, it’s good to know how to diagnose non-normality.

qq-plot

The simplest thing to look at is a normal qq plot of the data. This is obtained using the stat_qq() function.

# qq plot
p.birthwt <- ggplot(data = birthwt, aes(sample = birthwt.grams))

p.birthwt + stat_qq() + stat_qq_line()

# Separate plots for different values of smoking status
p.birthwt + stat_qq() + stat_qq_line() + facet_grid(. ~ mother.smokes)

# qq plot for 115 observations of truly normal data
df <- data.frame(x = rnorm(115))
ggplot(data = df, aes(sample = x)) + stat_qq() + stat_qq_line()

If the data are exactly normal, you expect the points to lie on a straight line. The data we have here are pretty close to lying on a line.

Here’s what we would see if the data were right-skewed

set.seed(12345)
fake.data <- data.frame(x = rexp(200))
p.fake <- ggplot(fake.data, aes(sample = x))
qplot(x, data = fake.data)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p.fake + stat_qq() + stat_qq_line()

If you construct a qqplot and it looks like this, you should be carefully, particularly if your sample size is small.

Tests for 2x2 tables

Here’s an example of a 2 x 2 table that we might want to run a test on. This one looks at low birthweight broken down by mother’s smoking status. You can think of it as another approach to the t-test problem, this time looking at indicators of low birth weight instead of the actual weights.

First, let’s build our table using the table() function (we did this back in Lecture 5)

weight.smoke.tbl <- with(birthwt, table(birthwt.below.2500, mother.smokes))
weight.smoke.tbl

##                   mother.smokes
## birthwt.below.2500 no yes
##                no  86  44
##                yes 29  30

It looks like there’s a positive association between low birthweight and smoking status.

To test for significance, we just need to pass our 2 x 2 table into the appropriate function. Here’s the result of using fisher’s exact test by calling fisher.test

birthwt.fisher.test <- fisher.test(weight.smoke.tbl)
birthwt.fisher.test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  weight.smoke.tbl
## p-value = 0.03618
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.028780 3.964904
## sample estimates:
## odds ratio 
##   2.014137

attributes(birthwt.fisher.test)

## $names
## [1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
## [6] "method"      "data.name"  
## 
## $class
## [1] "htest"

As when using the t-test, we find that there is a significant association between smoking an low birth weight.

You can also use the chi-squared test via the chisq.test function. This is the test that you may be more familiar with from your statistics class.

chisq.test(weight.smoke.tbl)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  weight.smoke.tbl
## X-squared = 4.2359, df = 1, p-value = 0.03958

You get essentially the same answer by running the chi-squared test, but the output isn’t as useful. In particular, you’re not getting an estimate or confidence interval for the odds ratio. This is why I prefer fisher.test() for testing 2 x 2 tables.

Tests for j x k tables

Here’s a small data set on party affiliation broken down by gender.

# Manually enter the data
politics <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(politics) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))

politics # display the data

##       party
## gender Democrat Independent Republican
##      F      762         327        468
##      M      484         239        477

We may be interested in asking whether men and women have different party affiliations.

The answer will be easier to guess at if we convert the rows to show proportions instead of counts. Here’s one way of doing this.

politics.prop <- prop.table(politics, 1)
politics.prop

##       party
## gender  Democrat Independent Republican
##      F 0.4894027   0.2100193  0.3005780
##      M 0.4033333   0.1991667  0.3975000

colSums(politics.prop) # Check that columns sum to 1

##    Democrat Independent  Republican 
##   0.8927360   0.4091859   0.6980780

# Fix dimnames
dimnames(politics.prop) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))

# Output
politics.prop

##       party
## gender  Democrat Independent Republican
##      F 0.4894027   0.2100193  0.3005780
##      M 0.4033333   0.1991667  0.3975000

By looking at the table we see that Female are more likely to be Democrats and less likely to be Republicans.

We still want to know if this difference is significant. To assess this we can use the chi-squared test (on the counts table, not the proportions table!).

chisq.test(politics)

## 
##  Pearson's Chi-squared test
## 
## data:  politics
## X-squared = 30.07, df = 2, p-value = 2.954e-07

There isn’t really a good one-number summary for general \(j\) x \(k\) tables the way there is for 2 x 2 tables. One thing that we may want to do at this stage is to ignore the Independent category and just look at the 2 x 2 table showing the counts for the Democrat and Republican categories.

politics.dem.rep <- politics[,c(1,3)]
politics.dem.rep

##       party
## gender Democrat Republican
##      F      762        468
##      M      484        477

# Run Fisher's exact test
fisher.test(politics.dem.rep)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  politics.dem.rep
## p-value = 6.806e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.347341 1.910944
## sample estimates:
## odds ratio 
##   1.604345

We see that women have significantly higher odds of being Democrat compared to men.