Homework 3: Variable selection in Regression

This homework is due by 2:50PM on Thursday, February 8.
To complete this assignment, follow these steps:
Preamble: Loading packages and data
Data dictionary
Problem 1
Best subset selection
- (d) The code below performs Best Subset Selection to identify which variables in the model are most important. We only go up to models of size 5, because beyond that the computation time starts to get excessive.
- Which variables are included in the best model of each size? (You will want to work with the summary(news.subset) or coef(news.subset, id = ) object to determine this.) Are the models all nested? That is, does the best model of size k-1 always a subset of the best model of size k? Do any “noise predictors” appear in any of the models?
Forward Stepwise Selection
Lasso

This homework is due by 2:50PM on Thursday, February 8.

To complete this assignment, follow these steps:

Download the homework3.Rmd file from Canvas or the course website.
Open homework3.Rmd in RStudio.
Replace the “Your Name Here” text in the author: field with your own name.
Supply your solutions to the homework by editing homework3.Rmd.
When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML, rename the R Markdown file to homework3_YourNameHere.Rmd, and submit both the .Rmd file and the .html output file on Canvas (YourNameHere should be changed to your own name.)

Preamble: Loading packages and data

DO NOT CHANGE ANYTHING ABOUT THIS CODE!

library(ggplot2)
library(ISLR)
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-5

library(leaps)  # needed for regsubsets
library(boot)   # needed for cv.glm

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

options(scipen = 4)

# Online news share count data
set.seed(20120180)
online.news <- read.csv("http://www.andrew.cmu.edu/user/achoulde/95791/data/online_news.csv")
# subsample the data to reduce data size
num.noise <- 50
news <- data.frame(online.news, 
                   matrix(rnorm(num.noise * nrow(online.news)), 
                            nrow = nrow(online.news))
                   )
# Extract covariates matrix (for lasso)
news.x <- as.matrix(news[, -which(names(news) == "shares")])
# Extract response variable (for lasso)
news.y <- news$shares

Data dictionary

If you want to learn more about this data set, you can have a look at the data dictionary provided here: Data dictionary for news share data.

Problem 1

This question walks you through a comparison of three variable selection procedures we discussed in class.

(a) Use the `glm` command to fit a linear regression of `shares` on all the other variables in the `news` data set. Print the names of the predictors whose coefficient estimates are statistically significant at the 0.05 level. Are any of the “noise” predictors statistically significant? (Recall that “noise” predictors all have variable names of the form X#.)

# Edit me

Your answer here.

Hint: To access the P-value column of a fitted model named my.fit, you’ll want to look at the coef(summary(my.fit)) object. If you are new to R, you may find the following section of the 94-842 note helpful.

(b) Use the `cv.glm` command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds `K = 2, 5, 10, 20, 50, 100` (use a loop).

Calculate the standard deviation of your CV estimates divided by the mean of the estimates. This quantity is called the coefficient of variation. Do the error estimates change much across the different choices of $K$?

# Edit me

Your answer here.

Note: This loop may take a few minutes to run. I have supplied the argument cache = TRUE in the header to prevent the code from needing to re-execute every time you knit. This code chunk will re-execute only if the code it contains gets changed.

(c) The code below produces estimates of test error using the validation set approach. Run this code 50 times (put the code in a loop). Calculate the standard deviation of the estimates divided by the mean of the estimates. Are the answers you get more or less variable than your answers from part (b)?

####
## Modify the code below as necessary to answer the question.
####
# Form a random split
rand.split <- sample(cut(1:nrow(news), breaks = 2, labels = FALSE))
# Fit model on first part
news.glm.train <- glm(shares ~ ., data = news[rand.split == 1,])
# Predict on the second part
news.glm.pred <- predict(news.glm.train, newdata = news[rand.split == 2, ])
# Calculate MSE on the second part
mean((news$shares[rand.split == 2] - news.glm.pred)^2)

## [1] 125708178

Your answer here.

Best subset selection

(d) The code below performs Best Subset Selection to identify which variables in the model are most important. We only go up to models of size 5, because beyond that the computation time starts to get excessive.

Which variables are included in the best model of each size? (You will want to work with the `summary(news.subset)` or `coef(news.subset, id = )` object to determine this.) Are the models all nested? That is, does the best model of size k-1 always a subset of the best model of size k? Do any “noise predictors” appear in any of the models?

set.seed(12310)
# Get a smaller subset of the data to work with
# Use this ONLY for problem (d).
news.small <- news[sample(nrow(news), 2000), ]

# Best subset selection
news.subset <- regsubsets(shares ~ .,
               data = news.small,
               nbest = 1,    # 1 best model for each number of predictors
               nvmax = 5,    # NULL for no limit on number of variables
               method = "exhaustive", really.big = TRUE)

# Add code below to answer the question

Your answer here.

Forward Stepwise Selection

(e) Modify the code provided in part (d) to perform Forward stepwise selection instead of exhaustive search. There should be no limit on the maximum size of subset to consider.

NOTE: You will need to swap out news.small for the full news data. You should not use news.small for anything other than part (d)

# Edit me

Note: Parts (f) - (k) all refer to the results produced by Forward stepwise selection.

(f) For models of size 1:12, display the variables that appear in each model. Are the models all nested? Do any “noise predictors” appear in any of the models?

# Edit me

Your answer here.

(g) When you run `summary()` on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.

# Edit me

Your answer here.

(h) Construct a plot showing Residual sum of squares on the y-axis and model size on the x-axis. Does the RSS always decrease as we increase the model size? Explain.

# Edit me

Your answer here.

(i) [2 points] Construct a plot showing AIC (aka Mallows Cp) on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes AIC? How many “noise predictors” get included in this model?

# Edit me

Your answer here.

(j) Construct a plot showing BIC on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes BIC? How many “noise predictors” get included in this model?

# Edit me

Your answer here.

(k) [2 points] Compare the models selected by AIC and BIC. Is one a subset of the other? Which criterion selects the smaller model? Does that criterion always result in a smaller model, or is does this happen just by coincidence on the `news` data? Explain.

Your answer here.

Lasso

For the Lasso problems below, you may find it helpful to review the code examples in the Linear regression with glmnet vignette. Running the glmnet command glmnet(x = X, y = y) where y is your response vector and X is your covariates matrix will fit a Lasso.

(l) Variables `news.x` and `news.y` were pre-constructed in the preamble to this assignment. Use the `glmnet` command to fit a Lasso to this data. Call the result `news.lasso`.

# Edit me

(m) It turns out that `news.lasso` contains model fits for an entire sequence of $\lambda$ values. Look at the `news.lasso$lambda` attribute. How many $\lambda$ values do we have model fits for?

# Edit me

Your answer here.

(n) The `coef(news.lasso, s = )` will print out the estimated coefficients for our model at any lambda value `s`. Display the coefficient estimates for the 25th value of $\lambda$ from part (l). How many coefficients in this model are non-zero? How many of the non-zero coefficients come from “noise predictors”?

# Edit me

Your answer here.

(o) Run the `plot` command on your `news.lasso` object to get a regularization plot. Review the help file for `plot.glmnet` to figure out how to set “norm” as the x-axis variable option, and how to add labels to the curves. In this parameterization of the x-axis, is the model fit getting more complex or less complex as the x-axis variable increases?

# Edit me

Your answer here.

(p) `cv.glmnet(x, y)` will perform 10-fold cross-validation on the entire sequence of models fit by `glmnet(x, y)`. Use the `cv.glmnet` command to perform cross-validation. Save the results in a variable called `news.lasso.cv`. Run the `plot()` command to get a CV error plot.

# Edit me

(q) Use the `news.lasso.cv` object to figure out the value of $\lambda$ that minimizes CV error. Which value of $\lambda$ does the 1-SE rule tell us to use? How many non-zero variables are selected by the min-CV rule and the 1-SE rule? What is the estimated CV error for both of these models? How many “noise predictors” get included in each model?

# Edit me

Your answer here.

(r) How does the CV error of the lambda-min model compare to the 10-fold CV error of the linear model you fit in part (a)? Does it look like a model with a small number of predictors does as good a job of predicting share count?

# Edit me

Your answer here.

Homework 3: Variable selection in Regression

Your Name Here

Assigned: February 1, 2018

This homework is due by 2:50PM on Thursday, February 8.

To complete this assignment, follow these steps:

Preamble: Loading packages and data

Data dictionary

Problem 1

(b) Use the `cv.glm` command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds `K = 2, 5, 10, 20, 50, 100` (use a loop).

Calculate the standard deviation of your CV estimates divided by the mean of the estimates. This quantity is called the coefficient of variation. Do the error estimates change much across the different choices of \(K\)?

Best subset selection

(d) The code below performs Best Subset Selection to identify which variables in the model are most important. We only go up to models of size 5, because beyond that the computation time starts to get excessive.

Forward Stepwise Selection

(e) Modify the code provided in part (d) to perform Forward stepwise selection instead of exhaustive search. There should be no limit on the maximum size of subset to consider.

(f) For models of size 1:12, display the variables that appear in each model. Are the models all nested? Do any “noise predictors” appear in any of the models?

(g) When you run `summary()` on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.

(h) Construct a plot showing Residual sum of squares on the y-axis and model size on the x-axis. Does the RSS always decrease as we increase the model size? Explain.

(i) [2 points] Construct a plot showing AIC (aka Mallows Cp) on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes AIC? How many “noise predictors” get included in this model?

(j) Construct a plot showing BIC on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes BIC? How many “noise predictors” get included in this model?

(k) [2 points] Compare the models selected by AIC and BIC. Is one a subset of the other? Which criterion selects the smaller model? Does that criterion always result in a smaller model, or is does this happen just by coincidence on the `news` data? Explain.

Lasso

(l) Variables `news.x` and `news.y` were pre-constructed in the preamble to this assignment. Use the `glmnet` command to fit a Lasso to this data. Call the result `news.lasso`.

(m) It turns out that `news.lasso` contains model fits for an entire sequence of \(\lambda\) values. Look at the `news.lasso$lambda` attribute. How many \(\lambda\) values do we have model fits for?

(p) `cv.glmnet(x, y)` will perform 10-fold cross-validation on the entire sequence of models fit by `glmnet(x, y)`. Use the `cv.glmnet` command to perform cross-validation. Save the results in a variable called `news.lasso.cv`. Run the `plot()` command to get a CV error plot.

Homework 3: Variable selection in Regression

Your Name Here

Assigned: February 1, 2018

This homework is due by 2:50PM on Thursday, February 8.

To complete this assignment, follow these steps:

Preamble: Loading packages and data

Data dictionary

Problem 1

(b) Use the cv.glm command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds K = 2, 5, 10, 20, 50, 100 (use a loop).

Calculate the standard deviation of your CV estimates divided by the mean of the estimates. This quantity is called the coefficient of variation. Do the error estimates change much across the different choices of \(K\)?

Best subset selection

(d) The code below performs Best Subset Selection to identify which variables in the model are most important. We only go up to models of size 5, because beyond that the computation time starts to get excessive.

Forward Stepwise Selection

(e) Modify the code provided in part (d) to perform Forward stepwise selection instead of exhaustive search. There should be no limit on the maximum size of subset to consider.

(f) For models of size 1:12, display the variables that appear in each model. Are the models all nested? Do any “noise predictors” appear in any of the models?

(g) When you run summary() on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.

(h) Construct a plot showing Residual sum of squares on the y-axis and model size on the x-axis. Does the RSS always decrease as we increase the model size? Explain.

(i) [2 points] Construct a plot showing AIC (aka Mallows Cp) on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes AIC? How many “noise predictors” get included in this model?

(j) Construct a plot showing BIC on the y-axis and model size on the x-axis. Is the curve monotonic? Explain. What model size minimizes BIC? How many “noise predictors” get included in this model?

(k) [2 points] Compare the models selected by AIC and BIC. Is one a subset of the other? Which criterion selects the smaller model? Does that criterion always result in a smaller model, or is does this happen just by coincidence on the news data? Explain.

Lasso

(l) Variables news.x and news.y were pre-constructed in the preamble to this assignment. Use the glmnet command to fit a Lasso to this data. Call the result news.lasso.

(m) It turns out that news.lasso contains model fits for an entire sequence of \(\lambda\) values. Look at the news.lasso$lambda attribute. How many \(\lambda\) values do we have model fits for?

(p) cv.glmnet(x, y) will perform 10-fold cross-validation on the entire sequence of models fit by glmnet(x, y). Use the cv.glmnet command to perform cross-validation. Save the results in a variable called news.lasso.cv. Run the plot() command to get a CV error plot.

(r) How does the CV error of the lambda-min model compare to the 10-fold CV error of the linear model you fit in part (a)? Does it look like a model with a small number of predictors does as good a job of predicting share count?

(b) Use the `cv.glm` command with 10-fold cross-validation to estimate the test error of the model you fit in part (a). Repeat with number of folds `K = 2, 5, 10, 20, 50, 100` (use a loop).

(g) When you run `summary()` on a regsubsets object you get a bunch of useful values. Construct a plot showing R-squared on the y-axis and model size on the x-axis. Use appropriate axis labels. Does R-squared always increase as we increase the model size? Explain.

(k) [2 points] Compare the models selected by AIC and BIC. Is one a subset of the other? Which criterion selects the smaller model? Does that criterion always result in a smaller model, or is does this happen just by coincidence on the `news` data? Explain.

(l) Variables `news.x` and `news.y` were pre-constructed in the preamble to this assignment. Use the `glmnet` command to fit a Lasso to this data. Call the result `news.lasso`.

(m) It turns out that `news.lasso` contains model fits for an entire sequence of \(\lambda\) values. Look at the `news.lasso$lambda` attribute. How many \(\lambda\) values do we have model fits for?

(p) `cv.glmnet(x, y)` will perform 10-fold cross-validation on the entire sequence of models fit by `glmnet(x, y)`. Use the `cv.glmnet` command to perform cross-validation. Save the results in a variable called `news.lasso.cv`. Run the `plot()` command to get a CV error plot.