--- title: "Lab 4 Solutions" author: "Prof. Chouldechova" date: "" output: html_document: toc: true toc_depth: 4 theme: cerulean highlight: tango --- ```{r package_load} library(ggplot2) # graphics library library(ISLR) # contains code and data from the textbook library(knitr) # contains kable() function library(tree) # For the tree-fitting 'tree' function library(rpart) # For nicer tree fitting library(partykit) # For nicer tree plotting library(MASS) # For Boston data options(scipen = 4) # Suppresses scientific notation ``` ### 1. Changing the author field and file name. ##### (a) Change the `author:` field on the Rmd document from Your Name Here to your own name. ##### (b) Rename this file to "lab01_YourHameHere.Rmd", where YourNameHere is changed to your own name.
> The next portion of the lab gets you to carry out the Lab exercises in §8.3.1 - 8.3.3 of ISLR (Pages 323 - 330). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you're doing. ### 2. Classification trees > You will need the `Carseats` data set from the `ISLR` library in order to complete this exercise. > Please run all of the code indicated in §8.3.1 of ISLR, even if I don't explicitly ask you to do so in this document. ##### (a) Run the `View()` command on the `Carseats` data to see what the data set looks like. ```{r} #View(Carseats) ``` ##### The following code construst the `High` variable for the purpose of classification. Our goal will be to classify whether Carseat sales in a store are high or not. ```{r} High <- with(Carseats, ifelse(Sales <= 8, "No", "Yes")) Carseats <- data.frame(Carseats, High) ``` ##### (b) What proportion of stores have High carseat sales? ```{r} prop.table(table(Carseats$High)) ``` - We see that `r round(100 * prop.table(table(Carseats$High)), 0)`% of locations had high carseat sales. ##### (c) Use the `tree` command to fit a decision tree to every other variable in the `Carseats` data other than `Sales`. Run `summary` on your tree object. Run the `plot` and `text` commands on your tree object. ```{r, fig.width = 14, fig.height = 10} # We don't want to use the Sales variable as an input because # our outcome variable, High, is derived from Sales tree.carseats <- tree::tree(High ~ . -Sales, Carseats) summary(tree.carseats) plot(tree.carseats) text(tree.carseats,pretty=0) ``` ##### (d) Does the tree make use of all of the available variables, or are there some that the tree does not use? (Look at the tree summary print-out to figure this out.) ```{r} summary(tree.carseats)$used names(Carseats)[which(!(names(Carseats) %in%summary(tree.carseats)$used))] ``` - The output above shows which variables get used, and which ones do not get used. We see that not all variables are used (Sales is not the only ommitted variable). ##### (e) What is the misclassification rate of the tree? Is this a good misclassification rate? i.e., Is the rate low relative to the prevalence of High sales in the data? ```{r} # This tells us: # misclass, # total summary(tree.carseats)$misclass misclass.rate <- summary(tree.carseats)$misclass[1] / summary(tree.carseats)$misclass[2] misclass.rate ``` - We see that the misclassification rate is just `r 100*misclass.rate`%, which is much lower than the prevalence of High sales in the data. ##### (f) Study the output from the tree-printout you get when you run the command below. Try to figure out how it connects to the tree diagram you got from running `plot`. (You do not need to write down an answer) ```{r} tree.carseats ``` ##### (g) What does the following line from the output mean? What does the `*` at the end mean? How many `High = yes` observations are there at this node? ``` 18) Population < 207.5 16 21.170 Yes ( 0.37500 0.62500 ) * ``` - The `*` indicates that this split corresponds to a leaf node. There are 16 observations in this final node. `0.62500 * 16 = 10` of them have `High = yes`. ##### Here is some code for splitting the data into training and testing, and for fitting a new `tree.carseats` model to just the training data. ```{r} set.seed(2) train <- sample(1:nrow(Carseats), 200) Carseats.test <- Carseats[-train,] High.test <- High[-train] tree.carseats <- tree(High~.-Sales,Carseats,subset=train) tree.pred <- predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) ``` ##### (h) Use the `cv.tree` command to carry out 10-Fold Cross-Validation pruning on `tree.subsets`. You'll need to supply `FUN = prune.misclass` as an argument to ensure that the error metric is taken to be the number of misclassifications instead of the deviance. ##### How many leaf nodes does the lowest CV error tree have? How many misclassifications does it make? What is its misclassification rate? ```{r} cv.carseats <- cv.tree(tree.carseats, FUN=prune.misclass) names(cv.carseats) cv.carseats # Index of tree with minimum error min.idx <- which.min(cv.carseats$dev) min.idx # Number of leaves in that tree cv.carseats$size[min.idx] # Number of misclassifications (this is a count) cv.carseats$dev[min.idx] # Misclassification rate cv.carseats$dev[min.idx] / length(train) ``` ##### The following code produces CV error plots indexed by tree size and k (essentially what we called alpha in class). ```{r} par(mfrow = c(1,2)) plot(cv.carseats$size, cv.carseats$dev, type="b") plot(cv.carseats$k, cv.carseats$dev, type="b") ``` ##### (i) Use the `prune.misclass` command to prune `tree.carseats` down to the subtree that has the lowest CV error. Plot this tree, and overlay text. Compare the variables that get used in this tree compared to those that get used in the unpruned tree. Does the pruned tree wind up using fewer variables? ```{r} par(mfrow = c(1,1)) prune.carseats <- prune.misclass(tree.carseats, best = 9) plot(prune.carseats) text(prune.carseats, pretty=0) # Variables used summary(prune.carseats)$used # Variables that are used in one model but not the other c(setdiff(summary(prune.carseats)$used, summary(tree.carseats)$used), setdiff(summary(tree.carseats)$used, summary(prune.carseats)$used)) ``` - As we can see, there are two variables that appear in the larger tree but not the pruned tree. ##### (j) The code below produces a Test set confusion matrix for the pruned tree. What is the Test misclassification rate of this tree? How does this compare to the error you got in part (e)? Is there a big difference? Explain. ```{r} tree.pred <- predict(prune.carseats, Carseats.test, type="class") confusion.pred <- table(tree.pred, High.test) confusion.pred # Misclassification rate 1 - sum(diag(confusion.pred)) / sum(confusion.pred) ``` - The mislcassification rate on the test set is over 2x larger than the full data misclassification rate we saw from the full tree. There are two factors at play: First, the model that achieved 0.09 error rate was trained on the full data set, which has twice as many obseravations as `Carseats.train`. Second, unpruned trees can greatly overfit data. ##### (k) The following code produces a pruned decision tree using the `rpart` command instead of `tree`. This results in a fit that can be converted into a `party` object, and plotted in a more aesthetically pleasing way. The code below illustrates how to perform this conversion and how to get a tree out of it. There are some accompanying questions below the code. ```{r} # Fit a decision tree using rpart # Note: when you fit a tree using rpart, the fitting routine automatically # performs 10-fold CV and stores the errors for later use # (such as for pruning the tree) carseats.rpart <- rpart(High ~ . -Sales , Carseats, method="class", subset=train) # Plot the CV error curve for the tree plotcp(carseats.rpart) # Identify the value of the complexity parameter that produces # the lowest CV error cp.min <- carseats.rpart$cptable[which.min(carseats.rpart$cptable[,"xerror"]),"CP"] # Prune using the CV error minimizing choice of the complexity parameter cp carseats.rpart.pruned <- prune(carseats.rpart, cp = cp.min) # Convert pruned tree to a party object carseats.party <- as.party(carseats.rpart.pruned) # Plot plot(carseats.party) ``` a. **How many terminal nodes are in the tree?** - 5 b. **Which node corresponds to stores with the greatest likelihood of High sales of the product?** - Node 9. The height of the dark (high sales = "Yes") bar is highest here. It looks like roughly 85% of stores in the training set falling in this node have high sales of the product. c. **How does the tree characterize the stores in this node?** - These are the stores where the product is placed on a "Good" shelf location and the price of the product is set at under $142.50 ### 3. Regression trees > These questions prompt you to follow along with §8.3.2 in ISL. We'll once again be working with the `Boston` data set. ##### The code below splits the data into training and testing, and then fits a regression tree to the training data. ```{r} set.seed(1) train <- sample(1:nrow(Boston), nrow(Boston)/2) tree.boston <- tree(medv ~ . ,Boston,subset=train) summary(tree.boston) plot(tree.boston) text(tree.boston, pretty=0) ``` ##### (a) What is the first variable that is split on? Where is the split for this variable? Next, ook at the terminal nodes (leaves) in the tree. What do the numbers appearing in the leaves of the tree mean? - lstat (percentage of the population that is lower socioeconomic status) is the first variable that is split on. The split occurs at `lstat = 9.715`. - The numbers refer to the average `medv` (median home value) in that leaf node. ##### (b) The following code repeats the process of randomly splitting the data and fitting a tree for 10 iterations. It produces a list of all of the variables that get used in the tree fit, in the order in which they appear. Does this list change a lot from one random split to the next? Do all of the trees wind up using the same variables? Is the first split always on the same variable? ```{r} set.seed(2) for(i in 1:10) { train.b <- sample(1:nrow(Boston), nrow(Boston)/2) tree.boston.b <- tree(medv ~ . , Boston, subset=train.b) print(as.character(summary(tree.boston.b)$used)) } ``` - There's a lot of variability in both which variables get used, and how many variables get used. The first split is always either on `lstat` or `rm`. ##### The code below performs cross-validation to prune the tree. Make sure that you understand what each line of code is doing. A written reply is not required for this section. Note: The y-axis on the CV error plot is MSE. `dev` stands for "deviance", which is the same as MSE for regression (prediction) problems. ```{r} # Run CV to find best level at which to prune cv.boston <- cv.tree(tree.boston) # Construct a plot (dev = MSE on y-axis) plot(cv.boston$size,cv.boston$dev,type='b') # Prune the tree, display pruned tree prune.boston <- prune.tree(tree.boston,best=5) plot(prune.boston) text(prune.boston,pretty=0) # Get predictions from pruned tree yhat.tree <- predict(tree.boston, newdata=Boston[-train,]) boston.test <- Boston[-train,"medv"] # Construct plot of observed values (x-axis) vs predicted values (y-axis) plot(yhat.tree, boston.test) # Add a diagonal line abline(0, 1) # Calculate test set MSE mean((yhat.tree-boston.test)^2) ```