``````library(ggplot2) # graphics library
library(ISLR)    # contains code and data from the textbook
library(knitr)   # contains kable() function
library(tree)    # For the tree-fitting 'tree' function
library(rpart)   # For nicer tree fitting
library(partykit)  # For nicer tree plotting``````
``## Loading required package: grid``
``````library(MASS)    # For Boston data

options(scipen = 4)  # Suppresses scientific notation``````

### 1. Changing the author field and file name.

##### (b) Rename this file to â€ślab01_YourHameHere.Rmdâ€ť, where YourNameHere is changed to your own name.

The next portion of the lab gets you to carry out the Lab exercises in Â§8.3.1 - 8.3.3 of ISLR (Pages 323 - 330). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what youâ€™re doing.

### 2. Classification trees

You will need the `Carseats` data set from the `ISLR` library in order to complete this exercise.

Please run all of the code indicated in Â§8.3.1 of ISLR, even if I donâ€™t explicitly ask you to do so in this document.

##### (a) Run the `View()` command on the `Carseats` data to see what the data set looks like.
``#View(Carseats)``
##### The following code construst the `High` variable for the purpose of classification. Our goal will be to classify whether Carseat sales in a store are high or not.
``````High <- with(Carseats, ifelse(Sales <= 8, "No", "Yes"))
Carseats <- data.frame(Carseats, High)``````
##### (b) What proportion of stores have High carseat sales?
``prop.table(table(Carseats\$High))``
``````##
##   No  Yes
## 0.59 0.41``````
• We see that 59, 41% of locations had high carseat sales.
##### (c) Use the `tree` command to fit a decision tree to every other variable in the `Carseats` data other than `Sales`. Run `summary` on your tree object. Run the `plot` and `text` commands on your tree object.
``````# We don't want to use the Sales variable as an input because
# our outcome variable, High, is derived from Sales
tree.carseats <- tree::tree(High ~ . -Sales, Carseats)
summary(tree.carseats)``````
``````##
## Classification tree:
## tree::tree(formula = High ~ . - Sales, data = Carseats)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population"
## Number of terminal nodes:  27
## Residual mean deviance:  0.4575 = 170.7 / 373
## Misclassification error rate: 0.09 = 36 / 400``````
``````plot(tree.carseats)
text(tree.carseats,pretty=0)``````

##### (d) Does the tree make use of all of the available variables, or are there some that the tree does not use? (Look at the tree summary print-out to figure this out.)
``summary(tree.carseats)\$used``
``````## [1] ShelveLoc   Price       Income      CompPrice   Population  Advertising
## [7] Age         US
## 11 Levels: <leaf> CompPrice Income Advertising Population ... US``````
``names(Carseats)[which(!(names(Carseats) %in%summary(tree.carseats)\$used))]``
``## [1] "Sales"     "Education" "Urban"     "High"``
• The output above shows which variables get used, and which ones do not get used. We see that not all variables are used (Sales is not the only ommitted variable).
##### (e) What is the misclassification rate of the tree? Is this a good misclassification rate? i.e., Is the rate low relative to the prevalence of High sales in the data?
``````# This tells us: # misclass, # total
summary(tree.carseats)\$misclass``````
``## [1]  36 400``
``````misclass.rate <- summary(tree.carseats)\$misclass[1] / summary(tree.carseats)\$misclass[2]
misclass.rate``````
``## [1] 0.09``
• We see that the misclassification rate is just 9%, which is much lower than the prevalence of High sales in the data.
##### (f) Study the output from the tree-printout you get when you run the command below. Try to figure out how it connects to the tree diagram you got from running `plot`. (You do not need to write down an answer)
``tree.carseats``
``````## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
##
##   1) root 400 541.500 No ( 0.59000 0.41000 )
##     2) ShelveLoc: Bad,Medium 315 390.600 No ( 0.68889 0.31111 )
##       4) Price < 92.5 46  56.530 Yes ( 0.30435 0.69565 )
##         8) Income < 57 10  12.220 No ( 0.70000 0.30000 )
##          16) CompPrice < 110.5 5   0.000 No ( 1.00000 0.00000 ) *
##          17) CompPrice > 110.5 5   6.730 Yes ( 0.40000 0.60000 ) *
##         9) Income > 57 36  35.470 Yes ( 0.19444 0.80556 )
##          18) Population < 207.5 16  21.170 Yes ( 0.37500 0.62500 ) *
##          19) Population > 207.5 20   7.941 Yes ( 0.05000 0.95000 ) *
##       5) Price > 92.5 269 299.800 No ( 0.75465 0.24535 )
##        10) Advertising < 13.5 224 213.200 No ( 0.81696 0.18304 )
##          20) CompPrice < 124.5 96  44.890 No ( 0.93750 0.06250 )
##            40) Price < 106.5 38  33.150 No ( 0.84211 0.15789 )
##              80) Population < 177 12  16.300 No ( 0.58333 0.41667 )
##               160) Income < 60.5 6   0.000 No ( 1.00000 0.00000 ) *
##               161) Income > 60.5 6   5.407 Yes ( 0.16667 0.83333 ) *
##              81) Population > 177 26   8.477 No ( 0.96154 0.03846 ) *
##            41) Price > 106.5 58   0.000 No ( 1.00000 0.00000 ) *
##          21) CompPrice > 124.5 128 150.200 No ( 0.72656 0.27344 )
##            42) Price < 122.5 51  70.680 Yes ( 0.49020 0.50980 )
##              84) ShelveLoc: Bad 11   6.702 No ( 0.90909 0.09091 ) *
##              85) ShelveLoc: Medium 40  52.930 Yes ( 0.37500 0.62500 )
##               170) Price < 109.5 16   7.481 Yes ( 0.06250 0.93750 ) *
##               171) Price > 109.5 24  32.600 No ( 0.58333 0.41667 )
##                 342) Age < 49.5 13  16.050 Yes ( 0.30769 0.69231 ) *
##                 343) Age > 49.5 11   6.702 No ( 0.90909 0.09091 ) *
##            43) Price > 122.5 77  55.540 No ( 0.88312 0.11688 )
##              86) CompPrice < 147.5 58  17.400 No ( 0.96552 0.03448 ) *
##              87) CompPrice > 147.5 19  25.010 No ( 0.63158 0.36842 )
##               174) Price < 147 12  16.300 Yes ( 0.41667 0.58333 )
##                 348) CompPrice < 152.5 7   5.742 Yes ( 0.14286 0.85714 ) *
##                 349) CompPrice > 152.5 5   5.004 No ( 0.80000 0.20000 ) *
##               175) Price > 147 7   0.000 No ( 1.00000 0.00000 ) *
##        11) Advertising > 13.5 45  61.830 Yes ( 0.44444 0.55556 )
##          22) Age < 54.5 25  25.020 Yes ( 0.20000 0.80000 )
##            44) CompPrice < 130.5 14  18.250 Yes ( 0.35714 0.64286 )
##              88) Income < 100 9  12.370 No ( 0.55556 0.44444 ) *
##              89) Income > 100 5   0.000 Yes ( 0.00000 1.00000 ) *
##            45) CompPrice > 130.5 11   0.000 Yes ( 0.00000 1.00000 ) *
##          23) Age > 54.5 20  22.490 No ( 0.75000 0.25000 )
##            46) CompPrice < 122.5 10   0.000 No ( 1.00000 0.00000 ) *
##            47) CompPrice > 122.5 10  13.860 No ( 0.50000 0.50000 )
##              94) Price < 125 5   0.000 Yes ( 0.00000 1.00000 ) *
##              95) Price > 125 5   0.000 No ( 1.00000 0.00000 ) *
##     3) ShelveLoc: Good 85  90.330 Yes ( 0.22353 0.77647 )
##       6) Price < 135 68  49.260 Yes ( 0.11765 0.88235 )
##        12) US: No 17  22.070 Yes ( 0.35294 0.64706 )
##          24) Price < 109 8   0.000 Yes ( 0.00000 1.00000 ) *
##          25) Price > 109 9  11.460 No ( 0.66667 0.33333 ) *
##        13) US: Yes 51  16.880 Yes ( 0.03922 0.96078 ) *
##       7) Price > 135 17  22.070 No ( 0.64706 0.35294 )
##        14) Income < 46 6   0.000 No ( 1.00000 0.00000 ) *
##        15) Income > 46 11  15.160 Yes ( 0.45455 0.54545 ) *``````
##### (g) What does the following line from the output mean? What does the `*` at the end mean? How many `High = yes` observations are there at this node?
``18) Population < 207.5 16  21.170 Yes ( 0.37500 0.62500 ) *``
• The `*` indicates that this split corresponds to a leaf node. There are 16 observations in this final node. `0.62500 * 16 = 10` of them have `High = yes`.
##### Here is some code for splitting the data into training and testing, and for fitting a new `tree.carseats` model to just the training data.
``````set.seed(2)
train <- sample(1:nrow(Carseats), 200)
Carseats.test <- Carseats[-train,]
High.test <- High[-train]
tree.carseats <- tree(High~.-Sales,Carseats,subset=train)
tree.pred <- predict(tree.carseats,Carseats.test,type="class")
table(tree.pred,High.test)``````
``````##          High.test
## tree.pred No Yes
##       No  86  27
##       Yes 30  57``````
##### How many leaf nodes does the lowest CV error tree have? How many misclassifications does it make? What is its misclassification rate?
``````cv.carseats <- cv.tree(tree.carseats, FUN=prune.misclass)
names(cv.carseats)``````
``## [1] "size"   "dev"    "k"      "method"``
``cv.carseats``
``````## \$size
## [1] 19 17 14 13  9  7  3  2  1
##
## \$dev
## [1] 53 53 50 50 48 51 67 66 80
##
## \$k
## [1]       -Inf  0.0000000  0.6666667  1.0000000  1.7500000  2.0000000
## [7]  4.2500000  5.0000000 23.0000000
##
## \$method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune"         "tree.sequence"``````
``````# Index of tree with minimum error
min.idx <- which.min(cv.carseats\$dev)
min.idx``````
``## [1] 5``
``````# Number of leaves in that tree
cv.carseats\$size[min.idx]``````
``## [1] 9``
``````# Number of misclassifications (this is a count)
cv.carseats\$dev[min.idx]``````
``## [1] 48``
``````# Misclassification rate
cv.carseats\$dev[min.idx] / length(train)``````
``## [1] 0.24``
##### The following code produces CV error plots indexed by tree size and k (essentially what we called alpha in class).
``````par(mfrow = c(1,2))
plot(cv.carseats\$size, cv.carseats\$dev, type="b")
plot(cv.carseats\$k, cv.carseats\$dev, type="b")``````