--- title: "Homework 4: Classification" author: "Your Name Here" date: 'Assigned: February 10, 2018' output: html_document: toc: true toc_depth: 3 theme: cerulean highlight: tango --- ##### This homework is due by **2:50PM Thursday, February 15**. ### Preamble: Loading packages and data ```{r, message=FALSE} library(ggplot2) library(ISLR) library(MASS) library(klaR) # You may need to install this one library(knitr) library(glmnet) library(plyr) library(gam) set.seed(14504008) cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") options(scipen = 4) # Adulthood data n.obs <- 3000 age <- pmin(pmax(rnorm(n.obs, mean = 30, sd = 10), 5), 50) is.adult <- as.numeric(age >= 18) age.data <- data.frame(age = age, is.adult = as.factor(is.adult)) ``` ```{r, cache = TRUE} # Spam data spamdf <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamdata.txt") varinfo <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/spamids2.txt", sep=" ", stringsAsFactors = FALSE) is.test <- read.table("http://www.andrew.cmu.edu/user/achoulde/95791/data/traintest_indicator.txt", header=FALSE) is.test <- as.integer(is.test[,1]) ``` ```{r} # Partition the spam data # log-transform the x variable spamdf.log <- spamdf spamdf.log[, 1:(ncol(spamdf) - 1)] <- log(0.1 + spamdf[, 1:(ncol(spamdf) - 1)]) # Add names colnames(spamdf.log) <- c(varinfo[,1], "is.spam") spam.train <- subset(spamdf.log, subset = is.test == 0) spam.test <- subset(spamdf.log, subset = is.test == 1) ``` ### Problem 1 [6 points]: Instability of Logistic regression > This question walks you through a simple example that illustrates the instability of logistic regression coefficient estimates in cases where the classes are clearly separable. > This instability can arise in practice when we have inputs $X$ that are categorical variables with a large number of levels. In such cases, particularly when we have low cell counts, it is not uncommon for all observed outcomes in a particular category to be either all $0$ or all $1$. This leads the coefficient corresponding to that category to be very unstable. ##### **(a)** The `age.data` data frame contains information on `r nrow(age.data)` individuals. We want to use the `age` variable to try to classify individuals as adults or non-adults. The outcome variable `is.adult` is 1 for adults and 0 for non-adults. ##### Following the `geom_histogram(position = "fill")` examples (at this link)[http://docs.ggplot2.org/0.9.3.1/geom_histogram.html], construct a conditional probability plot to show how the probability of being an adult varies with age. ```{r} # Edit me ``` - **Your answer here.** ##### **(b)** Is this a difficult classification problem? Can you think of a simple rule that gives 100\% classification accuracy for this task? Display a confusion matrix to verify that your rule works. ```{r} # Edit me ``` - **Your answer here.** ##### **(c)** Fit a logistic regression to the data. Use the `kable()` command to print out a nice summary of your coefficients estimate table. Is the coefficient of `age` statistically significant? ```{r} # Edit me ``` - **Your answer here.** ##### **(d)** Using a probability cutoff of 0.5, produce a confusion matrix for your logistic regression output. Calculate the mislcassification rate of the logistic regression classifier. Does the logistic regression classifier do a good job of classifying individuals as adult vs non-adult? ```{r} # Edit me ``` - **Your answer here.** ##### **(e)** Construct a histogram of the estimated probabilities from your logistic regression. Describe what you see. ```{r} # Edit me ``` - **Your answer here.** ##### **(f)** When we have a single predictor in logistic regression, we saw that the probability at $X = x$ is given by $$ p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}$$ Let's suppose we've ensured that $\beta_0 = 0$. Then $p(x)$ becomes $$ p(x) = \frac{1}{1 + e^{-\beta_1 x}} $$ ##### Suppose that we're in an easy classification problem where the rule $x > 0$ is a perfect clasifier. What does $\beta_1$ have to be to ensure that $p(x) = 1$ when $x > 0$ and $p(x) = 0$ when $x < 0$?. - **Your answer here.** ### Problem 2 [7 points]: Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes > This problem introduces you to the `klaR` library, which provides a set of useful model fitting and visualization tools. You will also use some fitting functions from the `MASS` library. > You may find the tutorial at [this link](http://www.statmethods.net/advstats/discriminant.html) helpful for solving this problem. > We're going to use Fisher's famous `iris` data set for this problem. This data set comes pre-loaded with R. You can learn more about it by looking at the helpfile `?iris`. It's fair to say that everyone who has ever learned Data Mining in R has encountered the iris data at one point or another ##### **(a)** Use the `lda` function from the `MASS` library to build an LDA classifier predicting `Species` from the 4 measurements in the `iris` data. Call this fit `iris.lda`. ```{r} # Edit me ``` ##### Explore the `iris.lda` object to answer the following: What are the group means and prior probabilities for each class? ```{r} # Edit me ``` ##### Run the `plot()` command on your `iris.lda` object. This produces what is called a discriminant plot. When we have $K$ possible classes, we get $K-1$ so-called linear discriminants. You should think of these as "derived features" that provide a helpful low-dimensional representation of the data. The more spread out the classes appear in these discriminant plots, the better the LDA method performs (and vice versa). You may supply the argument `col = as.numeric(iris$Species)` to colour the points based on the true class label. ```{r} # Edit me ``` ##### **(b)** Using the `predict` function, calculate the 3x3 confusion matrix for the lda classifier. What is the overall misclassification rate of the LDA classifier? Does LDA perform well on this problem? ```{r} # Edit me ``` - **Your answer here.** ##### Again using the `predict()` function: What are the estimated posterior class probabilities for the 120th observation? You should run `zapsmall()` on the vector of posterior probabilities to get it to display nicely. ```{r} # Edit me ``` ##### **(c)** Use the `partimat()` function from the `klaR` package with `method = "lda"` to get bivariate plots showing the LDA decision boundaries. Misclassifications are indicated by red text. ```{r, cache = TRUE, fig.width = 10, fig.height = 6} # Edit me ``` ##### Two of the classes begin with the letter v, which makes the above plot hard to interpret. The following code produces a new data frame, where the Species column has been transformed according to: `S = setosa`, `C = versicolor`, `N = verginica`. Try constructing the plot again. Do all 2-variable combinations of the inputs do an equally good job of separating the three classes? ```{r, cache = TRUE, fig.width = 10, fig.height = 6} iris2 <- transform(iris, Species = mapvalues(Species, c("setosa", "versicolor", "virginica"), c("S", "C", "N"))) # Edit me ``` - **Your answer here.** ##### **(d)** Using the `iris2` data frame, run the `partimat` command again, this time with `method = "qda"`. Does it look like allowing quadratic boundaries changes much? ```{r, cache = TRUE, fig.width = 10, fig.height = 6} # Edit me ``` - **Your answer here.** ##### **(e)** Using the `geom = "density"` or `geom_density()` functionality in `ggplot2`, construct density plots for the 4 input variables. Your density plots should look similar to the ones shown for `income` and `balance` in Lecture 8. There are 3 classes in the iris data instead of two, so your plots should have 3 densities overlaid. The `alpha` argument may be used to change the transparency. ##### Based on these plots, does it look like Naive Bayes will be an effective classifier for the iris data? Explain. ```{r} # Edit me ``` - **Your answer here.** ##### **(f)** Use the `NaiveBayes()` command with `usekernel = TRUE` to fit a Naive Bayes classifier to the `iris` data. Save your output as `iris.nb`. Produce a confusion matrix for the Naive Bayes classifier. What is the misclassification rate of Naive Bayes on this problem? How does the performance of Naive Bayes compare to that of LDA in this example? ```{r} # Edit me ``` - **Your answer here.** ##### **(g)** What is the true class of the 120th observation? What are the estimated posterior probabilities for the 120th observation according to Naive Bayes? Are they similar to those estimated by LDA? Do LDA and Naive Bayes result in the same classification for this observation? Does either method classify this observation correctly? ```{r} # Edit me ``` - **Your answer here.** ### Problem 3 [7 points]: Additive Logistic Regression with spam data > In the preamble to this document I pre-processed spam data for you to use for this problem. You have two data sets: `spam.train` and `spam.test`. The outcome is `is.spam`, which is 1 if the given email is spam and 0 if it isn't. You will use `spam.train` to build a spam classifier, which you will then test on the data in `spam.test`. > For more information on the meaning of these variables, you may refer to the variable information file here: [https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names) > The input variables have extremely highly skewed distributions, so I applied the transformation `log(0.1 + x)` to every input variable. Thus the inputs are now approximately log-frequencies or log-counts instead of the raw frequencies/counts. > To answer the questions in part (a), you will find it helpful to know that this data was made publicly available by George Forman from Hewlett-Packard laboratories in Palo Alto California. These are emails from his HP Labs work account. ##### [2 points] **(a)** Fit a logistic regression model to the training data. Name your logistic regression object `spam.logit`. Remember that the formula specification `y ~ .` will regress `y` on every other variable in the specified data set. ##### Use the `kable()` command to produce a nice summary of the logistic regression model. Make sure you set the `digits = ` argument appropriately so that the number of decimal places is reasonable for each column. Is increased frequency of the `!` and `$` characters associated with higher or lower likelihood that the email is spam? ##### There are several terms that are associated with decreased likelihood that this email is spam. Knowing what you do about the source of the data, pick out 3 terms with a negative coefficient where you can explain why the terms should have negative coefficients. ```{r} # Edit me ``` - **Your answer here.** ##### **(b)** Using `ggplot2` graphics, construct a single plot showing histograms of the estimated spam probabilities. Specify `fill = as.factor(spam.logit$y)` so that your plot colours the observations with `is.spam = 1` differently from those with `is.spam = 0`. Does logistic regression appear to do a good job of separating spam from not spam? ```{r} # Edit me ``` - **Your answer here.** ##### **(c)** What is the prevalence of spam in the training data? What is the prevalence of spam in the test data? Using a probability cutoff of 0.5, construct a confusion matrix for both the training data and test data. You will need to use the `predict()` function appropriately to get probability estimates for your test data. Look into the `type` argument of `predict.glm()` to see how to do this. ##### Calculate the misclassification rate for both the test data and training data. Is this a good misclassification rate, relative to the prevalence of spam in the data? ```{r} # Edit me ``` - **Your answer here.** ##### **(d)** The code below constructs an additive formula for fitting an additive model with degree of freedom 5 smoothing splines for every input. Supply this formula to the `gam` command to fit a logistic additive model to the Training data. (Be sure that you are fitting a LOGISTIC additive model.) Call this fit `spam.gam`. ##### Use the `plot()` command to display the fitted splines for each term. You should colour the lines 'ForestGreen'. You should also use the `par(mfrow = ...)` command to set up a grid with 15 rows and 4 columns for the purpose of plotting. Does it look like many of the fits are highly non-linear? ```{r, fig.height = 40, fig.width = 8} spam.formula <- formula(paste("is.spam ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + "))) # Edit me ``` - **Your answer here.** ##### **(e)** Using a probability cutoff of 0.5, construct a confusion matrix to show how the logistic additive model performs on the test data. Calculate the misclassification rate. Compare this to the Test misclassification rate of the standard logistic regression model. ```{r} # Edit me ``` - **Your answer here.** ##### **(f)** We all hate seeing spam in our inbox, but what we hate even more is when real emails wind up in the spam folder. Suppose that each time we see a spam email in our inbox, we get 1 happiness unit more sad Each time a real email winds up in the spam folder, we get 5 happiness units more sad. Our happiness is unaffected by correct classifications. ##### How much happiness do we lose using the Additive Model? How much happiness do we lose using the Logistic model? Report your answers as happiness loss-per-email-received. (Use the test data for your calculations) ```{r} # Edit me ``` - **Your answer here.**