library(ggplot2)
library(knitr)

# Generate random 5 x 2 matrix
set.seed(12345)
matrix(rpois(10, 100), nrow = 5)
##      [,1] [,2]
## [1,]  105  105
## [2,]  107   97
## [3,]   98  116
## [4,]   95   95
## [5,]  123  103

Linear regression for prediction

Many of the problems you’ll come across in the future are classification problems (and if they’re not, there’s you can typically turn them into classification problems). There are also notable cases where a good prediction of a quantitative outcome is desired. E.g., stock prices, home values, crop yields.

Stock prices are really hard to predict (if I could do this with any accuracy, I’d be wealthy and retired by now). Instead, we’ll look at a problem with lower stakes: predicting students’ final course grades.

Case study: Predicting students’ math grades

set.seed(12345)
grades <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/student-mat.csv", sep = ";", header = TRUE)
grades <- grades[,-c(31, 32)]  # Remove mid-year grades (columns 31 and 32)

# Set aside 95 of the students for a test set
test.indexes <- sample(1:nrow(grades), 95)
train.indexes <- setdiff(1:nrow(grades), test.indexes)

grades.train <- grades[train.indexes, ]
grades.test <- grades[test.indexes, ]

Our outcome of interest is the student’s final math grade (a score between 0 and 20). This is represented by column G3 in the data.

All of the other variables in the dataset correspond to information that’s available on the first day of class, and have no direct connection to math ability. We’re essentially looking to see if we can predict final math grades based on the students’ age, gender, family situation, and some reported aspects of recreational time use. Tall order, but let’s see how we do.

Step 1: Fit a linear regression model

Once again, we’re taking the viewpoint that what goes into the regression doesn’t really matter. We’re not even going to look at the coefficients. We just want predictions.

grades.lm <- lm(G3 ~ ., data = grades.train)

OK, we fit a model. How well did we do?

Step 2: Assess the model on the training data

First, what does it mean to do well? If we didn’t have any information on the students whatsoever, but we knew that the average grade in the class tends to be, our “best guess” of everyone’s grade is to guess the average (or median, if you prefer). The average course grade winds up being:

mean.grade <- mean(grades.train$G3)
mean.grade
## [1] 10.39333

How well does guessing the average work? We can calculate the mean squared error, root mean squared error, and mean absolute error of this prediction.

\[ MSE = \frac{1}{n}\sum_{i = 1}^n (y_i - \hat y_i)^2 \]

\[ RMSE = \sqrt{MSE} \]

\[ MAE = \frac{1}{n} \sum_{i = 1}^n |y_i - \hat y_i| \]

We’ll use vectorized operations to compute all these quantities (instead of loops).

# MSE (mean squared error)
mean((grades.train$G3 - mean.grade)^2)
## [1] 20.74529
# RMSE (root mean squared error)
sqrt(mean((grades.train$G3 - mean.grade)^2))
## [1] 4.5547
# MAE (mean absolute error)
mean(abs(grades.train$G3 - mean.grade))
## [1] 3.384489

We’ll be calculating these quantities a lot, so let’s wrap up the entire calculation into a function.

# Function that calculates MSE, RMSE and MAE of a predictor
# Inputs:
#  fitted - fitted or predicted values
#  true - vector of observed outcomes
calcErrorMetrics <- function(fitted, true) {
  mse <- mean((fitted - true)^2)
  rmse <- sqrt(mse)
  mae <- mean(abs(fitted - true))
  return(c(mse = mse, rmse = rmse, mae = mae))
}

metrics.ave <- calcErrorMetrics(mean.grade, grades.train$G3)

metrics.ave
##       mse      rmse       mae 
## 20.745289  4.554700  3.384489

Now that we have a benchmark to compare to, let’s see how well our linear regression does.

metrics.lm <- calcErrorMetrics(grades.lm$fitted, grades.train$G3)

metrics.lm
##       mse      rmse       mae 
## 14.843056  3.852669  2.959920

How did we do?

metrics.lm / metrics.ave
##       mse      rmse       mae 
## 0.7154904 0.8458667 0.8745545

We improved some, but not much. In particular, these values are based on the training data. Let’s see how we do on the test data.

Step 3: Asssess model on the test data

Let’s repeat all of the above calculations, this time on the test data

# Note: we're using the average grade calculated on the *training data*
metrics.ave.test <- calcErrorMetrics(mean.grade, grades.test$G3)

metrics.ave.test
##       mse      rmse       mae 
## 21.542220  4.641360  3.574807

And now with our regression model

grades.predicted <- predict(grades.lm, grades.test)

metrics.lm.test <- calcErrorMetrics(grades.predicted, grades.test$G3)
metrics.lm.test
##       mse      rmse       mae 
## 18.956138  4.353865  3.434196

How did we do?

metrics.lm.test / metrics.ave.test
##       mse      rmse       mae 
## 0.8799528 0.9380580 0.9606661

Not well at all. If we look at MAE, we barely made a dent!

Indeed, let’s take a look at a scatterplot of the students’ actual score vs their predicted score.

qplot(x = grades.test$G3, y = grades.predicted)

Our predictions don’t seem very close to the real outcomes. This happens. A lot. Not everything is easily predictable.

Why validation is important

Our performance on the grades data isn’t great. Evaluating our model on the test set allowed us to see this. If we don’t validate our model in this way, we can easily run into trouble. Here’s an example.

We’ll generate 200 variables that have nothing to do with the students’ math scores. These variables will just be random numbers. We’ll then regress math scores on these variables and look at our performance on the training data.

random.data <- data.frame(G3 = grades.train$G3, matrix(rnorm(200 * nrow(grades.train)), nrow = nrow(grades.train)))
random.lm <- lm(G3 ~ ., data = random.data)

metrics.random <- calcErrorMetrics(random.lm$fitted, grades.train$G3)

round(rbind(metrics.ave, metrics.lm, metrics.random), 2)
##                  mse rmse  mae
## metrics.ave    20.75 4.55 3.38
## metrics.lm     14.84 3.85 2.96
## metrics.random  6.17 2.48 1.91

Our (training set) prediction error is much lower using a bunch of completely random predictors that have nothing to do with end of term grades than when we regress on data that’s actually collected from the students. Let’s look at how we do on the test data.

# Generate another batch of random numbers for the test data
random.test.data <- data.frame(matrix(rnorm(200 * nrow(grades.test)), nrow = nrow(grades.test)))
metrics.random.test <- calcErrorMetrics(predict(random.lm, random.test.data), grades.test$G3)

# prediction error on test data
round(rbind(metrics.ave.test, metrics.lm.test, metrics.random.test), 2)
##                       mse rmse  mae
## metrics.ave.test    21.54 4.64 3.57
## metrics.lm.test     18.96 4.35 3.43
## metrics.random.test 61.91 7.87 6.41

On the test data, we see that our random variable model does horribly. Its average prediction error is nearly twice as high as just predicting with the mean.

Takeway: When you have a large number of covariates, it’s easy to overfit the data. When you overfit the training data, you get a model that describes the training data really well, but which doesn’t give good predictions on unseen data.

Overfitting

source: http://pingax.com/regularization-implementation-r/







Where do you go from here?

In this class we’ve made considerable use of the ggplot2 and plyr packages. Here’s a list of packages that we haven’t seen in this class, but which you will likely find useful in your analytics career.


Data import/export

foreign - Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, …

xlsx - Read/write Excel data

RSQLite - SQLite Interface for R

RMySQL - MySQL Inferface for R


Data summarization and manipulation

dplyr - Tools for efficiently subsetting, summarizing, rearranging, and joining together data sets

tidyr - Tools for reshaping your data into “tidy” formatting


Interfacing R with other languages

Rcpp - Call C++ functions from R.

RPython - Call Python functions from R.


Visualization

shiny - A web application framework for R

ggvis - Interactive web-based graphics

htmlwidgets - “Bring the best of JavaScript data visualization to R”