--- title: "Week 1 lecture code" author: "Alexandra Chouldechova" date: "January 18, 2017" output: html_document: toc: true toc_depth: 4 --- ### Regression function Figure #### Data generation {r data_generation} # Load packages library(ggplot2) library(FNN) # Used for k-nearest neighbours fit library(ISLR) ## Note: if you get Error messages saying things like: ## Error in library(ggplot2): there is no package called 'ggplot2' ## you need to run the command: ## install.packages("ggplot2") ## ## Similarly for other packages. # Define colour-blind-friendly colour palette cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # Set seed for the random number generator set.seed(4490) # This is the true regression function f.reg <- function(x) { 1 + 0.1*(x - 2.5)^2 } # Generate data n <- 2000 # num observations x <- rnorm(n, mean = 4, sd = 1.75) y <- f.reg(x) + rnorm(n, sd = 0.25) # Combine into data frame df <- data.frame(x = x, y = y) x.val <- 5 y.val <- f.reg(x.val)  The code above produced r n randomly generated realizations from the underlying model specified by the f.reg function. More precisely, the code starts by generating r n x-values from the $Normal(4, \sigma^2 = 1.75^2)$ distribution, and then forms y-values according to: $$y_i = 1 + 0.1(x_i - 2.5)^2 + \epsilon_i$$ where the errors are distributed as $\epsilon_i \sim Normal(0, \sigma^2 = 0.25^2)$. #### Resulting plots {r reg_fun, height = 4, width = 7} qplot(x = x, y = y, alpha = I(0.5), colour = I(cbPalette[1])) + stat_function(fun = f.reg, colour = cbPalette[2], lwd = 1.25) + geom_point(aes(x = x.val, y = y.val), size = 4, colour = cbPalette[4]) + geom_vline(xintercept = 5, colour = cbPalette[4], lwd = 1)  Here's a figure showing the true regression function, the observed data, and the linear regression (blue) and 50-nearest-neighbours (green) fits to the data. {r reg_fun_fits, height = 4, width = 7} # 3-NN regression model knn.fit <- knn.reg(train = x, y = y, k = 50) qplot(x = x, y = y, alpha = I(0.5), colour = I(cbPalette[1])) + stat_function(fun = f.reg, colour = cbPalette[2], lwd = 1.5) + stat_smooth(method = "lm", se = FALSE, lwd = 1) + geom_line(aes(x = x, y = knn.fit$pred), colour = I(cbPalette[4]), lwd = 0.75)  ### Best linear approximation #### Data generation {r} set.seed(14316) # true regresion function (non-linear) true.reg <- function(x) { (1 + 0.2*cos(1.5*x)) * (3*x + 1) } n <- 5000 # num observations for getting 'best linear approximation' n.sub <- 200 # num observations for getting linear regression fit  We generate two sets of data for this example. First, we generate r n points for the purpose of figuring out the "best linear predictor"$f_L(x)$. We could derive$f_L(x)$analytically (i.e., do a bunch of math to figure out what it should be), but here we're *cheating* a little and just using a very large sample size to figure out what$f_L(x)$should be. To get a linear regression predictor$\hat f(x)\$, we generate just r n.sub and treat these as our observations. {r} # Generate data for figuring out f_L x <- rnorm(n, mean = 3, sd = 1) y <- true.reg(x) + rnorm(n, sd = 5) # Generate data to feed into linear regression x.sub <- rnorm(n.sub, mean = 3, sd = 1) y.sub <- true.reg(x.sub) + rnorm(n.sub, sd = 5)  #### Linear regression predictor plot {r} qplot(x=x, y=y, colour = I(cbPalette[1]), alpha = I(0)) + geom_point(aes(x = x.sub, y = y.sub), alpha = I(0.7), colour = I(cbPalette[1])) + stat_function(fun = true.reg, colour = cbPalette[2], lwd = 1.25) + stat_smooth(method = "lm", aes(x = x.sub, y = y.sub), se = FALSE, lwd = 1) + stat_smooth(method = "lm", se = FALSE, colour = cbPalette[7], lty = 2, lwd = 1) + theme_bw()  Shown: True regression function (solid orange), Best linear predictor (dashed burnt orange), Linear regression fit based on observed points (solid blue) ### Polynomial regression and step functions {r, height = 4, width = 7} qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ poly(x, 1), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ poly(x, 2), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ poly(x, 3), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ poly(x, 4), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ poly(x, 10), lwd = 1.25) + theme_bw() ##### # Step functions ##### qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ cut(x, breaks = c(-Inf, 65, Inf)), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ cut(x, breaks = c(-Inf, 35, 65, Inf)), lwd = 1.25) + theme_bw() qplot(data = Wage, x = age, y = wage, xlab = "Age", ylab = "Wage", colour = I(cbPalette[1]), alpha = I(0.75)) + stat_smooth(method = "lm", formula = y ~ cut(x, breaks = c(-Inf, 25, 35, 65, Inf)), lwd = 1.25) + theme_bw()