--- title: "Homework 5" author: "Your Name Here" date: 'Assigned: February 15, 2018' output: html_document: theme: paper highlight: tango toc: true toc_depth: 3 --- ##### This homework is due by **1:20pm on Tuesday, February 27**. ### Problem 1: Linear regression with bikeshare data ```{r, include = FALSE} library(ggplot2) library(plyr) library(dplyr) ``` For this problem we'll be working with two years of bikeshare data from the Capital Bikeshare system in Washington DC. The dataset contains daily bikeshare counts, along with daily measurements on environmental and seasonal information that may affect the bikesharing. Let's start by loading the data. ```{r} bikes <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/bikes.csv", header = TRUE) # Transform temp and atemp to degrees C instead of [0,1] scale # Transform humidity to % # Transform wind speed (multiply by 67, the normalizing value) bikes <- transform(bikes, temp = 47 * temp - 8, atemp = 66 * atemp - 16, hum = 100 * hum, windspeed = 67 * windspeed) ``` Here's information on what the variables mean. - instant: record index - dteday : date - season : season (1:Winter, 2:Spring, 3:Summer, 4:Fall) - yr : year (0: 2011, 1:2012) - mnth : month ( 1 to 12) - hr : hour (0 to 23) - holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) - weekday : day of the week - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. + weathersit : - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Temperature in Celsius. - atemp: Feeling temperature in Celsius. - hum: Humidity (%) - windspeed: Wind speed. - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registered **(a)** Season: Factor or numeric? Consider the following two regression models. ```{r} bikes.lm1 <- lm(cnt ~ yr + temp + hum + season, data = bikes) bikes.lm2 <- lm(cnt ~ yr + temp + hum + as.factor(season), data = bikes) ``` **What is the difference between these two models?** Replace this text with your answer. (do not delete the html tags) **(b)** What is the interpretation the coefficient(s) of `season` in each of the models in part (a)? Replace this text with your answer. (do not delete the html tags) **(c)** Use ggplot2 graphics to construct boxplots of count for each season. Based on what you see from these plots, which model do you think makes more sense? Explain. ```{r} # Edit me ``` Replace this text with your answer. (do not delete the html tags) **(d)** Using `ggplot2` graphics, construct a scatterplot of `cnt` (bikeshare count) across `mnth` (month of the year). Describe what you see. Would a linear model be a good way of modeling how bikeshare count varies with month? ```{r} # Edit me ``` **(e)** Consider the following three models. Figures of the model fits are shown to help you interpret the models. ```{r} # Fit and plot first model bikes.lm3 <- lm(cnt ~ mnth, data = bikes) qplot(data = bikes, x = mnth, y = cnt) + stat_smooth(method = "lm") ``` ```{r} # Fit and plot second model bikes.lm4 <- lm(cnt ~ as.factor(mnth), data = bikes) # Construct data frame that has fitted values for each month mnth.df <- data.frame(mnth = unique(bikes$mnth)) df.lm4 <- data.frame(mnth = unique(bikes$mnth), fitted = predict(bikes.lm4, newdata = mnth.df)) # Red points are the fitted values from the model qplot(data = bikes, x = mnth, y = cnt, alpha = 0.5) + guides(alpha = FALSE) + geom_point(data = df.lm4, aes(x = mnth, y = fitted), colour = I("red"), alpha = 1, size = I(5)) ``` ```{r} # Fit and plot third model bikes.lm5 <- lm(cnt ~ mnth + I(mnth^2), data = bikes) qplot(data = bikes, x = mnth, y = cnt) + stat_smooth(method = "lm", formula = y ~ x + I(x^2)) ``` **What are the differences between the models? How many coefficients are used to model the relationship between bikeshare count and month in each of the models?** *Answer here* **(f)** Use the `plot()` function to produce diagnostic plots for `bikes.lm3`. Do you observe any problems? ```{r} # Edit me ``` **(g)** Produce diagnostic plots for `bikes.lm4` and `bikes.lm5`. Are any of the problems you identified in part (f) resolved? Do any remain? Explain. ```{r} # Edit me ``` ### Problem 2: Interpreting and testing linear models This problem once again uses the `bikes` data. **(a)** Use the `transform` and `mapvalues` functions to map the values of the `season` and `weathersit` variables to something more interpretable. Your new `weathersit` variable should have levels: Clear, Cloud, Light.precip, Heavy.precip ```{r} # Edit me ``` **(b)** Fit a linear regression model with `cnt` as the outcome, and season, workingday, weathersit, temp, atemp and hum as covariates. Use the `summary` function to print a summary of the model showing the model coefficients. Note that you may wish to code the workingday variable as a factor, though this will not change the estimated coefficient or its interpretation (workingday is already a 0-1 indicator). ```{r} # Edit me ``` **(c)** How do you interpret the coefficient of `workingday` in the model? What is its p-value? Is it statistically significant? ```{r} # Edit me ``` **(d)** How do you interpret the coefficient corresponding to Light.precip weather situation? What is its p-value? Is it statistically significant? ```{r} # Edit me ``` **(e)** Use the `pairs` function to construct a pairs plot of temp, atemp and hum. The bottom panel of your plot should show correlations (see example in Lecture 9). ```{r} # Edit me ``` **Do you observe any strong colinearities between the variables?** ```{r} # Edit me ``` **(f)** Use the `update` function to update your linear model by removing the `temp` variable. Has the coefficient of `atemp` changed? Has it changed a lot? Explain. ```{r} # Edit me ``` **(g)** How do you interpret the coefficient of `atemp` in the model from part (f)? **(h)** Use the `anova()` function on your model from part (f) to assess whether `weathersit` is a statistically significant factor in the model. Interpret your findings. ```{r} # Edit me ``` ### Problem 3: Exploring statistical significance This problem will guide you through the process of conducting a simulation to investigate statistical significance in regression. In this setup, we'll repeatedly simulate `n` observations of 3 variables: an outcome `y`, a covariate `x1` that's associated with `y`, and a covariate `x2` that is unassociated with `y`. Our model is: $$ y_i = 0.5 x_1 + \epsilon_i$$ where the $\epsilon_i$ are independent normal noise variables having standard deviation 5 (i.e., Normal random variables with mean 0 and standard deviation 5). Here's the setup. ```{r} set.seed(12345) # Set random number generator n <- 200 # Number of observations x1 <- runif(n, min = 0, max = 10) # Random covariate x2 <- rnorm(n, 0, 10) # Another random covariate ``` To generate a random realization of the outcome `y`, use the following command. ```{r} # Random realization of y y <- 0.5 * x1 + rnorm(n, mean = 0, sd = 5) ``` Here's are plots of that random realization of the outcome `y`, plotted against `x1` and `x2`. ```{r} qplot(x = x1, y = y) qplot(x = x2, y = y) ``` **(a)** Write code that implements the following simulation (you'll want to use a for loop): ``` for 2000 simulations: generate a random realization of y fit regression of y on x1 and x2 record the coefficient estimates, standard errors and p-values for x1 and x2 ``` At the end you should have 2000 instances of estimated slopes, standard errors, and corresponding p-values for both `x1` and `x2`. It's most convenient to store these in a data frame. ```{r, cache = TRUE} # Note the cache = TRUE header here. This tells R Markdown to store the output of this code chunk and only re-run the code when code in this chunk changes. By caching you won't wind up re-running this code every time you knit. # Edit me ``` **(b)** This problem has multiple parts. - Construct a histogram of the coefficient estimates for `x1`. - Calculate the average of the coefficient estimates for `x1`. Is the average close to the true value? - Calculate the average of the standard errors for the coefficient of `x1`. Calculate the standard deviation of the coefficient estimates for `x1`. Are these numbers similar? ```{r} # Edit me ``` **Take-away from this problem**: the `Std. Error` value in the linear model summary is an estimate of the standard deviation of the coefficient estimates. **(c)** Repeat part (b) for `x2`. ```{r} # Edit me ``` **(d)** Construct a histogram of the p-values for the coefficient of `x1`. What do you see? What % of the time is the p-value significant at the 0.05 level? ```{r} # Edit me ``` **(e)** Repeat part (d) with `x2`. What % of the time is the p-value significant at the 0.05 level? ```{r} # Edit me ``` **(f)** Given a coefficient estimate $\hat \beta$ and a standard error estimate $\hat{se}(\hat\beta)$, we can construct an approximate 95% confidence interval using the "2 standard error rule". i.e., $$ [\hat \beta - 2 \hat{se}, \, \hat \beta + 2 \hat{se}] $$ is an approximate 95% confidence interval for the true unknown coefficient. As part of your simulation you stored $\hat \beta$ and $\hat{se}$ values for 2000 simulation instances. Use these estimates to construct approximate confidence intervals and answer the following questions. - **Question**: In your simulation, what % of such confidence intervals constructed for the coefficnet of `x1` actually contain the the true value of the coefficient ($\beta_1 = 0.5$). Replace this text with your answer. (do not delete the html tags) - **Question**: In your simulation, what % of such confidence intervals constructed for the coefficient of `x2` actually contain the the true value of the coefficient ($\beta_2 = 0$). Replace this text with your answer. (do not delete the html tags) ### Problem 4: A more robust approach to error bars On homework 4 you completed several exercises that involved using `ddply` to calculate error bars via the `t.test` command. When the x-axis variable has one or more categories that have low counts, it's possible for this approach to fail. Here's an example of this issue. ```{r, cache = TRUE} # Data import & transformation # DO NOT CHANGE THIS CODE income.data <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/homework/nlsy97_income.csv", header=TRUE) income.data <- subset(income.data, select = c("R0536300", "T7545600", "T6656900"), subset = T6656900 %in% c(1:9, 11)) income.data[income.data < 0] <- NA income.data <- na.omit(income.data) colnames(income.data) <- c("gender", "income", "household.size") income.data <- transform(income.data, gender = mapvalues(gender, 1:2, c("Male", "Female"))) ``` Let's look at how many observations we have at every level of household size: ```{r, cache = TRUE} with(income.data, table(household.size, gender)) ``` > We can see that counts are very low in some of the categories. For instance, we have just 5 men in households of size 9. **(a)** Construct a modified t-test function that operates in the following way (CI below stands for "confidence interval"): 1. Input: 2 inputs - `x`: numeric vector - `group`: 2-level factor vector of the same length as `x` - `k`: integer, giving minimum group size - `flip.sign`: logical. Should CI and mean be flipped? 2. Output: a vector with the following elements: - `lower`: the lower value of the error bar (CI) - `upper`: the upper value of the error bar (CI) - `is.signif`: for groups of size >= `k`, this value should be 0 if CI overlaps 0, and 1 if it doesn't. For groups of size < `k`, this value should always be 0. - If `group` has at least `k` observations in each level, return the lower and upper confidence interval values from running `t.test` with formula `x ~ group`. - If `group` has fewer than `k` observations in one or both levels of this factor, `lower` and `upper` should both be set equal to the mean difference. - If `flip.sign` = TRUE, the difference and confidence interval should be calculate as group2 - group1 instead of group1 - group2. ```{r} # Add comments to this function betterTTestErrorBars <- function(x, groups, k = 5, flip.sign = TRUE) { # Edit me } ``` **(b)** Modify the code below to use the `betterTTestErrorBars` command instead of the `t.test` command to produce a bar chart with error bars showing how the income gap varies with household size. Specify that the bars should be filled based on `is.significant`. **Note: you'll need to set eval = TRUE in the code chunk header for the code to run** ```{r, fig.width = 5, eval = FALSE} # Calculate income gaps (male - female) and 95% confidence intervals # for the gap # MODIFY THIS CODE gap.data.conf <- ddply(nlsy, ~ household.size, summarize, income.gap = mean(income[gender == "Male"], na.rm = TRUE) - mean(income[gender == "Female"], na.rm = TRUE), upper = -t.test(income ~ gender)$conf.int[1], lower = -t.test(income ~ gender)$conf.int[2], is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05)) # Re-order the race factor according to gap size gap.data.conf <- transform(gap.data.conf, race = reorder(race, income.gap)) # Plot, with error bars # MODIFY THIS CODE TO WORK FOR household.size INSTEAD OF race ggplot(data = gap.data.conf, aes(x = race, y = income.gap)) + geom_bar(stat = "identity") + xlab("Race") + ylab("Income gap($)") + ggtitle("Income gap between men and women, by race") + guides(fill = FALSE) + geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) + theme(text = element_text(size=12)) ``` ### Problem 5: dlply + ldply For this problem we'll return to the `gapminder` data set from Lecture 10. ```{r} gapminder <- read.delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/gapminder_five_year.txt") ``` **(a)** Use `dlply` to produce a list of linear regression models, one for each country, regressing `gdpPercap` on `year`. ```{r} # Edit me ``` **(b)** Use `ldply` on your list from part (a) to produce a data frame displaying for each country whether the coefficient of `year` was significant at the 0.05 level. Your output should be a two column data frame: the first column gives the country, and the second column displays a 0 if the coefficient of `year` was not significant, and a 1 if it was significant. ```{r} # Edit me ``` **(c)** The following code produces a data frame giving the continent for each country. ```{r} summary.continent <- ddply(gapminder, ~ country, summarize, continent = unique(continent)) ``` Use the `merge` function to merge your data frame from part (b) with `summary.continent` to produce a data frame that shows both the slope significance indicator and also the continent. (See Lecture 10 notes for examples of how to do this.) ```{r} # Edit me ``` **(d)** Using your data frame from part (c), produce a contingency table showing counts for each combination of slope significance and continent. ```{r} # Edit me ``` **(e)** What do you observe in the table from part (d)? What does a non-significant slope (coefficient of year) suggest about a country's economic growth? Replace this text with your answer. (do not delete the html tags)