Fall 2020

Agenda

  • Wrap up of Lecture 2 content
  • More on data frames
  • Basic tidyverse (dplyr) commands
  • Lists
  • Writing functions in R
  • If-else statements
  • R coding style

Wrapping up Lecture 2 content

Load tidyverse

  • Most of the functions we’re using just come from dplyr, but we’ll load all of tidyverse anyway
library(tidyverse)

Grab a toy dataset from MASS library

  • Rather than loading the full MASS library, we’ll use the :: syntax to pull a specific object/function from the library

  • Loading all of MASS with library(MASS) after tidyverse is loaded has the unintended consequence of replacing the dplyr select command with the MASS select command. This is BAD, and leads to errors.

Cars93 <- MASS::Cars93
head(Cars93, 3)
##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3         Audi      90 Compact      25.9  29.1      32.3       20
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90

Adding a column: mutate() function from dplyr

  • mutate() returns a new data frame with columns modified or added as specified by the function call
Cars93.metric <- mutate(Cars93, 
                        KMPL.city = 0.425 * MPG.city, 
                        KMPL.highway = 0.425 * MPG.highway)
tail(names(Cars93.metric))
## [1] "Luggage.room" "Weight"       "Origin"       "Make"        
## [5] "KMPL.city"    "KMPL.highway"
  • Our data frame has two new columns, giving the fuel consumption in km/l

Another approach

# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))
## [1] "Weight"       "Origin"       "Make"         "KMPL.city"   
## [5] "KMPL.highway" "KMPL.city.2"
  • Let’s check that both approaches did the same thing
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
## [1] TRUE

Changing levels of a factor: recode()

manufacturer <- Cars93$Manufacturer
head(manufacturer, 8)
## [1] Acura Acura Audi  Audi  BMW   Buick Buick Buick
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo

We’ll use the recode() function from the dplyr library, which gets loaded when you load tidyverse.

# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- recode(manufacturer, 
                                "Chevrolet" = "GM", "Pontiac" = "GM", "Buick" = "GM")

head(manufacturer.combined, 8)
## [1] Acura Acura Audi  Audi  BMW   GM    GM    GM   
## 30 Levels: Acura Audi BMW GM Cadillac Chrylser Chrysler Dodge ... Volvo

Another example: recode_factor()

  • A lot of data comes with integer encodings of levels

  • You may want to convert the integers to more meaningful values for the purpose of your analysis

  • Let’s pretend that in the class survey ‘Program’ was coded as an integer with 1 = MISM, 2 = Other, 3 = PPM

# Load data
survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv", 
                     header=TRUE, sep=",") 
# Recode Program to have integer codings 
survey <- mutate(survey, Program=as.numeric(Program)) 
head(survey)
##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1       3         Some experience       Never used         Windows    10.5
## 2       2    Extensive experience Basic competence        Mac OS X     3.0
## 3       1 Never programmed before Basic competence         Windows     0.0
## 4       3 Never programmed before       Never used         Windows    10.0
## 5       3 Never programmed before       Never used         Windows     4.0
## 6       3         Some experience Basic competence        Mac OS X     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word
## 4          Excel
## 5 Microsoft Word
## 6 Microsoft Word

Example continued: recode_factor()

  • Here’s how we would get back the program codings using recode_factor(), a variant of recode that returns a factor, with elements ordered according to the mapping order.
  • Note the backticks around the numbers, which are necessary for parsing
survey <- mutate(survey,
                 Program = recode_factor(Program,
                                         `3` = "PPM",
                                         `1` = "MISM",
                                         `2` = "Other"))

head(survey)
##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1     PPM         Some experience       Never used         Windows    10.5
## 2   Other    Extensive experience Basic competence        Mac OS X     3.0
## 3    MISM Never programmed before Basic competence         Windows     0.0
## 4     PPM Never programmed before       Never used         Windows    10.0
## 5     PPM Never programmed before       Never used         Windows     4.0
## 6     PPM         Some experience Basic competence        Mac OS X     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word
## 4          Excel
## 5 Microsoft Word
## 6 Microsoft Word

Some more data frame summaries: table() function

  • Let’s revisit the Cars93 dataset

  • The table() function builds contingency tables (i.e., count tables) showing counts at each combination of factor levels

table(Cars93$AirBags)
## 
## Driver & Passenger        Driver only               None 
##                 16                 43                 34

table(Cars93$Origin)
## 
##     USA non-USA 
##      48      45
table(Cars93$AirBags, Cars93$Origin)
##                     
##                      USA non-USA
##   Driver & Passenger   9       7
##   Driver only         23      20
##   None                16      18
  • Looks like US and non-US cars had about the same distribution of AirBag types

  • Later in the class we’ll learn how to do a hypothesis tests on this kind of data

Alternative syntax

  • When table() is supplied a data frame, it produces contingency tables for all combinations of factors
head(Cars93[c("AirBags", "Origin")], 3)
##              AirBags  Origin
## 1               None non-USA
## 2 Driver & Passenger non-USA
## 3        Driver only non-USA
table(Cars93[c("AirBags", "Origin")])
##                     Origin
## AirBags              USA non-USA
##   Driver & Passenger   9       7
##   Driver only         23      20
##   None                16      18

Tidy count tables: count()

If we’re going to be plotting or further analysing our results, it is helpful to have them in a data frame instead of a tabular layout. That’s where the count() function comes in.

Cars93 %>% count(AirBags)
## # A tibble: 3 x 2
##   AirBags                n
##   <fct>              <int>
## 1 Driver & Passenger    16
## 2 Driver only           43
## 3 None                  34
Cars93 %>% count(AirBags, Origin)
## # A tibble: 6 x 3
##   AirBags            Origin      n
##   <fct>              <fct>   <int>
## 1 Driver & Passenger USA         9
## 2 Driver & Passenger non-USA     7
## 3 Driver only        USA        23
## 4 Driver only        non-USA    20
## 5 None               USA        16
## 6 None               non-USA    18

Basics of lists

A list is a data structure that can be used to store different kinds of data

  • Recall: a vector is a data structure for storing similar kinds of data

  • To better understand the difference, consider the following example.

my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1 
## [1] "Michael" "165"     "TRUE"
typeof(my.vector.1)  # All the elements are now character strings!
## [1] "character"

Lists vs. vectors

my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
typeof(my.vector.2)
## [1] "double"
  • Vectors expect elements to be all of the same type (e.g., Boolean, numeric, character)

  • When data of different types are put into a vector, the R converts everything to a common type

Lists

  • To store data of different types in the same object, we use lists

  • Simple way to construct lists: use list() function

  • (We’ll learn about functions like map and map_chr soon.)

my.list <- list("Michael", 165, TRUE)
my.list
## [[1]]
## [1] "Michael"
## 
## [[2]]
## [1] 165
## 
## [[3]]
## [1] TRUE
map_chr(my.list, typeof)
## [1] "character" "double"    "logical"

Named elements

patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
## $name
## [1] "Michael"
## 
## $weight
## [1] 165
## 
## $is.male
## [1] TRUE

Referencing elements of a list (similar to data frames)

patient.1$name # Get "name" element (returns a string)
## [1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
## [1] "Michael"
patient.1["name"] # Get "name" slice (returns a list)
## $name
## [1] "Michael"
c(typeof(patient.1$name), typeof(patient.1["name"]))
## [1] "character" "list"

Functions

  • We have used a lot of built-in functions: mean(), subset(), plot(), read.table()

  • An important part of programming and data analysis is to write custom functions

  • Functions help make code modular

  • Functions make debugging easier

  • Remember: this entire class is about applying functions to data

What is a function?

A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.

  • Let’s look at a really simple function
addOne <- function(x) {
  x + 1
}
  • x is the argument or input

  • The function output is the input x incremented by 1

addOne(12)
## [1] 13

More interesting example

  • Here’s a function that returns a % given a numerator, denominator, and desired number of decimal values
# Ended here
calculatePercentage <- function(x, y, d) {
  decimal <- x / y  # Calculate decimal value
  round(100 * decimal, d)  # Convert to % and round to d digits
}

calculatePercentage(27, 80, 1)
## [1] 33.8
  • If you’re calculating several %’s for your report, you should use this kind of function instead of repeatedly copying and pasting code

Function returning a list

  • Here’s a function that takes a person’s full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person’s first name, person’s last name, weight in kg, height in m, and BMI.
createPatientRecord <- function(full.name, weight, height) {
  name.list <- strsplit(full.name, split=" ")[[1]]
  first.name <- name.list[1]
  last.name <- name.list[2]
  weight.in.kg <- weight / 2.2
  height.in.m <- height * 0.0254
  bmi <- weight.in.kg / (height.in.m ^ 2)
  list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m,
       bmi=bmi)
}

Trying out the function

createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
## $first.name
## [1] "Michael"
## 
## $last.name
## [1] "Smith"
## 
## $weight
## [1] 84.09091
## 
## $height
## [1] 1.8542
## 
## $bmi
## [1] 24.45884

Another example: 3 number summary

  • Calculate mean, median and standard deviation
threeNumberSummary <- function(x) {
  c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
##     mean   median       sd 
## 5.296375 5.361622 2.081283

If-else statements

  • Oftentimes we want our code to have different effects depending on the features of the input

  • Example: Calculating a student’s letter grade
    • If grade >= 90, assign A
    • Otherwise, if grade >= 80, assign B
    • Otherwise, if grade >= 70, assign C
    • In all other cases, assign F
  • To code this up, we use if-else statements

If-else Example: Letter grades

calculateLetterGrade <- function(x) {
  if(x >= 90) {
    grade <- "A"
  } else if(x >= 80) {
    grade <- "B"
  } else if(x >= 70) {
    grade <- "C"
  } else {
    grade <- "F"
  }
  grade
}

course.grades <- c(92, 78, 87, 91, 62)
map_chr(course.grades, calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"

return()

  • In the previous examples we specified the output simply by writing the output variable as the last line of the function

  • More explicitly, we can use the return() function

addOne <- function(x) {
  return(x + 1)
}

addOne(12)
## [1] 13
  • We will generally avoid the return() function, but you can use it if necessary or if it makes writing a particular function easier.
  • Google’s style guide suggests explicit returns. Most do not.

R coding style

Reminders

  • Homework 1 due 1:30PM ET on Wednesday

  • Lab 3 is posted

  • If you have questions, feel free to post on the Piazza Discussion Forum or attend office hours