Lecture 3: Manipulating data; lists; functions; if-else statements

Prof. Alexandra Chouldechova

Agenda

  • More on data frames
  • Basic dplyr command
  • Lists
  • Writing functions in R
  • If-else statements

Load dplyr

  • In the coming weeks we'll learn a lot more about the dplyr and plyr libraries
  • Today we'll just need the mutate function from dplyr and mapvalues from plyr
library(plyr)
library(dplyr)
# Loading dplyr before plyr typically breaks things.  Don't do it.

Note Remember to run install.packages("dplyr") and similarly for plyr before trying to load the libraries for the first time.

More on data frames

library(MASS)
head(Cars93, 3)
  Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
1        Acura Integra   Small      12.9  15.9      18.8       25
2        Acura  Legend Midsize      29.2  33.9      38.7       18
3         Audi      90 Compact      25.9  29.1      32.3       20
  MPG.highway            AirBags DriveTrain Cylinders EngineSize
1          31               None      Front         4        1.8
2          25 Driver & Passenger      Front         6        3.2
3          26        Driver only      Front         6        2.8
  Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
1        140 6300         2890             Yes               13.2
2        200 5500         2335             Yes               18.0
3        172 5500         2280             Yes               16.9
  Passengers Length Wheelbase Width Turn.circle Rear.seat.room
1          5    177       102    68          37           26.5
2          5    195       115    71          38           30.0
3          5    180       102    67          37           28.0
  Luggage.room Weight  Origin          Make
1           11   2705 non-USA Acura Integra
2           15   3560 non-USA  Acura Legend
3           14   3375 non-USA       Audi 90

Adding a column: `mutate()` function from `dplyr`

  • mutate() returns a new data frame with columns modified or added as specified by the function call
Cars93.metric <- mutate(Cars93, 
                           KMPL.city = 0.425 * MPG.city, 
                           KMPL.highway = 0.425 * MPG.highway)
tail(names(Cars93.metric))
[1] "Luggage.room" "Weight"       "Origin"       "Make"        
[5] "KMPL.city"    "KMPL.highway"
  • Our data frame has two new columns, giving the fuel consumption in km/l

Another approach

# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))
[1] "Weight"       "Origin"       "Make"         "KMPL.city"   
[5] "KMPL.highway" "KMPL.city.2" 
  • Let's check that both approaches did the same thing
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
[1] TRUE

Changing levels of a factor: mapvalues

manufacturer <- Cars93$Manufacturer
head(manufacturer, 10)
 [1] Acura    Acura    Audi     Audi     BMW      Buick    Buick   
 [8] Buick    Buick    Cadillac
32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo

We'll use the mapvalues(x, from, to) function from the plyr library.

# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- mapvalues(manufacturer, 
                                   from = c("Chevrolet", "Pontiac", "Buick"), 
                                   to = rep("GM", 3))

head(manufacturer.combined, 10)
 [1] Acura    Acura    Audi     Audi     BMW      GM       GM      
 [8] GM       GM       Cadillac
30 Levels: Acura Audi BMW GM Cadillac Chrylser Chrysler Dodge ... Volvo

Another example

  • A lot of data comes with integer encodings of levels

  • You may want to convert the integers to more meaningful values for the purpose of your analysis

  • Let's pretend that in the class survey 'Program' was coded as an integer with 1 = MISM, 2 = Other, 3 = PPM

survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data.csv", header=TRUE, sep=",") 
survey <- mutate(survey, Program=as.numeric(Program))
head(survey)
  Program                PriorExp      Rexperience OperatingSystem TVhours
1       3         Some experience       Never used        Mac OS X       2
2       3 Never programmed before       Never used         Windows      15
3       3         Some experience Basic competence        Mac OS X      16
4       2         Some experience Basic competence        Mac OS X       0
5       3 Never programmed before       Never used         Windows       2
6       3 Never programmed before       Never used         Windows       5
          Editor
1 Microsoft Word
2 Microsoft Word
3 Microsoft Word
4          LaTeX
5 Microsoft Word
6 Microsoft Word

Example continued

  • Here's how we would get back the program codings using the mutate(), as.factor() and mapvalues() functions
survey <- mutate(survey, 
                Program = as.factor(mapvalues(Program, 
                                              c(1, 2, 3), 
                                              c("MISM", "Other", "PPM")))
                )
head(survey)
  Program                PriorExp      Rexperience OperatingSystem TVhours
1     PPM         Some experience       Never used        Mac OS X       2
2     PPM Never programmed before       Never used         Windows      15
3     PPM         Some experience Basic competence        Mac OS X      16
4   Other         Some experience Basic competence        Mac OS X       0
5     PPM Never programmed before       Never used         Windows       2
6     PPM Never programmed before       Never used         Windows       5
          Editor
1 Microsoft Word
2 Microsoft Word
3 Microsoft Word
4          LaTeX
5 Microsoft Word
6 Microsoft Word

Some more data frame summaries: `table()` function

  • Let's revisit the Cars93 dataset

  • The table() function builds contingency tables showing counts at each combination of factor levels

table(Cars93$AirBags)

Driver & Passenger        Driver only               None 
                16                 43                 34 
table(Cars93$Origin)

    USA non-USA 
     48      45 
table(Cars93$AirBags, Cars93$Origin)

                     USA non-USA
  Driver & Passenger   9       7
  Driver only         23      20
  None                16      18
  • Looks like US and non-US cars had about the same distribution of AirBag types

  • Later in the class we'll learn how to do a hypothesis tests on this kind of data

Alternative syntax

  • When table() is supplied a data frame, it produces contingency tables for all combinations of factors
head(Cars93[c("AirBags", "Origin")], 3)
             AirBags  Origin
1               None non-USA
2 Driver & Passenger non-USA
3        Driver only non-USA
table(Cars93[c("AirBags", "Origin")])
                    Origin
AirBags              USA non-USA
  Driver & Passenger   9       7
  Driver only         23      20
  None                16      18

Basics of lists

A list is a data structure that can be used to store different kinds of data

  • Recall: a vector is a data structure for storing similar kinds of data

  • To better understand the difference, consider the following example.

my.vector.1 <- c("Michael", 165, TRUE) # (name, weight, is.male)
my.vector.1 
[1] "Michael" "165"     "TRUE"   
typeof(my.vector.1)  # All the elements are now character strings!
[1] "character"

Lists vs. vectors

my.vector.2 <- c(FALSE, TRUE, 27) # (is.male, is.citizen, age)
typeof(my.vector.2)
[1] "double"
  • Vectors expect elements to be all of the same type (e.g., Boolean, numeric, character)

  • When data of different types are put into a vector, the R converts everything to a common type

Lists

  • To store data of different types in the same object, we use lists

  • Simple way to build lists: use list() function

my.list <- list("Michael", 165, TRUE)
my.list
[[1]]
[1] "Michael"

[[2]]
[1] 165

[[3]]
[1] TRUE
sapply(my.list, typeof)
[1] "character" "double"    "logical"  

Named elements

patient.1 <- list(name="Michael", weight=165, is.male=TRUE)
patient.1
$name
[1] "Michael"

$weight
[1] 165

$is.male
[1] TRUE

Referencing elements of a list (similar to data frames)

patient.1$name # Get "name" element (returns a string)
[1] "Michael"
patient.1[["name"]] # Get "name" element (returns a string)
[1] "Michael"
patient.1["name"] # Get "name" slice (returns a sub-list)
$name
[1] "Michael"
c(typeof(patient.1$name), typeof(patient.1["name"]))
[1] "character" "list"     

Functions

  • We have used a lot of built-in functions: mean(), subset(), plot(), read.table()

  • An important part of programming and data analysis is to write custom functions

  • Functions help make code modular

  • Functions make debugging easier

  • Remember: this entire class is about applying functions to data

What is a function?

A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.

  • Let's look at a really simple function
addOne <- function(x) {
  x + 1
}
  • x is the argument or input

  • The function output is the input x incremented by 1

addOne(12)
[1] 13

More interesting example

  • Here's a function that returns a % given a numerator, denominator, and desired number of decimal values
calculatePercentage <- function(x, y, d) {
  decimal <- x / y  # Calculate decimal value
  round(100 * decimal, d)  # Convert to % and round to d digits
}

calculatePercentage(27, 80, 1)
[1] 33.8
  • If you're calculating several %'s for your report, you should use this kind of function instead of repeatedly copying and pasting code

Function returning a list

  • Here's a function that takes a person's full name (FirstName LastName), weight in lb and height in inches and converts it into a list with the person's first name, person's last name, weight in kg, height in m, and BMI.
createPatientRecord <- function(full.name, weight, height) {
  name.list <- strsplit(full.name, split=" ")[[1]]
  first.name <- name.list[1]
  last.name <- name.list[2]
  weight.in.kg <- weight / 2.2
  height.in.m <- height * 0.0254
  bmi <- weight.in.kg / (height.in.m ^ 2)
  list(first.name=first.name, last.name=last.name, weight=weight.in.kg, height=height.in.m,
       bmi=bmi)
}

Trying out the function

createPatientRecord("Michael Smith", 185, 12 * 6 + 1)
$first.name
[1] "Michael"

$last.name
[1] "Smith"

$weight
[1] 84.09091

$height
[1] 1.8542

$bmi
[1] 24.45884

Another example: 3 number summary

  • Calculate mean, median and standard deviation
threeNumberSummary <- function(x) {
  c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
    mean   median       sd 
5.060500 4.968122 1.787651 

If-else statements

  • Oftentimes we want our code to have different effects depending on the features of the input

  • Example: Calculating a student's letter grade

    • If grade >= 90, assign A
    • Otherwise, if grade >= 80, assign B
    • Otherwise, if grade >= 70, assign C
    • In all other cases, assign F
  • To code this up, we use if-else statements

If-else Example: Letter grades

calculateLetterGrade <- function(x) {
  if(x >= 90) {
    grade <- "A"
  } else if(x >= 80) {
    grade <- "B"
  } else if(x >= 70) {
    grade <- "C"
  } else {
    grade <- "F"
  }
  grade
}

course.grades <- c(92, 78, 87, 91, 62)
sapply(course.grades, FUN=calculateLetterGrade)
[1] "A" "C" "B" "A" "F"

`return()`

  • In the previous examples we specified the output simply by writing the output variable as the last line of the function

  • More explicitly, we can use the return() function

addOne <- function(x) {
  return(x + 1)
}

addOne(12)
[1] 13
  • We will generally avoid the return() function, but you can use it if necessary or if it makes writing a particular function easier.

Reminders

  • Homework 1 due 1:20PM on Thursday

  • Lab 3

  • If you have questions, feel free to post on the Piazza Discussion Forum or attend office hours