Fall 2020

Agenda

• For/while loops to iterate over data
• apply
• map, map_<type>, map_at, map_if
• mutate_at, mutate_if
• summarize_at, summarize_if

# Our favourite library
library(tidyverse)

# For Cars93 data again
Cars93 <- MASS::Cars93

# For the clean survey data:
header=TRUE, stringsAsFactors = FALSE)

More programming basics: loops

• We’ll now learn about loops and some more efficient/syntactically simple loop alternatives

• loops are ways of iterating over data

For loops: a pair of examples

for(i in 1:4) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

• The basic syntax takes the form
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
• The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
##            [,1]      [,2]       [,3]       [,4]       [,5]
## [1,] -0.7567250 0.2771538 -0.5701664  0.5888506 -0.8897146
## [2,]  0.5451021 0.6679214 -0.5304780 -0.6796072 -0.5847928
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1]  8.118425  6.569486 20.780103 -2.837246  5.183321
colSums(fake.data) # A better approach (see also colMeans())
## [1]  8.118425  6.569486 20.780103 -2.837246  5.183321

while loops

• while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
day <- day + 1
}
• We won’t really be using while loops in this class

• Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
map(.x, .f, ...) Obtain a list by applying .f to every element of a list or atomic vector .x
map_<type>(.x, .f, ...) For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x
map_at(.x, .at, .f) Obtain a list by applying .f to the elements of .x specified by name or index given in .at
map_if(.x, .p, .f) Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector)
mutate_all/_at/_if Mutate all variables, specified (at) variables, or those selected by a predicate (if)
summarize_all/_at/_if Summarize all variables, specified variables, or those selected by a predicate (if)
• These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

• The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)
## [1]  0.08118425  0.06569486  0.20780103 -0.02837246  0.05183321
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1]  0.08118425  0.06569486  0.20780103 -0.02837246  0.05183321
# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
## [1] 0.50 0.53 0.50 0.52 0.55

Example: map, map_()

map(survey, is.numeric) # Returns a list
## $Program ## [1] FALSE ## ##$PriorExp
## [1] FALSE
##
## $Rexperience ## [1] FALSE ## ##$OperatingSystem
## [1] FALSE
##
## $TVhours ## [1] TRUE ## ##$Editor
## [1] FALSE
map_lgl(survey, is.numeric) # Returns a logical vector with named elements
##         Program        PriorExp     Rexperience OperatingSystem
##           FALSE           FALSE           FALSE           FALSE
##         TVhours          Editor
##            TRUE           FALSE

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays
## speed  dist
## 15.40 42.98
map(cars, mean) # Data frames are also lists
## $speed ## [1] 15.4 ## ##$dist
## [1] 42.98
map_dbl(cars, mean) # map output as a double vector
## speed  dist
## 15.40 42.98

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type) ## [1] Small Midsize Compact Midsize Midsize Midsize ## Levels: Compact Large Midsize Small Sporty Van Cars93.lower <- mutate_if(Cars93, is.factor, tolower) head(Cars93.lower$Type)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"
• Note: this has the effect of producing a copy of the Cars93 data where all of the factor variables have been replaced with versions containing lowercase values

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type) ## [1] Small Midsize Compact Midsize Midsize Midsize ## Levels: Compact Large Midsize Small Sporty Van head(Cars93.lower$Type_lower)
## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>%
mutate_at(c("MPG.city", "MPG.highway"),
list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
## [1] "Luggage.room"     "Weight"           "Origin"
## [4] "Make"             "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)
##   Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
##        RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
## 1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
##      Width Turn.circle Rear.seat.room Luggage.room   Weight
## 1 69.37634    38.95699             NA           NA 3072.903
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
##   Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
## 1       17.12581   19.50968       21.89892      22.36559         29.08602
##   EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
## 1        2.667742         143.828 5280.645          2332.204
##   Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean
## 1                16.66452        5.086022    183.2043       103.9462
##   Width_mean Turn.circle_mean Rear.seat.room_mean Luggage.room_mean
## 1   69.37634         38.95699            27.82967          13.89024
##   Weight_mean
## 1    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

Cars93 %>%
group_by(Type) %>%
summarize_at(c("MPG.city", "MPG.highway"), mean)
## # A tibble: 6 x 3
##   Type    MPG.city MPG.highway
##   <fct>      <dbl>       <dbl>
## 1 Compact     22.7        29.9
## 2 Large       18.4        26.7
## 3 Midsize     19.5        26.7
## 4 Small       29.9        35.5
## 5 Sporty      21.8        28.8
## 6 Van         17          21.9

Another approach

We’ll learn about a bunch of select helper functions like contains() and starts_with().

Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
group_by(Type) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 3
##   Type    MPG.city_mean MPG.highway_mean
##   <fct>           <dbl>            <dbl>
## 1 Compact          22.7             29.9
## 2 Large            18.4             26.7
## 3 Midsize          19.5             26.7
## 4 Small            29.9             35.5
## 5 Sporty           21.8             28.8
## 6 Van              17               21.9

More than one grouping variable

Cars93 %>%
group_by(Origin, AirBags) %>%
summarize_at(vars(contains("MPG")), list(mean = mean))
## # A tibble: 6 x 4
## # Groups:   Origin [2]
##   Origin  AirBags            MPG.city_mean MPG.highway_mean
##   <fct>   <fct>                      <dbl>            <dbl>
## 1 USA     Driver & Passenger          19               27.2
## 2 USA     Driver only                 20.2             27.5
## 3 USA     None                        23.1             29.6
## 4 non-USA Driver & Passenger          20.3             27
## 5 non-USA Driver only                 23.2             29.4
## 6 non-USA None                        25.9             32

Assignments

• Homework 2 will be posted today
• Due: Wednesday, November 11, 1:30pm ET
• Submit your .Rmd and .html files on Canvas
• Lab 5 is available on Canvas and the course website
• You have until Friday evening to complete it
• Friday’s lab session will go over this week’s material and help you complete the labs