Lecture 5: Loops and some of their alternatives

Fall 2020

Agenda

For/while loops to iterate over data
apply
map, map_<type>, map_at, map_if
mutate_at, mutate_if
summarize_at, summarize_if

Package and data loading

# Our favourite library
library(tidyverse)

# For Cars93 data again
Cars93 <- MASS::Cars93 

# For the clean survey data:
survey <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv", 
                   header=TRUE, stringsAsFactors = FALSE)

More programming basics: loops

We’ll now learn about loops and some more efficient/syntactically simple loop alternatives
loops are ways of iterating over data

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4

phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}

## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

The basic syntax takes the form

for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}

The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}

## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows

##            [,1]      [,2]       [,3]       [,4]       [,5]
## [1,] -0.7567250 0.2771538 -0.5701664  0.5888506 -0.8897146
## [2,]  0.5451021 0.6679214 -0.5304780 -0.6796072 -0.5847928

col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums

## [1]  8.118425  6.569486 20.780103 -2.837246  5.183321

colSums(fake.data) # A better approach (see also colMeans())

## [1]  8.118425  6.569486 20.780103 -2.837246  5.183321

while loops

while loops repeat a chunk of code while the specified condition remains true

day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}

We won’t really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command	Description
`apply(X, MARGIN, FUN)`	Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`map(.x, .f, ...)`	Obtain a list by applying `.f` to every element of a list or atomic vector `.x`
`map_<type>(.x, .f, ...)`	For `<type>` given by `lgl` (logical), `int` (integer), `dbl` (double) or `chr` (character), return a vector of this type obtained by applying `.f` to each element of `.x`
`map_at(.x, .at, .f)`	Obtain a list by applying `.f` to the elements of `.x` specified by name or index given in `.at`
`map_if(.x, .p, .f)`	Obtain a list `.f` to the elements of `.x` specified by `.p` (a predicate function, or a logical vector)
`mutate_all/_at/_if`	Mutate all variables, specified (at) variables, or those selected by a predicate (if)
`summarize_all/_at/_if`	Summarize all variables, specified variables, or those selected by a predicate (if)

These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)

## [1]  0.08118425  0.06569486  0.20780103 -0.02837246  0.05183321

apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns

## [1]  0.08118425  0.06569486  0.20780103 -0.02837246  0.05183321

# Function that calculates proportion of vector indexes that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)

## [1] 0.50 0.53 0.50 0.52 0.55

Example: map, map_()

map(survey, is.numeric) # Returns a list

## $Program
## [1] FALSE
## 
## $PriorExp
## [1] FALSE
## 
## $Rexperience
## [1] FALSE
## 
## $OperatingSystem
## [1] FALSE
## 
## $TVhours
## [1] TRUE
## 
## $Editor
## [1] FALSE

map_lgl(survey, is.numeric) # Returns a logical vector with named elements

##         Program        PriorExp     Rexperience OperatingSystem 
##           FALSE           FALSE           FALSE           FALSE 
##         TVhours          Editor 
##            TRUE           FALSE

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays

## speed  dist 
## 15.40 42.98

map(cars, mean) # Data frames are also lists

## $speed
## [1] 15.4
## 
## $dist
## [1] 42.98

map_dbl(cars, mean) # map output as a double vector

## speed  dist 
## 15.40 42.98

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type)

## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van

Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)

## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Note: this has the effect of producing a copy of the Cars93 data where all of the factor variables have been replaced with versions containing lowercase values

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, list(lower = tolower))
head(Cars93.lower$Type)

## [1] Small   Midsize Compact Midsize Midsize Midsize
## Levels: Compact Large Midsize Small Sporty Van

head(Cars93.lower$Type_lower)

## [1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), 
            list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))

## [1] "Luggage.room"     "Weight"           "Origin"          
## [4] "Make"             "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)

##   Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
##        RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
## 1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
##      Width Turn.circle Rear.seat.room Luggage.room   Weight
## 1 69.37634    38.95699             NA           NA 3072.903

Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)

##   Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
## 1       17.12581   19.50968       21.89892      22.36559         29.08602
##   EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
## 1        2.667742         143.828 5280.645          2332.204
##   Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean
## 1                16.66452        5.086022    183.2043       103.9462
##   Width_mean Turn.circle_mean Rear.seat.room_mean Luggage.room_mean
## 1   69.37634         38.95699            27.82967          13.89024
##   Weight_mean
## 1    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

Cars93 %>%
  group_by(Type) %>%
  summarize_at(c("MPG.city", "MPG.highway"), mean)

## # A tibble: 6 x 3
##   Type    MPG.city MPG.highway
##   <fct>      <dbl>       <dbl>
## 1 Compact     22.7        29.9
## 2 Large       18.4        26.7
## 3 Midsize     19.5        26.7
## 4 Small       29.9        35.5
## 5 Sporty      21.8        28.8
## 6 Van         17          21.9

Another approach

We’ll learn about a bunch of select helper functions like contains() and starts_with().

Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))

## # A tibble: 6 x 3
##   Type    MPG.city_mean MPG.highway_mean
##   <fct>           <dbl>            <dbl>
## 1 Compact          22.7             29.9
## 2 Large            18.4             26.7
## 3 Midsize          19.5             26.7
## 4 Small            29.9             35.5
## 5 Sporty           21.8             28.8
## 6 Van              17               21.9

More than one grouping variable

Cars93 %>%
  group_by(Origin, AirBags) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))

## # A tibble: 6 x 4
## # Groups:   Origin [2]
##   Origin  AirBags            MPG.city_mean MPG.highway_mean
##   <fct>   <fct>                      <dbl>            <dbl>
## 1 USA     Driver & Passenger          19               27.2
## 2 USA     Driver only                 20.2             27.5
## 3 USA     None                        23.1             29.6
## 4 non-USA Driver & Passenger          20.3             27  
## 5 non-USA Driver only                 23.2             29.4
## 6 non-USA None                        25.9             32

Assignments

Homework 2 will be posted today
- Due: Wednesday, November 11, 1:30pm ET
- Submit your .Rmd and .html files on Canvas
Lab 5 is available on Canvas and the course website
- You have until Friday evening to complete it
- Friday’s lab session will go over this week’s material and help you complete the labs