## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
We’re going to start by operating on the birthwt
dataset from the MASS library
Let’s get it loaded and see what we’re working with. Remember, loading the MASS library overrides certain tidyverse functions. We don’t want to do that. So when we need something from MASS we’ll extract that dataset or function directly.
tibbles
are nicer data framesdplyr
functions we’ve been using are very nice because they map tibbles to other tibbles.## # A tibble: 189 x 10
## low age lwt race smoke ptl ht ui ftv bwt
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 19 182 2 0 0 0 1 0 2523
## 2 0 33 155 3 0 0 0 0 3 2551
## 3 0 20 105 1 1 0 0 0 1 2557
## 4 0 21 108 1 1 0 0 1 2 2594
## 5 0 18 107 1 1 0 0 1 0 2600
## 6 0 21 124 3 0 0 0 0 0 2622
## 7 0 22 118 1 0 0 0 0 1 2637
## 8 0 17 103 3 0 0 0 0 1 2637
## 9 0 29 123 1 1 0 0 0 1 2663
## 10 0 26 113 1 1 0 0 0 0 2665
## # … with 179 more rows
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
## 92 0 22 118 1 0 0 0 0 1 2637
## 93 0 17 103 3 0 0 0 0 1 2637
## 94 0 29 123 1 1 0 0 0 1 2663
## 95 0 26 113 1 1 0 0 0 0 2665
## 96 0 19 95 3 0 0 0 0 0 2722
## 97 0 19 150 3 0 0 0 0 1 2733
## 98 0 22 95 3 0 0 1 0 0 2751
## 99 0 30 107 3 0 1 0 1 2 2750
## 100 0 18 100 1 1 0 0 0 0 2769
## 101 0 18 100 1 1 0 0 0 0 2769
## 102 0 15 98 2 0 0 0 0 0 2778
## 103 0 25 118 1 1 0 0 0 3 2782
## 104 0 20 120 3 0 0 0 1 0 2807
## 105 0 28 120 1 1 0 0 0 1 2821
## 106 0 32 121 3 0 0 0 0 2 2835
## 107 0 31 100 1 0 0 0 1 3 2835
## 108 0 36 202 1 0 0 0 0 1 2836
## 109 0 28 120 3 0 0 0 0 0 2863
## 111 0 25 120 3 0 0 0 1 2 2877
## 112 0 28 167 1 0 0 0 0 0 2877
## 113 0 17 122 1 1 0 0 0 0 2906
## 114 0 29 150 1 0 0 0 0 2 2920
## 115 0 26 168 2 1 0 0 0 0 2920
## 116 0 17 113 2 0 0 0 0 1 2920
## 117 0 17 113 2 0 0 0 0 1 2920
## 118 0 24 90 1 1 1 0 0 1 2948
## 119 0 35 121 2 1 1 0 0 1 2948
## 120 0 25 155 1 0 0 0 0 1 2977
## 121 0 25 125 2 0 0 0 0 0 2977
## 123 0 29 140 1 1 0 0 0 2 2977
## 124 0 19 138 1 1 0 0 0 2 2977
## 125 0 27 124 1 1 0 0 0 0 2922
## 126 0 31 215 1 1 0 0 0 2 3005
## 127 0 33 109 1 1 0 0 0 1 3033
## 128 0 21 185 2 1 0 0 0 2 3042
## 129 0 19 189 1 0 0 0 0 2 3062
## 130 0 23 130 2 0 0 0 0 1 3062
## 131 0 21 160 1 0 0 0 0 0 3062
## 132 0 18 90 1 1 0 0 1 0 3062
## 133 0 18 90 1 1 0 0 1 0 3062
## 134 0 32 132 1 0 0 0 0 4 3080
## 135 0 19 132 3 0 0 0 0 0 3090
## 136 0 24 115 1 0 0 0 0 2 3090
## 137 0 22 85 3 1 0 0 0 0 3090
## 138 0 22 120 1 0 0 1 0 1 3100
## 139 0 23 128 3 0 0 0 0 0 3104
## 140 0 22 130 1 1 0 0 0 0 3132
## 141 0 30 95 1 1 0 0 0 2 3147
## 142 0 19 115 3 0 0 0 0 0 3175
## 143 0 16 110 3 0 0 0 0 0 3175
## 144 0 21 110 3 1 0 0 1 0 3203
## 145 0 30 153 3 0 0 0 0 0 3203
## 146 0 20 103 3 0 0 0 0 0 3203
## 147 0 17 119 3 0 0 0 0 0 3225
## 148 0 17 119 3 0 0 0 0 0 3225
## 149 0 23 119 3 0 0 0 0 2 3232
## 150 0 24 110 3 0 0 0 0 0 3232
## 151 0 28 140 1 0 0 0 0 0 3234
## 154 0 26 133 3 1 2 0 0 0 3260
## 155 0 20 169 3 0 1 0 1 1 3274
## 156 0 24 115 3 0 0 0 0 2 3274
## 159 0 28 250 3 1 0 0 0 6 3303
## 160 0 20 141 1 0 2 0 1 1 3317
## 161 0 22 158 2 0 1 0 0 2 3317
## 162 0 22 112 1 1 2 0 0 0 3317
## 163 0 31 150 3 1 0 0 0 2 3321
## 164 0 23 115 3 1 0 0 0 1 3331
## 166 0 16 112 2 0 0 0 0 0 3374
## 167 0 16 135 1 1 0 0 0 0 3374
## 168 0 18 229 2 0 0 0 0 0 3402
## 169 0 25 140 1 0 0 0 0 1 3416
## 170 0 32 134 1 1 1 0 0 4 3430
## 172 0 20 121 2 1 0 0 0 0 3444
## 173 0 23 190 1 0 0 0 0 0 3459
## 174 0 22 131 1 0 0 0 0 1 3460
## 175 0 32 170 1 0 0 0 0 0 3473
## 176 0 30 110 3 0 0 0 0 0 3544
## 177 0 20 127 3 0 0 0 0 0 3487
## 179 0 23 123 3 0 0 0 0 0 3544
## 180 0 17 120 3 1 0 0 0 0 3572
## 181 0 19 105 3 0 0 0 0 0 3572
## 182 0 23 130 1 0 0 0 0 0 3586
## 183 0 36 175 1 0 0 0 0 0 3600
## 184 0 22 125 1 0 0 0 0 1 3614
## 185 0 24 133 1 0 0 0 0 0 3614
## 186 0 21 134 3 0 0 0 0 2 3629
## 187 0 19 235 1 1 0 1 0 0 3629
## 188 0 25 95 1 1 3 0 1 0 3637
## 189 0 16 135 1 1 0 0 0 0 3643
## 190 0 29 135 1 0 0 0 0 1 3651
## 191 0 29 154 1 0 0 0 0 1 3651
## 192 0 19 147 1 1 0 0 0 0 3651
## 193 0 19 147 1 1 0 0 0 0 3651
## 195 0 30 137 1 0 0 0 0 1 3699
## 196 0 24 110 1 0 0 0 0 1 3728
## 197 0 19 184 1 1 0 1 0 0 3756
## 199 0 24 110 3 0 1 0 0 0 3770
## 200 0 23 110 1 0 0 0 0 1 3770
## 201 0 20 120 3 0 0 0 0 0 3770
## 202 0 25 241 2 0 0 1 0 0 3790
## 203 0 30 112 1 0 0 0 0 1 3799
## 204 0 22 169 1 0 0 0 0 0 3827
## 205 0 18 120 1 1 0 0 0 2 3856
## 206 0 16 170 2 0 0 0 0 4 3860
## 207 0 32 186 1 0 0 0 0 2 3860
## 208 0 18 120 3 0 0 0 0 1 3884
## 209 0 29 130 1 1 0 0 0 2 3884
## 210 0 33 117 1 0 0 0 1 1 3912
## 211 0 20 170 1 1 0 0 0 0 3940
## 212 0 28 134 3 0 0 0 0 1 3941
## 213 0 14 135 1 0 0 0 0 0 3941
## 214 0 28 130 3 0 0 0 0 0 3969
## 215 0 25 120 1 0 0 0 0 2 3983
## 216 0 16 95 3 0 0 0 0 1 3997
## 217 0 20 158 1 0 0 0 0 1 3997
## 218 0 26 160 3 0 0 0 0 0 4054
## 219 0 21 115 1 0 0 0 0 1 4054
## 220 0 22 129 1 0 0 0 0 0 4111
## 221 0 25 130 1 0 0 0 0 2 4153
## 222 0 31 120 1 0 0 0 0 2 4167
## 223 0 35 170 1 0 1 0 0 1 4174
## 224 0 19 120 1 1 0 0 0 0 4238
## 225 0 24 116 1 0 0 0 0 1 4593
## 226 0 45 123 1 0 0 0 0 1 4990
## 4 1 28 120 3 1 1 0 1 0 709
## 10 1 29 130 1 0 0 0 1 2 1021
## 11 1 34 187 2 1 0 1 0 0 1135
## 13 1 25 105 3 0 1 1 0 0 1330
## 15 1 25 85 3 0 0 0 1 0 1474
## 16 1 27 150 3 0 0 0 0 0 1588
## 17 1 23 97 3 0 0 0 1 1 1588
## 18 1 24 128 2 0 1 0 0 1 1701
## 19 1 24 132 3 0 0 1 0 0 1729
## 20 1 21 165 1 1 0 1 0 1 1790
## 22 1 32 105 1 1 0 0 0 0 1818
## 23 1 19 91 1 1 2 0 1 0 1885
## 24 1 25 115 3 0 0 0 0 0 1893
## 25 1 16 130 3 0 0 0 0 1 1899
## 26 1 25 92 1 1 0 0 0 0 1928
## 27 1 20 150 1 1 0 0 0 2 1928
## 28 1 21 200 2 0 0 0 1 2 1928
## 29 1 24 155 1 1 1 0 0 0 1936
## 30 1 21 103 3 0 0 0 0 0 1970
## 31 1 20 125 3 0 0 0 1 0 2055
## 32 1 25 89 3 0 2 0 0 1 2055
## 33 1 19 102 1 0 0 0 0 2 2082
## 34 1 19 112 1 1 0 0 1 0 2084
## 35 1 26 117 1 1 1 0 0 0 2084
## 36 1 24 138 1 0 0 0 0 0 2100
## 37 1 17 130 3 1 1 0 1 0 2125
## 40 1 20 120 2 1 0 0 0 3 2126
## 42 1 22 130 1 1 1 0 1 1 2187
## 43 1 27 130 2 0 0 0 1 0 2187
## 44 1 20 80 3 1 0 0 1 0 2211
## 45 1 17 110 1 1 0 0 0 0 2225
## 46 1 25 105 3 0 1 0 0 1 2240
## 47 1 20 109 3 0 0 0 0 0 2240
## 49 1 18 148 3 0 0 0 0 0 2282
## 50 1 18 110 2 1 1 0 0 0 2296
## 51 1 20 121 1 1 1 0 1 0 2296
## 52 1 21 100 3 0 1 0 0 4 2301
## 54 1 26 96 3 0 0 0 0 0 2325
## 56 1 31 102 1 1 1 0 0 1 2353
## 57 1 15 110 1 0 0 0 0 0 2353
## 59 1 23 187 2 1 0 0 0 1 2367
## 60 1 20 122 2 1 0 0 0 0 2381
## 61 1 24 105 2 1 0 0 0 0 2381
## 62 1 15 115 3 0 0 0 1 0 2381
## 63 1 23 120 3 0 0 0 0 0 2410
## 65 1 30 142 1 1 1 0 0 0 2410
## 67 1 22 130 1 1 0 0 0 1 2410
## 68 1 17 120 1 1 0 0 0 3 2414
## 69 1 23 110 1 1 1 0 0 0 2424
## 71 1 17 120 2 0 0 0 0 2 2438
## 75 1 26 154 3 0 1 1 0 1 2442
## 76 1 20 105 3 0 0 0 0 3 2450
## 77 1 26 190 1 1 0 0 0 0 2466
## 78 1 14 101 3 1 1 0 0 0 2466
## 79 1 28 95 1 1 0 0 0 2 2466
## 81 1 14 100 3 0 0 0 0 2 2495
## 82 1 23 94 3 1 0 0 0 0 2495
## 83 1 17 142 2 0 0 1 0 0 2495
## 84 1 21 130 1 1 0 1 0 3 2495
## [1] 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18 15 25 20 28 32 31 36
## [24] 28 25 28 17 29 26 17 17 24 35 25 25 29 19 27 31 33 21 19 23 21 18 18
## [47] 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20 24 28 20
## [70] 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24 21
## [93] 19 25 16 29 29 19 19 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20
## [116] 28 14 28 25 16 20 26 21 22 25 31 35 19 24 45 28 29 34 25 25 27 23 24
## [139] 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24 17 20 22 27 20 17
## [162] 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17 26 20 26 14
## [185] 28 14 23 17 21
## [1] 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18 15 25 20 28 32 31 36
## [24] 28 25 28 17 29 26 17 17 24 35 25 25 29 19 27 31 33 21 19 23 21 18 18
## [47] 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20 24 28 20
## [70] 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24 21
## [93] 19 25 16 29 29 19 19 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20
## [116] 28 14 28 25 16 20 26 21 22 25 31 35 19 24 45 28 29 34 25 25 27 23 24
## [139] 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24 17 20 22 27 20 17
## [162] 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17 26 20 26 14
## [185] 28 14 23 17 21
## [1] 19
## # A tibble: 1 x 1
## age
## <int>
## 1 19
## [1] 19
Note: If you want to import data directly into
tibble
format, you may useread_delim()
andread_csv()
instead of their base-R alternatives. Even though we started with the base alternatives, I recommend using these improved import commands going forward.
The dataset doesn’t come with very descriptive variable names
Let’s get better column names (use help(birthwt)
to understand the variables and come up with better names)
## [1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui"
## [9] "ftv" "bwt"
# The default names are not very descriptive
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race", "mother.smokes",
"previous.prem.labor", "hypertension",
"uterine.irr", "physician.visits", "birthwt.grams")
# Better names!
birthwt
## # A tibble: 189 x 10
## birthwt.below.2… mother.age mother.weight race mother.smokes
## <int> <int> <int> <int> <int>
## 1 0 19 182 2 0
## 2 0 33 155 3 0
## 3 0 20 105 1 1
## 4 0 21 108 1 1
## 5 0 18 107 1 1
## 6 0 21 124 3 0
## 7 0 22 118 1 0
## 8 0 17 103 3 0
## 9 0 29 123 1 1
## 10 0 26 113 1 1
## # … with 179 more rows, and 5 more variables: previous.prem.labor <int>,
## # hypertension <int>, uterine.irr <int>, physician.visits <int>,
## # birthwt.grams <int>
rename()
commandrename
operates by allowing you to specify a new variable name for whichever old variable name you want to change.
# Reload the data again
birthwt <- as_tibble(MASS::birthwt)
birthwt <- birthwt %>%
rename(birthwt.below.2500 = low,
mother.age = age,
mother.weight = lwt,
mother.smokes = smoke,
previous.prem.labor = ptl,
hypertension = ht,
uterine.irr = ui,
physician.visits = ftv,
birthwt.grams = bwt)
colnames(birthwt)
## [1] "birthwt.below.2500" "mother.age" "mother.weight"
## [4] "race" "mother.smokes" "previous.prem.labor"
## [7] "hypertension" "uterine.irr" "physician.visits"
## [10] "birthwt.grams"
## # A tibble: 189 x 10
## birthwt.below.2… mother.age mother.weight race mother.smokes
## <int> <int> <int> <int> <int>
## 1 0 19 182 2 0
## 2 0 33 155 3 0
## 3 0 20 105 1 1
## 4 0 21 108 1 1
## 5 0 18 107 1 1
## 6 0 21 124 3 0
## 7 0 22 118 1 0
## 8 0 17 103 3 0
## 9 0 29 123 1 1
## 10 0 26 113 1 1
## # … with 179 more rows, and 5 more variables: previous.prem.labor <int>,
## # hypertension <int>, uterine.irr <int>, physician.visits <int>,
## # birthwt.grams <int>
Note that in this command we didn’t rename the race variable because it already had a good name.
All the factors are currently represented as integers
Let’s use the mutate()
, mutate_at()
and recode_factor()
functions to convert variables to factors and give the factors more meaningful levels
## # A tibble: 189 x 10
## birthwt.below.2… mother.age mother.weight race mother.smokes
## <int> <int> <int> <int> <int>
## 1 0 19 182 2 0
## 2 0 33 155 3 0
## 3 0 20 105 1 1
## 4 0 21 108 1 1
## 5 0 18 107 1 1
## 6 0 21 124 3 0
## 7 0 22 118 1 0
## 8 0 17 103 3 0
## 9 0 29 123 1 1
## 10 0 26 113 1 1
## # … with 179 more rows, and 5 more variables: previous.prem.labor <int>,
## # hypertension <int>, uterine.irr <int>, physician.visits <int>,
## # birthwt.grams <int>
birthwt <- birthwt %>%
mutate(race = recode_factor(race, `1` = "white", `2` = "black", `3` = "other")) %>%
mutate_at(c("mother.smokes", "hypertension", "uterine.irr", "birthwt.below.2500"),
~ recode_factor(.x, `0` = "no", `1` = "yes"))
birthwt
## # A tibble: 189 x 10
## birthwt.below.2… mother.age mother.weight race mother.smokes
## <fct> <int> <int> <fct> <fct>
## 1 no 19 182 black no
## 2 no 33 155 other no
## 3 no 20 105 white yes
## 4 no 21 108 white yes
## 5 no 18 107 white yes
## 6 no 21 124 other no
## 7 no 22 118 white no
## 8 no 17 103 other no
## 9 no 29 123 white yes
## 10 no 26 113 white yes
## # … with 179 more rows, and 5 more variables: previous.prem.labor <int>,
## # hypertension <fct>, uterine.irr <fct>, physician.visits <int>,
## # birthwt.grams <int>
Recall that the syntax ~ recode_factor(.x, ...)
defines an anonymous function that will be applied to every column specfied in the first part of the mutate_at()
call. In this case, all of the specified variables are binary 0/1 coded, and are being recoded to no/yes.
## birthwt.below.2500 mother.age mother.weight race
## no :130 Min. :14.00 Min. : 80.0 white:96
## yes: 59 1st Qu.:19.00 1st Qu.:110.0 black:26
## Median :23.00 Median :121.0 other:67
## Mean :23.24 Mean :129.8
## 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :45.00 Max. :250.0
## mother.smokes previous.prem.labor hypertension uterine.irr
## no :115 Min. :0.0000 no :177 no :161
## yes: 74 1st Qu.:0.0000 yes: 12 yes: 28
## Median :0.0000
## Mean :0.1958
## 3rd Qu.:0.0000
## Max. :3.0000
## physician.visits birthwt.grams
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
summarize()
and group_by()
functions to see what the average birthweight looks like when broken down by race and smoking status. To make the printout nicer we’ll round to the nearest gram.tbl.mean.bwt <- birthwt %>%
group_by(race, mother.smokes) %>%
summarize(mean.birthwt = round(mean(birthwt.grams), 0))
tbl.mean.bwt
## # A tibble: 6 x 3
## # Groups: race [3]
## race mother.smokes mean.birthwt
## <fct> <fct> <dbl>
## 1 white no 3429
## 2 white yes 2827
## 3 black no 2854
## 4 black yes 2504
## 5 other no 2816
## 6 other yes 2757
pivot_wider()
function from tidyr
tidyr
this would have been achieved through the spread()
function
spread()
still works, but the new preferred call is to pivot_wider()
pivot_wider()
for this type of reshaping, we’ll want to specify the data
, names_from
and values_from
:## # A tibble: 3 x 3
## # Groups: race [3]
## race no yes
## <fct> <dbl> <dbl>
## 1 white 3429 2827
## 2 black 2854 2504
## 3 other 2816 2757
Let’s use the header {r, results='asis'}
, along with the kable()
function from the knitr
library
kable
on it directly with a kable(x, format)
command. Or, we can take our table code from before, and pipe it into a kable command.
# Print nicely
tbl.mean.bwt %>%
pivot_wider(names_from = mother.smokes, values_from = mean.birthwt) %>%
kable(format = "markdown")
race | no | yes |
---|---|---|
white | 3429 | 2827 |
black | 2854 | 2504 |
other | 2816 | 2757 |
kable()
outputs the table in a way that Markdown can read and nicely display
Note: changing the CSS changes the table appearance
## [1] 0.09031781
## # A tibble: 2 x 2
## mother.smokes cor_bwt_age
## <fct> <dbl>
## 1 no 0.201
## 2 yes -0.144
## # A tibble: 3 x 2
## race cor_bwt_age
## <fct> <dbl>
## 1 white 0.166
## 2 black -0.329
## 3 other -0.0293
There does look to be variation, but we don’t know if it’s statistically significant without further investigation.
We now know a lot about how to tabulate data
It’s often easier to look at plots instead of tables
We’ll now talk about some of the standard plotting options
Let’s continue with the birthwt
data from the MASS
library.
Here are some basic single-variable plots.
par(mfrow = c(2,2)) # Display plots in a single 2 x 2 figure
plot(birthwt$mother.age)
with(birthwt, hist(mother.age))
plot(birthwt$mother.smokes)
plot(birthwt$birthwt.grams)
Note that the result of calling plot(x, ...)
varies depending on what x
is.
- When x
is numeric, you get a plot showing the value of x
at every index.
- When x
is a factor, you get a bar plot of counts for every level
Let’s add more information to the smoking bar plot, and also change the color by setting the col
option.
par(mfrow = c(1,1))
plot(birthwt$mother.smokes,
main = "Mothers Who Smoked In Pregnancy",
xlab = "Smoking during pregnancy",
ylab = "Count of Mothers",
col = "lightgrey")
ggplot2 has a slightly steeper learning curve than the base graphics functions, but it also generally produces far better and more easily customizable graphics.
There are two basic calls in ggplot:
qplot(x, y, ..., data)
: a “quick-plot” routine, which essentially replaces the base plot()
ggplot(data, aes(x, y, ...), ...)
: defines a graphics object from which plots can be generated, along with aesthetic mappings that specify how variables are mapped to visual properties.Here’s how the default scatterplots look in ggplot compared to the base graphics. We’ll illustrate things by continuing to use the birthwt data from the MASS
library.