14  R4DS: Data transformation

To run these solutions you need to load the necessary libraries.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)

As always, its good to take a glimpse at the data

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

That’s a lot of flights! Anyway…

14.1 4.2 Rows

14.1.1 Exercise 1

TO avoid extensive outputs, we only provide the code for these solutions.

# Had an arrival delay of two or more hours
flights |>
  filter(arr_delay > 120)

# Flew to Houston (IAH or HOU)
flights |>
  filter(dest == "IAH" | dest == "HOU" )

# Were operated by United, American, or Delta
# Here we use "|" for or.
flights |>
  filter(carrier == "UA" | carrier == "AA" | carrier == "DL")

# Here we use "%in%".
flights |>
  filter(carrier %in% c("UA", "AA", "DL"))

# Departed in summer (July, August, and September)
flights |>
  filter(month %in% c(7, 8, 9)) 

# Arrived more than two hours late, but didn’t leave late
flights |>
  filter(arr_delay > 120 & dep_delay <= 0) 

# Were delayed by at least an hour, but made up over 30 minutes in flight
flights |>
  filter(dep_delay > 60 & arr_delay < 30) 

In this solution we used both | and %in%. When you have multiple conditions the %in% operator can be more concise and easier to understand. | or more specifically || is a more traditional way of indicating or conditions, but this is outside of the scope of this course.

14.1.2 Exercise 2

Here we fist arrange by dep_delay in desc-ending order to find the flights with the longest delay. We also use relocate, which is introduced in 4.2.6 to move the column to the left in the tibble so it is visible in our output.

flights |>
  arrange(desc(dep_delay)) |>
  relocate(dep_delay) |>
  head(3)
# A tibble: 3 × 19
  dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
      <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
1      1301  2013     1     9      641            900     1242           1530
2      1137  2013     6    15     1432           1935     1607           2120
3      1126  2013     1    10     1121           1635     1239           1810
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Then we arrange by dep_time to find the flights that left earliest in the morning. Here we use select, also introduced in 4.2.6 to select only the column that we arranged by.

flights |>
  arrange(dep_time) |>
  select(dep_time) |>
  head(3)
# A tibble: 3 × 1
  dep_time
     <int>
1        1
2        1
3        1

Did you do it differently? I interpreted the question in this way, but maybe it was actually supposed to be

flights |>
  arrange(desc(dep_delay), dep_time) |>
  relocate(dep_delay, dep_time) |>
  head(3)
# A tibble: 3 × 19
  dep_delay dep_time  year month   day sched_dep_time arr_time sched_arr_time
      <dbl>    <int> <int> <int> <int>          <int>    <int>          <int>
1      1301      641  2013     1     9            900     1242           1530
2      1137     1432  2013     6    15           1935     1607           2120
3      1126     1121  2013     1    10           1635     1239           1810
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

14.1.3 Exercise 3

Here we remember our high school physics, speed = distance / time. Remember to use desc to get the

flights |>
  arrange(desc(distance / air_time)) |>
  relocate(distance, air_time) |>
  head(3)
# A tibble: 3 × 19
  distance air_time  year month   day dep_time sched_dep_time dep_delay arr_time
     <dbl>    <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1      762       65  2013     5    25     1709           1700         9     1923
2     1008       93  2013     7     2     1558           1513        45     1745
3      594       55  2013     5    13     2040           2025        15     2225
# ℹ 10 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

A better solution would be to use mutate to calculate a new variable, speed:

flights |>
  mutate(speed = distance / air_time) |>
  arrange(desc(speed)) |>
  relocate(speed) |>
  head(3)
# A tibble: 3 × 20
  speed  year month   day dep_time sched_dep_time dep_delay arr_time
  <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  11.7  2013     5    25     1709           1700         9     1923
2  10.8  2013     7     2     1558           1513        45     1745
3  10.8  2013     5    13     2040           2025        15     2225
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

14.1.4 Exercise 4

flights |>
  distinct(month, day) |>
  count()
# A tibble: 1 × 1
      n
  <int>
1   365

We see that there are 365 distinct values and 2013 was not a leap year, so there were flights everyday of 2013. It is not necessary to use count, but the output is more concise.

14.1.5 Exercise 5

We simply arrange with (longest flights) or without (shortest flights) desc.

flights |>
  arrange(desc(distance)) |>
  relocate(distance) |>
  head(3)
# A tibble: 3 × 19
  distance  year month   day dep_time sched_dep_time dep_delay arr_time
     <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1     4983  2013     1     1      857            900        -3     1516
2     4983  2013     1     2      909            900         9     1525
3     4983  2013     1     3      914            900        14     1504
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
  arrange(distance) |>
  relocate(distance) |>
  head(3)
# A tibble: 3 × 19
  distance  year month   day dep_time sched_dep_time dep_delay arr_time
     <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1       17  2013     7    27       NA            106        NA       NA
2       80  2013     1     3     2127           2129        -2     2222
3       80  2013     1     4     1240           1200        40     1333
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The shortest flight looks suspicious. Maybe there is something wrong with this entry?

14.1.6 Exercise 6

It does not matter in which order we filter or arrange, but it is always preferable to filter as early as possible. filter makes the data frame shorter and means that the subsequent operations (function calls) will need to process less data, and therefore be faster. However, if you are not working with a lot of data, this will not matter much. Consider it a good practice to adhere to though.

14.2 4.3 Columns

14.2.1 Exercise 1

We would believe that dep_delay = dep_time - sched_dep_time. We can test this by creating a new variable called my_dep_delay and then compare it to dep_delay.

flights |>
  mutate(
    my_dep_delay = dep_time - sched_dep_time,
    dep_delay = dep_delay,
    .keep = "used") |>
  head(5)
# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay my_dep_delay
     <int>          <int>     <dbl>        <int>
1      517            515         2            2
2      533            529         4            4
3      542            540         2            2
4      544            545        -1           -1
5      554            600        -6          -46

We see that there is something wrong in our calculations. If we use ?flights we can see that dep_time is stored in the format HHMM and dep_delay in minutes. There are several ways we can solve this, see chapter 11.4 for another, similar solution. If you are interested however, here is a solution:

flights |>
  mutate(
    dep_hours = floor(dep_time / 100),
    dep_minutes = dep_time - dep_hours * 100,
    dep_time_minutes = 60 * dep_hours + dep_minutes,
    sched_dep_hours = floor(sched_dep_time / 100),
    sched_dep_minutes = sched_dep_time - sched_dep_hours * 100,
    sched_dep_time_minutes = 60 * sched_dep_hours + sched_dep_minutes,
    my_dep_delay = dep_time_minutes - sched_dep_time_minutes,
    dep_delay = dep_delay,
    .keep = "used") |>
  relocate(dep_delay, my_dep_delay) |>
  head(5)
# A tibble: 5 × 10
  dep_delay my_dep_delay dep_time sched_dep_time dep_hours dep_minutes
      <dbl>        <dbl>    <int>          <int>     <dbl>       <dbl>
1         2            2      517            515         5          17
2         4            4      533            529         5          33
3         2            2      542            540         5          42
4        -1           -1      544            545         5          44
5        -6           -6      554            600         5          54
# ℹ 4 more variables: dep_time_minutes <dbl>, sched_dep_hours <dbl>,
#   sched_dep_minutes <dbl>, sched_dep_time_minutes <dbl>

where the floor function rounds down a number.

14.2.2 Exercise 2

Here are a two suggestions where we use names which extracts the column names of a data frame.

# Using select
flights |>
  select(dep_time, dep_delay, arr_time, arr_delay) |>
  names()
[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay"
# Using starts_with
flights |>
  select(starts_with("dep") | starts_with("arr")) |>
  names()
[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay"
# A unintuitive and not recommended solution, using mutate and .keep = used
flights |>
  mutate(
    dep_time = dep_time, 
    dep_delay = dep_delay, 
    arr_time = arr_time, 
    arr_delay = arr_delay,
    .keep = "used") |>
  names()
[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay"

14.2.3 Exercise 3

Nothing seems to happen…

flights |>
  select(dep_time, dep_time) |>
  head(3)
# A tibble: 3 × 1
  dep_time
     <int>
1      517
2      533
3      542

14.2.4 Exercise 4

any_of can be used to select any matching variables in the vector.

variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |>
  select(any_of(variables)) |>
  head(3)
# A tibble: 3 × 5
   year month   day dep_delay arr_delay
  <int> <int> <int>     <dbl>     <dbl>
1  2013     1     1         2        11
2  2013     1     1         4        20
3  2013     1     1         2        33

produces the same result as

variables <- c("bill_depth_mm", "year", "month", "day", "dep_delay", "arr_delay")
flights |>
  select(any_of(variables)) |>
  head(3)
# A tibble: 3 × 5
   year month   day dep_delay arr_delay
  <int> <int> <int>     <dbl>     <dbl>
1  2013     1     1         2        11
2  2013     1     1         4        20
3  2013     1     1         2        33

while all_of would produce an error

variables <- c("bill_depth_mm", "year", "month", "day", "dep_delay", "arr_delay")
flights |>
  select(all_of(variables)) |>
  head(3)
Error in `select()`:
ℹ In argument: `all_of(variables)`.
Caused by error in `all_of()`:
! Can't subset elements that don't exist.
✖ Element `bill_depth_mm` doesn't exist.

These functions may be useful if you select variables across multiple data frames.

14.2.5 Exercise 5

flights |> 
  select(contains("TIME")) |>
  head(3)
# A tibble: 3 × 6
  dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
     <int>          <int>    <int>          <int>    <dbl> <dttm>             
1      517            515      830            819      227 2013-01-01 05:00:00
2      533            529      850            830      227 2013-01-01 05:00:00
3      542            540      923            850      160 2013-01-01 05:00:00

The default behavior is to ignore case, ignore.case = TRUE. Set it to FALSE to match case

flights |> 
  select(contains("TIME", ignore.case = FALSE)) |>
  head(3)
# A tibble: 3 × 0

14.2.6 Exercise 6

flights |> 
  rename(air_time_min = air_time) |>
  relocate(air_time_min) |>
  head(3)
# A tibble: 3 × 19
  air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
         <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1          227  2013     1     1      517            515         2      830
2          227  2013     1     1      533            529         4      850
3          160  2013     1     1      542            540         2      923
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

14.2.7 Exercise 7

By select-ing tailnum we drop all other columns. Thus there is no column called arr_delay anymore.

flights |> 
  select(tailnum) |>
  head(3)
# A tibble: 3 × 1
  tailnum
  <chr>  
1 N14228 
2 N24211 
3 N619AA 

14.3 4.5 Groups

14.3.1 Exercise 1

First we will look at the carriers with the worst delays. Either by using group_by

flights |>
  group_by(carrier) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))

or by using .by

flights |>
  summarize(
    dep_delay_avg = mean(dep_delay, na.rm = TRUE),
    .by = carrier) |>
  head(3)
# A tibble: 3 × 2
  carrier dep_delay_avg
  <chr>           <dbl>
1 UA              12.1 
2 AA               8.59
3 B6              13.0 

Note: we recommend using .by, at least when grouping over more variables to keep grouping tied to specific summarize. Performing multiple summarize with one group_by makes it harder to follow the code, and there is a risk that the end result is a grouped tibble, which is hard to keep track of the next time it is used.

Now we’ll try the challenge. First we look at the hint provided, where we use group_by in conjunction with the .group argument set to drop, and then try to visualize how many flights each carrier has to each destination.

flights |>
  group_by(carrier, dest) |>
  summarize(
    n = n(),
    .groups = "drop") |>
  ggplot(aes(x = dest, y = carrier, col = n)) + 
  geom_tile()

Since several carriers only fly to a few destinations, it may be hard to disentangle if the fault is with the carrier or the airport. Do you have a better solution? I’m sure there is one! Also, wouldn’t it be more reasonable to blame the origin? Hmm…

14.3.2 Exercise 2

flights |>
  slice_max(
    dep_delay,
    by = origin) |>
  relocate(origin, dep_delay) |>
  head(3)
# A tibble: 3 × 19
  origin dep_delay  year month   day dep_time sched_dep_time arr_time
  <chr>      <dbl> <int> <int> <int>    <int>          <int>    <int>
1 EWR         1126  2013     1    10     1121           1635     1239
2 LGA          911  2013     3    17     2321            810      135
3 JFK         1301  2013     1     9      641            900     1242
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Alternatively

flights |>
group_by(origin) |>
  slice_max(dep_delay) |>
  relocate(origin, dep_delay) |>
  head(3)

14.3.3 Exercise 3

First we average dep_delay grouped by hour. Remember to use na.rm =TRUE to remove any missing values

avg_delays <- flights |>
  summarize(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
    .by = hour) |>
  arrange(desc(avg_dep_delay))
avg_delays |> head(5)
# A tibble: 5 × 2
   hour avg_dep_delay
  <dbl>         <dbl>
1    19          24.8
2    20          24.3
3    21          24.2
4    18          21.1
5    17          21.1

Looking at the full tibble, we can see that no flights depart at hours 2, 3, and 4, and that there seems to be something wrong with hour 1 since we have a Not-a-Number or NaN. We will ignore this now, but mention that NaN is used when some computation doesn’t have a meaningful result, such as division by zero, e.g. 1 / 0. NaN are sometimes used for missing values, but NA is preferable in that case.

Since both hour and dep_delay are numerical variables, we may use a scatter plot, geom_point, but in this case it may be more appealing to visualize the delays using a bar chart, geom_bar, or histogram, geom_histogram with stat = "identity" and avg_dep_delay selected as the y aesthetic. We see that this works with geom_freqpoly too (and geom_density).

avg_delays |>
  arrange(hour) |>
  ggplot(aes(x = hour, y = avg_dep_delay)) + 
  geom_histogram(stat = "identity") + 
  geom_freqpoly(stat = "identity")
Warning in geom_histogram(stat = "identity"): Ignoring unknown parameters:
`binwidth`, `bins`, and `pad`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_path()`).

We see that the most delays are around 19-21.

14.3.4 Exercise 4

When we slice using e negative value we drop rows instead of selecting them. slice_min drops from the bottom and slice_max drops from the top, in an ordered tibble.

avg_delays |>
  slice_min(
    avg_dep_delay, 
    n = -17)
# A tibble: 3 × 2
   hour avg_dep_delay
  <dbl>         <dbl>
1     5         0.688
2     6         1.64 
3     7         1.91 

14.3.5 Exercise 5

count counts the number of occurences of a value of a variable. If we set the sort parameter to TRUE the resulting tibble will be sorted according to the counts. Compare with distinct. Below we count the number of flights departing at each hour.

flights |>
  count(
    hour, 
    sort = TRUE) |>
  head(3)
# A tibble: 3 × 2
   hour     n
  <dbl> <int>
1     8 27242
2     6 25951
3    17 24426

We see that most flights depart in the morning. An interesting question is why the most delays are in the evening when most flights depart in the morning. Maybe we can look at when there are most arrivals?

flights |>
  mutate(arr_hour = floor(arr_time / 100)) |>
  count(
    arr_hour, 
    sort = TRUE) |>
  head(5)
# A tibble: 5 × 2
  arr_hour     n
     <dbl> <int>
1       10 25034
2       18 22367
3       20 22123
4       19 21017
5       21 20866

Here we see what we expect, the hours with the most delays coincide with top 5 hours of most arrivals. It still doesn’t explain why we don’t have more delays at 10. To investigate this we would have to explore our data further, and possibly conclude that the answer may lie beyond the recorded data.

14.3.6 Exercise 6

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)

14.3.6.1 a.

df |>
  group_by(y)
# A tibble: 5 × 3
# Groups:   y [2]
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     2 b     K    
3     3 a     L    
4     4 a     L    
5     5 b     K    

Observe that the output is now grouped, which may be confusing if other operations are applied to df in a new code segment. Compare with 6 e. below.

14.3.6.2 b.

df |>
  arrange(y)
# A tibble: 5 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     3 a     L    
3     4 a     L    
4     2 b     K    
5     5 b     K    

The tibble is sorted according to y in alphabetical order.

14.3.6.3 c.

df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
# A tibble: 2 × 2
  y     mean_x
  <chr>  <dbl>
1 a       2.67
2 b       3.5 

The mean of x is calculated for all values of y.

14.3.6.4 d.

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups:   y [2]
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The mean of x is calculated for all combinations of y and z. The output is still grouped by y.

14.3.6.5 e.

df |>
  group_by(y, z) |>
  summarize(
    mean_x = mean(x),
    .groups = "drop")
# A tibble: 3 × 3
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The mean of x is calculated for all combinations of y and z. The output is not grouped by y.

14.3.6.6 f.

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups:   y [2]
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5
df |>
  group_by(y, z) |>
  mutate(mean_x = mean(x))
# A tibble: 5 × 4
# Groups:   y, z [3]
      x y     z     mean_x
  <int> <chr> <chr>  <dbl>
1     1 a     K        1  
2     2 b     K        3.5
3     3 a     L        3.5
4     4 a     L        3.5
5     5 b     K        3.5

In the first piece of code we summarize over z and the resulting tibble only has the unique values of z. In the second piece of code we create a new variable with the mean value of x for each combination of y and z.