13  R4DS: Data visualization

In these solutions we use the piping operator, |> (also %>%). You will encounter this later in the book, in Chapter 4.4. It is very useful to write clear and understandable code. In these exercises the only difference between using the piping operator or not is

penguins |>
  ggplot(aes(...))

instead of

ggplot(data = penguins, mapping = aes(...))

and it doesn’t matter which one you choose to do for this exercise. However, as you progress in the course, piping will be good practice and therefore we use it in our solutions so you can use this as a reference later in the course. We also leave out the explicit arguments data = and mapping = as that is in line with the later parts of the book.

13.1 Setup

First we need to load the necessary libraries.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

It is also good practice to take a glimpse at the data before we begin, even if we are not looking at something in particular.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

That’s a number of penguins. Anyway…

13.2 2.2 First steps

13.2.1 Exercise 1

# Using piping
penguins |> nrow()
[1] 344
penguins |> ncol()
[1] 8
# Using functions with arguments
nrow(penguins)
[1] 344
ncol(penguins)
[1] 8

13.2.2 Exercise 3

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

There seems to be multiple groups within the data (in this case, the different species), as there seems to be a positive correlation when bill length is 0-40 mm and another one when bill length is between 40-60 mm.

13.2.3 Exercise 4

penguins |>
  ggplot(aes(x = species, y = bill_depth_mm)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

As suspected, the bill depth is different between the different species. Box plot is a good graphic to visualize continuous data for different categories. Another option would be to do separate histograms or density plots for the different species:

penguins |>
  ggplot(aes(fill = species, x = bill_depth_mm)) +
  geom_histogram(alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Note that we set alpha = 0.5 to make the bins slightly transparent. Which one to do you prefer?

13.2.4 Exercise 5

We haven’t defined any aesthetic mappings! Plots need at least an x or y aestethic mapped.

13.2.5 Exercise 6

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(na.rm = TRUE)

The na.rm argument specifies how the plot deals with missing values. The default behavior is to remove the NAs, but if we specify it explicitly, like in this exercise, the plot does not output any warning. Compare with exercise 3!

13.2.6 Exercise 7

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(na.rm = TRUE) +
  labs(caption = "Data come from the palmerpenguins package")

Use ?labs to see the documentation for the labs function.

13.2.7 Exercise 8

penguins |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(col = bill_depth_mm), na.rm = TRUE) +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Here we defined col only for the geom_point. Does it make a difference if we define it globally in ggplot? Spoiler: it doesn’t! It only affects geom_point, because geom_smooth creates a line based on many points, so it doesn’t make sense to color parts of the line.

13.2.8 Exercise 9

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

We now color based on the categorical variable island instead. Now we see that our geom_smooth indeed is affected, as it now creates one line for each of the different islands.

We have also set se = FALSE, which means that we do not see the shaded area around the curve. se stands for Standard Error and tells us about the uncertainty of the line at different points. We do not expect you to fully understand or know this concept now, but it can be useful to at least get an idea of whats going on.

13.2.9 Exercise 10

# Here we use patchwork to set the plots next to each other.
library(patchwork)

p1 <- ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()

p2 <- ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )

p1 + p2
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

We see that it doesn’t make a difference. We can define aesthetics globally in ggplot or locally in each geom_. However, it is better to define as much as possible in ggplot, as it makes it clear for you writing the code and someone reading the code what the plot will show.

Here we used library(patchwork to put the plots next to each other. You can try it out, but you don’t have to use it!

13.3 2.4 Visualizing distributions

13.3.1 Exercise 1

penguins |>
  ggplot(aes(y = species)) +
  geom_bar()

The bars now extend horizontally.

13.3.2 Exercise 2

library(patchwork)
p1 <- penguins |>
  ggplot(aes(y = species)) +
  geom_bar(col = "red")
p2 <- penguins |>
  ggplot(aes(y = species)) +
  geom_bar(fill = "red")

p1 + p2

fill is usually the more interesting property to adjust when it is applicable. Colors and fills are useful to play around with to make it easier for the reader, and sometimes ourselves. Maybe a more useful coloring is one color for each species and a colorscheme that is suitable to colorblind people?

# We import the ggthemes package to get the scale_fill_colorblind
library(ggthemes)
penguins |>
  ggplot(aes(y = species, fill = species)) +
  geom_bar() + 
  scale_fill_colorblind()

13.3.3 Exercise 3

To find out, we can look at ?geom_histogram.

bins
Number of bins. Overridden by binwidth. Defaults to 30.

Selecting number of bins can affect the perception of the histogram, and of the data.

library(patchwork)
p1 <- penguins |>
  ggplot(aes(fill = species, x = bill_depth_mm)) +
  geom_histogram(bins = 50, alpha = 0.5)
p2 <- penguins |>
  ggplot(aes(fill = species, x = bill_depth_mm)) +
  geom_histogram(bins = 10, alpha = 0.5)

p1 + p2
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 2 rows containing non-finite outside the scale range (`stat_bin()`).

Which one is better depends on the data and what features you are interested in. If the histogram doesn’t look like you would expect, changing number of bins may be a good idea. Try using binwidth as well, as that may be more intuitive to derive from the data.

13.3.4 Exercise 4

First it’s a good idea to get a glimpse of the diamonds data. glimpse is a function that gives a quick overview, a glimpse, of a data frame.

glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Then we can create the histogram over carat:

diamonds |>
  ggplot(aes(x = carat)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

13.3.5 Extra exercise

diamonds |>
  ggplot(aes(x = carat)) +
  geom_density()

Using geom_density is another way of visualizing a distribution. It is less detailed than the histogram but gives a good overview of the data. It can also be useful to visualize many categories at the same time, which can get messy when using a histogram. Note however that we do not escape the problem of selecting a suitable number of bins, as something similar exists in geom_density, the parameter bw. We do not expect you to understand this parameter or even use it, but it is good to be aware of it.

An example of multiple densities, colored by cut.

diamonds |>
  ggplot(aes(x = carat, col = cut)) +
  geom_density() 

13.4 2.5 Visualizing relationships

13.4.1 Exercise 1

Again we can use glimpse to get an overview, and get to know which kinds of variables the data frame contains:

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

13.4.2 Exercise 2

One example using color and size at the same time.

mpg |>
  ggplot(aes(x = hwy, y = displ, col = cty, size = cyl)) +
  geom_point()

Note that we cannot assign shape to a continuous variable! For that we need the variable to be a factor, as in this example

mpg |>
  ggplot(aes(x = hwy, y = displ, col = cty, shape = as.factor(cyl))) +
  geom_point()

13.4.3 Exercise 3

linewidth doesn’t have any effect on a scatterplot, geom_point, so it will be ignored.

13.4.4 Exercise 4

It is possible to map the same variable to multiple aesthetics and it can even be useful sometimes, for example when we have a reason to use a specific color theme, but also want our plot to be easily understood by someone who is colorblind, see example below.

mpg |>
  ggplot(aes(x = hwy, y = displ, col = as.factor(cyl), shape = as.factor(cyl))) +
  geom_point()

13.4.5 Exercise 5

As we saw in 2.2.4, exercise 3, there seemed to be groups within the scatter plot between bill_length_mm and bill_depth_mm. Now we will investigate this by coloring the plot by species.

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, col = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Now it becomes obvious that the different species in our data was responsible for different patterns.

Interesting fact: This is an example of Simpson’s paradox: when a relationship between two variables doesn’t show up for the whole population (all penguins) but shows up in sub-populations (the different species).

We will try faceting as well

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, col = species)) +
  geom_point() +
  facet_wrap(~species)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Which one do you think is most useful?

13.4.6 Exercise 6

Legends take the name from the aesthetic mapping that creates them, in this case species. When we manually rename a legend, as in labs(color = "Species"), it only affects the specified aesthetic, i.e. color. We can remedy this by specifying the same label for shape, as below.

penguins |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, col = species, shape = species)) +
  geom_point() +
  labs(color = "Species",
       shape = "Species")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

13.4.7 Exercise 7

library(patchwork)
p1 <- penguins |>
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = "fill")
p2 <- penguins |>
  ggplot(aes(x = species, fill = island)) +
  geom_bar(position = "fill")

p1 + p2

In the first plot, p1, we can answer what fraction of the total number of penguins at each island is of a particular species. In p2 we can answer what fraction of each species resides on which island.

13.5 2.6 Saving your plots

Note that it is good practice to save you plots as PDF whenever it is possible. PDF is a form of vector graphics, this enables you to zoom as much as you want without the image becoming “pixely”. It is also good practice to always reflect on and specify width and height of your plots to make sure the format suits your needs.

13.5.1 Exercise 1 & 2

The last plot is always the one to be saved. If you have multiple plots and want to save all of them, put a ggsave after each plot. To save as pdf, specify .pdf in the filename.

penguins |>
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = "fill")
ggsave("species-by-island.pdf", width = 10, height = 8)
penguins |>
  ggplot(aes(x = species, fill = island)) +
  geom_bar(position = "fill")
ggsave("island-by-species.pdf", width = 10, height = 8)