12  Basic statistics

12.1 Variables

12.1.1 Exercise 1

We run the summary function on the penguins dataset after loading palmerpenguins.

library(palmerpenguins)
summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

We see that the continuous variables show summary statistics and the categorical shows counts of the different levels, like Adelie, Chinstrap, and Gentoo in the species variable. You will learn more about these concepts in the Descriptive statistics section.

12.1.2 Exercise 2

We see that the variables have both changed to double, indicated by the <dbl> in the glimpse output.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
penguins$flipper_length_mm <- as.numeric(penguins$flipper_length_mm)
penguins$body_mass_g <- as.numeric(penguins$body_mass_g)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

12.2 Descriptive statistics

12.2.1 Exercise 1

If we don’t specify na.rm = TRUE the mean function will include the missing values and the function will return a missing value, NA.

# Not specifying na.rm = TRUE
mean(penguins$flipper_length_mm)
[1] NA

12.2.2 Exercise 2

Calculating the mean of a factor returns a warning saying that the argument is not numeric or logical and returns NA. What would the mean of a categorical (factor) mean? Does it make sense?

mean(penguins$species, na.rm = TRUE)
Warning in mean.default(penguins$species, na.rm = TRUE): argument is not
numeric or logical: returning NA
[1] NA

12.2.3 Exercise 3

The median and the 50th percentile (or second quartile) are the same value! The median is the value that splits the data into two parts with the same number of observations and the 50th percentile is the value that puts half of the data below the value and half above. Do you see that this is two ways of saying the same thing?

# 50th percentile
penguins |> 
  pull(body_mass_g) |> 
  quantile(prob = 0.5, na.rm = TRUE)
 50% 
4050 
# Median
penguins |> 
  pull(body_mass_g) |> 
  median(na.rm = TRUE)
[1] 4050

12.2.4 Exercise 4

cut, color, clarity are categorical, the other numerical. There are no missing values.

library(tidyverse)
summary(diamonds)

12.3 Relationships between variables

12.3.1 Exercise 1

We see that the values in the matrix correspond to the colors in the correlation plot. High correlation is indicated by values close to either 1 (positive correlation) or -1 (negative correlation).

library(ggcorrplot)
penguins |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs")
                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm        1.00000000   -0.23505287         0.6561813  0.59510982
bill_depth_mm        -0.23505287    1.00000000        -0.5838512 -0.47191562
flipper_length_mm     0.65618134   -0.58385122         1.0000000  0.87120177
body_mass_g           0.59510982   -0.47191562         0.8712018  1.00000000
year                  0.05454458   -0.06035364         0.1696751  0.04220939
                         year
bill_length_mm     0.05454458
bill_depth_mm     -0.06035364
flipper_length_mm  0.16967511
body_mass_g        0.04220939
year               1.00000000
penguins |>
  select(where(is.numeric)) |>
  cor(use = "complete.obs") |>
  ggcorrplot()