18 Linear regression

18.1 Setup

First we load the dataset and have a glimpse at it.

library(tidyverse)
library(palmerpenguins)
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

18.2 A linear relationship

18.2.1 Exercise 1

We see that the average body weight for a penguin with flipper length of 201 mm is 4206, an increase of 50 g from 200. Note we round the values in the plot outputs.

x <- 201
model <- penguins |> 
  lm(formula = body_mass_g ~ flipper_length_mm)
prediction <- data.frame(
  flipper_length_mm = x,
  body_mass_g = model |> predict(data.frame(flipper_length_mm = x)))

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_segment(x = prediction$flipper_length_mm, xend = prediction$flipper_length_mm,  y = 0, yend = prediction$body_mass_g, color = "black", linetype = 2, linewidth = 0.2) +
  geom_segment(x = 0, xend = prediction$flipper_length_mm,  y = prediction$body_mass_g, yend = prediction$body_mass_g, color = "black", linetype = 2, linewidth = 0.2) +
  scale_x_continuous(breaks = c(prediction$flipper_length_mm)) +
  scale_y_continuous(breaks = round(c(prediction$body_mass_g))) +
  geom_point(data = prediction, col = "red", size = 3) + 
  labs(
    x = "Flipper length (mm)", 
    y = "Body mass (g)")

18.2.2 Exercise 2

The predicted average body mass is -812 g, which doesn’t seem to make any sense! When we are predicting based on values that are outside of our sample we may get poor predictions, as in this case. Always be mindful when predicting outside your sample.

x <- 100
model <- penguins |> 
  lm(formula = body_mass_g ~ flipper_length_mm)
prediction <- data.frame(
  flipper_length_mm = x,
  body_mass_g = model |> predict(data.frame(flipper_length_mm = x)))

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_segment(x = prediction$flipper_length_mm, xend = prediction$flipper_length_mm,  y = 0, yend = prediction$body_mass_g, color = "black", linetype = 2, linewidth = 0.2) +
  geom_segment(x = 0, xend = prediction$flipper_length_mm,  y = prediction$body_mass_g, yend = prediction$body_mass_g, color = "black", linetype = 2, linewidth = 0.2) +
  scale_x_continuous(breaks = c(prediction$flipper_length_mm)) +
  scale_y_continuous(breaks = round(c(prediction$body_mass_g))) +
  geom_point(data = prediction, col = "red", size = 3) + 
  labs(
    x = "Flipper length (mm)", 
    y = "Body mass (g)")

18.3 Fitting a linear model

18.3.1 Exercise 1

model <- penguins |> 
  lm(formula = flipper_length_mm ~ body_mass_g)
summary(model)


Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.7626  -4.9138   0.9891   5.1166  16.6392 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.367e+02  1.997e+00   68.47   <2e-16 ***
body_mass_g 1.528e-02  4.668e-04   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.913 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

The linear model doesn’t care which variable is the dependent and independent variable, it will fit a model either way. The coefficients will be very different, as the increase in body mass is much smaller per mm flipper length than the other way around, but the relationship will still be significant. The coefficient is 1.528e-02 which is the same as \(1.528 \cdot 10^{-2} = 0.01528\). Thus the flipper length increases on average by 0.015 cm per g the body mass increases. We can never draw any causal conclusions without additional assumptions, and in this case it doesn’t seem appropriate, as it seems more like a “chicken and egg” dilemma than a causal effect. The model can still be useful for predicting and to understand the relationship between the variables, without suggesting a causal relationship.

penguins |> 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(
    x = "Flipper length (mm)", 
    y = "Body mass (g)")

18.3.2 Exercise 2

From the visualization we see that the sales increase during the summer months and decreases during winter, but the line is completely flat. There is a clear relationship, but it is not a linear relationship! We can see from how the data is generated that it depends on the sin function, which is not linear.

# Simulating data
day <- seq(0, 12, length.out = 356)
sales <- 400 * sin(pi * day / 12)
ice_cream <- tibble(day = day, sales = sales)

ice_cream |> 
  ggplot(aes(x = day, y = sales)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_continuous(
    breaks = seq(0, 12)) +
  labs(
    x = "Month",
    y = "Ice cream sales")

`geom_smooth()` using formula = 'y ~ x'

We see that the \(p\)-value is 1, there is no significant relationship. It is important to remember that we are looking for a linear relationship when performing linear regression. It will not detect any other relationship.

ice_cream |> 
  lm(formula = sales ~ day) |> 
  summary()


Call:
lm(formula = sales ~ day, data = ice_cream)

Residuals:
    Min      1Q  Median      3Q     Max 
-253.93 -102.08   28.28  115.45  146.06 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.539e+02  1.312e+01   19.36   <2e-16 ***
day         1.638e-14  1.892e+00    0.00        1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 124 on 354 degrees of freedom
Multiple R-squared:  2.54e-31,  Adjusted R-squared:  -0.002825 
F-statistic: 8.991e-29 on 1 and 354 DF,  p-value: 1

18.3.3 Exercise 3

We see that we get one coefficient for Chinstrap and one for Gentoo. The value, under “Estimate” is the difference from the Adelie penguins in average body weight, compare with the values computed below.

penguins |> 
  lm(formula = body_mass_g ~ species) |>
  summary()


Call:
lm(formula = body_mass_g ~ species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3700.66      37.62   98.37   <2e-16 ***
speciesChinstrap    32.43      67.51    0.48    0.631    
speciesGentoo     1375.35      56.15   24.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16

We can check this

# Calculating mean body weight for each species
adelie_mean <- penguins |> 
  filter(species == "Adelie") |> 
  pull(body_mass_g) |>
  mean(na.rm = TRUE)
chinstrap_mean <- penguins |> 
  filter(species == "Chinstrap") |> 
  pull(body_mass_g) |>
  mean(na.rm = TRUE)
gentoo_mean <- penguins |> 
  filter(species == "Gentoo") |> 
  pull(body_mass_g) |>
  mean(na.rm = TRUE)

 # Differences in mean
chinstrap_mean - adelie_mean

[1] 32.42598

gentoo_mean - adelie_mean

[1] 1375.354

In addition to comparing the means the linear regression outputs \(p\)-values that tests if there is a significant difference between the species. Only the difference between Adelie and Gentoos is significant.

18.3.4 Exercise 4

We see that the \(p\)-values are the same and we can use linear regression instead of the t-test. Linear regression is more versatile and can accommodate many levels in the categorical predictor. If it is possible to use a t-test it may be more straightforward and clear though.

# t-test
female_gentoo_body_mass <- penguins |>
  filter(species == "Gentoo" & sex == "female") |>
  select(body_mass_g) |>
  drop_na()
male_gentoo_body_mass <- penguins |>
  filter(species == "Gentoo" & sex == "male") |>
  select(body_mass_g) |>
  drop_na()

t.test(male_gentoo_body_mass, female_gentoo_body_mass)


    Welch Two Sample t-test

data:  male_gentoo_body_mass and female_gentoo_body_mass
t = 14.761, df = 116.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 697.0763 913.1130
sample estimates:
mean of x mean of y 
 5484.836  4679.741

# Linear regression
penguins |> 
  filter(species == "Gentoo") |>
  lm(formula = body_mass_g ~ sex) |>
  summary()


Call:
lm(formula = body_mass_g ~ sex, data = filter(penguins, species == 
    "Gentoo"))

Residuals:
    Min      1Q  Median      3Q     Max 
-734.84 -207.29   20.26  215.16  815.16 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4679.74      39.15  119.52   <2e-16 ***
sexmale       805.09      54.69   14.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 298.2 on 117 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.6494,    Adjusted R-squared:  0.6464 
F-statistic: 216.7 on 1 and 117 DF,  p-value: < 2.2e-16

18.4 Categorical predictors

18.4.1 Exercise 1

We start with the linear model.

penguins |> 
  lm(formula = body_mass_g ~ island) |> 
  summary()


Call:
lm(formula = body_mass_g ~ island, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1866.02  -362.90    -6.37   433.98  1583.98 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4716.02      48.47   97.30   <2e-16 ***
islandDream     -1003.11      74.25  -13.51   <2e-16 ***
islandTorgersen -1009.65     100.21  -10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 626.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.3936,    Adjusted R-squared:   0.39 
F-statistic:   110 on 2 and 339 DF,  p-value: < 2.2e-16

We see that the coefficients for both Dream and Torgersen are significant, but let’s have a look at a density plot to better understand the result.

Code

mean_bws <- penguins |> 
  summarize(
    body_mass_g = mean(body_mass_g, na.rm = TRUE),
    .by = island)
penguins |> 
  ggplot(aes(x = body_mass_g, fill = island)) + 
  geom_density(alpha = 0.3) +
  geom_vline(
    aes(
      xintercept = body_mass_g,
      col = island),
    data = mean_bws) +
  labs(
    x = "Body mass (g)",
    y = "",
    col = "Species",
    fill = "Species") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

We see that lm choose Biscoe as the reference island, making the coefficients for both the other islands significant. The result would have been different if Dream or Torgersen would’ve been the reference.

18.4.2 Exercise 2

We set Torgersen as reference and rerun the linear regression.

penguins_re <- penguins |> mutate(island = relevel(island, ref = "Torgersen"))
model_summary <- penguins_re |> 
  lm(formula = body_mass_g ~ island) |> 
  summary()
model_summary$coefficients

We see that now only Biscoe is significant.

18.4.3 Exercise 3

Gentoos only exists on Biscoe, which is the island with the significant difference in body weight. It is fair to assume that species is more responsible than island for the effect we see on body mass.

penguins |> 
  ggplot(aes(x = island, fill = species)) +
  geom_bar() +
  labs(
    x = "Island",
    y = "Count",
    fill = "Species")

18.4.4 Exercise 4

We see that island doesn’t have a significant effect on the Adelie penguins body mass and the coefficients are quite small in comparison to the total body mass of the average penguin.

model_summary <- penguins |> 
  filter(species == "Adelie") |>  
  lm(formula = body_mass_g ~ island) |> 
  summary()
model_summary$coefficients

                   Estimate Std. Error     t value     Pr(>|t|)
(Intercept)     3709.659091   69.58192 53.31355215 1.671273e-98
islandDream      -21.266234   92.98275 -0.22871161 8.194088e-01
islandTorgersen   -3.286542   94.96708 -0.03460717 9.724396e-01

We can visualize the result to see that it seems to make sense.

mean_bws <- penguins |> 
  filter(species == "Adelie") |>  
  summarize(
    body_mass_g = mean(body_mass_g, na.rm = TRUE),
    .by = island)
penguins |> 
  filter(species == "Adelie") |>  
  ggplot(aes(x = body_mass_g, fill = island, col = island)) + 
  geom_density(alpha = 0.3) +
  geom_vline(
    aes(xintercept = body_mass_g,
        col = island),
    data = mean_bws) +
  labs(
    x = "Body mass (g)",
    y = "",
    col = "Island",
    fill = "Island") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).

18.5 A surprising result (Simpson’s paradox)

18.5.1 Exercise 2

We see that the \(p\)-values are low for both species, even if it is just above 0.05 for Gentoos. Even if the points seemingly line up well, the difference is still statistically significant, at least for one species. This shows that it is hard to determine statistical significance by eye only.

model_summary <- penguins |> 
  lm(formula = body_mass_g ~ flipper_length_mm + species) |> 
  summary()
model_summary$coefficients

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)       -4031.4769  584.15134 -6.901425 2.551453e-11
flipper_length_mm    40.7054    3.07102 13.254686 1.401600e-32
speciesChinstrap   -206.5101   57.73065 -3.577132 3.980986e-04
speciesGentoo       266.8096   95.26374  2.800747 5.391808e-03

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, col = species)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    col = "Species")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

18.6 Prediction and prediction interval

18.6.1 Exercise 1

model <- penguins |>  lm(formula = body_mass_g ~ flipper_length_mm)
model |> predict(data.frame(flipper_length_mm = 201), interval = "prediction", level = 0.95)

       fit      lwr      upr
1 4205.967 3429.303 4982.632

model |> predict(data.frame(flipper_length_mm = 100), interval = "prediction", level = 0.95)

        fit       lwr      upr
1 -812.2747 -1645.371 20.82124

18.6.2 Exercise 2

The interval becomes smaller. The confidence level specifies our confidence in a new observation falling within the interval. If we make the interval smaller, we have to reduce our confidence. On the other hand, if we make our interval infidelity big, we can be 100% confident it will cover the next observation. Of course there is no use for such a 100% interval in practice.

model |> predict(data.frame(flipper_length_mm = 200), interval = "prediction", level = 0.5)

       fit     lwr      upr
1 4156.282 3889.67 4422.894