20  Logistic regression

20.1 Setup

First we load the dataset and have a glimpse at it.

library(tidyverse)
#install.packages("mlbench") # Install to get PimaIndiansDiabetes2 dataset
library(mlbench)
data("PimaIndiansDiabetes2") # Using 2 because it is a corrected version.
glimpse(PimaIndiansDiabetes2)
Rows: 768
Columns: 9
$ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1…
$ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,…
$ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, 72, N…
$ triceps  <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, NA, 23, 19, N…
$ insulin  <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA, NA, 846, 17…
$ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA, 37.…
$ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158…
$ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3…
$ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n…

20.2 The Pima Indians dataset

20.2.1 Exercise 1

Our population is all women of the Pima Indian population aged 21 or older. Looking at the original paper, we can see that the study started in 1965, which we may also want to specify.

20.2.2 Exercise 2

We see that the blood pressure seems higher in the group that were diagnosed with diabetes. In the next chapter we will see if it is a significant relationship.

library(patchwork)
p1 <- PimaIndiansDiabetes2 |>
  ggplot(aes(y = pressure)) + 
  facet_grid(~ diabetes) +
  geom_boxplot() +
  theme(
    axis.text.x = element_blank(),  # Remove axis text
    axis.ticks.x = element_blank()  # Remove axis title
  ) +
  labs(y = "Blood pressure")
p2 <- PimaIndiansDiabetes2 |>
  ggplot(aes(x = pressure, fill = diabetes)) + 
  geom_density(alpha = 0.5) +
  theme(
    legend.position = "top",
    axis.text.y = element_blank(),  # Remove axis text
    axis.ticks.y = element_blank()  # Remove axis title
  ) +
  labs(
    x = "Blood pressure",
    y = "",
    fill = "Diabetes"
    )

p1 + p2
Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_density()`).

20.3 Using linear regression

20.3.1 Exercise 1

First we formulate our hypotheses:

H_0: There is no linear relationship between mass and blood pressure in the women in the Pima Indian tribe.

H_a: There is a linear relationship between mass and blood pressure in the women in the Pima Indian tribe.

Then we fit our model.

model <- lm(pressure ~ mass, data = PimaIndiansDiabetes2)
summary(model)

Call:
lm(formula = pressure ~ mass, data = PimaIndiansDiabetes2)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.081  -7.628  -0.331   7.262  54.868 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 55.48694    2.11810  26.197  < 2e-16 ***
mass         0.51989    0.06382   8.147 1.63e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.86 on 727 degrees of freedom
  (39 observations deleted due to missingness)
Multiple R-squared:  0.08365,   Adjusted R-squared:  0.08239 
F-statistic: 66.37 on 1 and 727 DF,  p-value: 1.63e-15

From the summary output we can see that the \(p\)-value is very small. According to our model there is a linear relationship between mass and blood pressure. We can also visualize this to see that our results are reasonable.

PimaIndiansDiabetes2 |> 
  ggplot(aes(x = mass, y = pressure)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 39 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 39 rows containing missing values or values outside the scale range
(`geom_point()`).

We see a clear slope of the line.

20.3.2 Exercise 2

We formulate the hypotheses:

H_0: There is no linear relationship between glucose and the onset of diabetes in the women in the Pima Indian tribe.

H_a: There is a linear relationship between glucose levels and the onset of diabetes in the women in the Pima Indian tribe.

As discussed in the chapter, a linear relationship is not right for this problem, since the outcome is binary.

20.4 Using logistic regression

20.4.1 Exercise 1 (a)

We start by formulating the hypotheses

H_0: There is no relationship between blood pressure and being diagnosed with diabetes.

H_a: There is a relationship between blood pressure level and being diagnosed with diabetes.

then fit our model

model <- glm(diabetes ~ pressure, data = PimaIndiansDiabetes2, family = "binomial")
summary(model)

Call:
glm(formula = diabetes ~ pressure, family = "binomial", data = PimaIndiansDiabetes2)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.830259   0.491750  -5.755 8.64e-09 ***
pressure     0.029884   0.006587   4.537 5.72e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 943.40  on 732  degrees of freedom
Residual deviance: 921.78  on 731  degrees of freedom
  (35 observations deleted due to missingness)
AIC: 925.78

Number of Fisher Scoring iterations: 4

We see that the \(p\)-value is low and that there is a significant relationship between blood pressure and diabetes at significance level 0.05.

20.4.2 Exercise 1 (b)

We start by formulating the hypotheses

H_0: There is no relationship between body mass index and being diagnosed with diabetes.

H_a: There is a relationship between body mass index level and being diagnosed with diabetes.

then fit our model

model <- glm(diabetes ~ mass, data = PimaIndiansDiabetes2, family = "binomial")
summary(model)

Call:
glm(formula = diabetes ~ mass, family = "binomial", data = PimaIndiansDiabetes2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.99682    0.42885   -9.32  < 2e-16 ***
mass         0.10250    0.01261    8.13 4.31e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 981.53  on 756  degrees of freedom
Residual deviance: 904.89  on 755  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 908.89

Number of Fisher Scoring iterations: 4

We see that the \(p\)-value is low and that there is a significant relationship between BMI and diabetes at significance level 0.05.

20.5 Prediction

20.5.1 Exercise 1

Here are two examples of predicting the probability of being diagnosed with diabetes based on pressure and mass

pressure_model <- glm(diabetes ~ pressure, data = PimaIndiansDiabetes2, family = "binomial")
predict(pressure_model, data.frame(pressure = 90), type = "response")
        1 
0.4648821 
mass_model <- glm(diabetes ~ mass, data = PimaIndiansDiabetes2, family = "binomial")
predict(mass_model, data.frame(mass = 25), type = "response")
        1 
0.1924252 

20.5.2 Exercise 2

We try to find the x value (glucose level) that corresponds to 0.9 probability of being diagnosed with diabetes. According to the plot this roughly coincides with 195 mg/dl. Our prediction confirms this.

model <- glm(diabetes ~ glucose, data = PimaIndiansDiabetes2, family = "binomial")
predict(model, data.frame(glucose = 195), type = "response")
        1 
0.9010074 

20.5.3 Exercise 3

Our model predicts a \(\sim0.92\) probability of the onset of diabetes given a glucose level of 200 mg/dl. Our model was not trained on any data of women with diabetes, i.e. the model was not trained on any glucose levels 200 or above and is not suited for predictions outside of what was observed in the dataset. Further, the model doesn’t “know” about the classification of diabetes, i.e. that glucose levels over 200 are classified as diabetes.

model <- glm(diabetes ~ glucose, data = PimaIndiansDiabetes2, family = "binomial")
predict(model, data.frame(glucose = 200), type = "response")
        1 
0.9177104