21  Multiple logistic regression

21.1 Setup

First we load the dataset and have a glimpse at it.

library(tidyverse)
#install.packages("mlbench") # Install to get PimaIndiansDiabetes2 dataset
library(mlbench)
data("PimaIndiansDiabetes2") # Using 2 because it is a corrected version.
glimpse(PimaIndiansDiabetes2)
Rows: 768
Columns: 9
$ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1…
$ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,…
$ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, 72, N…
$ triceps  <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, NA, 23, 19, N…
$ insulin  <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA, NA, 846, 17…
$ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA, 37.…
$ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158…
$ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3…
$ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n…

21.2 Model selection

21.2.1 Exercise 1

We fit the two models and use the AIC function

# Pressure model
model <- glm(diabetes ~ pressure, data = PimaIndiansDiabetes2, family = "binomial")
AIC(model)
[1] 925.7838
# Triceps model
model <- glm(diabetes ~ triceps, data = PimaIndiansDiabetes2, family = "binomial")
AIC(model)
[1] 654.857

We see that triceps has much lower AIC and comparing to the result in the lecture notes including pressure barely improves the model!

21.2.2 Exercise 2

We fit the model with all three predictors and look at the AIC value and the VIF value

model <- glm(diabetes ~ pressure + mass + triceps, data = PimaIndiansDiabetes2, family = "binomial")
summary(model)

Call:
glm(formula = diabetes ~ pressure + mass + triceps, family = "binomial", 
    data = PimaIndiansDiabetes2)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.105936   0.719401  -7.097 1.27e-12 ***
pressure     0.018727   0.008321   2.251 0.024416 *  
mass         0.071762   0.019412   3.697 0.000218 ***
triceps      0.021741   0.012031   1.807 0.070760 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 683.62  on 536  degrees of freedom
Residual deviance: 623.13  on 533  degrees of freedom
  (231 observations deleted due to missingness)
AIC: 631.13

Number of Fisher Scoring iterations: 3

We see that the AIC is lower than using only pressure and triceps, but the \(p\)-value for triceps is now over 0.05, indicating that the relationship is not statistically significant at significance level 0.05. However, the \(p\)-value is low so we can keep it in our model, but it may be reasonable to use only one of BMI and tricep skin fold thickness as both are measurements relating to body fat.

We also calculate the VIF value to see if we have multicollinearity

# warning: false
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
vif(model)
pressure     mass  triceps 
1.054880 1.594327 1.549618 

We see that all values are below 5, so we do not need to remove any variable because of multicollinearity.

21.2.3 Model selection strategies

21.2.3.1 Exercise 1

See Assignment 2.

21.3 Accuracy

21.3.1 Exercise 1

We see that the accuracy is the same, with or without pressure. In this case the simpler model, not including pressure is the better option.

model <- PimaIndiansDiabetes2 |> 
  glm(formula = diabetes ~ ., family = "binomial")

# Calculating the accuracy
PimaIndiansDiabetes2 |> 
  # Calculated the predicted probabilities for all observations
  mutate(predicted_prob = model |> predict(PimaIndiansDiabetes2, type = "response")) |> 
  # Predicting the outcome
  mutate(predicted_outcome = ifelse(predicted_prob > 0.5, "pos", "neg")) |> 
  # Dropping predictions with missing values
  filter(!is.na(predicted_outcome)) |>
  # Summarizing (Accuracy = # Correct predictions / # Total predictions
  summarize(accuracy = mean(predicted_outcome == diabetes))
   accuracy
1 0.7831633

21.3.2 Exercise 2

We leave this one to you.

21.3.3 Exercise 3

Using this seed the accuracy is the same.

# code-fold: true
# comment: Code
# Set seed for reproducibility
set.seed(42) 

# Create an index to keep track of subjects
pima <- PimaIndiansDiabetes2 |> 
  drop_na() |> # Using only complete observations
  mutate(index = 1:n())

# Create training and testing set
training_set <- pima |> 
  sample_frac(0.8) # Use 80% of the data in training set
test_set <- pima |> 
  anti_join(training_set, by = "index")

# Train model on training set
model <- training_set |> 
  glm(formula = diabetes ~ ., family = "binomial")

# Evaluate model by accuracy on test set
predictions <- model |> 
  predict(test_set, type = "response")
test_set |> 
  mutate(prediction = ifelse(predictions > 0.5, "pos", "neg")) |> 
  drop_na() |> # Dropping predictions with missing values
  summarize(accuracy = sum(prediction == diabetes) / n())
   accuracy
1 0.8076923
model <- training_set |> 
  glm(formula = diabetes ~ . - pressure, family = "binomial")

# Evaluate model by accuracy on test set
predictions <- model |> 
  predict(test_set, type = "response")
test_set |> 
  mutate(prediction = ifelse(predictions > 0.5, "pos", "neg")) |> 
  drop_na() |> # Dropping predictions with missing values
  summarize(accuracy = mean(prediction == diabetes))
   accuracy
1 0.8076923

With a different seed the full model has a higher accuracy.

# code-fold: true
# comment: Code
# Set seed for reproducibility
set.seed(44) 

# Create an index to keep track of subjects
pima <- PimaIndiansDiabetes2 |> 
  drop_na() |> # Using only complete observations
  mutate(index = 1:n())

# Create training and testing set
training_set <- pima |> 
  sample_frac(0.8) # Use 80% of the data in training set
test_set <- pima |> 
  anti_join(training_set, by = "index")

# Train model on training set
model <- training_set |> 
  glm(formula = diabetes ~ ., family = "binomial")

# Evaluate model by accuracy on test set
predictions <- model |> 
  predict(test_set, type = "response")
test_set |> 
  mutate(prediction = ifelse(predictions > 0.5, "pos", "neg")) |> 
  summarize(accuracy = sum(prediction == diabetes) / n())
   accuracy
1 0.7820513
model <- training_set |> 
  glm(formula = diabetes ~ . - pressure, family = "binomial")

# Evaluate model by accuracy on test set
predictions <- model |> 
  predict(test_set, type = "response")
test_set |> 
  mutate(prediction = ifelse(predictions > 0.5, "pos", "neg")) |> 
  summarize(accuracy = sum(prediction == diabetes) / n())
   accuracy
1 0.7692308

To properly study accuracy it is good to use more than one train-test combination or using k-fold validation cross-validation. However, this is beyond the scope of this course.

We see that including pressure doesn’t seem to improve our accuracy much, suggesting it is better to go for the smaller, less complex model.

21.3.4 Exercise 4

The model would achieve an accuracy of 90% since \(\text{Accuracy} = \#\text{Correct predictions} / \#\text{Total predictions} = 90 / 100 = 0.9 = 90 \%\).