Lab 4: Generalized Linear Models

Published

May 7, 2022

R Packages Used

library(VGAM) # Multinomial and Ordinal Regression
library(statmod) #Used to obtaine quatile residuals
library(survival) # Data sets
library(MASS) # Data Set
library(GLMsData) # Data Set
library(tidyverse)
library(haven) # Reading Stata File

References

For more information on this topic, I recommend you reading the following books:

Generalized Linear Models with Examples in R - Dunn
Vectorized Generalized Linear and Additive Models - Yee

Generalized Linear Models

In linear regression, we are fitting a model between a set of predictors and an outcome variable. The outcome variable is generally normally distributed. However, if the outcome is not normally distributed, you must use a generalized linear model to assess the association between a set of predictors and an outcome variable.

Link Functions

Due to the outcome variable being non-normally distributed, we must use link functions. The link function allows us to link the outcome variable to a linear model.

Logistic Regression

A logistic regression model is used when the outcome is binary, yes and no. The link function is the logit function that models the association between a set of linear predictors and the odds of observing yes:

\[ \log\left\{\frac{P(Y=1)}{P(Y=0)}\right\} = \boldsymbol{X^T\beta} \] Fitting a logistic regression requires you to use the glm() function and specifying family=binomial. Then use the summary function to obtain the coefficients and broom::tidy()¹ We are going to fit y the presence of bacteria and ap active or placebo drug and hilo high/low compliance.

logit_glm <- glm(y ~ ap + hilo, data = bacteria, 
                 family = binomial)
summary(logit_glm)


Call:
glm(formula = y ~ ap + hilo, family = binomial, data = bacteria)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.3539     0.2855   4.743 2.11e-06 ***
app           0.7933     0.3748   2.116   0.0343 *  
hilolo       -0.4816     0.3480  -1.384   0.1664    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217.38  on 219  degrees of freedom
Residual deviance: 209.87  on 217  degrees of freedom
AIC: 215.87

Number of Fisher Scoring iterations: 4

## OR ##

logit_glm %>% broom::tidy(exp=T)

# A tibble: 3 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    3.87      0.285      4.74 0.00000211
2 app            2.21      0.375      2.12 0.0343    
3 hilolo         0.618     0.348     -1.38 0.166

A \(\beta\) coefficient is interpreted as an odds ratio. The odds of having bacteria presence is 2.21 times higher with a placebo drug than when an active drug is used, adjusting for high/low compliance.

Poisson Regression

Poisson regression is used when the outcome is count data. The link function is the log of the mean count:

\[ \log(\mu)=\boldsymbol{X^T\beta} \]

Fitting a Poisson regression requires you to use the glm() function and specifying family=poisson. Then use the summary function to obtain the coefficients and broom::tidy()²

pois_glm <- glm(recur ~ treatment + number, data = bladder1, 
                family = poisson)
summary(pois_glm)


Call:
glm(formula = recur ~ treatment + number, family = poisson, data = bladder1)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.00918    0.06057  16.661  < 2e-16 ***
treatmentpyridoxine  0.25506    0.06889   3.702 0.000214 ***
treatmentthiotepa   -0.45167    0.08626  -5.236 1.64e-07 ***
number               0.11603    0.01620   7.164 7.82e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 868.47  on 293  degrees of freedom
Residual deviance: 772.19  on 290  degrees of freedom
AIC: 1529.5

Number of Fisher Scoring iterations: 5

pois_glm %>% broom::tidy(exp=T)

# A tibble: 4 × 5
  term                estimate std.error statistic  p.value
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            2.74     0.0606     16.7  2.51e-62
2 treatmentpyridoxine    1.29     0.0689      3.70 2.14e- 4
3 treatmentthiotepa      0.637    0.0863     -5.24 1.64e- 7
4 number                 1.12     0.0162      7.16 7.82e-13

The line pois_glm %>% broom::tidy(exp=T) provides the exponentiated values of the \(\beta\) coefficients. Interpreting the number variable, as the initial number of tumors increases, the mean number of recurrences increases by a factor of 1.12, adjusting for treatment status.

Questions

Answer the following questions and check for outliers.

The ccancer data set in GLMSData package contains information on the number of deaths Count for different regions in Canada. Fit a model on the variable Count and the independent variables Gender and Region. Use data("ccancer") to load the data.
The diabetic data set in the survival package contains information on diabetic retinopathy prescence (status) after treatment and follow-up. Fit a model on status with independent variables age and trt (treatment).
Submit your assignment on canvas as an RMD file

Due May 14 @ 11:59 PM

Footnotes

The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎
The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎