library(VGAM) # Multinomial and Ordinal Regression
library(statmod) #Used to obtaine quatile residuals
library(survival) # Data sets
library(MASS) # Data Set
library(GLMsData) # Data Set
library(tidyverse)
library(haven) # Reading Stata File
Lab 4: Generalized Linear Models
R Packages Used
References
For more information on this topic, I recommend you reading the following books:
Generalized Linear Models with Examples in R - Dunn
Vectorized Generalized Linear and Additive Models - Yee
Generalized Linear Models
In linear regression, we are fitting a model between a set of predictors and an outcome variable. The outcome variable is generally normally distributed. However, if the outcome is not normally distributed, you must use a generalized linear model to assess the association between a set of predictors and an outcome variable.
Link Functions
Due to the outcome variable being non-normally distributed, we must use link functions. The link function allows us to link the outcome variable to a linear model.
Logistic Regression
A logistic regression model is used when the outcome is binary, yes and no. The link function is the logit function that models the association between a set of linear predictors and the odds of observing yes:
\[
\log\left\{\frac{P(Y=1)}{P(Y=0)}\right\} = \boldsymbol{X^T\beta}
\] Fitting a logistic regression requires you to use the glm()
function and specifying family=binomial
. Then use the summary function to obtain the coefficients and broom::tidy()
1 We are going to fit y
the presence of bacteria and ap
active or placebo drug and hilo
high/low compliance.
<- glm(y ~ ap + hilo, data = bacteria,
logit_glm family = binomial)
summary(logit_glm)
Call:
glm(formula = y ~ ap + hilo, family = binomial, data = bacteria)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3539 0.2855 4.743 2.11e-06 ***
app 0.7933 0.3748 2.116 0.0343 *
hilolo -0.4816 0.3480 -1.384 0.1664
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 217.38 on 219 degrees of freedom
Residual deviance: 209.87 on 217 degrees of freedom
AIC: 215.87
Number of Fisher Scoring iterations: 4
## OR ##
%>% broom::tidy(exp=T) logit_glm
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.87 0.285 4.74 0.00000211
2 app 2.21 0.375 2.12 0.0343
3 hilolo 0.618 0.348 -1.38 0.166
A \(\beta\) coefficient is interpreted as an odds ratio. The odds of having bacteria presence is 2.21 times higher with a placebo drug than when an active drug is used, adjusting for high/low compliance.
Poisson Regression
Poisson regression is used when the outcome is count data. The link function is the log of the mean count:
\[ \log(\mu)=\boldsymbol{X^T\beta} \]
Fitting a Poisson regression requires you to use the glm()
function and specifying family=poisson
. Then use the summary function to obtain the coefficients and broom::tidy()
2
<- glm(recur ~ treatment + number, data = bladder1,
pois_glm family = poisson)
summary(pois_glm)
Call:
glm(formula = recur ~ treatment + number, family = poisson, data = bladder1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.00918 0.06057 16.661 < 2e-16 ***
treatmentpyridoxine 0.25506 0.06889 3.702 0.000214 ***
treatmentthiotepa -0.45167 0.08626 -5.236 1.64e-07 ***
number 0.11603 0.01620 7.164 7.82e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 868.47 on 293 degrees of freedom
Residual deviance: 772.19 on 290 degrees of freedom
AIC: 1529.5
Number of Fisher Scoring iterations: 5
%>% broom::tidy(exp=T) pois_glm
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.74 0.0606 16.7 2.51e-62
2 treatmentpyridoxine 1.29 0.0689 3.70 2.14e- 4
3 treatmentthiotepa 0.637 0.0863 -5.24 1.64e- 7
4 number 1.12 0.0162 7.16 7.82e-13
The line pois_glm %>% broom::tidy(exp=T)
provides the exponentiated values of the \(\beta\) coefficients. Interpreting the number
variable, as the initial number of tumors increases, the mean number of recurrences increases by a factor of 1.12, adjusting for treatment status.
Questions
Answer the following questions and check for outliers.
The
ccancer
data set inGLMSData
package contains information on the number of deathsCount
for different regions in Canada. Fit a model on the variableCount
and the independent variablesGender
andRegion
. Usedata("ccancer")
to load the data.The
diabetic
data set in thesurvival
package contains information on diabetic retinopathy prescence (status
) after treatment and follow-up. Fit a model onstatus
with independent variablesage
andtrt
(treatment).
Submit your assignment on canvas as an RMD file
Due May 14 @ 11:59 PM