Homework 8

Published

May 4, 2023

Homework 8 is due 5/12/2023 at 11:59 PM. Submit your homework on Canvas as one PDF document.

Problem 1

The palmerpenguins dataset is a collection of morphological measurements and observational data for three species of penguins (Adelie, Gentoo, and Chinstrap) collected from 2007 to 2009 on the Palmer Archipelago, Antarctica. The dataset consists of 344 penguin observations with 8 variables, including species, sex, bill length, bill depth, flipper length, body mass, and several other measurements.

Researchers are interested if the there is a significant relationship between body_mass_g (body mass measured in grams) and a set of predictors: flipper_length_mm (measured in mm), bill_length_mm (measured in mm), and bill_depth_mm (measured in mm). They decide to run a multiple linear regression to describe the relationship between the variables. The output is provided below:


Call:
lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm + 
    bill_depth_mm)

Residuals:
     Min       1Q   Median       3Q      Max 
-1051.37  -284.50   -20.37   241.03  1283.51 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6445.476    566.130 -11.385   <2e-16 ***
flipper_length_mm    50.762      2.497  20.327   <2e-16 ***
bill_length_mm        3.293      5.366   0.614    0.540    
bill_depth_mm        17.836     13.826   1.290    0.198    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393 on 329 degrees of freedom
Multiple R-squared:  0.7639,    Adjusted R-squared:  0.7618 
F-statistic: 354.9 on 3 and 329 DF,  p-value: < 2.2e-16

Describe the model and is there a significant relationship between body mass and the predictors? What is the predicted body mass of a penguin with a flipper length of 213, bill length of 43, and bill depth of 17.

Problem 2

In logistic regression, the odds ratio is a measure of the strength of the relationship between the predictor variables and the dependent variable. It represents the ratio of the odds of the dependent variable for two different values of the predictor variable, holding all other variables constant.

To interpret the odds ratio, we first need to understand what odds are. Odds are a way of expressing the probability of an event occurring as a ratio of the number of times the event occurs to the number of times it does not occur. For example, if a coin has a 50% chance of landing heads, the odds of it landing heads are 1:1 or 1/1.

To compute the odds ratio:

Determine the beta coefficient value for the predictor variable of interest from the logistic regression model output.
Calculate the exponential function of the beta coefficient value. This can be done using a scientific calculator or software like R or Python.
For example, if the beta coefficient value is 0.8, the exponential function would be e^0.8 = 2.2255.
Interpret the resulting value as the odds ratio. The odds ratio represents the change in the odds of the outcome variable for a one-unit increase in the predictor variable.
For example, if the odds ratio is 2.2255, it means that for every one-unit increase in the predictor variable, the odds of the outcome variable occurring are 2.2255 times higher.
Consider the confidence interval and the p-value associated with the odds ratio. The confidence interval provides a range of values within which the true odds ratio is likely to lie with a certain level of certainty. A p-value less than 0.05 indicates that the odds ratio is statistically significant.

Researchers are interested in whether the probability of being admitted to graduate school is affected by an individual’s GRE and GPA score. They fit a Logistic regression modeling the probability of being accepted. The output is printed below


Call:
glm(formula = admit ~ gre + gpa, family = "binomial", data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
gre          0.002691   0.001057   2.544   0.0109 *  
gpa          0.754687   0.319586   2.361   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 480.34  on 397  degrees of freedom
AIC: 486.34

Number of Fisher Scoring iterations: 4

Describe the model and is there a significant relationship between the probability of being admitted and the predictors?

Problem 3

In Poisson regression, the beta coefficient (\beta) represents the change in the logarithm of the expected count of the response variable for a one-unit increase in the corresponding predictor variable while holding all other variables constant.

More specifically, the Poisson regression model assumes that the expected count of the response variable Y, denoted as λ, is related to the predictor variable X through the following equation:

λ = \exp(β_0 + β_1X_1 + β_2X_2 + ... + β_pX_p)

where \beta_0 is the intercept term, \beta_1 to \beta_p are the coefficients for the predictor variables X_1 to X_p, and \exp() is the exponential function.

Thus, a beta coefficient of \beta_j means that, for a one-unit increase in the predictor variable X_j, the expected count of the response variable Y is expected to increase by a factor of \exp(β_j), while holding all other predictor variables constant.

For example, if the coefficient for a predictor variable is 0.2, then a one-unit increase in that predictor variable is associated with a 1.22 increase in the expected count of the response variable while holding all other predictor variables constant.

Researchers are interested if the number of awards earned by high school students can be explained by the type of program (prog: General, Vocational, Academic) and their final math exam score (math). They fit a Poisson regression on the number of awards and using the predictors prog (reference=General) and math in the model.


Call:
glm(formula = num_awards ~ prog + math, family = "poisson", data = p)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
progAcademic    1.08386    0.35825   3.025  0.00248 ** 
progVocational  0.36981    0.44107   0.838  0.40179    
math            0.07015    0.01060   6.619 3.63e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 287.67  on 199  degrees of freedom
Residual deviance: 189.45  on 196  degrees of freedom
AIC: 373.5

Number of Fisher Scoring iterations: 6

Describe the model and is there a significant relationship between the expected number of awards and the predictors? What is the expected count if a student came from general school and had a math score of 75?