Matrix Formulation
Multiple Linear Regression
Model Assumptions
\[ Y_i = \boldsymbol X_i^\mathrm T \boldsymbol \beta + \epsilon_i \]
\(Y_i\): Outcome Variable
\(\boldsymbol X_i=(1, X_i)^\mathrm T\): Predictors
\(\boldsymbol \beta = (\beta_0, \beta_1)^\mathrm T\): Coefficients
\(\epsilon_i\): error term
For \(n\) data points
\[ \boldsymbol Y = \boldsymbol X^\mathrm T\boldsymbol \beta + \boldsymbol \epsilon \]
\(\boldsymbol Y = (Y_1, \cdots, Y_n)^\mathrm T\): Outcome Variable
\(\boldsymbol X=(\boldsymbol X_1, \cdots, \boldsymbol X_n)^\mathrm T\): Predictors
\(\boldsymbol \beta = (\beta_0, \beta_1)^\mathrm T\): Coefficients
\(\boldsymbol \epsilon = (\epsilon_1, \cdots, \epsilon_n)^\mathrm T\): Error terms
\[ (Y - \boldsymbol X ^\mathrm T\boldsymbol \beta)^\mathrm T(Y - \boldsymbol X ^\mathrm T\boldsymbol \beta) \]
\[ \hat{\boldsymbol \beta} = (\boldsymbol X ^\mathrm T\boldsymbol X)^{-1}\boldsymbol X ^\mathrm T\boldsymbol Y \]
Multivariable linear regression models are used when more than one explanatory variable is used to explain the outcome of interest.
To fit an additional continuous random variable to the model, we will only need to add it to the model:
\[ Y = \beta_0 +\beta_1 X_1 + \beta_2 X_2 \]
A categorical variable can be included in a model, but a reference category must be specified.
To fit a model with categorical variables, we must utilize dummy (binary) variables that indicate which category is being referenced. We use \(C-1\) dummy variables where \(C\) indicates the number of categories. When coded correctly, each category will be represented by a combination of dummy variables.
If we have 4 categories, we will need 3 dummy variables:
Cat 1 | Cat 2 | Cat 3 | Cat 4 | |
---|---|---|---|---|
Dummy 1 | 1 | 0 | 0 | 0 |
Dummy 2 | 0 | 1 | 0 | 0 |
Dummy 3 | 0 | 0 | 1 | 0 |
Which one is the reference category?
\[ Y = \boldsymbol \beta^T\boldsymbol X \]
\(\boldsymbol \beta\): a column vector of regression coefficients
\(\boldsymbol X\): a column vector of predictor variables
\[ Y = \boldsymbol \beta^T\boldsymbol X \]
Errors are normally distributed
Constant Variance
Linearity
Independence
No outliers
A residual analysis is used to assess the validity of the assumptions.