Regression

Linear Regression

Weekly design


Pre-class video

  • Eng ver.
  • Kor ver.
  • Pre-class PPT pdf

Discussion

Discussion #6


Class


Motivation

Linear regression is a foundational technique in statistical analysis and machine learning that helps us understand and quantify relationships between variables. As social scientists, we often aim to analyze the effect of certain factors on an outcome of interest. Linear regression provides us with a way to model these relationships, quantify their effects, and make predictions based on our findings. By mastering linear regression, social scientists can gain valuable insights into various phenomena, test hypotheses, and make data-driven decisions.


Usage and Importance

Linear regression is widely used in social science research for several reasons:

  • Simplicity: Linear regression is relatively easy to understand and implement, making it an accessible method for researchers across disciplines. Despite its simplicity, it can often provide valuable insights and predictions.

  • Interpretability: The coefficients obtained from linear regression have a clear interpretation, allowing researchers to understand the effect of each independent variable on the dependent variable.

  • Basis for Advanced Techniques: Linear regression serves as a foundation for more advanced statistical and machine learning techniques. Gaining a deep understanding of linear regression helps social scientists better understand and apply these more advanced methods.

Real-world Applications

Linear regression has a wide range of applications in social science research. Some examples include:

  • Economics: Linear regression can be used to study the impact of various factors on economic indicators, such as GDP growth, unemployment rate, and inflation.

  • Political Science: Researchers can use linear regression to analyze the effects of political factors on election outcomes, public opinion, or policy decisions.

  • Sociology: Linear regression can help us understand the relationship between social variables, such as education level, income, and various social outcomes like crime rates, health status, and life satisfaction.

  • Psychology: Researchers can use linear regression to study the effects of different psychological factors on human behavior, mental health, and well-being.

  • Education: Linear regression can be used to analyze the impact of various factors on educational outcomes, such as standardized test scores, graduation rates, and college enrollment.

Overall, linear regression is a versatile and powerful tool for social scientists, enabling them to gain insights into the relationships between variables and make evidence-based predictions.


Theory

Simple Linear Regression

Simple linear regression is a statistical method that helps us understand the relationship between one dependent variable (y) and one independent variable (x). It models the relationship as a linear function.

Equation:

\[ y = β_1 + β_2x + ε \]

  • \(y\) : dependent variable (outcome)

  • \(x\) : independent variable (predictor)

  • \(β_1\) : intercept (value of y when x = 0)

  • \(β_2\) : slope (change in y for a one-unit increase in x)

  • \(ε\) : error term (difference between the predicted and observed values of y)


Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression that allows us to model the relationship between one dependent variable (y) and multiple independent variables (x₁, x₂, …, xₙ). It is useful when we want to analyze the impact of several predictors on an outcome variable.

Equation

\[ y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε \]

  • \(y\) : dependent variable (outcome)

  • \(x₁, x₂, …, xₙ\) : independent variables (predictors)

  • \(β₀\) : intercept (value of y when all x’s are 0)

  • \(β₁, β₂, …, βₙ\) : coefficients (change in y for a one-unit increase in the corresponding x)

  • \(ε\) : error term (difference between the predicted and observed values of y)


Assumptions of Linear Regression

  • Linearity: The relationship between the dependent variable and the independent variables is linear.

  • Independence: The observations in the dataset are independent of each other.

  • Homoscedasticity: The variance of the error term is constant for all values of the independent variables.

  • Normality: The error term follows a normal distribution.

  • No multicollinearity: The independent variables are not highly correlated with each other.


Coefficient Estimation: Least Squares (LS) Method

Minimize the sum of the squared differences between the observed and predicted values of the dependent variable.

  • Formula:

    \[ β = (X'X)^{-1}X'y \]where X is the matrix of independent variables, y is the dependent variable, and β is the vector of coefficients.

Model Evaluation Metrics

  • R-squared (Coefficient of Determination): Proportion of the variance in the dependent variable that can be explained by the independent variables. Ranges from 0 to 1.

    \[ R^2 = 1- \frac{SSE}{SST} \]

    \[ SSE = \sum(y_i - \hat{y_i})^2 \]

    \[ SST=\sum(y_i - \bar{y_i})^2 \]

    where SSE is the sum of squared errors and SST is the sum of squared total

  • Adjusted R-squared: R-squared adjusted for the number of predictors in the model. Useful for comparing models with different numbers of predictors.

    \[ Adj.R^2=1-\frac{(1-R^2)(N-1)}{N-p-1} \]

    where \(R^2\) is sample R-squared, \(N\) is Total Sample Size, and \(p\) is the number of independent variables

  • Root Mean Squared Error (RMSE): The square root of the average squared differences between the observed and predicted values of the dependent variable. A measure of the model’s prediction accuracy.

    \[ RMSE = \sqrt{\frac{\sum_{i=1}^{N}{(y_i-\hat{y_i})^2}}{N}} \]

    where N is the number of data points (observations)


Pop-up Quizzes

  1. What is the main objective of simple linear regression?

    1. To predict the value of a dependent variable based on the values of multiple independent variables.
    2. To predict the value of a dependent variable based on the value of one independent variable.
    3. To predict the values of independent variables based on the value of a dependent variable.
    4. To find the correlation between two independent variables.
  2. In a multiple linear regression model, what happens when two or more independent variables are highly correlated with each other?

    1. The model becomes more accurate.
    2. The model becomes more interpretable.
    3. Multicollinearity occurs, which can affect the stability and interpretation of the coefficients.
    4. The model’s R-squared value decreases significantly.
  3. Which of the following is NOT an assumption of linear regression?

    1. Linearity
    2. Independence
    3. Homoscedasticity
    4. Exponential distribution of the error term
  4. In a simple linear regression model with the equation y = β₀ + β₁x + ε, what does β₁ represent?

    1. The intercept of the model, or the value of y when x = 0.
    2. The slope of the model, or the change in y for a one-unit increase in x.
    3. The error term, or the difference between the predicted and observed values of y.
    4. The coefficient of determination, or the proportion of the variance in y explained by x.
  5. Which of the following metrics can be used to evaluate the performance of a linear regression model?

    1. R-squared
    2. Adjusted R-squared
    3. Root Mean Squared Error (RMSE)
    4. All of the above

    Answers: bcdbd


Hands-on Practice

For this hands-on practice, we will use the mtcars dataset, which is built into R. The dataset contains information about various car models, including miles per gallon (mpg), number of cylinders (cyl), horsepower (hp), and weight (wt). The goal is to predict miles per gallon based on the number of cylinders, horsepower, and weight using linear regression.

# Load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the 'mtcars' dataset
data(mtcars)

# View the first few rows of the dataset
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exploratory Data Analysis

# Summary statistics
summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
# Visualize relationships between variables using scatterplots
pairs(mtcars[, c("mpg", "cyl", "hp", "wt")])

Simple Linear Regression in R (Predicting mpg based on weight)

# Fit a simple linear regression model
simple_model <- lm(mpg ~ wt, data = mtcars)

# Model summary and interpretation
summary(simple_model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
# Model diagnostics (residuals vs. fitted values)
plot(simple_model, which = 1)

Hypothesis Testing and Statistical Significance in Linear Regression

T-statistics and p-values are essential concepts in statistical hypothesis testing and linear regression analysis.

  • T-statistics

    A t-statistic is a measure of how many standard deviations a regression coefficient is from zero. It is used to test the null hypothesis that there is no relationship between the independent and dependent variables (i.e., the coefficient is zero). A higher t-statistic value indicates a stronger relationship between the variables.

    The t-statistic for a regression coefficient can be calculated as:

    \[ t = \frac{\beta - H₀}{se(\beta)} \]

    where \(t\) is the t-statistic, \(\beta\) is the estimated regression coefficient, \(H₀\) is the null hypothesis value (usually 0), and \(se(\beta)\) is the standard error of the estimated coefficient.

    See appendix (further study part) if you want to dig in more about the way of estimating \(se(\beta)\).


Note
  • t-통계량클 수록 좋음: 분자는 클 수록, 분모는 작을 수록

  • 분자가 크려면: 회귀 계수 (Beta의 추정값)이 커야함

  • 분모가 작으려면: 회귀 계수의 표준 오차가 작아야함 (표준 오차 추정법은 부록 참고)

  • 회귀 계수의 표준 오차가 작으려면: \(se(\beta)\) = MSE / (X의 표준편차 * 표본수) 이므로 MSE가 작아야 하고 표본수가 커야함.

  • 종합하면, 회귀 계수가 크고, MSE가 작고, 표본 수가 커질 수록 t-통계량이 커진다


  • P-values

    A p-value is the probability of obtaining a test statistic as extreme as the observed value under the null hypothesis. It helps us determine the statistical significance of a regression coefficient. In general, a smaller p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, suggesting that the coefficient is significantly different from zero.

    To calculate the p-value for a t-statistic, we use the cumulative distribution function (CDF) of the t-distribution with n - k degrees of freedom, where n is the number of observations and k is the number of estimated coefficients (including the intercept).

    \[ P(T > |t|) = 1 - CDF(t, df = n - k) \]


Multiple Linear Regression in R (Adding number of cylinders and horsepower as predictors)

# Fit a multiple linear regression model
multiple_model <- lm(mpg ~ cyl + hp + wt, data = mtcars)

# Model summary and interpretation
summary(multiple_model)

Call:
lm(formula = mpg ~ cyl + hp + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9290 -1.5598 -0.5311  1.1850  5.8986 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
cyl         -0.94162    0.55092  -1.709 0.098480 .  
hp          -0.01804    0.01188  -1.519 0.140015    
wt          -3.16697    0.74058  -4.276 0.000199 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared:  0.8431,    Adjusted R-squared:  0.8263 
F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11
# Model diagnostics (residuals vs. fitted values)
plot(multiple_model, which = 1)

Model Evaluation and Comparison

# Calculate R-squared and adjusted R-squared for both models
simple_r_squared <- summary(simple_model)$r.squared
simple_adj_r_squared <- summary(simple_model)$adj.r.squared

multiple_r_squared <- summary(multiple_model)$r.squared
multiple_adj_r_squared <- summary(multiple_model)$adj.r.squared

# Compare R-squared and adjusted R-squared values
cat("Simple Model - R-squared:", simple_r_squared, "Adjusted R-squared:", simple_adj_r_squared, "\n")
Simple Model - R-squared: 0.7528328 Adjusted R-squared: 0.7445939 
cat("Multiple Model - R-squared:", multiple_r_squared, "Adjusted R-squared:", multiple_adj_r_squared, "\n")
Multiple Model - R-squared: 0.84315 Adjusted R-squared: 0.8263446 

Model Predictions

# Make predictions using the multiple linear regression model
new_data <- data.frame(
  cyl = c(4, 6, 8),
  hp = c(100, 150, 200),
  wt = c(2.5, 3.0, 3.5)
)

predicted_mpg <- predict(multiple_model, newdata = new_data)

# View predicted mpg values
predicted_mpg
       1        2        3 
25.26408 20.89545 16.52683 


Addressing Multi-collinearity

# Check for multicollinearity using the Variance Inflation Factor (VIF)
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
vif(multiple_model)
     cyl       hp       wt 
4.757456 3.258481 2.580486 

Variance Inflation Factor (VIF) is a measure used to detect the presence and severity of multicollinearity in a multiple linear regression model. Multicollinearity occurs when two or more independent variables in the model are highly correlated, which can lead to instability in the estimated regression coefficients and make it difficult to interpret their individual effects on the dependent variable.

If VIF values are significantly greater than 1 (> 5 or 10), consider removing or combining correlated predictors

VIF for the j-th independent variable can be calculated as:

\[ VIF(j) = \frac{1}{1 - R²(j)} \]

Here, \(R²(j)\) is the coefficient of determination (R-squared) of the regression model when the j-th independent variable is regressed on all the other independent variables in the model. In other words, \(R²(j)\) measures the proportion of variance in the j-th independent variable that can be explained by the other independent variables.

If the VIF value for a particular independent variable is close to 1, it means that there is no significant multicollinearity between that variable and the other independent variables. As the VIF value increases, it suggests a higher degree of multicollinearity.

The general interpretation of VIF values is as follows:

  • VIF = 1: No multicollinearity

  • VIF between 1 and 5: Moderate multicollinearity

  • VIF greater than 5 or 10: High multicollinearity (threshold values may vary depending on the field of study)

If high multicollinearity is detected, it is often advisable to address the issue by removing or combining correlated predictors, or by using regularization techniques such as Lasso, Ridge, or Elastic Net regression. This can help improve the stability and interpretability of the regression coefficients.


Optional: Regularization techniques (Lasso, Ridge, and Elastic Net)


Lasso, Ridge, and Elastic Net are regularization techniques used in linear regression models to address issues like multicollinearity, overfitting, and feature selection. They work by adding a penalty term to the linear regression’s objective function, which helps to shrink the coefficients towards zero and simplify the model. Here’s a brief explanation of each technique along with the relevant equations:

  1. Lasso Regression (Least Absolute Shrinkage and Selection Operator)

    Lasso regression adds an L1 penalty term to the linear regression’s objective function. The L1 penalty term is the sum of the absolute values of the coefficients. The objective function for Lasso regression is:

    \[ Objective = RSS + λ Σ|β_j| \]

    where:

    • \(RSS\) is the residual sum of squares.

    • \(β_j\) represents the j-th coefficient in the model.

    • \(λ\) (lambda) is the regularization parameter that controls the strength of the L1 penalty. Higher values of λ result in more shrinkage and simpler models.

    Lasso regression can drive some coefficients to zero, effectively performing feature selection by excluding irrelevant variables from the model.


  1. Ridge Regression

    Ridge regression adds an L2 penalty term to the linear regression’s objective function. The L2 penalty term is the sum of the squares of the coefficients. The objective function for Ridge regression is:

    \[ Objective = RSS + λ Σ(β_j)^2 \]

    where:

    • \(RSS\) is the residual sum of squares.

    • \(β_j\) represents the j-th coefficient in the model.

    • \(λ\) (lambda) is the regularization parameter that controls the strength of the L2 penalty. Higher values of λ result in more shrinkage and simpler models.

    Ridge regression doesn’t drive coefficients to zero but can shrink them close to zero, leading to a more stable and interpretable model, especially when multicollinearity is present.


  1. Elastic Net Regression

    Elastic Net regression combines both L1 and L2 penalty terms, effectively blending Lasso and Ridge regression (진리의 반반). The objective function for Elastic Net regression is:

    \[ Objective = RSS + λ [(1 - α) Σ(β_j)^2 + α Σ|β_j|] \]

    where:

    • \(RSS\) is the residual sum of squares.

    • \(β_j\) represents the j-th coefficient in the model.

    • \(λ\) (lambda) is the regularization parameter that controls the overall strength of the penalty.

    • \(α\) (alpha) is the mixing parameter that determines the balance between L1 (Lasso) and L2 (Ridge) penalties.

      • α = 1 results in Lasso regression,

      • α = 0 results in Ridge regression,

      • and values between 0 and 1 produce a mix of both.

    Elastic Net regression can be useful when there are many correlated predictors, as it can perform feature selection like Lasso while maintaining the stability and robustness of Ridge regression.


Let’s learn how to code lasso, ridge, and elastic net regression.

# Load necessary library
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-8
# Prepare data for regularization
x <- model.matrix(mpg ~ cyl + hp + wt, data = mtcars)[, -1]
y <- mtcars$mpg

# Fit Lasso, Ridge, and Elastic Net models
lasso_model <- glmnet(x, y, alpha = 1)
ridge_model <- glmnet(x, y, alpha = 0)
elastic_net_model <- glmnet(x, y, alpha = 0.5)

# Cross-validation to find the optimal lambda value
cv_lasso <- cv.glmnet(x, y, alpha = 1)
cv_ridge <- cv.glmnet(x, y, alpha = 0)
cv_elastic_net <- cv.glmnet(x, y, alpha = 0.5)

# Model summary and interpretation
cat("Lasso - Optimal Lambda:", cv_lasso$lambda.min, "\n")
Lasso - Optimal Lambda: 0.2176783 
cat("Ridge - Optimal Lambda:", cv_ridge$lambda.min, "\n")
Ridge - Optimal Lambda: 0.5146981 
cat("Elastic Net - Optimal Lambda:", cv_elastic_net$lambda.min, "\n")
Elastic Net - Optimal Lambda: 0.3293313 
# Make predictions using Lasso, Ridge, and Elastic Net models:
# Create new data for predictions
new_data <- data.frame(
  cyl = c(4, 6, 8),
  hp = c(100, 150, 200),
  wt = c(2.5, 3.0, 3.5)
)

# Prepare new data for predictions
new_data_x <- model.matrix(~ cyl + hp + wt, data = new_data)[, -1]

# Make predictions
lasso_predictions <- predict(cv_lasso, new_data_x, s = "lambda.min")
ridge_predictions <- predict(cv_ridge, new_data_x, s = "lambda.min")
elastic_net_predictions <- predict(cv_elastic_net, new_data_x, s = "lambda.min")

# View predictions
cat("Lasso Predictions:", lasso_predictions, "\n")
Lasso Predictions: 25.06234 20.87144 16.68053 
cat("Ridge Predictions:", ridge_predictions, "\n")
Ridge Predictions: 25.13371 20.8278 16.52189 
cat("Elastic Net Predictions:", elastic_net_predictions, "\n")
Elastic Net Predictions: 25.07144 20.8533 16.63516 

Conclusion

In this hands-on practice, we used the ‘mtcars’ dataset to predict miles per gallon (mpg) based on the number of cylinders, horsepower, and weight of a car. We started with a simple linear regression model using only weight as a predictor and then moved to a multiple linear regression model with three predictors. We also explored regularization techniques and made predictions using our multiple linear regression model.

Remember to always perform exploratory data analysis and check the assumptions of linear regression before fitting your models. Also, consider applying regularization techniques when multicollinearity is present or when the model is overfitting the data.


For your further study

Derivation of the Coefficient: Least Squares (LS) Method

The goal of the Least Squares (LS) method is to minimize the sum of the squared differences between the observed and predicted values of the dependent variable. This method is commonly used in linear regression to estimate the coefficients \(\beta\).


1. Objective Function

We aim to minimize the following objective function:

\[ \text{Objective: } \min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 = \min_{\beta} (y - X\beta)^T (y - X\beta) \]

Where: - \(y\) is the vector of observed dependent variable values. - \(X\) is the matrix of independent variables. - \(\beta\) is the vector of coefficients. - \(\hat{y_i} = X\beta\) is the vector of predicted values.


2. Expanding the Objective Function

Expand the objective function:

\[ (y - X\beta)^T (y - X\beta) = y^T y - 2 \beta^T X^T y + \beta^T X^T X \beta \]

3. Taking the Derivative

To minimize the objective function, take the derivative with respect to \(\beta\) and set it equal to 0:

\[ \frac{\partial}{\partial \beta} \left( y^T y - 2 \beta^T X^T y + \beta^T X^T X \beta \right) = 0 \]

This simplifies to:

\[ -2 X^T y + 2 X^T X \beta = 0 \]

4. Solving for beta

Rearrange the equation to solve for \(\beta\):

\[ X^T X \beta = X^T y \]


Finally, solve for \(\beta\):

\[ \beta = (X^T X)^{-1} X^T y \]


Steps to Estimate the Standard Error \(\text{SE}(\hat{\beta})\)

  1. Residual Sum of Squares (RSS): First, compute the residuals (the difference between the observed values \(y\) and the predicted values \(\hat{y} = X\hat{\beta}\)) and the residual sum of squares.

    \[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \hat{\epsilon}_i^2 \]

    Where:

    • \(\hat{y}_i = X_i \hat{\beta}\) are the predicted values.
    • \(\hat{\epsilon}_i = y_i - X_i \hat{\beta}\) are the residuals.
  2. Estimate of Variance of Errors \(\sigma^2\): The variance of the error terms \(\sigma^2\) is estimated as:

    \[ \hat{\sigma}^2 = \frac{\text{RSS}}{n - p} \]

    Where:

    • \(n\) is the number of observations.
    • \(p\) is the number of parameters in the model (including the intercept).
    • \(\text{RSS}\) is the residual sum of squares.
  3. Variance-Covariance Matrix of \(\hat{\beta}\): The variance-covariance matrix of the estimated coefficients \(\hat{\beta}\) is given by:

    \[ \text{Var}(\hat{\beta}) = \hat{\sigma}^2 (X^T X)^{-1} \]

    Where:

    • \(X\) is the design matrix (the matrix of independent variables, including the intercept).
    • \(X^T X\) is the matrix product of \(X\) transposed and \(X\).
    • \((X^T X)^{-1}\) is the inverse of the matrix \(X^T X\).
  4. Standard Error of \(\hat{\beta}\): The standard error of each coefficient \(\hat{\beta}_j\) (where \(j = 1, 2, ..., p\)) is the square root of the diagonal elements of the variance-covariance matrix:

    \[ \text{SE}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)} = \sqrt{\hat{\sigma}^2 \cdot (X^T X)^{-1}_{jj}} \]

    Where:

    • \((X^T X)^{-1}_{jj}\) is the \(j\)-th diagonal element of the matrix \((X^T X)^{-1}\).


Practice in R

# Generate some example data
set.seed(123)  # For reproducibility

# Create a matrix of independent variables (X) with an intercept
n <- 100  # Number of observations
p <- 3    # Number of predictors

# Create a matrix X of independent variables with an intercept
X <- cbind(1, matrix(rnorm(n * (p - 1)), n, p - 1))  # X includes intercept

# Create a true beta (coefficients)
beta_true <- c(2, 1.5, -3)  # True coefficients for the model
# intercept: 2
# B1: 1.5
# B2: -3

# Generate the dependent variable (y) with some noise
y <- X %*% beta_true + rnorm(n)

# Compute beta using the formula: beta = (X'X)^(-1) X'y

# Step 1: Compute X'X (transpose of X times X)
XtX <- t(X) %*% X

# Step 2: Compute X'y (transpose of X times y)
Xty <- t(X) %*% y

# Step 3: Compute the inverse of X'X
XtX_inv <- solve(XtX)

# Step 4: Compute the estimated beta
beta_hat <- XtX_inv %*% Xty



# Print the estimated coefficients
print("Estimated beta coefficients:")
[1] "Estimated beta coefficients:"
print(beta_hat)
          [,1]
[1,]  2.135065
[2,]  1.366828
[3,] -2.976189
# Compare the estimated beta with the true beta
print("True beta coefficients:")
[1] "True beta coefficients:"
print(beta_true)
[1]  2.0  1.5 -3.0
################################################
# Check how well the model fits
y_hat <- X %*% beta_hat  # Predicted values

# Compute residuals
residuals <- y - y_hat

# Print residual sum of squares (RSS)
rss <- sum(residuals^2)
print(paste("Residual Sum of Squares (RSS):", rss))
[1] "Residual Sum of Squares (RSS): 87.7818656295526"
# Step 5: Estimate variance of residuals (sigma^2)
sigma2_hat <- rss / (n - p)

# Step 6: Calculate standard errors of beta_hat
var_beta_hat <- sigma2_hat * solve(t(X) %*% X)
se_beta_hat <- sqrt(diag(var_beta_hat))


# --- Using lm() to fit the model ---

# Create a data frame to use with lm()
df <- data.frame(y = y, X1 = X[, 2], X2 = X[, 3])

# Fit the linear model using lm()
model_lm <- lm(y ~ X1 + X2, data = df)

# Print the estimated coefficients (lm())
print("Estimated beta coefficients (lm):")
[1] "Estimated beta coefficients (lm):"
print(coef(model_lm))
(Intercept)          X1          X2 
   2.135065    1.366828   -2.976189 
summary(model_lm)

Call:
lm(formula = y ~ X1 + X2, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8730 -0.6607 -0.1245  0.6214  2.0798 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.13507    0.09614   22.21   <2e-16 ***
X1           1.36683    0.10487   13.03   <2e-16 ***
X2          -2.97619    0.09899  -30.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9513 on 97 degrees of freedom
Multiple R-squared:   0.92, Adjusted R-squared:  0.9183 
F-statistic: 557.6 on 2 and 97 DF,  p-value: < 2.2e-16


Maximum Likelihood Estimation

In linear regression, the method of least squares is commonly used to estimate the coefficients of the regression model. However, there is another estimation method called Maximum Likelihood Estimation (MLE) that can be used as an alternative to least squares. In this optional material, we will introduce the concept of MLE, explain how it works, and discuss its advantages and disadvantages compared to least squares.


Maximum Likelihood Estimation

Maximum Likelihood Estimation is a statistical method used to estimate the parameters of a model by finding the values that maximize the likelihood function. The likelihood function measures how likely the observed data is, given the parameters of the model. In the context of linear regression, MLE seeks to find the values of the coefficients that maximize the likelihood of observing the data, assuming that the error terms follow a normal distribution


MLE in Linear Regression

Let’s consider the linear regression model:

\[ y_i = β_0 + β_1 x_i + ε_i \]

where \(y_i\) is the dependent variable, \(x_i\) is the independent variable, \(β_0\) and \(β_1\) are the regression coefficients, and \(ε_i\) is the error term.

Assuming that the error terms \(ε_i\) are normally distributed with mean 0 and constant variance \(σ^2\), the probability density function (PDF) of the normal distribution for a single observation is:

\[ f(y_i | x_i, β_0, β_1, σ^2) = \frac{1}{σ \sqrt{2π}} exp(\frac{-(y_i - (β_0 + β_1 x_i))^2} {2 σ^2}) \]

The likelihood function is the product of the PDFs for all observations:

\[ L(β_0, β_1, σ^2) = Π f(y_i | x_i, β_0, β_1, σ^2) \]

To make the optimization problem easier, we take the natural logarithm of the likelihood function, which is called the log-likelihood function:

\[ logL(β_0, β_1, σ^2) = Σ log(f(y_i | x_i, β_0, β_1, σ^2)) \]

The goal of MLE is to find the values of \(β_0\), \(β_1\), and \(σ^2\) that maximize the log-likelihood function.


Practice in R

# Generate some example data
set.seed(123)  # For reproducibility

# Create a matrix of independent variables (X) with an intercept
n <- 100  # Number of observations
p <- 3    # Number of predictors

# Create a matrix X of independent variables with an intercept
X <- cbind(1, matrix(rnorm(n * (p - 1)), n, p - 1))  # X includes intercept

# Create a true beta (coefficients)
beta_true <- c(2, 1.5, -3)  # True coefficients for the model

# Generate the dependent variable (y) with some noise
y <- X %*% beta_true + rnorm(n)

# --- Define the log-likelihood function for linear regression ---
log_likelihood <- function(params) {
  beta <- params[1:p]  # Extract beta parameters
  sigma2 <- params[p + 1]  # Extract variance (sigma^2)
  
  # Calculate predicted values
  y_hat <- X %*% beta
  
  # Calculate residuals
  residuals <- y - y_hat
  
  # Compute the log-likelihood
  log_like <- -(n / 2) * log(2 * pi * sigma2) - (1 / (2 * sigma2)) * sum(residuals^2)
  
  return(-log_like)  # Return negative log-likelihood for minimization
}

# --- Initial parameter guesses for beta and sigma^2 ---
init_params <- c(rep(0, p), var(y))  # Initial guesses: zeros for beta, var(y) for sigma^2

# --- Perform MLE using optim to minimize the negative log-likelihood ---
mle_results <- optim(par = init_params, fn = log_likelihood, method = "BFGS")
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
Warning in log(2 * pi * sigma2): NaNs produced
# Extract estimated betas and sigma^2
beta_mle <- mle_results$par[1:p]
sigma2_mle <- mle_results$par[p + 1]

# Print the results
print("Estimated beta coefficients (MLE):")
[1] "Estimated beta coefficients (MLE):"
print(beta_mle)
[1]  2.135065  1.366829 -2.976189
print(paste("Estimated variance (sigma^2, MLE):", sigma2_mle))
[1] "Estimated variance (sigma^2, MLE): 0.877819388641911"
# Compare the estimated betas (MLE) with the true betas
print("True beta coefficients:")
[1] "True beta coefficients:"
print(beta_true)
[1]  2.0  1.5 -3.0
# Check how well the model fits using MLE
y_hat_mle <- X %*% beta_mle  # Predicted values using MLE betas

# Compute residual sum of squares (RSS) for MLE
rss_mle <- sum((y - y_hat_mle)^2)
print(paste("Residual Sum of Squares (RSS, MLE):", rss_mle))
[1] "Residual Sum of Squares (RSS, MLE): 87.7818656295527"


The optim() function in R is a general-purpose optimization function. Its role is to find the minimum or maximum of a given objective function. In the case of MLE, we want to maximize the log-likelihood (or equivalently minimize the negative log-likelihood) to find the best estimates for our model parameters β and \(\sigma^2\).

The optimization problem is often framed as minimizing some objective, so in MLE, we minimize the negative log-likelihood. The optim() function adjusts the parameter values to reduce the negative log-likelihood as much as possible, eventually finding the optimal estimates.

  • par = init_params: This argument provides the initial guesses for the parameters β and \(\sigma^2\) that we are trying to estimate. In our case:

    • We initialize β with zeros (or some other sensible guess).

    • We initialize \(\sigma^2\) with the variance of y, since it’s a reasonable starting point for the variance of the errors.

  • fn = log_likelihood: This is the objective function we are trying to minimize. In our case, it’s the negative log-likelihood function (log_likelihood). The goal is to minimize this function by adjusting the parameters β and \(\sigma^2\)

  • method = "BFGS": This specifies the optimization method. BFGS stands for Broyden–Fletcher–Goldfarb–Shanno algorithm, which is a popular method used for optimization when the objective function is smooth and differentiable (as is the case here). The BFGS algorithm is well-suited for problems like this because it approximates the second derivatives of the objective function (related to curvature), making it efficient for finding local minima.

What Happens Inside optim()?


  1. Initial Step:

    • The function starts with the initial guesses for β and \(\sigma^2\) provided in init_params.
  2. Evaluate the Objective Function:

    • optim() calculates the negative log-likelihood for the current parameter values using the log_likelihood() function.

    • It evaluates how far off the current guess is by computing the residuals \(y - X\beta\) and using those residuals to calculate the likelihood.

  3. Iterative Optimization:

    • optim() adjusts the parameter values iteratively. It evaluates the gradient (the direction in which the objective function decreases most quickly) and step size (how far to move in the direction of the gradient).

    • With each iteration, the algorithm updates β and \(\sigma^2\) slightly and recalculates the negative log-likelihood.

    • This process continues until the algorithm finds a set of parameters where the negative log-likelihood is minimized (i.e., the most likely parameters given the data).

  4. Convergence:

    • Once the negative log-likelihood cannot be significantly reduced by adjusting the parameters (i.e., when it has “converged” to a solution), optim() stops.

    • At this point, the current values of β and \(\sigma^2\) are considered the Maximum Likelihood Estimates (MLEs).

The Log-Likelihood Function: What’s Being Minimized?

The log-likelihood function describes the probability of observing the data given the parameters \(\beta\) and \(\sigma^2\). For a linear regression model with normally distributed errors, the log-likelihood is:

\[ \mathcal{L}(\beta, \sigma^2 | X, y) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - X_i \beta)^2 \]

Our goal is to maximize this function. Since optim() minimizes functions by default, we instead minimize the negative log-likelihood:

\[ -\mathcal{L}(\beta, \sigma^2 | X, y) = \frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - X_i \beta)^2 \]

The negative log-likelihood consists of two parts:

  1. The first part, \(\frac{n}{2} \log(2\pi\sigma^2)\), depends on the variance \(\sigma^2\).
  2. The second part, \(\frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - X_i \beta)^2\), depends on the residuals (differences between observed and predicted values) and both \(\beta\) and \(\sigma^2\).

The optimizer (optim()) will adjust the values of \(\beta\) and \(\sigma^2\) to minimize this expression, thereby maximizing the likelihood of the data given the model parameters.


Advantages and Disadvantages of MLE

Advantages:

  • MLE provides a general framework that can be applied to a wide range of statistical models, not just linear regression.

  • MLE is asymptotically unbiased and efficient, meaning that as the sample size increases, the estimates converge to the true parameter values, and the estimates have the smallest possible variance.

  • MLE allows for the estimation of additional parameters, such as the error variance \(σ^2\) in linear regression.

Disadvantages:

  • MLE can be computationally intensive, especially for complex models with many parameters.

  • MLE relies on the assumption that the error terms follow a specific distribution (e.g., normal distribution in linear regression). If this assumption is not met, the estimates may be biased or inefficient.

Let’s demonstrate the similarity of the estimates by fitting a linear regression model using both LS and MLE, and then visualize the fitted lines. To do this, we’ll predict miles per gallon (mpg) based on the weight (wt) of a car using the ‘mtcars’ dataset.

# Load necessary libraries
library(tidyverse)
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
# Load the 'mtcars' dataset
data(mtcars)

Fit the linear regression model using LS (lm function):

# Fit the model using LS
ls_model <- lm(mpg ~ wt, data = mtcars)

summary(ls_model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Fit the linear regression model using MLE (fit a normal linear model with mle2 function):

# Load necessary libraries
library(bbmle)
Loading required package: stats4

Attaching package: 'bbmle'
The following object is masked from 'package:dplyr':

    slice
# Define the log-likelihood function for MLE
loglik_fn <- function(beta0, beta1, sigma) {
  y <- mtcars$mpg
  x <- mtcars$wt
  n <- length(y)
  
  mu <- beta0 + beta1 * x
  epsilon <- y - mu
  
  loglik <- -n/2 * log(2 * pi) - n/2 * log(sigma^2) - 1/(2 * sigma^2) * sum(epsilon^2)
  return(-loglik) # The optimization function will minimize the function, so we need to negate the log-likelihood
}


# Fit the model using MLE
mle_model <- mle2(loglik_fn, start = list(beta0 = coef(ls_model)[1], 
                                          beta1 = coef(ls_model)[2], 
                                          sigma = 1))
summary(mle_model)
Maximum likelihood estimation

Call:
mle2(minuslogl = loglik_fn, start = list(beta0 = coef(ls_model)[1], 
    beta1 = coef(ls_model)[2], sigma = 1))

Coefficients:
      Estimate Std. Error z value     Pr(z)    
beta0 37.28513    1.81800 20.5088 < 2.2e-16 ***
beta1 -5.34447    0.54135 -9.8725 < 2.2e-16 ***
sigma  2.94916    0.36864  8.0000 1.244e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 160.0294 

Visualize the fitted lines:

# Extract coefficients from the LS and MLE models
ls_coefs <- coef(ls_model)
mle_coefs <- coef(mle_model)

# Create a scatter plot of mpg vs. wt
mtcars_plot <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  xlim(c(1.5, 5.5)) +
  ylim(c(5, 40))

# Add the LS and MLE fitted lines to the plot
mtcars_plot +
  geom_abline(aes(intercept = ls_coefs[1], slope = ls_coefs[2], color = "LS", linetype = "LS"), size = 1, alpha=0.5) +
  geom_abline(aes(intercept = mle_coefs[1], slope = mle_coefs[2], color = "MLE", linetype = "MLE"), size = 1) +
  scale_color_manual("Model", values = c("LS" = "blue", "MLE" = "red")) +
  scale_linetype_manual("Model", values = c("LS" = "solid", "MLE" = "dashed")) +
  labs(title = "Linear Regression: LS vs. MLE", x = "Weight", y = "Miles per Gallon") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

In the resulting plot, you’ll notice that the LS and MLE fitted lines are almost indistinguishable, which confirms that the estimates are the same when the error terms follow a normal distribution.