Week 9: Regression

What is regression about?

Modeling the relationship between variables
Response variable (dependent variable, outcome): what we want to predict or explain
Predictor variable(s) (independent variable(s), explanatory variable(s)): what we use to predict or explain
Potential goals:
- Prediction: estimate the response for new observations
- Inference: understand whether and how predictors affect the response

The regression framework

Simple linear regression

Models the relationship between:
- One numeric response variable (\(Y\))
- One numeric predictor variable (\(X\))
The model equation: \[Y = \beta_0 + \beta_1 X + \varepsilon \]

Components of the model

Symbol	Name	Description
\(Y\)	Response	The variable we want to predict
\(X\)	Predictor	The variable we use to predict
\(\beta_0\)	Intercept	Expected value of \(Y\) when \(X = 0\)
\(\beta_1\)	Slope	Change in \(Y\) for one-unit increase in \(X\)
\(\varepsilon\)	Error term	Random variation not explained by \(X\)

Interpreting the slope

\(\beta_1\) represents the rate of change
For every one-unit increase in \(X\), we expect \(Y\) to change by \(\beta_1\) units
Examples:
- If \(X\) = years of education and \(Y\) = salary, then \(\beta_1\) = expected change in salary per additional year of education
- If \(X\) = temperature and \(Y\) = ice cream sales, then \(\beta_1\) = expected change in sales per degree increase

The least squares principle

We don’t know the true values of \(\beta_0\) and \(\beta_1\)
We estimate them from data using least squares:
- Choose \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the sum of squared residuals
- Residual = observed value - predicted value = \(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\)
The fitted (predicted) values: \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\)

Visual representation

Illustrating the regression line and residuals

set.seed(42)
x <- rnorm(20, mean = 10, sd = 2)
y <- 2 + 1.5 * x + rnorm(20, mean = 0, sd = 2)
df <- data.frame(x = x, y = y)
fit <- lm(y ~ x, data = df)

ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  geom_segment(aes(xend = x, yend = fitted(fit)), 
               linetype = "dashed", alpha = 0.5) +
  labs(title = "Linear regression: fitted line and residuals",
       x = "Predictor (X)", y = "Response (Y)") +
  theme_minimal()

Fitting regression models in R

The `lm()` function

lm stands for “linear model”
Basic syntax:

model <- lm(response ~ predictor, data = dataset)

Formula notation: response ~ predictor
- ~ means “is modeled as a function of”
- Left side: response variable
- Right side: predictor variable(s)

Example: Cars dataset

Built-in dataset on cars from the 1920s
Variables:
- speed: speed of car (mph)
- dist: stopping distance (ft)

View the cars dataset

data(cars)
head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

View the cars dataset

str(cars)

'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Visualizing the relationship

Scatterplot of speed vs. stopping distance

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(size = 3, color = "steelblue") +
  labs(title = "Car speed vs. stopping distance",
       x = "Speed (mph)", y = "Stopping distance (ft)") +
  theme_minimal()

Research question

How does stopping distance depend on speed?
We will model stopping distance as a function of speed: \[\text{dist} = \beta_0 + \beta_1 \times \text{speed} + \epsilon\]

Fitting the model

cars_model <- lm(dist ~ speed, data = cars)
summary(cars_model)


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Breaking down the output

summary(cars_model)


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Key components:

Coefficients table: estimates of \(\beta_0\) and \(\beta_1\), with standard errors, t-statistics, and p-values
Residual standard error: estimate of \(\sigma\) (standard deviation of errors)
R-squared: proportion of variance explained by the model (0 to 1)
F-statistic: overall test of whether the model is useful

Interpreting the coefficients

coef(cars_model)

(Intercept)       speed 
 -17.579095    3.932409

Intercept (\(\hat{\beta}_0 = -17.58\)): estimated stopping distance when speed = 0 mph
- May not be meaningful if predictor values near 0 are not observed
Slope (\(\hat{\beta}_1 = 3.93\)): for each 1 mph increase in speed, stopping distance increases by about 3.93 feet on average

Hypothesis test for the slope

Testing whether there is a linear relationship:
- \(H_0: \beta_1 = 0\) (mean of stopping distance does not depend on speed)
- \(H_A: \beta_1 \neq 0\) (mean of stopping distance does depend on speed)

summary(cars_model)$coefficients

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
speed         3.932409  0.4155128  9.463990 1.489836e-12

The p-value for speed is very small (< 0.001)
We reject \(H_0\): strong evidence of a linear relationship

Confidence intervals for coefficients

confint(cars_model, level = 0.95)

                 2.5 %    97.5 %
(Intercept) -31.167850 -3.990340
speed         3.096964  4.767853

We are 95% confident that the true slope is between 3.1 and 4.77
For each 1 mph increase in speed, the stopping distance increases by approximately 3.1 to 4.77 feet

Adding the regression line to the plot

Scatterplot with fitted regression line

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Car speed vs. stopping distance",
       subtitle = "With fitted regression line and 95% confidence band",
       x = "Speed (mph)", y = "Stopping distance (ft)") +
  theme_minimal()

Making predictions

Once we have a fitted model, we can predict the response for new predictor values

# Predict stopping distance for a car going 22 mph
new_data <- data.frame(speed = 22)
predict(cars_model, newdata = new_data)

      1 
68.9339

Expected stopping distance for a car traveling at 22 mph is about 68.9 feet

Confidence interval for the mean response

# Get prediction with confidence interval
predict(cars_model, newdata = new_data, interval = "confidence")

      fit     lwr      upr
1 68.9339 61.8963 75.97149

This is a confidence interval for the mean response at speed = 22 mph
We are 95% confident that the average stopping distance for cars at 22 mph is between 61.9 and 76 feet

Prediction intervals for a new observation

# Get prediction interval for individual observation
predict(cars_model, newdata = new_data, interval = "prediction")

      fit      lwr      upr
1 68.9339 37.22044 100.6474

This is a prediction interval for a single new observation
We are 95% confident that a single car traveling at 22 mph will have a stopping distance between 37.2 and 100.6 feet
Prediction intervals are wider than confidence intervals. How much wider?

Model diagnostics

Checking model assumptions

Linear regression assumes:

Linearity: relationship between \(X\) and \(Y\) is linear
Independence: observations are independent
Normality: residuals are normally distributed
Equal variance (homoscedasticity): variance of residuals is constant across predictor values

We can check these using diagnostic plots

Residual plots

Standard diagnostic plots for regression

par(mfrow = c(2, 2))
plot(cars_model)

Interpreting diagnostic plots

Residuals vs Fitted:
- Check for linearity and equal variance
- Should see random scatter around horizontal line at 0
Normal Q-Q:
- Check normality of residuals
- Points should fall close to diagonal line
Scale-Location:
- Check equal variance assumption
- Should see horizontal line with random scatter
Residuals vs Leverage:
- Identify influential observations
- Points far from the pack may have undue influence

Impact of violated assumptions

When assumptions are violated, several problems can occur:

Biased coefficient estimates (wrong slope/intercept)
Invalid inference (p-values and confidence intervals are wrong)
Poor predictions (unreliable forecasts)
Inefficient estimates (larger standard errors than necessary)

Let’s see examples of each violation…

Violation 1: Non-linearity 👻

Simulate non-linear association

set.seed(123)
x <- seq(-10, 10, length.out = 50)
y <- 2 + 0.1 * x^2 + rnorm(50, sd = 3)
df_nonlinear <- data.frame(x, y)

ggplot(df_nonlinear, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, 
              color = "red") +
  labs(title = "Linear fit to curved data",
       subtitle = "Red line misses the pattern") +
  theme_minimal()

Consequences:

Linear model fails to capture the true relationship
Systematic bias in predictions
Residuals show clear pattern

Solutions:

Depending on the type of non-linearity, some steps that may help:
- Transform variables (log, square root)
- Add polynomial terms (x + I(x^2))
- Use non-linear models

Non-linearity: Diagnostic plot 🎃

model_nonlinear <- lm(y ~ x, data = df_nonlinear)
par(mfrow = c(1, 2))
plot(model_nonlinear, which = 1)  # Residuals vs. Fitted
plot(model_nonlinear, which = 2)  # Q-Q plot

Notice the curved pattern in residuals vs. fitted - this indicates non-linearity!

Violation 2: Non-constant variance (heteroskedasticity) 🕷️

Simulate non-constant variance

set.seed(456)
x <- runif(100, 0, 10)
y <- 2 + 3 * x + rnorm(100, sd = x)  # SD increases with x
df_hetero <- data.frame(x, y)

ggplot(df_hetero, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, 
              color = "red") +
  labs(title = "Heteroskedasticity",
       subtitle = "Variance increases with x") +
  theme_minimal()

Consequences:

Coefficient estimates are still unbiased
Standard errors are wrong
Confidence intervals and p-values are unreliable
Some predictions more uncertain than others

Solutions:

Weighted least squares
Transform response variable
Use robust standard errors

Heteroscedasticity: Diagnostic plot 🕸️

model_hetero <- lm(y ~ x, data = df_hetero)
par(mfrow = c(1, 2))
plot(model_hetero, which = 1)  # Residuals vs. Fitted
plot(model_hetero, which = 3)  # Scale-Location

The “fan shape” in the residual plot reveals non-constant variance.

Violation 3: Non-normal residuals 🧟

Simulate non-normal errors

set.seed(789)
x <- runif(100, 0, 10)
# Add some extreme outliers
y <- 2 + 3 * x + c(rt(95, df = 2) * 2, 
                    rnorm(5, mean = 20, sd = 1))
df_nonnormal <- data.frame(x, y)

ggplot(df_nonnormal, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, 
              color = "red") +
  labs(title = "Heavy-tailed residuals",
       subtitle = "With outliers") +
  theme_minimal()

Consequences:

Coefficients still unbiased (least-squares principal still applies and does not require normality)
(Unless the sample size is very large) Hypothesis tests and confidence intervals may lose validity (CIs too narrow or too wide, HTs not maintaining specified significance level)

Solutions:

Transform response variable (e.g. Box-Cox transformations)
Interpret confidence intervals / hypothesis tests with skepticism

Non-normality: Diagnostic plot 💀

model_nonnormal <- lm(y ~ x, data = df_nonnormal)
par(mfrow = c(1, 2))
plot(model_nonnormal, which = 2)  # Q-Q plot
plot(model_nonnormal, which = 5)  # Residuals vs Leverage

Departures from the diagonal line in Q-Q plot indicate non-normality. Point 99 is influential!

Violation 4: Non-independence 🦇

Simulate correlated errors

set.seed(101)
n <- 100
x <- 1:n
errors <- arima.sim(n = n, 
                    list(ar = 0.7), sd = 2)
y <- 2 + 0.5 * x + errors
df_autocor <- data.frame(x, y)

ggplot(df_autocor, aes(x, y)) +
  geom_point() +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, 
              color = "red") +
  labs(title = "Autocorrelated errors",
       subtitle = "Time series data") +
  theme_minimal()

Consequences:

Common in time series or clustered data
Standard errors are severely underestimated
P-values too small (false positives)
Confidence intervals too narrow

Solutions:

Time series models (model the correlation in the error terms)

Key takeaway

Always check diagnostic plots before interpreting results
Violated assumptions don’t always invalidate the analysis:
- Some violations are worse than others
- Large samples can mitigate some problems
- Transformations can often help
When in doubt:
- Use robust methods
- Report diagnostics
- Be transparent about limitations
No model is perfect - focus on whether violations are severe enough to affect your conclusions

Multiple regression

Extending to multiple predictors

Multiple regression includes more than one predictor \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]
Each coefficient \(\beta_j\) represents the effect of \(X_j\) on \(Y\), holding other predictors constant
R syntax:

model <- lm(response ~ predictor1 + predictor2 + predictor3,
            data = dataset)

Example: mtcars dataset

View the mtcars dataset

data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mpg: miles per gallon
wt: weight (1000 lbs)
hp: horsepower
cyl: number of cylinders

Research question

How do weight and horsepower jointly affect fuel efficiency (mpg)? \[\text{mpg} = \beta_0 + \beta_1 \times \text{wt} + \beta_2 \times \text{hp} + \epsilon\]

Fitting the multiple regression model

mtcars_model <- lm(mpg ~ wt + hp, data = mtcars)
summary(mtcars_model)


Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpreting multiple regression coefficients

coef(mtcars_model)

(Intercept)          wt          hp 
37.22727012 -3.87783074 -0.03177295

Intercept (\(\hat{\beta}_0 = 37.23\)): expected mpg when weight and horsepower are both 0 (not meaningful)
wt (\(\hat{\beta}_1 = -3.88\)): for each 1000 lb increase in weight, mpg decreases by about 3.88, holding horsepower constant
hp (\(\hat{\beta}_2 = -0.03\)): for each 1 hp increase, mpg decreases by about 0.03, holding weight constant

Comparing models

We can compare models with different numbers of predictors

# Simple model: mpg vs weight only
model1 <- lm(mpg ~ wt, data = mtcars)

# Multiple model: mpg vs weight and horsepower
model2 <- lm(mpg ~ wt + hp, data = mtcars)

# Compare R-squared
summary(model1)$r.squared

[1] 0.7528328

summary(model2)$r.squared

[1] 0.8267855

Model 2 explains more variance (higher R-squared)

ANOVA for model comparison

anova(model1, model2)

Analysis of Variance Table

Model 1: mpg ~ wt
Model 2: mpg ~ wt + hp
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     30 278.32                                
2     29 195.05  1    83.274 12.381 0.001451 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA F-test compares the two models
Small p-value indicates that adding hp significantly improves the model
The more complex model (model2) is preferred

Categorical predictors

Including categorical variables

Categorical predictors are automatically converted to dummy variables
Example: modeling mpg as a function of number of cylinders
- cyl can be 4, 6, or 8

mtcars$cyl <- factor(mtcars$cyl)
cyl_model <- lm(mpg ~ cyl, data = mtcars)
summary(cyl_model)


Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
cyl6         -6.9208     1.5583  -4.441 0.000119 ***
cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,    Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

Interpreting categorical coefficients

coef(cyl_model)

(Intercept)        cyl6        cyl8 
  26.663636   -6.920779  -11.563636

R creates dummy variables with the first level as the reference category
- Here, 4 cylinders is the reference
Intercept (\(\hat{\beta}_0 = 26.66\)): mean mpg for 4-cylinder cars
cyl6 (\(\hat{\beta}_1 = -6.92\)): 6-cylinder cars have about 6.92 lower mpg than 4-cylinder cars, on average
cyl8 (\(\hat{\beta}_2 = -11.56\)): 8-cylinder cars have about 11.56 lower mpg than 4-cylinder cars, on average

Visualizing categorical regression

Boxplot with group means

ggplot(mtcars, aes(x = cyl, y = mpg, fill = cyl)) +
  geom_boxplot(alpha = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red") +
  labs(title = "MPG by number of cylinders",
       subtitle = "Red diamonds show group means (regression estimates)",
       x = "Number of cylinders", y = "Miles per gallon") +
  theme_minimal() +
  theme(legend.position = "none")

Key R functions for regression

Summary table

Function	Purpose	Example
`lm()`	Fit linear model	`lm(y ~ x, data = df)`
`summary()`	Model summary statistics	`summary(model)`
`coef()`	Extract coefficients	`coef(model)`
`confint()`	Confidence intervals	`confint(model)`
`predict()`	Make predictions	`predict(model, newdata)`
`plot()`	Diagnostic plots	`plot(model)`
`anova()`	Compare models	`anova(model1, model2)`

Formula syntax in R

Formula	Meaning
`y ~ x`	Simple regression: y on x
`y ~ x1 + x2`	Multiple regression: y on x1 and x2
`y ~ x1 + x2 + x1:x2`	Include interaction between x1 and x2
`y ~ x1 * x2`	Equivalent to `y ~ x1 + x2 + x1:x2`
`y ~ .`	Use all other variables as predictors
`y ~ x - 1`	Regression without intercept

Important considerations

Correlation vs. causation

Regression reveals associations, not necessarily causation
Just because \(X\) predicts \(Y\) doesn’t mean \(X\) causes \(Y\)
Possible explanations for correlation:
1. \(X\) causes \(Y\)
2. \(Y\) causes \(X\)
3. A third variable causes both \(X\) and \(Y\) (confounding)
4. The association is due to chance
Randomized experiments (like the clinical trials in Week 4) provide stronger evidence for causation

Extrapolation

Extrapolation: making predictions outside the range of observed data
The linear relationship may not hold beyond the observed range
Example: predicting stopping distance for a car going 100 mph using our cars model
- The data only include speeds from 4 to 25 mph
- Extrapolating to 100 mph is risky

Limitations of R-squared

R-squared measures the proportion of variance explained
Higher R-squared doesn’t always mean a better model:
- Adding more predictors always increases R-squared
- A model with lower R-squared might generalize better
- R-squared doesn’t tell us if the model assumptions are met
Use adjusted R-squared when comparing models with different numbers of predictors
- Penalizes for adding unnecessary predictors

Example: Full analysis

Palmer Penguins dataset

Load and display the penguins dataset

if(!require(palmerpenguins)) install.packages("palmerpenguins")
library(palmerpenguins)
data(penguins)
str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Research question

How does body mass depend on flipper length and species?
Let’s build a multiple regression model

Exploratory visualization

Scatterplot colored by species

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point(size = 2, alpha = 0.6) +
#   geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Body mass vs. flipper length by species",
       x = "Flipper length (mm)", y = "Body mass (g)") +
  theme_minimal()

Fitting the model

# Remove missing values
penguins_clean <- na.omit(penguins[, c("body_mass_g", "flipper_length_mm", "species")])

# Fit model
penguin_model <- lm(body_mass_g ~ flipper_length_mm + species, data = penguins_clean)
summary(penguin_model)


Call:
lm(formula = body_mass_g ~ flipper_length_mm + species, data = penguins_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-927.70 -254.82  -23.92  241.16 1191.68 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -4031.477    584.151  -6.901 2.55e-11 ***
flipper_length_mm    40.705      3.071  13.255  < 2e-16 ***
speciesChinstrap   -206.510     57.731  -3.577 0.000398 ***
speciesGentoo       266.810     95.264   2.801 0.005392 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 375.5 on 338 degrees of freedom
Multiple R-squared:  0.7826,    Adjusted R-squared:  0.7807 
F-statistic: 405.7 on 3 and 338 DF,  p-value: < 2.2e-16

Interpreting the results

coef(penguin_model)

      (Intercept) flipper_length_mm  speciesChinstrap     speciesGentoo 
       -4031.4769           40.7054         -206.5101          266.8096

Intercept: Expected body mass for Adelie penguin with 0 mm flipper length (not meaningful)
flipper_length_mm: For each 1 mm increase in flipper length, body mass increases by about 40.7 g, holding species constant
speciesChinstrap: Chinstrap penguins are about 206.5 g lighter than Adelie, holding flipper length constant
speciesGentoo: Gentoo penguins are about 266.8 g heavier than Adelie, holding flipper length constant

Model diagnostics

Diagnostic plots

par(mfrow = c(2, 2))
plot(penguin_model)

Making predictions

# Predict body mass for a Gentoo penguin with 210 mm flippers
new_penguin <- data.frame(
  flipper_length_mm = 210,
  species = "Gentoo"
)

predict(penguin_model, newdata = new_penguin, interval = "prediction")

       fit     lwr      upr
1 4783.467 4040.52 5526.413

We predict a Gentoo penguin with 210 mm flippers will have a body mass around 4783 g
The 95% prediction interval is (4041, 5526) g

The two cultures

Two approaches to statistical modeling

Leo Breiman (2001) identified two cultures in statistical modeling:

Culture 1: Data Modeling

Focus on inference
Assume a stochastic model (e.g., linear regression)
Estimate parameters and test hypotheses
Interpretability is key
Assumptions matter greatly
This lecture has focused here

Culture 2: Algorithmic Modeling

Focus on prediction
Treat the data mechanism as unknown
Prioritize predictive accuracy
“Black box” models acceptable
Less concern about distributional assumptions
The box-office prediction case study has some of this flavor

Implications for your work

In prediction tasks:
- Diagnostic plots are less critical if the model predicts well on new data
- Large sample sizes can overcome some assumption violations
- Cross-validation and test set performance matter more than p-values
- The “true” model may be unknowable and that’s okay
In inference tasks (like much of today):
- Understanding relationships is the goal
- Assumptions must be checked carefully
- Coefficient interpretation requires proper model specification
- Causal claims require even stronger assumptions

Summary

Key takeaways

Regression models relationships between variables
Simple linear regression: one predictor, estimates slope and intercept
Multiple regression: multiple predictors, each coefficient represents effect holding others constant
R function: lm(response ~ predictors, data = dataset)
Use summary(), coef(), confint(), and predict() to interpret results
Check model assumptions using diagnostic plots
Categorical predictors are handled automatically via dummy variables
Remember: correlation ≠ causation

Next steps

Continue with Case Study 1
Practice fitting and interpreting regression models
Think about:
- What makes a good predictor?
- How do we choose which variables to include?
- What should we do if assumptions are violated?

Week 9: Regression

What is regression about?

The regression framework

Simple linear regression

Components of the model

Interpreting the slope

The least squares principle

Visual representation

Fitting regression models in R

The lm() function

Example: Cars dataset

Visualizing the relationship

Research question

Fitting the model

Breaking down the output

Interpreting the coefficients

Hypothesis test for the slope

Confidence intervals for coefficients

Adding the regression line to the plot

Making predictions

Confidence interval for the mean response

Prediction intervals for a new observation

Model diagnostics

Checking model assumptions

Residual plots

Interpreting diagnostic plots

Impact of violated assumptions

Violation 1: Non-linearity 👻

Non-linearity: Diagnostic plot 🎃

Violation 2: Non-constant variance (heteroskedasticity) 🕷️

Heteroscedasticity: Diagnostic plot 🕸️

Violation 3: Non-normal residuals 🧟

Non-normality: Diagnostic plot 💀

Violation 4: Non-independence 🦇

Key takeaway

Multiple regression

Extending to multiple predictors

Example: mtcars dataset

Research question

Fitting the multiple regression model

Interpreting multiple regression coefficients

Comparing models

ANOVA for model comparison

Categorical predictors

Including categorical variables

Interpreting categorical coefficients

Visualizing categorical regression

Key R functions for regression

Summary table

Formula syntax in R

Important considerations

Correlation vs. causation

Extrapolation

Limitations of R-squared

Example: Full analysis

Palmer Penguins dataset

Research question

Exploratory visualization

Fitting the model

Interpreting the results

Model diagnostics

Making predictions

The two cultures

Two approaches to statistical modeling

Implications for your work

Summary

Key takeaways

Next steps

The `lm()` function