Moving Beyond Simple: Multiple Linear Regression

Go beyond single factors and learn how multiple inputs influence an outcome.

Moving Beyond Simple: Multiple Linear Regression

In Simple Linear Regression, we saw how to predict an outcome (like house price) using just one input factor (like house size). But reality is often more complex! House prices usually depend on size, location, number of bedrooms, age, and more.

That’s where Multiple Linear Regression (MLR) comes in. It’s a powerful extension that allows us to use multiple independent variables (inputs) to predict a single dependent variable (output). It helps us build more realistic and often more accurate models.

What is Multiple Linear Regression?

The Core Idea

MLR assumes that the relationship between the inputs and the output can still be represented by a linear equation (think flat plane or hyperplane in higher dimensions, rather than just a line), but now incorporates multiple factors.

The Equation

The mathematical formula looks like an expanded version of the SLR equation:

y = b₀ + b₁x₁ + b₂x₂ + ... + bnxn

Where:

y is the predicted Dependent Variable (e.g., predicted profit).
x₁, x₂, …, xn are the different Independent Variables (e.g., R&D Spend, Marketing Spend, State).
b₀ is the Intercept (predicted value of y when all x’s are 0).
b₁, b₂, …, bn are the Coefficients: Each bᵢ shows how much y changes for a one-unit increase in the corresponding xᵢ, assuming all other x variables are held constant.

Just like in SLR, the goal is to find the best values for the intercept (b₀) and all the coefficients (b₁ to bₙ) that make the equation fit our data points as closely as possible.

Important Rules (Assumptions) for MLR

MLR shares assumptions with SLR, but adds a crucial new one:

Linearity: The relationship between Y and each independent variable should be linear.
Independence of Errors: Errors should be independent of each other.
Homoscedasticity: Errors should have constant variance across all levels.
Normality of Errors: Errors should be normally distributed.
Lack of Multicollinearity: The independent variables should not be highly correlated with each other. If two inputs are highly correlated (e.g., ‘Years of Experience’ and ‘Age’), it’s hard for the model to tell which one is truly influencing the output, leading to unstable coefficient estimates.

Checking for multicollinearity often involves looking at correlation matrices or calculating Variance Inflation Factors (VIFs).

Dealing with Categories: Dummy Variables

Converting Text to Numbers

Regression models need numbers. What if one of your inputs is categorical, like ‘State’ or ‘Gender’? We need to convert these into a numerical format using Dummy Variables.

The most common way is One-Hot Encoding:

Create a new binary (0 or 1) column for each category.
For a given row, the column corresponding to that row’s category gets a ‘1’, and all other columns get a ‘0’.

Example: ‘State’ with [California, Florida, New York]

Row with ‘California’ → California=1, Florida=0, New York=0
Row with ‘Florida’ → California=0, Florida=1, New York=0
Row with ‘New York’ → California=0, Florida=0, New York=1

Avoiding the Dummy Variable Trap!

There’s a catch! If you include all the dummy columns, they become perfectly predictable from each other (if Florida=0 and New York=0, you know California must be 1). This creates perfect multicollinearity, which breaks regression assumptions.

Solution: Always drop one of the dummy variable columns for each original categorical feature. If you have ‘m’ categories, include only ‘m-1’ dummy columns in your model. The dropped category becomes the “reference” category.

Building an MLR Model (Python Workflow)

Load & Prepare Data: Import your dataset, separate features (X) and target (y), handle missing values.
Encode Categorical Features: Use One-Hot Encoding with drop='first' to create dummy variables.
Split Data: Divide into training (80%) and testing (20%) sets.
Feature Scaling (Optional): Apply StandardScaler after splitting (fit only on training data).
Train the Model: Create a LinearRegression instance and fit to training data.
Make Predictions: Use the trained model to predict on test data.
Evaluate: Calculate metrics like MSE and R².

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Preprocessing: encode categorical features with drop='first'
preprocessor = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(drop='first'), [3])],
    remainder='passthrough'
)

X = preprocessor.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
print(f"MSE: {mse:.4f}")
print(f"R-squared: {r2:.4f}")

Multiple Linear Regression: Key Points

MLR predicts a dependent variable (Y) using two or more independent variables.
The equation is y = b₀ + b₁x₁ + ... + bnxn.
Key assumptions include Linearity, Independence, Homoscedasticity, Normality of Errors, and crucially, Lack of Multicollinearity.
Categorical predictors must be converted to Dummy Variables using One-Hot Encoding.
Avoid the Dummy Variable Trap by dropping one dummy column per original categorical feature.
Evaluation uses MSE and R².
Feature selection methods can help refine the model by keeping only significant predictors.