Why Its Needed

Why Neural Networks Need Activation Functions

The Linear Activation Problem

Critical Concept

If we used linear activation functions for all neurons in a neural network:
The network becomes no different than linear regression
“This would defeat the entire purpose of using a neural network”
Cannot fit anything more complex than basic linear regression models

Mathematical Proof with Simple Example

Network Structure:

Input: x (scalar)
Hidden layer: one unit with parameters w₁ and b₁, outputs a₁
Output layer: one unit with parameters w₂ and b₂, outputs a₂ (which equals f(x))

Using Linear Activation Throughout:

Calculate hidden layer output:
- a₁ = g(w₁x + b₁)
- Since g(z) = z (linear activation)
- a₁ = w₁x + b₁
Calculate final output:
- a₂ = g(w₂a₁ + b₂)
- Since g(z) = z
- a₂ = w₂a₁ + b₂
Substitute a₁ into the equation:
- a₂ = w₂(w₁x + b₁) + b₂
- a₂ = w₂w₁x + w₂b₁ + b₂
Simplify by setting:
- w = w₂w₁
- b = w₂b₁ + b₂
Result:
- a₂ = wx + b

Fundamental Limitation

“A linear function of a linear function is itself a linear function”
Multiple layers with linear activations don’t enable the network to compute:
More complex features
Non-linear relationships
Complex patterns in data

Generalizing to Larger Networks

Neural network with multiple hidden layers using linear activations:
Output will always be equivalent to linear regression
Can be expressed as wx + b (for some values of w and b)
If using linear activations in hidden layers but sigmoid in output layer:
Becomes equivalent to logistic regression
Output can be expressed as 1/(1+e^(-wx+b))
“This big neural network doesn’t do anything that you can’t also do with logistic regression”

Best Practice Rule

Key Recommendation

“Don’t use the linear activation function in the hidden layers of the neural network”
Recommendation: Use ReLU activation for hidden layers
This enables the network to learn non-linear relationships

Looking Ahead

So far, we’ve covered neural networks for:
Binary classification (y is 0 or 1)
Regression (y can be positive, negative, or non-negative)
Next: Multi-class classification
When y can take 3, 4, 10 or more categorical values

Non-linear activation functions are essential for neural networks to work effectively. Without them, a neural network simply collapses into linear or logistic regression, regardless of how many layers it has, making it impossible to learn complex patterns in data.