Why Its Needed
Why Neural Networks Need Activation Functions
Section titled “Why Neural Networks Need Activation Functions”The Linear Activation Problem
Section titled “The Linear Activation Problem”- If we used linear activation functions for all neurons in a neural network:
- The network becomes no different than linear regression
- “This would defeat the entire purpose of using a neural network”
- Cannot fit anything more complex than basic linear regression models
Mathematical Proof with Simple Example
Section titled “Mathematical Proof with Simple Example”Network Structure:
Section titled “Network Structure:”- Input: x (scalar)
- Hidden layer: one unit with parameters w₁ and b₁, outputs a₁
- Output layer: one unit with parameters w₂ and b₂, outputs a₂ (which equals f(x))
Using Linear Activation Throughout:
Section titled “Using Linear Activation Throughout:”-
Calculate hidden layer output:
- a₁ = g(w₁x + b₁)
- Since g(z) = z (linear activation)
- a₁ = w₁x + b₁
-
Calculate final output:
- a₂ = g(w₂a₁ + b₂)
- Since g(z) = z
- a₂ = w₂a₁ + b₂
-
Substitute a₁ into the equation:
- a₂ = w₂(w₁x + b₁) + b₂
- a₂ = w₂w₁x + w₂b₁ + b₂
-
Simplify by setting:
- w = w₂w₁
- b = w₂b₁ + b₂
-
Result:
- a₂ = wx + b
Fundamental Limitation
Section titled “Fundamental Limitation”- “A linear function of a linear function is itself a linear function”
- Multiple layers with linear activations don’t enable the network to compute:
- More complex features
- Non-linear relationships
- Complex patterns in data
Generalizing to Larger Networks
Section titled “Generalizing to Larger Networks”-
Neural network with multiple hidden layers using linear activations:
-
Output will always be equivalent to linear regression
-
Can be expressed as wx + b (for some values of w and b)
-
If using linear activations in hidden layers but sigmoid in output layer:
-
Becomes equivalent to logistic regression
-
Output can be expressed as 1/(1+e^(-wx+b))
-
“This big neural network doesn’t do anything that you can’t also do with logistic regression”
Best Practice Rule
Section titled “Best Practice Rule”- “Don’t use the linear activation function in the hidden layers of the neural network”
- Recommendation: Use ReLU activation for hidden layers
- This enables the network to learn non-linear relationships
Looking Ahead
Section titled “Looking Ahead”- So far, we’ve covered neural networks for:
- Binary classification (y is 0 or 1)
- Regression (y can be positive, negative, or non-negative)
- Next: Multi-class classification
- When y can take 3, 4, 10 or more categorical values
Non-linear activation functions are essential for neural networks to work effectively. Without them, a neural network simply collapses into linear or logistic regression, regardless of how many layers it has, making it impossible to learn complex patterns in data.