Skip to content
Pablo Rodriguez

Optimization

Better Than Gradient Descent
  • Gradient descent is a widely used optimization algorithm in machine learning
  • Foundation for linear regression, logistic regression, and early neural networks
  • However, newer optimization algorithms can train neural networks much faster

Too Small Learning Rate

  • When starting far from minimum
  • Steps are small and in similar directions
  • Convergence is unnecessarily slow
  • Ideally would increase learning rate

Too Large Learning Rate

  • When close to minimum or in narrow valleys
  • Steps oscillate back and forth
  • Never properly converges
  • Ideally would decrease learning rate
Adaptive Moment Estimation
  • Adaptive Moment Estimation (Adam)
  • Automatically adjusts learning rates during training
  • Uses different learning rates for each parameter in your model
  • For w₁ through w₁₀ and b, it maintains 11 different learning rates
  • If parameter w_j consistently moves in same direction:

  • Increase the learning rate for that specific parameter

  • Accelerates progress toward minimum

  • If parameter oscillates back and forth:

  • Decrease the learning rate for that specific parameter

  • Prevents bouncing and allows convergence

Adam Implementation
model = Sequential([
Dense(25, activation='relu'),
Dense(15, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(1e-3), # Initial learning rate of 0.001
)
  • Typically works much faster than gradient descent
  • More robust to the choice of initial learning rate
  • Has become the de facto standard for training neural networks
  • Safe choice for most deep learning applications
Practitioner's Choice
  • Next videos will cover more advanced neural network concepts
  • Will explore alternative layer types for neural networks

The Adam optimization algorithm provides significant training speedups by adaptively adjusting learning rates for each parameter. Instead of using a single global learning rate, it increases rates for parameters moving consistently in one direction and decreases rates for oscillating parameters. This adaptive approach has made Adam the preferred choice for most modern neural network training.