Visualizing optimization algorithms

1. Gradient Descent

Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a function, typically a loss function. By moving in the direction of the steepest decrease, determined by the gradient, it helps find the optimal parameters that best fit the data.

Equation:

θ = θ - α * ∇J(θ)

Description:

θ: Parameters (weights) of the model being optimized.
α (alpha): Learning rate, determines the size of the steps taken during optimization.
∇J(θ): Gradient of the cost function with respect to the parameters.

2. Stochastic Gradient Descent (SGD)

Unlike traditional gradient descent, which uses the entire dataset, SGD updates the model using a single or small batch of data points per iteration. This makes it faster and helps avoid local minima, though it can lead to a noisier convergence path.

Equation:

θ = θ - α * ∇J(θ, xi, yi)

Description:

θ: Parameters (weights) of the model being optimized.
α (alpha): Learning rate, determines the size of the steps taken during optimization.
∇J(θ, xi, yi): Gradient of the cost function with respect to the parameters computed on a single training example (xi, yi).

3. Momentum

Momentum is an optimization technique used in machine learning to accelerate gradient descent by taking into account the past gradients when updating model parameters. Instead of just relying on the current gradient, momentum adds a fraction of the previous update to the current one. This helps in smoothing out the updates, allowing the algorithm to maintain direction and avoid oscillations, especially in regions with steep gradients. By building up velocity in directions of consistent gradients, momentum helps the optimization process converge faster and can help the model escape local minima.

Equation:

v = β * v + (1 - β) * ∇J(θ)

θ = θ - α * v

Description:

θ: Parameters (weights) of the model being optimized.
α (alpha): Learning rate, determines the size of the steps taken during optimization.
β: Momentum coefficient, controls the influence of the past gradients.
v: Momentum term, which accumulates the exponentially decaying moving average of past gradients.
∇J(θ): Gradient of the cost function with respect to the parameters.

4. RMSprop (Root Mean Square Propagation)

Root Mean Square Propagation is an adaptive learning rate optimization algorithm. It adjusts the learning rate for each parameter based on the average of the squared gradients from previous iterations. By keeping a moving average of the squared gradients, RMSProp helps mitigate the issue of vanishing or exploding gradients, ensuring more stable and efficient convergence.

Equation:

v = β * v + (1 - β) * (∇J(θ))^2

θ = θ - α * ∇J(θ) / (sqrt(v) + ε)

Description:

θ: Parameters (weights) of the model being optimized.
α (alpha): Learning rate, determines the size of the steps taken during optimization.
β: RMSprop decay rate, controls the exponential moving average of squared gradients.
v: RMSprop term, which accumulates the exponentially decaying moving average of squared gradients.
∇J(θ): Gradient of the cost function with respect to the parameters.
ε (epsilon): Small constant to prevent division by zero.

5. Adam (Adaptive Moment Estimation)

Adam combines the benefits of two other methods: momentum and RMSprop. It computes adaptive learning rates for each parameter by maintaining exponentially decaying averages of past gradients (momentum) and their squared values (RMSprop). This approach allows Adam to adjust the learning rate dynamically, leading to faster convergence and better handling of noisy or sparse gradients.

Equation:

m = β1 * m + (1 - β1) * ∇J(θ)
v = β2 * v + (1 - β2) * (∇J(θ))^2
θ = θ - α * m / (sqrt(v) + ε)

Description:

θ: Parameters (weights) of the model being optimized.
α (alpha): Learning rate, determines the size of the steps taken during optimization.
β1, β2: Exponential decay rates for the moment estimates.
m: First moment (mean) estimate.
v: Second moment (uncentered variance) estimate.
∇J(θ): Gradient of the cost function with respect to the parameters.
ε (epsilon): Small constant to prevent division by zero.

Click anywhere on the function heatmap to start a minimization. You can toggle the different algorithms (SGD, Momentum, RMSProp, Adam) by clicking on the circles in the lower bar.

The global minimum is on the left. A local minimum is found on the right. Interestingly, different initializations make some algorithms converge to the local minimum while others converge to the global minimum.

Note: The learning rate is 1e-2 for Adam, SGD with Momentum and RMSProp, while it is 2e-2 for SGD (to make it converge faster)

So this demo sows how quickly Adam optmizer converges with respect to other optimizers due to intoduced momenteum term in the opitimization problem. The source code of this visualization is from this GitHub repo of Emilien Dupont.