How Adamax and other Optimizers address Local Minima
How Do We Address the Challenge of Local Minima in Deep Neural Networks?
Deep neural networks often face non-convex loss surfaces with many local minima and saddle points. To address this:
Use advanced optimizers like Adam or RMSProp that adapt learning rates and incorporate momentum.
Initialize weights smartly (e.g., He or Xavier initialization) to avoid poor starting points.
Add noise through techniques like dropout or small batch sizes to help escape shallow minima.
Use batch normalization to smooth the loss landscape.
Train with mini-batches, which introduce stochasticity and help the model explore the loss surface.
What Is the Adam Optimizer?
Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the strengths of Momentum and RMSProp. It:
Maintains a moving average of gradients (first moment).
Maintains a moving average of squared gradients (second moment).
Adapts learning rates for each parameter.
Training Time Comparison
Adam usually converges faster than:
SGD, which requires more epochs and careful learning rate tuning.
RMSProp, which adapts learning rates but lacks momentum.
AdaGrad, which may shrink learning rates too aggressively.
Adam is widely used because it performs well out of the box on many deep learning tasks.
What Is AdaComp?
AdaComp (Adaptive Compression) is a gradient compression technique used in distributed training. It:
Selects and transmits only the most important gradients.
Reduces communication overhead between devices.
Computational Efficiency Comparison with Adam
Adam improves optimization speed.
AdaComp improves communication efficiency in distributed systems.
You can use AdaComp with Adam to scale training across multiple machines efficiently.
What Is Stochastic Gradient Descent (SGD)?
SGD updates model parameters using one randomly selected data point at a time.
How It Works:
Shuffle the training dataset.
Randomly select one data point.
Compute the gradient of the loss with respect to model parameters.
Update the parameters using this gradient.
This process repeats for each data point in the dataset.
Noisy Gradient Signal
Because SGD uses only one sample per update:
The gradient estimate is noisy.
The model parameters may "jump" around.
The loss curve may fluctuate heavily.
What Is the Adamax Optimizer?
Adamax is a variant of Adam that uses the infinity norm (maximum absolute value) instead of the L2 norm.
How Adamax Addresses Local Minima
Adamax provides stable updates even when gradients are large or sparse.
It helps the model escape local minima by maintaining consistent learning rates.
Comparison with SGD
Adamax adapts learning rates and handles sparse gradients better.
SGD is simpler and may generalize better but often converges more slowly.
What Is Mini-Batch Gradient Descent?
Mini-batch gradient descent splits the training data into small batches (e.g., 32, 64 samples) and updates weights using the average gradient of each batch.
How It Works
Shuffle the dataset.
Split it into batches of fixed size.
For each batch:
Compute the average gradient.
Update the model parameters.
Comparison with SGD
SGD: Updates per sample → fast but noisy.
Mini-batch: Updates per batch → smoother, more efficient on GPUs.
Batch Size Trade-offs
The table below shows the trade offs between smaller batch sizes and larger batch sizes.
Batch Size | Pros | Cons |
Small (e.g., 32) | Fast updates, better generalization | Noisy gradients |
Large (e.g., 1024) | Stable gradients, efficient on GPUs | Slower convergence, risk of poor generalization |
Example: Large Batch Slows Convergence
In image classification, a batch size of 1024 may:
Estimate gradients more accurately.
Converge more slowly due to fewer updates per epoch.
A batch size of 64 may:
Converge faster due to more frequent updates.
Generalize better despite noisier gradients.
TensorFlow Code Example
import tensorflow as tf
import numpy as np
# Generate dummy data
X = np.random.rand(1000, 100)
y = tf.keras.utils.to_categorical(np.random.randint(10, size=(1000,)), num_classes=10)
# Define a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(100,)),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile with Adam optimizer
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train with large batch size
model.fit(X, y, batch_size=1024, epochs=10, verbose=2)
How It Works
Creates a simple neural network.
Uses Adam optimizer.
Trains with a large batch size (1024).
You can change batch_size=64 to compare convergence speed.
PyTorch Code Example
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Generate dummy data
X = torch.rand(1000, 100)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
# DataLoader with large batch size
loader = DataLoader(dataset, batch_size=1024, shuffle=True)
# Define a simple model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(100, 64)
self.fc2 = nn.Linear(64, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
return self.fc2(x)
model = Net()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(10):
for batch_X, batch_y in loader:
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
How It Works
Defines a simple feedforward neural network.
Uses Adam optimizer and cross-entropy loss.
Loads data in large batches (1024).
You can change batch_size=64 to observe faster convergence.
References
[1] Chatterjee, Arjita, et al. "Selection of Gradient Descent Optimizers for Convolutional Neural Network Based Brain Tumor Detectors." 2021 IEEE 2nd International Conference on Applied Electromagnetics, Signal Processing, & Communication (AESPC). IEEE, 2021.
[2] Sikkanan, Sobana, and Seerangurayar Thirupathi. "Overview of Optimization Algorithms in Deep Learning." Optimization, Machine Learning, and Fuzzy Logic: Theory, Algorithms, and Applications. IGI Global Scientific Publishing, 2025. 33-70.
[3] Mohamed, Mohamed A., et al. "Deep learning-based SC-FDMA channel equalization." Int. J. Electr. Electron. Eng. Telecommun. 13.1 (2024): 67-79.
[4] Li, Yanan, et al. "A zeroth-order adaptive learning rate method to reduce cost of hyperparameter tuning for deep learning." Applied Sciences 11.21 (2021): 10184.
[5] Weiss, Romano, et al. "Applications of neural networks in biomedical data analysis." Biomedicines 10.7 (2022): 1469.
[6] Wanjau, Stephen, Geoffrey Wambugu, and Aaron Oirere. "Empirical Evaluation of Adaptive Optimization on the Generalization Performance of Convolutional Neural Networks." (2021).