The Adam optimizer, short for Adaptive Moment Estimation, is a popular optimization algorithm used in training deep learning models. It combines the advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSProp. Adam computes adaptive learning rates for each parameter by maintaining running averages of both the gradients (first moment) and the squared gradients (second moment). These moving averages are estimates of the mean and the uncentered variance of the gradients, respectively.
Adam uses two hyperparameters, typically denoted as β₁ and β₂, which control the exponential decay rates of these moving averages. It also includes a small constant ε to prevent division by zero. This optimizer is well-suited for problems with large datasets or parameters and is known for its efficiency and low memory requirements.
Adam is widely used because it generally requires little tuning and works well in practice across a wide range of deep learning architectures, including convolutional and recurrent neural networks.