Skip to main content
Please wait...
The Role of the Softmax Activation Function in a NN Model
18 Apr, 2025

The Role of the Softmax Activation Function in a NN Model

Role of Activation Functions and Loss Functions in Neural Network Models

 

Activation Functions 

Activation functions introduce non-linearity into the model, allowing it to learn complex patterns. They determine whether a neuron should be activated or not, based on the input it receives. Common activation functions include: 

  • ReLU (Rectified Linear Unit): ReLU(x)=max(0,x) 

  • Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$

  • Tanh: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

 

Loss Functions

Loss functions measure how well the model's predictions match the actual data. They guide the optimization process by providing a metric to minimize. Common loss functions include: 

  • Cross-Entropy Loss: Used for classification tasks, it measures the difference between the predicted probability distribution and the actual distribution. 

  • Mean Squared Error (MSE): Used for regression tasks, it measures the average squared difference between predicted and actual values. 

The Softmax Activation Function

The softmax activation function is used in machine learning to convert a multi-dimensional input vector of raw scores (logits) into a probability distribution. Each element in the output vector represents the probability that the input belongs to a particular class. The formula for the softmax function is: 

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$ 

Input of Softmax Activation Function

The input to the softmax function is a multi-dimensional vector $z=[z_1,z_2,…,z_K]$, where K is the number of classes. Each $z_i$  represents the raw score for the i-th class. 

Why It Divides by the Sum of Exponential Functions 

  • Normalization: By dividing by the sum of all exponentials, the softmax function ensures that the sum of the output probabilities is 1. This is crucial for interpreting the output as a probability distribution.

  • Exponentiation: The exponential function ($e^{z_i}$) ensures that all values are positive and amplifies larger values more than smaller ones. This helps in differentiating between the classes more clearly. 

  • Relative Comparison: The division by the sum of exponentials allows each output to be compared relative to the others. This means that the probability of a class is influenced by the scores of all classes, not just the individual score. 

Example of Multi-Class Classification 

Let's consider a neural network designed to classify images into one of three categories: cat, dog, and rabbit. The last layer of this neural network has three neurons, each corresponding to one of these classes. 

Step-by-Step Process

Logits Calculation: Suppose the output logits from the last layer are: 

  • $(z_{\text{cat}} = 2.0)$

  • $(z_{\text{dog}} = 1.0)$ 

  • $(z_{\text{rabbit}} = 0.1)$

 

Exponentiation: Apply the exponential function to each logit: 

  • $(e^{z_{\text{cat}}} = e^{2.0} \approx 7.39)$

  • $e^{z_{\text{dog}}} = e^{1.0} \approx 2.72)$

  • $(e^{z_{\text{rabbit}}} = e^{0.1} \approx 1.11)$ 

Sum of Exponentials: Calculate the sum of these exponentials: 

  • $(\sum_{j=1}^{3} e^{z_j} = 7.39 + 2.72 + 1.11 \approx 11.22)$ 

     

Softmax Calculation: Compute the softmax probabilities: 

  • $(\text{softmax}(z_{\text{cat}}) = \frac{7.39}{11.22} \approx 0.66)$

  • $(\text{softmax}(z_{\text{dog}}) = \frac{2.72}{11.22} \approx 0.24)$

  • $(\text{softmax}(z_{\text{rabbit}}) = \frac{1.11}{11.22} \approx 0.10)$

 

Interpretation

  • The probability of the image being a cat is approximately 66%. 

  • The probability of the image being a dog is approximately 24%. 

  • The probability of the image being a rabbit is approximately 10%.

 

Example of a CNN Model 

Let's build a Convolutional Neural Network (CNN) model for image classification, where all convolutional layers and fully connected layers use the ReLU (Rectified Linear Unit) activation function. The final layer will use the softmax activation function. This model will be trained using the cross-entropy loss function and the stochastic gradient descent (SGD) optimizer. 

CNN Model Architecture

import tensorflow as tf 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Activation 
from tensorflow.keras.optimizers import SGD 
 
# Define the CNN model 
model = Sequential() 
 
# Convolutional layer 1 
model.add(Conv2D(32, (3, 3), input_shape=(64, 64, 3))) 
model.add(Activation('relu')) 
model.add(MaxPooling2D(pool_size=(2, 2))) 
 
# Convolutional layer 2 
model.add(Conv2D(64, (3, 3))) 
model.add(Activation('relu')) 
model.add(MaxPooling2D(pool_size=(2, 2))) 
 
# Convolutional layer 3 
model.add(Conv2D(128, (3, 3))) 
model.add(Activation('relu')) 
model.add(MaxPooling2D(pool_size=(2, 2))) 
 
# Fully connected layer 1 
model.add(Flatten()) 
model.add(Dense(128)) 
model.add(Activation('relu')) 
 
# Fully connected layer 2 
model.add(Dense(64)) 
model.add(Activation('relu')) 
 
# Output layer with softmax activation 
model.add(Dense(3))  # Assuming 3 classes: cat, dog, rabbit 
model.add(Activation('softmax')) 
 
# Compile the model 
model.compile(loss='categorical_crossentropy', 
              optimizer=SGD(), 
              metrics=['accuracy']) 
 
# Summary of the model 
model.summary() 

Training the Model 

To train the model, you would typically use a dataset of labeled images. Here's an example of how you might train the model: 

# Assuming X_train and y_train are your training data and labels 
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2) 

This example demonstrates how to build and train a CNN model for a multi-class classification problem, using ReLU activation functions in the convolutional and fully connected layers, and softmax in the output layer. The model is trained using the cross-entropy loss function and the stochastic gradient descent optimizer. 

References 

[1] Orescanin, Marko, et al. "Multi-Label Classification of Heterogeneous Underwater Soundscapes with Bayesian Deep Learning." Authorea Preprints (2023).  [2] Chen, Mengfei. "DCGAN-CNN with physical constraints for porosity prediction in laser metal deposition with unbalanced data." Manufacturing Letters 35 (2023): 1146-1154.  [3] van Niekerk, Carel. "Uncertainty Estimation, Management, and Utilisation in Human-Computer Dialogue." (2024).