Layer Normalization in Deep Learning

Detailed Explanation of Layer Normalization

Layer normalization is a technique used in deep learning to normalize the inputs across the features for each data sample, rather than across the batch as in batch normalization. This method is particularly useful in recurrent neural networks (RNNs) and transformers, where it helps stabilize the training process and improve model performance.

How Layer Normalization Works

Compute the Mean and Variance: For a given input sample, compute the mean ( $\mu_L$ ) and variance ( $\sigma_L^2$ ) across all features.
$\mu_L=\frac{1}{H}\sum_{i=1}^{H}x_i$
$\sigma_L^2=\frac{1}{H}\sum_{i=1}^{H}(x_i-\mu_L)^2$
Normalize: Normalize the features using the computed mean and variance.
$\hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}$
Scale and Shift: Apply learned parameters gamma ( $\gamma$ ) and beta ( $\beta$ ) to scale and shift the normalized value.
$y_i=\gamma\hat{x_i}+\beta$

Properties and Advantages

Independent of Batch Size: Effective for training models with small batch sizes or single samples.
Stabilizes Training: Reduces internal covariate shift and helps in stabilizing the training process.
Consistent Normalization: Ensures consistent normalization across different features within each sample.

Uses

Recurrent Neural Networks (RNNs): Often used in RNNs to address instability issues.
Transformers: Widely used in transformer architectures for tasks like language modeling.

Comparison with Batch Normalization

Batch normalization (BN) normalizes the input across the batch for each feature, which can be sensitive to batch size and may not be suitable for RNNs or small batches.

Batch Normalization (BN) Overview

Compute the Mean and Variance: For a given mini-batch, compute the mean ( $\mu_B$ ) and variance ( $\sigma_B^2$ ) for each feature.
$\mu_B=\frac{1}{m}\sum_{i=1}^{m}x_i$
$\sigma_B^2=\frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2$
Normalize: Normalize the batch using the computed mean and variance.
$\hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$
Scale and Shift: Apply learned parameters gamma ( $\gamma$ ) and beta ( $\beta$ ) to scale and shift the normalized value.
$y_i=\gamma\hat{x_i}+\beta$

Properties and Advantages

Effective with Large Batches: Performs well with large batch sizes.
Stabilizes Training: Reduces internal covariate shift, speeding up the training process.
Regularization Effect: Can act as a form of regularization, reducing the need for other techniques like dropout.

Uses

Convolutional Neural Networks (CNNs): Commonly used in CNNs to normalize feature maps.
Large Batch Training: Suitable for models trained with large batch sizes.

Comparison of Layer Normalization and Batch Normalization

Feature	Layer Normalization	Batch Normalization
Normalization Dimension	Normalizes across features for each sample	Normalizes across the batch for each feature
Formula	$\hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}$	$\hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$
Dependence on Batch Size	Independent of batch size, suitable for small batches or single samples	Requires larger batch sizes for accurate statistics
Use Cases	Recurrent neural networks, small mini-batches, or single data points	Convolutional neural networks, large mini-batches
Advantages	Stabilizes training, effective in RNNs, no dependence on batch size	Stabilizes training, accelerates convergence, acts as a regularizer
Disadvantages	May be less effective in CNNs compared to batch normalization, introduces additional computations	Performance may degrade with very small batch sizes, requires computation of batch statistics