Batch Normalization in Deep Learning

Detailed Explanation of Batch Normalization Layer

Batch normalization is a technique used to improve the training of deep neural networks. It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. This process stabilizes the learning process and significantly reduces the number of training epochs required to train deep networks.

How Batch Normalization Works

Compute the mean and variance: For a given mini-batch, compute the mean ( $\mu_B$ ) and variance ( $\sigma_B^2$ ) for each feature.
$\mu_B=\frac{1}{m}\sum_{i=1}^{m}x_i$
$\sigma_B^2=\frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2$
Normalize: Normalize the batch using the computed mean and variance.
$\hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$
Scale and shift: Apply learned parameters gamma ( $\gamma$ ) and beta ( $\beta$ ) to scale and shift the normalized value.
$y_i=\gamma\hat{x_i}+\beta$

Properties and Advantages

Stabilizes training: Reduces internal covariate shift
Speeds up training: Allows for higher learning rates
Regularization effect: Can reduce the need for dropout

Uses

Deep neural networks: Common in CNNs and RNNs
Improving convergence: Helps in faster convergence during training

Layer Normalization

Layer normalization is another normalization technique that normalizes the inputs across the features for each data sample, rather than across the batch.

How Layer Normalization Works

Compute the mean and variance: For a given input sample, compute the mean ( $\mu_L$ ) and variance ( $\sigma_L^2$ ) across all features.
$\mu_L=\frac{1}{H}\sum_{i=1}^{H}x_i$
$\sigma_L^2=\frac{1}{H}\sum_{i=1}^{H}(x_i-\mu_L)^2$
Normalize: Normalize the features using the computed mean and variance.
$\hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}$
Scale and shift: Apply learned parameters gamma ( $\gamma$ ) and beta ( $\beta$ ) to scale and shift the normalized value.
$y_i=\gamma\hat{x_i}+\beta$

Properties and Advantages

Independent of batch size: Useful in recurrent neural networks
Stabilizes training: Similar to batch normalization

Uses

Recurrent neural networks: Better suited for RNNs
Small batch sizes: Effective where batch normalization is less effective

Comparison of Batch Normalization and Layer Normalization

Feature	Batch Normalization	Layer Normalization
Normalization Dimension	Normalizes across the batch for each feature	Normalizes across features for each sample
Formula	$\hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$	$\hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}$
Dependence on Batch Size	Requires larger batch sizes for accurate statistics	Independent of batch size, suitable for small batches or single samples
Use Cases	Convolutional neural networks, large mini-batches	Recurrent neural networks, small mini-batches, or single data points
Advantages	Stabilizes training, accelerates convergence, acts as a regularizer	Stabilizes training, effective in RNNs, no dependence on batch size
Disadvantages	Performance may degrade with very small batch sizes, requires computation of batch statistics	May be less effective in CNNs compared to batch normalization, introduces additional computations