Batch Normalization in Deep Learning

Detailed Explanation of Batch Normalization Layer

Batch normalization is a technique used to improve the training of deep neural networks. It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. This process stabilizes the learning process and significantly reduces the number of training epochs required to train deep networks.

How Batch Normalization Works

  1. Compute the mean and variance: For a given mini-batch, compute the mean (\mu_B) and variance (\sigma_B^2) for each feature.

    \mu_B=\frac{1}{m}\sum_{i=1}^{m}x_i
    \sigma_B^2=\frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2

  2. Normalize: Normalize the batch using the computed mean and variance.

    \hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}

  3. Scale and shift: Apply learned parameters gamma (\gamma) and beta (\beta) to scale and shift the normalized value.

    y_i=\gamma\hat{x_i}+\beta

Properties and Advantages

  • Stabilizes training: Reduces internal covariate shift
  • Speeds up training: Allows for higher learning rates
  • Regularization effect: Can reduce the need for dropout

Uses

  • Deep neural networks: Common in CNNs and RNNs
  • Improving convergence: Helps in faster convergence during training

Layer Normalization

Layer normalization is another normalization technique that normalizes the inputs across the features for each data sample, rather than across the batch.

How Layer Normalization Works

  1. Compute the mean and variance: For a given input sample, compute the mean (\mu_L) and variance (\sigma_L^2) across all features.

    \mu_L=\frac{1}{H}\sum_{i=1}^{H}x_i
    \sigma_L^2=\frac{1}{H}\sum_{i=1}^{H}(x_i-\mu_L)^2

  2. Normalize: Normalize the features using the computed mean and variance.

    \hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}

  3. Scale and shift: Apply learned parameters gamma (\gamma) and beta (\beta) to scale and shift the normalized value.

    y_i=\gamma\hat{x_i}+\beta

Properties and Advantages

  • Independent of batch size: Useful in recurrent neural networks
  • Stabilizes training: Similar to batch normalization

Uses

  • Recurrent neural networks: Better suited for RNNs
  • Small batch sizes: Effective where batch normalization is less effective

Comparison of Batch Normalization and Layer Normalization

Feature Batch Normalization Layer Normalization
Normalization Dimension Normalizes across the batch for each feature Normalizes across features for each sample
Formula \hat{x_i}=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}} \hat{x_i}=\frac{x_i-\mu_L}{\sqrt{\sigma_L^2+\epsilon}}
Dependence on Batch Size Requires larger batch sizes for accurate statistics Independent of batch size, suitable for small batches or single samples
Use Cases Convolutional neural networks, large mini-batches Recurrent neural networks, small mini-batches, or single data points
Advantages Stabilizes training, accelerates convergence, acts as a regularizer Stabilizes training, effective in RNNs, no dependence on batch size
Disadvantages Performance may degrade with very small batch sizes, requires computation of batch statistics May be less effective in CNNs compared to batch normalization, introduces additional computations