Batch Normalization in Deep Learning
Detailed Explanation of Batch Normalization Layer
Batch normalization is a technique used to improve the training of deep neural networks. It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. This process stabilizes the learning process and significantly reduces the number of training epochs required to train deep networks.
How Batch Normalization Works
- Compute the mean and variance: For a given mini-batch, compute the mean (
) and variance (
) for each feature.
- Normalize: Normalize the batch using the computed mean and variance.
- Scale and shift: Apply learned parameters gamma (
) and beta (
) to scale and shift the normalized value.
Properties and Advantages
- Stabilizes training: Reduces internal covariate shift
- Speeds up training: Allows for higher learning rates
- Regularization effect: Can reduce the need for dropout
Uses
- Deep neural networks: Common in CNNs and RNNs
- Improving convergence: Helps in faster convergence during training
Layer Normalization
Layer normalization is another normalization technique that normalizes the inputs across the features for each data sample, rather than across the batch.
How Layer Normalization Works
- Compute the mean and variance: For a given input sample, compute the mean (
) and variance (
) across all features.
- Normalize: Normalize the features using the computed mean and variance.
- Scale and shift: Apply learned parameters gamma (
) and beta (
) to scale and shift the normalized value.
Properties and Advantages
- Independent of batch size: Useful in recurrent neural networks
- Stabilizes training: Similar to batch normalization
Uses
- Recurrent neural networks: Better suited for RNNs
- Small batch sizes: Effective where batch normalization is less effective
Comparison of Batch Normalization and Layer Normalization
| Feature | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalization Dimension | Normalizes across the batch for each feature | Normalizes across features for each sample |
| Formula | ||
| Dependence on Batch Size | Requires larger batch sizes for accurate statistics | Independent of batch size, suitable for small batches or single samples |
| Use Cases | Convolutional neural networks, large mini-batches | Recurrent neural networks, small mini-batches, or single data points |
| Advantages | Stabilizes training, accelerates convergence, acts as a regularizer | Stabilizes training, effective in RNNs, no dependence on batch size |
| Disadvantages | Performance may degrade with very small batch sizes, requires computation of batch statistics | May be less effective in CNNs compared to batch normalization, introduces additional computations |