Deep Encoder–Decoder vs Transformer for Semantic Segmentation
Team project comparing convolutional encoder–decoder networks (SegNet) and transformer-based architectures (SegFormer) for semantic segmentation on the CamVid dataset. Analyzed architectural trade-offs, class imbalance effects, and recall–IoU performance across model sizes and training regimes.
In this team project, we conducted a quantitative comparison of convolutional and transformer-based segmentation architectures under conditions of data scarcity and class imbalance. We implemented and trained SegNet from scratch and fine-tuned pretrained SegFormer models (both small and large variants), evaluating their performance on CamVid. In addition, we analyzed a relabelled transfer-learning setup using a Cityscapes-pretrained SegFormer to assess zero-training adaptation across label spaces. Model performance was evaluated using class-wise Intersection over Union (IoU), with particular focus on sparse and pixel-imbalanced classes. We performed systematic hyperparameter analysis—including learning rate, data augmentation strategies, and model size—and studied convergence dynamics across rare versus dominant classes. Our results show that transformer-based models generalize more effectively under limited data, and that larger SegFormer variants significantly improve detection of rare classes. We further demonstrate that class imbalance materially influences optimization dynamics and IoU convergence, while data augmentation enhances structural coherence at the potential cost of fine-grained detail. Overall, the project highlights architectural trade-offs, statistical evaluation of per-class performance, and the sensitivity of segmentation metrics in safety-critical perception settings.
Academic Context