This project is a structured replication of the Vision Transformer (ViT) architecture introduced by Google Research in the paper:
"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
Dosovitskiy et al., 2020 (arXiv:2010.11929)
The goal is to:
- Reproduce the model architecture and results on CIFAR-10 and TinyImageNet
- Validate core claims (e.g., performance vs CNNs)
- Extend the study with visualization and interpretability techniques
This is being developed as part of a personal research initiative to demonstrate competency in modern AI architectures.