The Vision Transformer (ViT) Paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", Dosovitskiy, A., et. al, (ICLR'21)
Inspired by the Transformer scaling success in NLP, they experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion.
Although Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, which leads to their relatively poor performance when trained on insufficient amounts of data, the models perform surprisingly well after large-scale training. The resulting Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
- ViT Model Implementation: I implemented the Vision Transformer model architecture from scratch (self-attention, multihead self-attention, transformer layer, and vision transformer block).
- Trainer Implementation: I implemented the training and testing codes including the plotter.
- During the implementation, I was inspired by https://github.com/omihub777/ViT-CIFAR. I followed their codes to preprocess the CIFAR-10 dataset.
Param | Value |
---|---|
n_epochs | 200 |
batch_size | 128 |
Optimizer | Adam |
0.9 | |
0.999 | |
Weight Decay | 5e-5 |
LR Scheduler | Cosine |
(Init LR, Last LR) | (1e-3, 1e-5) |
Warmup | 5 epochs |
Dropout | 0.0 |
AutoAugment | True |
Label Smoothing | 0.1 |
Heads | 12 |
Transformer Layers | 7 |
ViT Hidden Dim | 384 |
MLP Hidden Dim | 384 |
- One NVIDIA RTX A6000 48GB for around 2 hours.
- #Params: 6,268,810
- Best epoch = 200 with dev_eval acc = 89.87%
- Here is the plot showing the loss curve on the training dataset, and the accuracy curve on the development dataset.
- The original ViT paper pretrained the model on a very large dataset, and then fine-tuned on CIFAR-10, resulting in a very high accuracy (99.5%). Due to the limit of resources, I trained and tested the model only on CIFAR-10.
- Due to limited time, I trained for 200 epochs and got the best result after the 200th epoch. The model's performance may still increase a little bit if training further.
- Dropout didn't work well, so I set it to 0. The regularization tricks (label smoothing, auto augmentation of the data, weight decay and warm up) are very important to the small-scale training. Tuning the hyperparameters may further improve the performance.
cd KunViT
pip install -r requirements.txt
nohup python -u main.py > run.log 2>&1 &