Skip to content

Self-implemented Vision Transformer for the course Visual Media

Notifications You must be signed in to change notification settings

KunhangL/KunViT

Repository files navigation

Self-implemented Vision Transformer

The Vision Transformer (ViT) Paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", Dosovitskiy, A., et. al, (ICLR'21)

Summary of the Paper

Methodology

Inspired by the Transformer scaling success in NLP, they experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion.

Highlights

Although Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, which leads to their relatively poor performance when trained on insufficient amounts of data, the models perform surprisingly well after large-scale training. The resulting Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.

Summary of the Self-implemented Codes

Experiments

Hyperparameters

Param Value
n_epochs 200
batch_size 128
Optimizer Adam
$\beta_1$ 0.9
$\beta_2$ 0.999
Weight Decay 5e-5
LR Scheduler Cosine
(Init LR, Last LR) (1e-3, 1e-5)
Warmup 5 epochs
Dropout 0.0
AutoAugment True
Label Smoothing 0.1
Heads 12
Transformer Layers 7
ViT Hidden Dim 384
MLP Hidden Dim 384

Device

  • One NVIDIA RTX A6000 48GB for around 2 hours.

Results

  • #Params: 6,268,810
  • Best epoch = 200 with dev_eval acc = 89.87%
  • Here is the plot showing the loss curve on the training dataset, and the accuracy curve on the development dataset.

Analysis

  • The original ViT paper pretrained the model on a very large dataset, and then fine-tuned on CIFAR-10, resulting in a very high accuracy (99.5%). Due to the limit of resources, I trained and tested the model only on CIFAR-10.
  • Due to limited time, I trained for 200 epochs and got the best result after the 200th epoch. The model's performance may still increase a little bit if training further.
  • Dropout didn't work well, so I set it to 0. The regularization tricks (label smoothing, auto augmentation of the data, weight decay and warm up) are very important to the small-scale training. Tuning the hyperparameters may further improve the performance.

Usage

cd KunViT
pip install -r requirements.txt
nohup python -u main.py > run.log 2>&1 &

About

Self-implemented Vision Transformer for the course Visual Media

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages