Skip to content

Aryan8912/pytorch-paligemma-main

Repository files navigation

pytorch-paligemma

A Multimodal (Vision) Language Model from scratch using only Python and PyTorch.

the PaliGemma Vision Language Model from scratch :

  • Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax)
  • Vision Transformer model
  • Contrastive learning (CLIP, SigLip)
  • Numerical stability of the Softmax and the Cross Entropy Loss
  • Rotary Positional Embedding
  • Multi-Head Attention
  • Grouped Query Attention
  • Normalization layers (Batch, Layer and RMS)
  • KV-Cache (prefilling and token generation)
  • Attention masks (causal and non-causal)
  • Weight tying
  • Top-P Sampling and Temperature

About

A Multimodal (Vision) Language Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published