pytorch-paligemma

A Multimodal (Vision) Language Model from scratch using only Python and PyTorch.

the PaliGemma Vision Language Model from scratch :

Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax)
Vision Transformer model
Contrastive learning (CLIP, SigLip)
Numerical stability of the Softmax and the Cross Entropy Loss
Rotary Positional Embedding
Multi-Head Attention
Grouped Query Attention
Normalization layers (Batch, Layer and RMS)
KV-Cache (prefilling and token generation)
Attention masks (causal and non-causal)
Weight tying
Top-P Sampling and Temperature

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notes		notes
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
launch_inference.sh		launch_inference.sh
modeling_gemma.py		modeling_gemma.py
modeling_siglip.py		modeling_siglip.py
processing_paligemma.py		processing_paligemma.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback