Nano-GPT

We break the video tutorial by Andrej Karpathy on building a Generative Pre-trained Transformer (GPT) from scratch into 10 sections. The tutorial follows ideas covered in Attention is All You Need

Using PyTorch, we begin by coding a single layer NN bigram model that takes as input a single token and generates the next token. Then, we progress in complexity all the way to a decoder having 6 Transformer layers. The final model can take a sequence as input and generates the next token.

Key concepts

Self-Attention Mechanism:
- The core innovation of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different parts of the input sequence when processing a particular token. This mechanism replaces the need for fixed-length context windows or recurrence. Because we construct a decoder, only previous parts of the sequence (before the current token) are considered by the self-attention mechanism.
- Scaled Dot-Product Attention: Self-attention calculates attention scores by taking the dot product of a query vector with key vectors and scaling it. Then, it applies a softmax function to obtain attention weights. These weights are used to compute a weighted sum of the value vectors.
Multi-Head Attention:
- Multi-head attention captures different types of relationships in the input data. Instead of using a single attention mechanism, multiple mechanisms (heads) are used in parallel.
- Each head learns a different representation of the input sequence and contributes to the final output.
Positional Encoding:
- Since the Transformer model does not have built-in positional information (unlike sequential models like RNNs), positional encodings are added to the input embeddings to give the model information about the position of tokens in the sequence.
Feed-Forward Neural Networks:
- After the self-attention mechanism, each sub-layer in the decoder contains a feed-forward neural network.
Residual Connections and Layer Normalization:
- To facilitate training deep networks, residual connections are used, and layer normalization is applied after each sub-layer.

Technical Note

The scripts were run on a MacBook Pro M2 Max, so the device is set to "mps" to take advantage of the GPU. The scripts include a code snippet to accommodate "cuda" or "cpu".

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
00_Simple_Bigram_Model.ipynb		00_Simple_Bigram_Model.ipynb
01_bigram.py		01_bigram.py
02_self_attention_tricks.py		02_self_attention_tricks.py
03_single_headed_attention.py		03_single_headed_attention.py
04_multi_headed_attention.py		04_multi_headed_attention.py
05_multi_headed_attention_ff.py		05_multi_headed_attention_ff.py
06_transformer_block.py		06_transformer_block.py
07_transformer_block_residual.py		07_transformer_block_residual.py
08_transformer_block_residual_norm.py		08_transformer_block_residual_norm.py
09_decoder.py		09_decoder.py
README.md		README.md
input.txt		input.txt
out2k.txt		out2k.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nano-GPT

Key concepts

Technical Note

About

Releases

Packages

Languages

frogstar-world-b/Nano-GPT

Folders and files

Latest commit

History

Repository files navigation

Nano-GPT

Key concepts

Technical Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages