Tokenizer & Model Training
November 17, 2024
Dataset Samples Description:
The curated scrapped data by all the team members was used to train the tokenizer. The dataset was called all_final_dataset
. A random sampling from this dataset (with a fixed seed) was used to make datasets of fixed sizes. Additionally, a dataset containing only news articles was compiled called all_news_dataset
, along with standalone datasets like Sangrah and Nepali Deva, as they were substantial in size on their own.
- Dataset Sample 1:
all_final_dataset
(200 MB) - Dataset Sample 2:
all_final_dataset
(350 MB) - Dataset Sample 3:
all_final_dataset
(500 MB) - Dataset Sample 4:
all_final_dataset
(750 MB) - Dataset Sample 5:
all_final_dataset
(1 GB) - Dataset Sample 6:
all_news_scrapped
(350 MB) - Dataset Sample 7:
Sangrah
(750 MB) - Dataset Sample 8:
Nepali_Deva
(1.2 GB)
Tokenizer Details:
Tokenizer | Algorithm | Vocab Size | Dataset Information |
---|---|---|---|
Tokenizer 1 | sentencePiece | 49,152 | all_final_dataset (200 MB) |
Tokenizer 2 | sentencePiece | 49,152 | all_final_dataset (350 MB) |
Tokenizer 3 | sentencePiece | 49,152 | all_final_dataset (500 MB) |
Tokenizer 4 | sentencePiece | 49,152 | all_final_dataset (750 MB) |
Tokenizer 5 | sentencePiece | 49,152 | all_final_dataset (1 GB) |
Tokenizer 6 | sentencePiece | 49,152 | all_news_scrapped (350 MB) |
Tokenizer 7 | sentencePiece | 50,000 | Sangrah (750 MB) |
Methodology:
Fertility scores were calculated for the trained tokenizers on both their respective training datasets and a held-out dataset (all_news_scrapped
).
Results when tested on the same dataset it was trained on:
Tokenizer | Fertility Score | Dataset Info (Train) | Dataset Info (Test) |
---|---|---|---|
Tokenizer 1 | 1.2911565101 | all_final_dataset (200 MB) |
all_final_dataset (200 MB) |
Tokenizer 2 | 1.2952214366 | all_final_dataset (350 MB) |
all_final_dataset (350 MB) |
Tokenizer 3 | 1.2952510265 | all_final_dataset (500 MB) |
all_final_dataset (500 MB) |
Tokenizer 4 | 1.2964268691 | all_final_dataset (750 MB) |
all_final_dataset (750 MB) |
Tokenizer 5 | 1.2960094995 | all_final_dataset (1 GB) |
all_final_dataset (1 GB) |
Tokenizer 7 | 1.2469413133 | Sangrah (200 MB) |
Sangrah (200 MB) |
Results when tested on a fixed held-out dataset (all_news_scrapped
):
Tokenizer | Fertility Score | Dataset Info (Train) | Dataset Info (Test) |
---|---|---|---|
Tokenizer 1 | 1.2529094399 | all_final_dataset (200 MB) |
all_news_scrapped (350 MB) |
Tokenizer 2 | 1.2515642463 | all_final_dataset (350 MB) |
all_news_scrapped (350 MB) |
Tokenizer 3 | 1.2509406335 | all_final_dataset (500 MB) |
all_news_scrapped (350 MB) |
Tokenizer 4 | 1.2504782866 | all_final_dataset (750 MB) |
all_news_scrapped (350 MB) |
Tokenizer 5 | 1.2515644638 | all_final_dataset (1 GB) |
all_news_scrapped (350 MB) |
Model Architecture Chosen:
The architecture chosen is LLaMAForCausalLM, which is based on the transformer architecture optimized for causal language modeling tasks. It employs a multi-head self-attention mechanism with rotary embeddings, RMS normalization, and an efficient feed-forward network.
Adjustments Made:
To reduce the total number of parameters to below 100M, the following modifications were made:
- Reduced the
hidden_size
to 512 (compared to higher defaults in larger models). - Set the number of attention heads to 8 with a head dimension of 64.
- Kept 16 layers in the model (
num_hidden_layers
). - Set
intermediate_size
to 1024, balancing the computational load in the feed-forward layers. - Used a vocabulary size of 50,000, optimized for the dataset size.
- Excluded additional biases in feed-forward layers (
mlp_bias: false
) to reduce parameters.
Total Parameters:
The final model has approximately 86M parameters, which is within the 100M constraint.
Best Tokenizer Chosen: The tokenizer 7 (trained on Sangrah (750 MB)) was chosen for the model training.
Reason for Choice:
It had a perplexity of around 1.2469413133, which was the lowest among all the tokenizers trained on different datasets. Thus, we chose it as the best tokenizer for the model training.
Training Details:
- Number of Epochs: 10
- Batch Size: 8
Perplexity Results per Epoch:
The following table shows the training loss, validation loss, and perplexity at various steps during model training. Every 500 steps corresponds to 0.1 epoch.
Steps | Training Loss | Validation Loss | Perplexity |
---|---|---|---|
500 | 7.918729 | 7.703134 | 2102.312241 |
1000 | 7.529400 | 7.505957 | 1818.844430 |
1500 | 7.207300 | 7.229811 | 1437.476760 |
2000 | 6.884200 | 6.893766 | 986.108044 |
2500 | 6.688900 | 6.741419 | 838.426617 |
3000 | 6.493600 | 6.508142 | 670.579000 |
3500 | 6.371900 | 6.391015 | 591.100484 |
4000 | 6.250200 | 6.223870 | 504.652362 |
4500 | 6.111200 | 6.113407 | 451.556748 |
5000 | 5.971700 | 6.002983 | 404.633863 |
5500 | 5.877500 | 5.910859 | 366.748795 |
6000 | 5.783300 | 5.816735 | 335.873564 |
6500 | 5.687900 | 5.723983 | 310.632056 |
7000 | 5.630000 | 5.672997 | 290.905210 |
7500 | 5.577500 | 5.599084 | 274.989125 |
8000 | 5.524400 | 5.547181 | 256.513335 |
8500 | 5.469400 | 5.487976 | 241.756643 |
9000 | 5.444300 | 5.444356 | 231.448171 |
9500 | 5.385400 | 5.398842 | 222.632577 |
10000 | 5.327800 | 5.347515 | 210.085508 |
10500 | 5.276100 | 5.306705 | 202.228561 |
11000 | 5.227900 | 5.265094 | 193.464456 |
11500 | 5.157300 | 5.214261 | 188.233344 |
12000 | 5.184900 | 5.194471 | 180.272732 |
12500 | 5.113100 | 5.126363 | 173.604526 |
13000 | 5.041700 | 5.132655 | 169.466479 |
13500 | 5.028200 | 5.092561 | 163.415618 |
14000 | 5.021400 | 5.080809 | 160.904193 |
14500 | 5.015000 | 5.068379 | 157.665767 |
15000 | 5.016100 | 5.031531 | 153.167311 |
15500 | 4.993000 | 5.009063 | 149.152026 |
16000 | 4.924900 | 4.986255 | 146.387131 |
16500 | 4.890000 | 4.967847 | 142.557062 |
17000 | 4.855600 | 4.947319 | 140.796985 |
17500 | 4.850000 | 4.915397 | 137.633544 |
18000 | 4.834600 | 4.908449 | 135.429160 |
18500 | 4.832000 | 4.885120 | 133.049802 |
19000 | 4.840000 | 4.873579 | 130.788174 |
19500 | 4.774500 | 4.853282 | 128.125492 |
20000 | 4.688000 | 4.849658 | 127.696651 |
As Markdown does not support Devanagari characters, all the prompts and generated texts are documented in the Google Doc file linked here: Link to Google Doc
The best-trained model was chosen for the final submission. Below are the details of the other models that were trained:
-
Nepali LLM Version 1:
- Trained on:
all_final_dataset
(500 MB) - Model Architecture: LlamaForCausalLM
- Parameters: 86.000128 Million
- Tokenizer Used: Tokenizer 4 (trained on
all_final_dataset
(750 MB)) - Perplexity: 4000
- Epochs: 10
- Training Split: 80:20
- Trained on:
-
Nepali LLM Version 2:
- Trained on:
all_final_dataset
(200 MB) - Model Architecture: LlamaForCausalLM
- Parameters: 86.000128 Million
- Tokenizer Used: Tokenizer 4 (trained on
all_final_dataset
(750 MB)) - Perplexity: 171
- Epochs: 10
- Training Split: 90:10
- Trained on:
-
Aditya Mehta (Roll Number: 22110017)
- Tokenized the
all_news_scrapped
dataset and identified duplicates with other datasets. - Worked on the tokenization process of the
Nepali_Deva
text corpus. - Prepared the entire report.
- Tokenized the
-
Daksh Jain (Roll Number: 22110066)
- Tokenized a Nepali Wikipedia-scraped dataset and evaluated fertility scores to determine the best tokenization strategy.
- Tested scalability by tokenizing datasets of various sizes.
- Ran prompt tests on the final model to validate its performance.
-
Hrriday Ruparel (Roll Number: 22110099)
- Wrote scripts for dataset merging, tokenization, and model training.
- Trained tokenizers on the scraped dataset.
- Trained an 80M model on the scraped dataset.
-
Kishan Ved (Roll Number: 22110122)
- Trained five different tokenizers and identified the one with the best fertility score.
- Trained an 86M parameter model on the scraped dataset.
- Trained a tokenizer on the Sangrah dataset and tokenized it.
- Trained an 86M parameter model on the Sangrah dataset.
-
Summet Sawale (Roll Number: 22110234)
- Assisted in writing code for tokenizing text into tensors of context length.
- Tested models using prompts.
- Documented testing results.
- Downloaded the scraped data from the server.