Skip to content

Commit

Permalink
unit testing for pretraining
Browse files Browse the repository at this point in the history
  • Loading branch information
Kye committed Jul 7, 2023
1 parent 4a320cc commit 287b49c
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 0 deletions.
22 changes: 22 additions & 0 deletions testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
To ensure the performance and reliability of the language model, we need to track a variety of metrics. These include both quantitative measures, such as accuracy, perplexity, speed and memory consumption, as well as qualitative measures like coherence, relevance, and versatility of the generated responses. Here is a list of potential metrics to consider:

**1. Accuracy Metrics:**
- **Perplexity**: A measure of how well the model predicts a sample. Lower values are better.
- **BLEU (Bilingual Evaluation Understudy) Score**: Measures how many words overlap in the predicted and actual outputs, with a particular emphasis on the order of the words. It is most useful in tasks like translation.
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score**: It measures the quality of summaries by counting the number of overlapping units like n-grams, word sequences, and word pairs between the source and target text.
- **F1 Score**: Harmonic mean of precision (how many selected items are relevant) and recall (how many relevant items are selected).

**2. Speed and Resource Metrics:**
- **Latency**: The time it takes to generate a response after the input is given.
- **Throughput**: The number of tasks the model can complete in a given time period.
- **Memory consumption**: The amount of RAM consumed during the prediction phase.

**3. Qualitative Metrics:**
- **Coherence**: Whether the output makes sense.
- **Relevance**: Whether the output is relevant to the input query.
- **Versatility**: Whether the model can handle a variety of input types and still produce coherent, relevant output.


This suite tests the model for speed (latency and throughput) and memory consumption. In addition to these, you should also conduct manual tests to evaluate the model's output on various inputs for coherence, relevance and versatility.

Remember, there is no specific test for accuracy metrics such as perplexity, BLEU score, ROUGE score or F1 score because these are often task-specific and need to be evaluated on a per task basis.
69 changes: 69 additions & 0 deletions testing/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import unittest
import torch
import time
from Andromeda.model import Andromeda
from Andromeda.utils.stable_adamw import StableAdamWUnfused

class AndromedaTest(unittest.TestCase):

def setUp(self):
self.model = Andromeda
self.optimizer = StableAdamWUnfused()
self.loss_function = torch.nn.CrossEntropyLoss()
self.test_input = torch.randint(0, 256, (1, 1024)).cuda()

def test_forward_pass(self):
#test if the models forward pass works
output = self.model(self.input_tensor)
self.assertEqual(output.shape, (1, 1024, 64007)) # test if output shape is correct

def test_backward_pass(self):
#test if the models backward pass works correctly
self.optimizer.zero_grad()
output = self.model(self.input_tensor)
loss = self.loss_function(output, self.input_tensor)

loss.backward()
for name, parameter in self.model.named_parameters():
self.assertFalse(torch.isnan(parameter.grad().any(), f"Gradient for {name} contains NaNs"))
self.assertFalse(torch.isinf(parameter.grad().any(), f"Gradient for {name} contains Infs"))

def test_optimizer_step(self):
#test if the optimizer steps correctly
initial_params = [param.clone() for param in self.model.parameters()]
output = self.model(self.input_tensor)
loss = self.loss_function(output, self.input_tensor)

self.optimizer.zero_grad()
loss.backward()
self.optimizer_step()
for initial_param, param in zip(initial_params, self.model.parameters()):
self.assertFalse(torch.equal(initial_param, param), 'Model parameters did not change after an optimizer step')

def test_prediction(self):
start_time = time.time()
prediction = self.model(self.test_input)
latency = time.time() - start_time

self.assertLess(latency, 1) # test if latency is less than 1 second
self.assertEqual(prediction.shape, (1, 1024, 64007)) # test if output shape si correct


def test_memory_consumption(self):
start_mem = torch.cuda.memory_allocated()
prediction = self.model(self.test_input)
end_mem = torch.cuda.memory_allocated()
mem_diff = end_mem - start_mem
self.assertLess(mem_diff, 2 * 1024**3) # memory diff should be less than 2gb

def test_throughput(self):
start_time = time.time()
for _ in range(100):
prediction = self.model(self.test_input)
end_time = time.time()
throughput = 100 / (end_time - start_time)
self.assertGreater(throughput, 10) # model should handle atleast at 10 inferences per second


if __name__ == "__main__":
unittest.main()

0 comments on commit 287b49c

Please sign in to comment.