Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Benchmarking / profiling example #11

Open
4 tasks
lebrice opened this issue Jun 26, 2024 · 3 comments · May be fixed by #45
Open
4 tasks

Add a Benchmarking / profiling example #11

lebrice opened this issue Jun 26, 2024 · 3 comments · May be fixed by #45
Assignees

Comments

@lebrice
Copy link
Collaborator

lebrice commented Jun 26, 2024

Should be completed after mila-iqia/mila-docs#247

Now that we have an example of how to benchmark the throughput and identify bottlenecks in the mila-docs, the research project template should also make this easy to do.

  • Add an example experiment configuration and accompanying notebook, that use the pytorch profiler and does the same kind of profiling as in the example, but using the template
  • Add an example of a sweep over some parameters, with the training throughput as the metric, and using different kinds of GPUs.
  • Create a wandb report with the throughput comparison between the different GPU types.
      1. Find the best datamodule parameters to maximize the throughput (batches per second) without training (NoOP algo)
      1. Measure the performance on different GPUS using the optimal datamodule params from before (and keeping other parameters the same)
      1. Using the results from before, do a simple sweep over model hyper-parameters to maximize the utilization of the selected GPU (which was selected as a tradeoff between performance and difficulty to get an allocation). For example if the RTX8000's are 20% slower than A100s but 5x easier to get an allocation on, use those instead.
  • If done after DRAC support, also include a comparison between Mila/DRAC clusters. (For example, the optimal num_workers might be greater in DRAC due to the very slow $SCRATCH filesystems, could be interesting to take a look at that).
@lebrice
Copy link
Collaborator Author

lebrice commented Jul 5, 2024

Also interesting: https://github.com/nschloe/tuna

@lebrice
Copy link
Collaborator Author

lebrice commented Sep 10, 2024

More specific breakdown of the example notebook steps:

  1. Instrumenting your code: adding metrics so you can measure things you care about (e.g. a) training speed (steps or samples per second), b) CPU/GPU utilization, RAM / VRAM utilization, etc.)
    • This is achieved by using a callback (MeasureSamplesPerSecondCallback)
    • An easy way to set this up is using wandb, you get those "for free" in the systems panel
  2. Establish a baseline performance: What are the values for the metrics above that we get with our initial configuration?
  3. Check whether dataloading is the bottleneck (Using the NoOp algorithm, check that throughput (metric A) is much higher than when actually training).
    • If it is, then we can safely assume that the dataloader isnt the bottleneck, so we can move on to other problems,
  4. Do we even need a GPU? Compare speed using CPU only vs the slowest GPU available, for a low number of steps
    • If the CPU performance loosely comparable (for instance, only 1.5-2x slower) than with a GPU, then it might be worth considering! (LMK if this happens, one thing could be to try to increase the # of CPUs and measure performance scaling, then ship this kind of job to a DRAC cluster)
    • In most workflows, using a GPU actually helps a lot.
  5. What performance do you get with each type of GPU? (Based on the VRAM requirements of the job (step 1), try all the GPU types on the Cluster that can accomodate this kind of job)
  6. How well are we using the GPU?
    • Once we've selected the target GPU that we want to use, measure the GPU utilization. Is the GPU utilization high? (>80%?)
    • If it's high (>80%), then we can either stop here, or we can keep going a bit further
    • If it's low, then we can use the PyTorch profiler (or any other tool) to try to figure out what the bottleneck is.

@lebrice
Copy link
Collaborator Author

lebrice commented Sep 10, 2024

An example of step 7+ would be something like this: https://pytorch.org/blog/accelerating-generative-ai/

@lebrice lebrice linked a pull request Sep 18, 2024 that will close this issue
@lebrice lebrice linked a pull request Sep 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants