Training Infrastructure Scripts

This repository contains various scripts developed at Imbue for managing a large cluster of H100s, detecting and fixing hardware issues, and generally ensuring smooth model training. You can read more about our process here

The code is organized as follows:

gpu_stress_test contains a check that the GPUs on each machine are able to allocate large tensors and perform standard operations.
health_checks contains various checks we use to determine which hosts are healthy, as well as automated solutions to common issues.
host_validation contains tests to check that the GPUs on a given machine are able to communicate with each other (via NVLink) and with GPUs on other machines (via InfiniBand).
ufm_events contains a script which parses the UFM event log and other logs, checks for relevant events, and determines which network ports should be disabled.
ib_burn contains a script for generating a comprehensive burn-in workload for IB fabrics, aiming to exercise every available link.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gpu_stress_test		gpu_stress_test
health_checks		health_checks
host_validation		host_validation
ib_burn		ib_burn
ufm_events		ufm_events
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Infrastructure Scripts

About

Releases

Packages

Contributors 4

Languages

License

imbue-ai/cluster-health

Folders and files

Latest commit

History

Repository files navigation

Training Infrastructure Scripts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages