Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Getting_Started_Benchmarking.md #1179

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions benchmarks/Getting_Started_Benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
### Getting starting with benchmark running in MaxText

Two approaches are here:

1. Run a model recipe with a single CLI command. Great to replicate performance results previously measured. See https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/training/trillium for examples.
2. Run several experiments pythonically across a sweep of parameters (cluster configuration, maxtext parameters) with XPK workloads.

- **xla_flags_library.py**: A grouping of xla flags organized by purpose with details on how they can be applied to a model.
- **maxtext_trillium_model_config.py**: A list of model definitions for Trillium. See optimized models here and how they apply xla flags. This config provides a pythonic way to run MaxText models.
- **benchmark_runner.py**: A cli interface to running a specific model recipe, on pathways or mcjax directly or with orchestration like xpk with one command.

```shell
# McJax
CLUSTER=my-cluster
ZONE=my-zone
PROJECT=my-project
python3 benchmarks/benchmark_runner.py --project $PROJECT --zone $ZONE --cluster_name $CLUSTER --device_type v6e-256 --base_output_directory gs://maxtext-experiments-tpem/ --num_steps=5
```

```shell
# Pathways
export RUNNER=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/maxtext_jax_stable
export PROXY_IMAGE=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/proxy_server:latest
export SERVER_IMAGE=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/server:latest

python3 benchmarks/benchmark_runner.py --project $PROJECT --zone $ZONE --cluster_name $CLUSTER --device_type v6e-256 --base_output_directory gs://maxtext-experiments-tpem/ --num_steps=5 --pathways_server_image="${SERVER_IMAGE}" --pathways_proxy_image="${PROXY_IMAGE}" --pathways_runner_image="${RUNNER}"
```

- **maxtext_xpk_runner.py**: A pythonic way to run xpk workloads! With the magic of for looping and python code, run several xpk workloads across a sweep of parameters including libtpu version, gke clusters, and maxtext parameters with one python script.

```shell
# Loop possibilities:
# 1. Test different libtpu nightly versions.
# for libtpu_type in [
# LibTpuType.NIGHTLY
# ]:
# todays_date = time.strftime('%Y%m%d')
# for date in ['20241201', '20241202', todays_date]:

# 2. Test different model configurations.
# for remat_policy in ['qkv_proj_offloaded', 'minimal']:
# model.tuning_params['remat_policy'] = remat_policy

for model in list_of_models:
# Run workloads on the below clusters
for cluster_config in [
v5e_cluster_config,
v6e_cluster_config,
]:
# Run workloads in the following slice configurations
for num_slices in [1, 2 , 4]:
# Use the libtpu dependencies from:
for libtpu_type in [
LibTpuType.MAXTEXT
]:
wl_config = WorkloadConfig(
model=model,
num_slices=num_slices,
device_type=cluster_config.device_type,
base_output_directory=base_output_dir,
priority="medium",
max_restarts=0,
libtpu_type=libtpu_type,
libtpu_nightly_version="",
base_docker_image=base_docker_image,
pathways_config=None
)
command, name = generate_xpk_workload_cmd(
cluster_config=cluster_config,
wl_config=wl_config
)

print(f"Name of the workload is: {name} \n")
xpk_workload_names.append(name)

print(f"XPK command to be used is: {command} \n")
xpk_workload_cmds.append(command)

for xpk_workload_name, xpk_workload_cmd in zip(xpk_workload_names, xpk_workload_cmds):
return_code = run_command_with_updates(xpk_workload_cmd, xpk_workload_name)
if return_code != 0:
print('Unable to run xpk workload: {xpk_workload_name}')

```
Loading