Skip to content

Commit

Permalink
[Misc] Reorganize Zeus NSDI 23 paper artifacts (#126)
Browse files Browse the repository at this point in the history
  • Loading branch information
jaywonchung authored Sep 19, 2024
1 parent 360d0ee commit 43297f6
Show file tree
Hide file tree
Showing 19 changed files with 32 additions and 22 deletions.
7 changes: 2 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,14 @@ zeus/
│ ├── device/ # - Abstraction layer over CPU and GPU devices
│ ├── utils/ # - Utility functions and classes
│ ├── _legacy/ # - Legacy code to keep our research papers reproducible
│ ├── show_env.py # - Installation & device detection verification script
│ └── callback.py # - Base class for callbacks during training
├── zeusd # 🌩️ Zeus daemon
├── docker/ # 🐳 Dockerfiles and Docker Compose files
├── examples/ # 🛠️ Zeus usage examples
├── capriccio/ # 🌊 A drifting sentiment analysis dataset
└── trace/ # 🗃️ Training and energy traces for various GPUs and DNNs
└── examples/ # 🛠️ Zeus usage examples
```

## Getting Started
Expand Down
2 changes: 1 addition & 1 deletion docs/research_overview/zeus.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ We have our trace-driven simulator open-sourced [here](https://github.com/ml-ene
### Extending the Zeus simulator

Users can implement custom policies that optimize batch size and power limit, and plug it into the Zeus simulator.
We have training and energy traces for 6 different DNNs and 4 different NVIDIA GPU microarchitectures [here](https://github.com/ml-energy/zeus/tree/master/trace){.external}, which the simulator runs with.
We have training and energy traces for 6 different DNNs and 4 different NVIDIA GPU microarchitectures [here](https://github.com/ml-energy/zeus/tree/master/examples/research_reproducibility/zeus_nsdi23/trace){.external}, which the simulator runs with.

Zeus defines two abstract classes [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] in [`zeus._legacy.policy.interface`][zeus._legacy.policy.interface].
Each class optimizes the batch size and power limit of a recurring training job respectively.
Expand Down
6 changes: 3 additions & 3 deletions examples/batch_size_optimizer/capriccio/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Capriccio + BSO

This example will demonstrate how to integrate Zeus with [Capriccio](../../../capriccio), a drifting sentiment analysis dataset.
This example will demonstrate how to integrate Zeus with [Capriccio](../../research_reproducibility/zeus_nsdi23/capriccio), a drifting sentiment analysis dataset.

## Dependencies

1. Generate Capriccio, following the instructions in [Capriccio's README.md](../../../capriccio/).
1. If you're not using our [Docker image](https://ml.energy/zeus/getting_started/environment/), install `zeus` following our [Getting Started](https://ml.energy/zeus/getting_started/) guide.
1. Generate Capriccio, following the instructions in [Capriccio's README](../../research_reproducibility/zeus_nsdi23/capriccio).
1. Either use our Docker images or install `zeus` following our [documentation](https://ml.energy/zeus/getting_started/).
1. Install python dependencies for this example:
```sh
pip install -r requirements.txt
Expand Down
9 changes: 6 additions & 3 deletions examples/research_reproducibility/zeus_nsdi23/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Running Zeus in a Trace-Driven Fashion
# Zeus NSDI 23 paper artifacts

Two big chunks in our Zeus NSDI 23 paper were (1) running Zeus in a trace-driven fashion, and (2) Capriccio, a sentiment analysis dataset with data drift.
See the [capriccio](capriccio) directory for more about the Capriccio dataset; read on for trace-driven Zeus.

While the existence of recurring jobs in production GPU clusters is clear, it is not really easy to run 50 DNN training jobs sequentially to evaluate energy optimization methods.
Thus, Zeus provides a trace-driven simulator that allows users to plug in their own customized batch size optimizer and power limit optimizers and observe gains.
Expand All @@ -8,7 +11,7 @@ We provide two types of traces.
1. Train trace: We trained six different (model, dataset) pairs with many different batch sizes. And we repeated training at least four times for each triplet with different random seeds. Thus, when we would like to know the result of training a model on a dataset with a certain batch size, we can sample a *training path* from this trace.
2. Power trace: We profiled the duration of one epoch and average power consumption for six (model, dataset) pairs with many different (batch size, power limit) configurations. These results not stochastic, and can be fetched from the trace to construct TTA (time to accuracy) and ETA (energy to accuracy) values.

Refer to the [`trace`](../../trace/) directory for more information about the traces we provide.
Refer to the [`trace`](trace) directory for more information about the traces we provide.

## Simulating the recurrence of one job

Expand Down Expand Up @@ -45,7 +48,7 @@ Please refer to our paper for details on how jobs in our train/power traces are

### Dependencies

Install `zeus` following [Installing and Building](https://ml.energy/zeus/getting_started). The power monitor is not needed.
Install `zeus` following [our Getting Started guide](https://ml.energy/zeus/getting_started).

All dependencies are already installed you're using our Docker image (`mlenergy/zeus:latest`).

Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
(%d_val.json).
Usage:
python capriccio/generate.py /path/to/sentiment140.json
python generate.py /path/to/sentiment140.json
"""

import argparse
Expand Down
File renamed without changes.
6 changes: 3 additions & 3 deletions examples/research_reproducibility/zeus_nsdi23/run_alibaba.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def run_simulator(
) -> list[tuple[str, list[HistoryEntry]]]:
"""Run the simulator on the given job."""
# Read in the Alibaba trace
alibaba_df = pd.DataFrame(pd.read_csv("../../../trace/alibaba_groups.csv.xz"))
alibaba_df = pd.DataFrame(pd.read_csv("trace/alibaba_groups.csv.xz"))
print("Read in the Alibaba trace.")
print(f"Number of groups: {alibaba_df.group.nunique()}")

Expand Down Expand Up @@ -83,13 +83,13 @@ def simulate_group(
@lru_cache(maxsize=1)
def read_train_trace() -> pd.DataFrame:
"""Read the train trace file as a Pandas DataFrame."""
return pd.DataFrame(pd.read_csv("../../../trace/summary_train.csv"))
return pd.DataFrame(pd.read_csv("trace/summary_train.csv"))


@lru_cache(maxsize=1)
def read_power_trace(gpu: Literal["a40", "v100", "p100", "rtx6000"]) -> pd.DataFrame:
"""Read the power trace of the given GPU as a Pandas DataFrame."""
return pd.DataFrame(pd.read_csv(f"../../../trace/summary_power_{gpu}.csv"))
return pd.DataFrame(pd.read_csv(f"trace/summary_power_{gpu}.csv"))


def get_job_with_defaults(
Expand Down
4 changes: 2 additions & 2 deletions examples/research_reproducibility/zeus_nsdi23/run_single.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ def read_trace(
gpu: Literal["a40", "v100", "p100", "rtx6000"]
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Read the train and power trace files as Pandas DataFrames."""
train_df = pd.DataFrame(pd.read_csv("../../trace/summary_train.csv"))
power_df = pd.DataFrame(pd.read_csv(f"../../trace/summary_power_{gpu}.csv"))
train_df = pd.DataFrame(pd.read_csv("trace/summary_train.csv"))
power_df = pd.DataFrame(pd.read_csv(f"trace/summary_power_{gpu}.csv"))
return train_df, power_df


Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions scripts/lint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
set -ev

if [[ -z $GITHUB_ACTION ]]; then
black zeus capriccio tests
black zeus tests
else
black --check zeus capriccio tests
black --check zeus tests
fi

ruff check zeus
Expand Down
14 changes: 12 additions & 2 deletions tests/optimizer/batch_size/test_simulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,18 @@ def read_trace(
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Read the train and power trace files as Pandas DataFrames."""
trace_dir = Path(__file__).resolve(strict=True).parents[3]
train_df = pd.DataFrame(pd.read_csv(trace_dir / "trace/summary_train.csv"))
power_df = pd.DataFrame(pd.read_csv(trace_dir / f"trace/summary_power_{gpu}.csv"))
train_df = pd.DataFrame(
pd.read_csv(
trace_dir
/ "examples/research_reproducibility/zeus_nsdi23/trace/summary_train.csv"
)
)
power_df = pd.DataFrame(
pd.read_csv(
trace_dir
/ f"examples/research_reproducibility/zeus_nsdi23/trace/summary_power_{gpu}.csv"
)
)
return train_df, power_df


Expand Down

0 comments on commit 43297f6

Please sign in to comment.