Skip to content

Commit

Permalink
Move things & add few pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Delaunay committed Apr 10, 2024
1 parent cc0ff39 commit d26140b
Show file tree
Hide file tree
Showing 17 changed files with 401 additions and 206 deletions.
20 changes: 20 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.9"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py

python:
install:
- method: pip
path: .
- method: pip
path: Sphinx
- method: pip
path: sphinx-rtd-theme
5 changes: 5 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"githubPullRequests.ignoredPullRequestBranches": [
"master"
]
}
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
49 changes: 49 additions & 0 deletions docs/Contributing/Design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Design
======

Milabench aims to simulate research workloads for benchmarking purposes.

* Performance is measured as throughput (samples / secs).
For example, for a model like resnet the throughput would be image per seconds.

* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
Simulating something similar to a hyper parameter search.
The performance of the benchmark is the sum of throughput of each processes.

* Multi GPU workloads

* Multi Nodes


Run
===

* Milabench Manager Process
* Handles messages from benchmark processes
* Saves messages into a file for future analysis

* Benchmark processes
* run using ``voir``
* voir is configured to intercept and send events during the training process
* This allow us to add models from git repositories without modification
* voir sends data through a file descriptor that was created by milabench main process


What milabench is
=================

* Training focused
* milabench show candid performance numbers
* No optimization beyond batch size scaling is performed
* we want to measure the performance our researcher will see
not the performance they could get.
* pytorch centric
* Pytorch has become the defacto library for research
* We are looking for accelerator with good maturity that can support
this framework with limited code change.


What milabench is not
=====================

* milabench goal is not a performance show case of an accelerator.
File renamed without changes.
File renamed without changes.
161 changes: 44 additions & 117 deletions docs/docker.rst → docs/GettingStarted/Docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,40 @@ Docker

`Docker Images <https://github.com/mila-iqia/milabench/pkgs/container/milabench>`_ are created for each release. They come with all the benchmarks installed and the necessary datasets. No additional downloads are necessary.


Setup
------

0. Make sure the machine can ssh between each other without passwords
1. Pull the milabench docker image you would like to run on all machines
- ``docker pull``
1. Create the output directory
- ``mkdir -p results``
2. Create a list of nodes that will participate in the benchmark inside a ``results/system.yaml`` file (see example below)
- ``vi results/system.yaml``
3. Call milabench with by specifying the node list we created.
- ``docker ... -v $(pwd)/results:/milabench/envs/runs -v <privatekey>:/milabench/id_milabench milabench run ... --system /milabench/envs/runs/system.yaml``


.. code-block:: yaml
system:
sshkey: <privatekey>
arch: cuda
docker_image: ghcr.io/mila-iqia/milabench:${system.arch}-nightly
nodes:
- name: node1
ip: 192.168.0.25
main: true
port: 8123
user: <username>
- name: node2
ip: 192.168.0.26
main: false
user: <username>
CUDA
----

Expand All @@ -22,16 +56,19 @@ storing the results inside the ``results`` folder on the host machine:

.. code-block:: bash
export SSH_KEY_FILE=$HOME/.ssh/id_rsa
# Choose the image you want to use
export MILABENCH_IMAGE=ghcr.io/mila-iqia/milabench:cuda-nightly
# Pull the image we are going to run
docker pull $MILABENCH_IMAGE
# Run milabench
docker run -it --rm --ipc=host --gpus=all \
-v $(pwd)/results:/milabench/envs/runs \
$MILABENCH_IMAGE \
docker run -it --rm --ipc=host --gpus=all --network host --privileged \
-v $SSH_KEY_FILE:/milabench/id_milabench \
-v $(pwd)/results:/milabench/envs/runs \
$MILABENCH_IMAGE \
milabench run
``--ipc=host`` removes shared memory restrictions, but you can also set ``--shm-size`` to a high value instead (at least ``8G``, possibly more).
Expand Down Expand Up @@ -63,16 +100,19 @@ For ROCM the usage is similar to CUDA, but you must use a different image and th

.. code-block:: bash
export SSH_KEY_FILE=$HOME/.ssh/id_rsa
# Choose the image you want to use
export MILABENCH_IMAGE=ghcr.io/mila-iqia/milabench:rocm-nightly
# Pull the image we are going to run
docker pull $MILABENCH_IMAGE
# Run milabench
docker run -it --rm --ipc=host \
docker run -it --rm --ipc=host --network host --privileged \
--device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --group-add video \
-v $SSH_KEY_FILE:/milabench/id_milabench \
-v /opt/amdgpu/share/libdrm/amdgpu.ids:/opt/amdgpu/share/libdrm/amdgpu.ids \
-v /opt/rocm:/opt/rocm \
-v $(pwd)/results:/milabench/envs/runs \
Expand All @@ -90,119 +130,6 @@ For the performance report, it is the same command:
milabench report --runs /milabench/envs/runs
Multi-node benchmark
^^^^^^^^^^^^^^^^^^^^

There are currently two multi-node benchmarks, ``opt-1_3b-multinode`` (data-parallel) and
``opt-6_7b-multinode`` (model-parallel, that model is too large to fit on a single GPU). Here is how to run them:

0. Make sure the machine can ssh between each other without passwords
1. Pull the milabench docker image you would like to run on all machines
- ``docker pull``
1. Create the output directory
- ``mkdir -p results``
2. Create a list of nodes that will participate in the benchmark inside a ``results/system.yaml`` file (see example below)
- ``vi results/system.yaml``
3. Call milabench with by specifying the node list we created.
- ``docker ... -v $(pwd)/results:/milabench/envs/runs -v <privatekey>:/milabench/id_milabench milabench run ... --system /milabench/envs/runs/system.yaml``

.. notes::

The main node is the node that will be in charge of managing the other worker nodes.

.. code-block:: yaml
system:
sshkey: <privatekey>
arch: cuda
docker_image: ghcr.io/mila-iqia/milabench:${system.arch}-nightly
nodes:
- name: node1
ip: 192.168.0.25
main: true
port: 8123
user: <username>
- name: node2
ip: 192.168.0.26
main: false
user: <username>
Then, the command should look like this:

.. code-block:: bash
# On manager-node:
# Change if needed
export SSH_KEY_FILE=$HOME/.ssh/id_rsa
export MILABENCH_IMAGE=ghcr.io/mila-iqia/milabench:cuda-nightly
docker run -it --rm --gpus all --network host --ipc=host --privileged \
-v $SSH_KEY_FILE:/milabench/id_milabench \
-v $(pwd)/results:/milabench/envs/runs \
$MILABENCH_IMAGE \
milabench run --system /milabench/envs/runs/system.yaml \
--select multinode
The last line (``--select multinode``) specifically selects the multi-node benchmarks. Omit that line to run all benchmarks.

If you need to use more than two nodes, edit or copy ``system.yaml`` and simply add the other nodes' addresses in ``nodes``.
You will also need to update the benchmark definition and increase the max number of nodes by creating a new ``overrides.yaml`` file.

For example, for 4 nodes:


.. code-block:: yaml
# Name of the benchmark. You can also override values in other benchmarks.
opt-6_7b-multinode:
num_machines: 4
.. code-block:: yaml
system:
arch: cuda
docker_image: ghcr.io/mila-iqia/milabench:${system.arch}-nightly
nodes:
- name: node1
ip: 192.168.0.25
main: true
port: 8123
user: <username>
- name: node2
ip: 192.168.0.26
main: false
user: <username>
- name: node3
ip: 192.168.0.27
main: false
user: <username>
- name: node4
ip: 192.168.0.28
main: false
user: <username>
The command would look like

.. code-block:: bash
docker ... milabench run ... --system /milabench/envs/runs/system.yaml --overrides /milabench/envs/runs/overrides.yaml
.. note::
The multi-node benchmark is sensitive to network performance. If the mono-node benchmark ``opt-6_7b`` is significantly faster than ``opt-6_7b-multinode`` (e.g. processes more than twice the items per second), this likely indicates that Infiniband is either not present or not used. (It is not abnormal for the multinode benchmark to perform *a bit* worse than the mono-node benchmark since it has not been optimized to minimize the impact of communication costs.)

Even if Infiniband is properly configured, the benchmark may fail to use it unless the ``--privileged`` flag is set when running the container.


Building images
---------------

Expand Down
Loading

0 comments on commit d26140b

Please sign in to comment.