-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Dmitry Shmulevich <[email protected]>
- Loading branch information
Showing
4 changed files
with
178 additions
and
133 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Topograph with Kubernetes | ||
|
||
In Kubernetes, Topograph performs two main actions: | ||
|
||
- Creates a ConfigMap containing the topology information. | ||
- Applies node labels that define the node’s position within the cloud topology. For instance, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will label the node with the following: | ||
- `topology.kubernetes.io/network-level-1: S1` | ||
- `topology.kubernetes.io/network-level-2: S2` | ||
- `topology.kubernetes.io/network-level-3: S3` | ||
|
||
## Configuration and Deployment | ||
TBD | ||
|
||
## Validation and Testing | ||
TBD |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Topograph with SLURM | ||
|
||
For the SLURM engine, topograph supports [tree](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/tree) and [block](https://slurm.schedmd.com/topology.conf.html#SECTION_topology/block) topology configurations. | ||
|
||
### Test Provider and Engine | ||
There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes. | ||
|
||
## Installation and Configuration | ||
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch. | ||
|
||
The configuration file and certificates created by the installer are located in the /etc/topograph directory. | ||
|
||
#### Service Management | ||
To enable and start the service, run the following commands: | ||
```bash | ||
systemctl enable topograph.service | ||
systemctl start topograph.service | ||
``` | ||
|
||
Upon starting, the service executes: | ||
```bash | ||
/usr/local/bin/topograph -c /etc/topograph/topograph-config.yaml | ||
``` | ||
|
||
To disable and stop the service, run the following commands: | ||
```bash | ||
systemctl stop topograph.service | ||
systemctl disable topograph.service | ||
systemctl daemon-reload | ||
``` | ||
|
||
#### Verifying Health | ||
To verify the service is healthy, you can use the following command: | ||
|
||
```bash | ||
curl http://localhost:49021/healthz | ||
``` | ||
|
||
#### Using Toposim | ||
To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim. | ||
```bash | ||
forward_service_url: dns:localhost:49025 | ||
``` | ||
Then run the topograph service as normal. | ||
|
||
You must then start the toposim service as such, setting the path to the test model that you want to use in simulation: | ||
```bash | ||
/usr/local/bin/topograph -m /usr/local/bin/tests/models/<cluster-model>.yaml | ||
``` | ||
|
||
You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider. | ||
If you want to view the tree topology, then use the command: | ||
```bash | ||
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test"}}' http://localhost:49021/v1/generate) | ||
``` | ||
|
||
And if you want to view the block topology (with specified block sizes), use the command: | ||
```bash | ||
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/<cluster-model>.yaml"}},"engine":{"name":"test", "params":{"plugin":"topology/block", "block_sizes": <block-sizes>}}}' http://localhost:49021/v1/generate) | ||
``` | ||
|
||
You can query the results of either topology request with: | ||
```bash | ||
curl -s "http://localhost:49021/v1/topology?uid=$id" | ||
``` | ||
Note the path specified in the topograph query should point to the same model as provided to toposim. | ||
|
||
#### Automated Solution for SLURM | ||
|
||
The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up: | ||
|
||
```bash | ||
strigger --set --node --down --up --flags=perm --program=<script> | ||
``` | ||
|
||
In this setup, the `<script>` would contain the curl command to call the endpoint: | ||
|
||
```bash | ||
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate | ||
``` | ||
|
||
We provide the [create-topology-update-script.sh](../scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger. | ||
|
||
The script accepts the following parameters: | ||
- **provider name** (aws, oci, gcp, cw, baremetal) | ||
- **path to the generated topology update script** | ||
- **path to the topology.conf file** | ||
|
||
Usage: | ||
```bash | ||
create-topology-update-script.sh -p <provider name> -s <topology update script> -c <path to topology.conf> | ||
``` | ||
|
||
Example: | ||
```bash | ||
create-topology-update-script.sh -p aws -s /etc/slurm/update-topology-config.sh -c /etc/slurm/topology.conf | ||
``` | ||
|
||
This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration. |