From 37297cf92d5f6c7c3c092c5e1da7443f224777dd Mon Sep 17 00:00:00 2001 From: Dmitry Shmulevich Date: Tue, 29 Oct 2024 09:20:22 -0700 Subject: [PATCH] update docs (#17) Signed-off-by: Dmitry Shmulevich --- README.md | 197 +++++++++++++---------------------------- docs/assets/design.png | Bin 0 -> 245860 bytes docs/k8s.md | 15 ++++ docs/slurm.md | 99 +++++++++++++++++++++ 4 files changed, 178 insertions(+), 133 deletions(-) create mode 100644 docs/assets/design.png create mode 100644 docs/k8s.md create mode 100644 docs/slurm.md diff --git a/README.md b/README.md index 146c8b7..176d036 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# topograph +# Topograph Topograph is a component designed to expose the underlying physical network topology of a cluster to enable a workload manager make network-topology aware scheduling decisions. It consists of four major components: @@ -7,14 +7,14 @@ Topograph is a component designed to expose the underlying physical network topo 3. **Topology Generator** 4. **Node Observer** +

Design

+ ## Components ### 1. CSP Connector - The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator. ### 2. API Server - The API Server listens for network topology configuration requests on a specific port. When a request is received, the server triggers the Topology Generator to populate the configuration. The API Server exposes two endpoints: one for synchronous requests and one for asynchronous requests. @@ -23,7 +23,6 @@ The API Server exposes two endpoints: one for synchronous requests and one for a - In the asynchronous mode, the API Server promptly returns a "202 Accepted" response to the HTTP request. It then begins generating and serializing the topology configuration. ### 3. Topology Generator - The Topology Generator is the central component that manages the overall network topology of the cluster. It performs the following functions: - **Notification Handling:** Receives notifications from the API Server. @@ -34,119 +33,84 @@ The Topology Generator is the central component that manages the overall network The Node Observer is used when the Topology Generator is deployed in a Kubernetes cluster. It monitors changes in the cluster nodes. If a node's status changes (e.g., a node goes down or comes up), the Node Observer sends a request to the API Server to generate a new topology configuration. -## Supported Environments - -Topograph functions using the concepts of `provider` and `engine`. Here, a `provider` refers to a CSP, and an `engine` denotes a scheduling system such as SLURM or Kubernetes. - -### SLURM Engine - -For the SLURM engine, topograph supports the following CSPs: -- AWS -- OCI -- GCP -- CoreWeave -- Bare metal - -### Kubernetes Engine - -Support for the Kubernetes engine is currently in the development stage. - -### Test Provider and Engine - -There is a special *provider* and *engine* named `test`, which supports both SLURM and Kubernetes. This configuration returns static results and is primarily used for testing purposes. - ## Workflow -- The API Server listens on the port and notifies the Topology Generator about incoming requests. +- The API Server listens on the port and notifies the Topology Generator about incoming requests. In kubernetes, the incoming requests sent by the Node Observer, which watches changes in the node status. - The Topology Generator receives the notification and attempts to gather the current network topology of the cluster. - The Topology Generator instructs the CSP Connector to retrieve the network topology from the CSP. - The CSP Connector fetches the topology and translates it from the CSP-specific format to an internal format. - The Topology Generator converts the internal format into the format expected by the user cluster (e.g., SLURM or Kubernetes). -- The Topology Generator returns the network topology configuration to the API Server, which then relays it back to the requester. -## Topograph Installation and Configuration -Topograph can operate as a standalone service within SLURM clusters or be deployed in Kubernetes clusters. - -### Topograph as a Standalone Service -Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch. - -#### Configuration -The default configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). It includes settings for: - - HTTP endpoint for the Topology Generator - - SSL/TLS connection - - environment variables - -By default, SSL/TLS is disabled, but the server certificate and key are generated during package installation. - -The configuration file also includes an optional section for environment variables. When specified, these variables are added to the shell environment. Note that the `PATH` variable, if provided, is appended to the existing `PATH`. - -#### Service Management -To enable and start the service, run the following commands: -```bash -systemctl enable topograph.service -systemctl start topograph.service -``` - -Upon starting, the service executes: -```bash -/usr/local/bin/topograph -c /etc/topograph/topograph-config.yaml +## Configuration +Topograph accepts its configuration file path using the `-c` command-line parameter. The configuration file is a YAML document. A sample configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). + +The configuration file supports the following parameters: +```yaml +# serving topograph endpoint +http: + # port: specifies the port on which the API server will listen (required). + port: 49021 + # ssl: enables HTTPS protocol if set to `true` (optional). + ssl: false + +# request_aggregation_delay: defines the delay before processing a request (required). +# Topograph aggregates multiple sequential requests within this delay into a single request, +# processing only if no new requests arrive during the specified duration. +request_aggregation_delay: 15s + +# forward_service_url: specifies the URL of an external gRPC service +# to which requests are forwarded (optional). +# This can be useful for testing or integration with external systems. +# See protos/topology.proto for details. +# forward_service_url: + +# page_size: sets the page size for topology requests against a CSP API (optional). +page_size: 100 + +# ssl: specifies the paths to the TLS certificate, private key, +# and CA certificate (required if `http.ssl=true`). +ssl: + cert: /etc/topograph/ssl/server-cert.pem + key: /etc/topograph/ssl/server-key.pem + ca_cert: /etc/topograph/ssl/ca-cert.pem + +# credentials_path: specifies the path to a file containing CSP credentials (optional). +# credentials_path: + +# env: environment variable names and values to inject into Topograph's shell (optional). +# The `PATH` variable, if provided, will append the specified value to the existing `PATH`. +env: +# SLURM_CONF: /etc/slurm/slurm.conf +# PATH: ``` -To disable and stop the service, run the following commands: -```bash -systemctl stop topograph.service -systemctl disable topograph.service -systemctl daemon-reload -``` - -#### Verifying Health -To verify the service is healthy, you can use the following command: - -```bash -curl http://localhost:49021/healthz -``` - -#### Using Toposim -To test the service on a simulated cluster, first add the following line to `/etc/topograph/topograph-config.yaml` so that any topology requests are forwarded to toposim. -```bash -forward_service_url: dns:localhost:49025 -``` -Then run the topograph service as normal. - -You must then start the toposim service as such, setting the path to the test model that you want to use in simulation: -```bash -/usr/local/bin/topograph -m /usr/local/bin/tests/models/.yaml -``` - -You can then verify the topology results via simulation by querying topograph using the `test` provider and engine, and specifying the test model path as a parameter to the provider. If you want to view the tree topology, then use the command: -```bash -id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/.yaml"}},"engine":{"name":"test"}}' http://localhost:49021/v1/generate) -``` +## Supported Environments -And if you want to view the block topology (with specified block sizes), use the command: -```bash -id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test", "params":{"model_path":"/usr/local/bin/topograph/tests/models/.yaml"}},"engine":{"name":"test", "params":{"plugin":"topology/block", "block_sizes": }}}' http://localhost:49021/v1/generate) -``` +Topograph operates with two primary concepts: `provider` and `engine`. A `provider` represents a CSP or a similar environment, while an engine refers to a scheduling system like SLURM or Kubernetes. -You can query the results of either topology request with: -```bash -curl -s "http://localhost:49021/v1/topology?uid=$id" -``` +Currently supported providers: +- AWS +- OCI +- GCP +- CoreWeave +- Bare metal -Note the path specified in the topograph query should point to the same model as provided to toposim. +For detailed information on supported engines, see: +- [SLURM](./docs/slurm.md) +- [Kubernetes](./docs/k8s.md) -#### Using the Cluster Topology Generator +## Using Topograph -The Cluster Topology Generator offers three endpoints for interacting with the service. Below are the details of each endpoint: +Topograph offers three endpoints for interacting with the service. Below are the details of each endpoint: -##### 1. Health Endpoint +### 1. Health Endpoint - **URL:** `http://:/healthz` - **Description:** This endpoint verifies the service status. It returns a "200 OK" HTTP response if the service is operational. -##### 2. Topology Request Endpoint +### 2. Topology Request Endpoint -- **URL:** `http(s)://:/v1/generate` +- **URL:** `http://:/v1/generate` - **Description:** This endpoint is used to request a new cluster topology. - **Payload:** The payload is a JSON object that includes the following fields: - **provider name**: (mandatory) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `cw`, `baremetal` or `test`. @@ -162,7 +126,7 @@ The Cluster Topology Generator offers three endpoints for interacting with the s - **topology_config_path**: (mandatory) A string specifying the key for the topology config in the ConfigMap. - **topology_configmap_name**: (mandatory) A string specifying the name of the ConfigMap containing the topology config. - **topology_configmap_namespace**: (mandatory) A string specifying the namespace of the ConfigMap containing the topology config. - - **nodes**: (optional) An array of regions mapping instance IDs to node names. + - **nodes**: (optional) An array of regions mapping instance IDs to node names. Example: @@ -205,9 +169,9 @@ The Cluster Topology Generator offers three endpoints for interacting with the s - **Response:** This endpoint immediately returns a "202 Accepted" status with a unique request ID if the request is valid. If not, it returns an appropriate error code. -##### 3. Topology Result Endpoint +### 3. Topology Result Endpoint -- **URL:** `http(s)://:/v1/topology` +- **URL:** `http://:/v1/topology` - **Description:** This endpoint retrieves the result of a topology request. - **URL Query Parameters:** - **uid**: Specifies the request ID returned by the topology request endpoint. @@ -223,36 +187,3 @@ id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http:/ curl -s "http://localhost:49021/v1/topology?uid=$id" ``` - -#### Automated Solution for SLURM - -The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up: - -```bash -strigger --set --node --down --up --flags=perm --program=