diff --git a/README.md b/README.md index 71bb24e..a6298af 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Topograph is a component designed to expose the underlying physical network topo ### 1. CSP Connector -The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, with plans to add support for OCI, GCP, and Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator. +The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator. ### 2. API Server @@ -45,6 +45,7 @@ For the SLURM engine, topograph supports the following CSPs: - OCI - GCP - CoreWeave +- Bare metal ### Kubernetes Engine @@ -64,19 +65,22 @@ There is a special *provider* and *engine* named `test`, which supports both SLU - The Topology Generator returns the network topology configuration to the API Server, which then relays it back to the requester. ## Topograph Installation and Configuration +Topograph can operate as a standalone service within SLURM clusters or be deployed in Kubernetes clusters. + +### Topograph as a Standalone Service Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch. -### Configuration +#### Configuration The default configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). It includes settings for: - HTTP endpoint for the Topology Generator - SSL/TLS connection - environment variables -By default, SSL/TLS is enabled, and the server certificate and key are generated during package installation. +By default, SSL/TLS is disabled, but the server certificate and key are generated during package installation. The configuration file also includes an optional section for environment variables. When specified, these variables are added to the shell environment. Note that the `PATH` variable, if provided, is appended to the existing `PATH`. -### Service Management +#### Service Management To enable and start the service, run the following commands: ```bash systemctl enable topograph.service @@ -95,39 +99,90 @@ systemctl disable topograph.service systemctl daemon-reload ``` -### Testing the Service +#### Testing the Service To verify the service is running correctly, you can use the following commands: + ```bash curl http://localhost:49021/healthz -curl -X POST "http://localhost:49021/v1/generate?provider=test&engine=test" +id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test"},"engine":{"name":"test"}}' http://localhost:49021/v1/generate) + +curl -s "http://localhost:49021/v1/topology?uid=$id" ``` -### Using the Cluster Topology Generator +#### Using the Cluster Topology Generator The Cluster Topology Generator offers three endpoints for interacting with the service. Below are the details of each endpoint: -#### 1. Health Endpoint +##### 1. Health Endpoint - **URL:** `http://:/healthz` - **Description:** This endpoint verifies the service status. It returns a "200 OK" HTTP response if the service is operational. -#### 2. Topology Request Endpoint +##### 2. Topology Request Endpoint -- **URL:** `http://:/v1/generate` +- **URL:** `http(s)://:/v1/generate` - **Description:** This endpoint is used to request a new cluster topology. -- **URL Query Parameters:** - - **provider**: (mandatory) Specifies the Cloud Service Provider (CSP) such as `aws`, `oci`, `gcp`, `azure`, or `test`. - - **engine**: (mandatory) Specifies the configuration format, either `slurm` or `k8s`. - - **topology_config_path**: (optional for `engine=slurm`, mandatory for `engine=k8s`) Specifies the file path for the topology config when `engine=slurm`, or the key for the topology config in the configmap when `engine=k8s`. If omitted for `slurm`, the content of the topology config is returned with the HTTP response. - - **topology_configmap_name**: (mandatory for `engine=k8s`) Specifies the name of the configmap containing the topology config. - - **topology_configmap_namespace**: (mandatory for `engine=k8s`) Specifies the namespace of the configmap containing the topology config. - - **skip_reload**: (optional for `engine=slurm`) Omit cluster reconfiguration if present. +- **Payload:** The payload is a JSON object that includes the following fields: + - **provider name**: (mandatory) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `cw`, `baremetal` or `test`. + - **provider credentials**: (optional) A key-value map with provider-specific parameters for authentication. + - **engine name**: (mandatory) A string specifying the topology output, either `slurm` or `k8s`. + - **engine parameters**: A key-value map with engine-specific parameters. + - **slurm parameters**: + - **topology_config_path**: (optional) A string specifying the file path for the topology configuration. If omitted, the topology config content is returned in the HTTP response. + - **plugin**: (optional) A string specifying topology plugin. Default topology/tree. + - **block_sizes**: (optional) A string specifying block size for topology/block plugin + - **skip_reload**: (optional) If present, the cluster reconfiguration is skipped. + - **k8s parameters**: + - **topology_config_path**: (mandatory) A string specifying the key for the topology config in the ConfigMap. + - **topology_configmap_name**: (mandatory) A string specifying the name of the ConfigMap containing the topology config. + - **topology_configmap_namespace**: (mandatory) A string specifying the namespace of the ConfigMap containing the topology config. + - **nodes**: (optional) An array of regions mapping instance IDs to node names. + + Example: + +```json + { + "provider": { + "name": "aws", + "creds": { + "access_key_id": "id", + "secret_access_key": "secret" + } + }, + "engine": { + "name": "slurm", + "params": { + "plugin": "topology/block", + "block_sizes": "30,120" + } + }, + "nodes": [ + { + "region": "region1", + "instances": { + "instance1": "node1", + "instance2": "node2", + "instance3": "node3" + } + }, + { + "region": "region2", + "instances": { + "instance4": "node4", + "instance5": "node5", + "instance6": "node6" + } + } + ] +} +``` + - **Response:** This endpoint immediately returns a "202 Accepted" status with a unique request ID if the request is valid. If not, it returns an appropriate error code. -#### 3. Topology Result Endpoint +##### 3. Topology Result Endpoint -- **URL:** `http://:/v1/topology` +- **URL:** `http(s)://:/v1/topology` - **Description:** This endpoint retrieves the result of a topology request. - **URL Query Parameters:** - **uid**: Specifies the request ID returned by the topology request endpoint. @@ -139,20 +194,12 @@ The Cluster Topology Generator offers three endpoints for interacting with the s Example usage: ```bash -id=$(curl -s -X POST "http://localhost:49021/v1/generate?provider=aws&engine=slurm&topology_config_path=/path/to/topology.conf") - -curl -s http://localhost:49021/v1/topology?uid=$id -``` - -You can optionally skip the SLURM reconfiguration: - -```bash -id=$(curl -X POST "http://localhost:49021/v1/generate?provider=oci&engine=slurm&topology_config_path=/path/to/topology.conf&skip_reload) +id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate) -curl -s http://localhost:49021/v1/topology?uid=$id +curl -s "http://localhost:49021/v1/topology?uid=$id" ``` -### Automated Solution for SLURM +#### Automated Solution for SLURM The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up: @@ -160,16 +207,16 @@ The Cluster Topology Generator enables a fully automated solution when combined strigger --set --node --down --up --flags=perm --program=