Skip to content

Commit

Permalink
replace URL query with URL payload (#10)
Browse files Browse the repository at this point in the history
Signed-off-by: Dmitry Shmulevich <[email protected]>
  • Loading branch information
dmitsh authored Oct 19, 2024
1 parent 676ad8f commit b7d5a58
Show file tree
Hide file tree
Showing 23 changed files with 319 additions and 294 deletions.
142 changes: 80 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Topograph is a component designed to expose the underlying physical network topo

### 1. CSP Connector

The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, with plans to add support for OCI, GCP, and Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.
The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.

### 2. API Server

Expand Down Expand Up @@ -45,6 +45,7 @@ For the SLURM engine, topograph supports the following CSPs:
- OCI
- GCP
- CoreWeave
- Bare metal

### Kubernetes Engine

Expand All @@ -64,19 +65,22 @@ There is a special *provider* and *engine* named `test`, which supports both SLU
- The Topology Generator returns the network topology configuration to the API Server, which then relays it back to the requester.

## Topograph Installation and Configuration
Topograph can operate as a standalone service within SLURM clusters or be deployed in Kubernetes clusters.

### Topograph as a Standalone Service
Topograph can be installed using the `topograph` Debian or RPM package. This package sets up a service but does not start it automatically, allowing users to update the configuration before launch.

### Configuration
#### Configuration
The default configuration file is located at [config/topograph-config.yaml](config/topograph-config.yaml). It includes settings for:
- HTTP endpoint for the Topology Generator
- SSL/TLS connection
- environment variables

By default, SSL/TLS is enabled, and the server certificate and key are generated during package installation.
By default, SSL/TLS is disabled, but the server certificate and key are generated during package installation.

The configuration file also includes an optional section for environment variables. When specified, these variables are added to the shell environment. Note that the `PATH` variable, if provided, is appended to the existing `PATH`.

### Service Management
#### Service Management
To enable and start the service, run the following commands:
```bash
systemctl enable topograph.service
Expand All @@ -95,39 +99,90 @@ systemctl disable topograph.service
systemctl daemon-reload
```

### Testing the Service
#### Testing the Service
To verify the service is running correctly, you can use the following commands:

```bash
curl http://localhost:49021/healthz

curl -X POST "http://localhost:49021/v1/generate?provider=test&engine=test"
id=$(curl -s -X POST -H "Content-Type: application/json" -d '{"provider":{"name":"test"},"engine":{"name":"test"}}' http://localhost:49021/v1/generate)

curl -s "http://localhost:49021/v1/topology?uid=$id"
```

### Using the Cluster Topology Generator
#### Using the Cluster Topology Generator

The Cluster Topology Generator offers three endpoints for interacting with the service. Below are the details of each endpoint:

#### 1. Health Endpoint
##### 1. Health Endpoint

- **URL:** `http://<server>:<port>/healthz`
- **Description:** This endpoint verifies the service status. It returns a "200 OK" HTTP response if the service is operational.

#### 2. Topology Request Endpoint
##### 2. Topology Request Endpoint

- **URL:** `http://<server>:<port>/v1/generate`
- **URL:** `http(s)://<server>:<port>/v1/generate`
- **Description:** This endpoint is used to request a new cluster topology.
- **URL Query Parameters:**
- **provider**: (mandatory) Specifies the Cloud Service Provider (CSP) such as `aws`, `oci`, `gcp`, `azure`, or `test`.
- **engine**: (mandatory) Specifies the configuration format, either `slurm` or `k8s`.
- **topology_config_path**: (optional for `engine=slurm`, mandatory for `engine=k8s`) Specifies the file path for the topology config when `engine=slurm`, or the key for the topology config in the configmap when `engine=k8s`. If omitted for `slurm`, the content of the topology config is returned with the HTTP response.
- **topology_configmap_name**: (mandatory for `engine=k8s`) Specifies the name of the configmap containing the topology config.
- **topology_configmap_namespace**: (mandatory for `engine=k8s`) Specifies the namespace of the configmap containing the topology config.
- **skip_reload**: (optional for `engine=slurm`) Omit cluster reconfiguration if present.
- **Payload:** The payload is a JSON object that includes the following fields:
- **provider name**: (mandatory) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `cw`, `baremetal` or `test`.
- **provider credentials**: (optional) A key-value map with provider-specific parameters for authentication.
- **engine name**: (mandatory) A string specifying the topology output, either `slurm` or `k8s`.
- **engine parameters**: A key-value map with engine-specific parameters.
- **slurm parameters**:
- **topology_config_path**: (optional) A string specifying the file path for the topology configuration. If omitted, the topology config content is returned in the HTTP response.
- **plugin**: (optional) A string specifying topology plugin. Default topology/tree.
- **block_sizes**: (optional) A string specifying block size for topology/block plugin
- **skip_reload**: (optional) If present, the cluster reconfiguration is skipped.
- **k8s parameters**:
- **topology_config_path**: (mandatory) A string specifying the key for the topology config in the ConfigMap.
- **topology_configmap_name**: (mandatory) A string specifying the name of the ConfigMap containing the topology config.
- **topology_configmap_namespace**: (mandatory) A string specifying the namespace of the ConfigMap containing the topology config.
- **nodes**: (optional) An array of regions mapping instance IDs to node names.

Example:

```json
{
"provider": {
"name": "aws",
"creds": {
"access_key_id": "id",
"secret_access_key": "secret"
}
},
"engine": {
"name": "slurm",
"params": {
"plugin": "topology/block",
"block_sizes": "30,120"
}
},
"nodes": [
{
"region": "region1",
"instances": {
"instance1": "node1",
"instance2": "node2",
"instance3": "node3"
}
},
{
"region": "region2",
"instances": {
"instance4": "node4",
"instance5": "node5",
"instance6": "node6"
}
}
]
}
```

- **Response:** This endpoint immediately returns a "202 Accepted" status with a unique request ID if the request is valid. If not, it returns an appropriate error code.

#### 3. Topology Result Endpoint
##### 3. Topology Result Endpoint

- **URL:** `http://<server>:<port>/v1/topology`
- **URL:** `http(s)://<server>:<port>/v1/topology`
- **Description:** This endpoint retrieves the result of a topology request.
- **URL Query Parameters:**
- **uid**: Specifies the request ID returned by the topology request endpoint.
Expand All @@ -139,37 +194,29 @@ The Cluster Topology Generator offers three endpoints for interacting with the s
Example usage:

```bash
id=$(curl -s -X POST "http://localhost:49021/v1/generate?provider=aws&engine=slurm&topology_config_path=/path/to/topology.conf")

curl -s http://localhost:49021/v1/topology?uid=$id
```

You can optionally skip the SLURM reconfiguration:

```bash
id=$(curl -X POST "http://localhost:49021/v1/generate?provider=oci&engine=slurm&topology_config_path=/path/to/topology.conf&skip_reload)
id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate)

curl -s http://localhost:49021/v1/topology?uid=$id
curl -s "http://localhost:49021/v1/topology?uid=$id"
```

### Automated Solution for SLURM
#### Automated Solution for SLURM

The Cluster Topology Generator enables a fully automated solution when combined with SLURM's `strigger` command. You can set up a trigger that runs whenever a node goes down or comes up:

```bash
strigger --set --node --down --up --flags=perm --program=<script>
```

In this setup, the `<script>` would contain the curl command to call the asynchronous endpoint:
In this setup, the `<script>` would contain the curl command to call the endpoint:

```bash
curl -X POST "http://localhost:49021/v1/generate?provider=aws&engine=slurm&topology_config_path=/path/to/topology.conf"
curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate
```

We provide the [create-topology-update-script.sh](scripts/create-topology-update-script.sh) script, which performs the steps outlined above: it creates the topology update script and registers it with the strigger.

The script accepts the following parameters:
- **provider name** (aws, oci, gcp)
- **provider name** (aws, oci, gcp, cw, baremetal)
- **path to the generated topology update script**
- **path to the topology.conf file**

Expand All @@ -184,32 +231,3 @@ create-topology-update-script.sh -p aws -s /etc/slurm/update-topology-config.sh
```

This automation ensures that your cluster topology is updated and SLURM configuration is reloaded whenever there are changes in node status, maintaining an up-to-date cluster configuration.
### Optional: Instance ID to Node Name Mapping
You can provide a mapping between CSP VM instance IDs and SLURM cluster node names as part of the request payload:
```bash
cat payload.json
{
"nodes": [
{
"region": "region1",
"instances": {
"instance1": "node1",
"instance2": "node2",
"instance3": "node3"
}
},
{
"region": "region2",
"instances": {
"instance4": "node4",
"instance5": "node5",
"instance6": "node6"
}
}
]
}
curl -X POST -H "Content-Type: application/json" -d @payload.json "http://localhost:49021/v1/generate?provider=aws&engine=slurm"
```
2 changes: 1 addition & 1 deletion config/topograph-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# serving topograph endpoint
http:
port: 49021
ssl: true
ssl: false

# waiting period before processing a request
request_aggregation_delay: 15s
Expand Down
2 changes: 0 additions & 2 deletions pkg/common/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@ const (
EngineTest = "test"

KeyUID = "uid"
KeyProvider = "provider"
KeyEngine = "engine"
KeyTopoConfigPath = "topology_config_path"
KeyTopoConfigmapName = "topology_configmap_name"
KeyTopoConfigmapNamespace = "topology_configmap_namespace"
Expand Down
93 changes: 37 additions & 56 deletions pkg/common/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ func (e *HTTPError) Error() string {
}

type Provider interface {
GetCredentials(*Credentials) (interface{}, error)
GetCredentials(map[string]string) (interface{}, error)
GetComputeInstances(context.Context, Engine) ([]ComputeInstances, error)
GenerateTopologyConfig(context.Context, interface{}, int, []ComputeInstances) (*Vertex, error)
}
Expand All @@ -68,69 +68,58 @@ type Engine interface {
GenerateOutput(context.Context, *Vertex, map[string]string) ([]byte, error)
}

type Payload struct {
Nodes []ComputeInstances `json:"nodes"`
Creds *Credentials `json:"creds,omitempty"` // access credentials
type TopologyRequest struct {
Provider provider `json:"provider"`
Engine engine `json:"engine"`
Nodes []ComputeInstances `json:"nodes"`
}

type ComputeInstances struct {
Region string `json:"region"`
Instances map[string]string `json:"instances"` // <instance ID>:<node name> map
type provider struct {
Name string `json:"name"`
Creds map[string]string `json:"creds"` // access credentials
}

type Credentials struct {
AWS *AWSCredentials `yaml:"aws,omitempty" json:"aws,omitempty"` // AWS credentials
OCI *OCICredentials `yaml:"oci,omitempty" json:"oci,omitempty"` // OCI credentials
type engine struct {
Name string `json:"name"`
Params map[string]string `json:"params"` // access credentials
}

type AWSCredentials struct {
AccessKeyId string `yaml:"access_key_id" json:"access_key_id"`
SecretAccessKey string `yaml:"secret_access_key" json:"secret_access_key"`
Token string `yaml:"token,omitempty" json:"token,omitempty"` // token is optional
type ComputeInstances struct {
Region string `json:"region"`
Instances map[string]string `json:"instances"` // <instance ID>:<node name> map
}

type OCICredentials struct {
TenancyID string `yaml:"tenancy_id" json:"tenancy_id"`
UserID string `yaml:"user_id" json:"user_id"`
Region string `yaml:"region" json:"region"`
Fingerprint string `yaml:"fingerprint" json:"fingerprint"`
PrivateKey string `yaml:"private_key" json:"private_key"`
Passphrase string `yaml:"passphrase,omitempty" json:"passphrase,omitempty"` // passphrase is optional
func NewTopologyRequest(prv string, creds map[string]string, eng string, params map[string]string) *TopologyRequest {
return &TopologyRequest{
Provider: provider{
Name: prv,
Creds: creds,
},
Engine: engine{
Name: eng,
Params: params,
},
}
}

func (p *Payload) String() string {
func (p *TopologyRequest) String() string {
var sb strings.Builder

sb.WriteString(fmt.Sprintf("Payload:\n Nodes: %v\n", p.Nodes))
if p.Creds != nil {
sb.WriteString(" Credentials:\n")
if p.Creds.AWS != nil {
var accessKeyId, secretAccessKey, token string
if len(p.Creds.AWS.AccessKeyId) != 0 {
accessKeyId = "***"
}
if len(p.Creds.AWS.SecretAccessKey) != 0 {
secretAccessKey = "***"
}
if len(p.Creds.AWS.Token) != 0 {
token = "***"
}
sb.WriteString(fmt.Sprintf(" AWS: AccessKeyID=%s SecretAccessKey=%s SessionToken=%s\n",
accessKeyId, secretAccessKey, token))
}
if p.Creds.OCI != nil {
sb.WriteString(" OCI:\n")
sb.WriteString(fmt.Sprintf(" UserID=%s\n", p.Creds.OCI.UserID))
sb.WriteString(fmt.Sprintf(" TenancyID=%s\n", p.Creds.OCI.TenancyID))
sb.WriteString(fmt.Sprintf(" Region=%s\n", p.Creds.OCI.Region))
}
sb.WriteString("TopologyRequest:\n")
sb.WriteString(fmt.Sprintf(" Provider: %s\n", p.Provider.Name))
sb.WriteString(" Credentials: ")
for key := range p.Provider.Creds {
sb.WriteString(fmt.Sprintf("%s=***,", key))
}
sb.WriteString("\n")
sb.WriteString(fmt.Sprintf(" Engine: %s\n", p.Engine.Name))
sb.WriteString(fmt.Sprintf(" Parameters: %v\n", p.Engine.Params))
sb.WriteString(fmt.Sprintf(" Nodes: %s\n", p.Nodes))

return sb.String()
}

func GetPayload(body []byte) (*Payload, error) {
var payload Payload
func GetTopologyRequest(body []byte) (*TopologyRequest, error) {
var payload TopologyRequest

if len(body) == 0 {
return &payload, nil
Expand All @@ -140,13 +129,5 @@ func GetPayload(body []byte) (*Payload, error) {
return nil, fmt.Errorf("failed to parse payload: %v", err)
}

if payload.Creds != nil {
if payload.Creds.AWS != nil {
if len(payload.Creds.AWS.AccessKeyId) == 0 || len(payload.Creds.AWS.SecretAccessKey) == 0 {
return nil, fmt.Errorf("invalid payload: must provide access_key_id and secret_access_key for AWS")
}
}
}

return &payload, nil
}
Loading

0 comments on commit b7d5a58

Please sign in to comment.