update k8s doc

Signed-off-by: Dmitry Shmulevich <[email protected]>
NVIDIA · Oct 30, 2024 · 04ddd10 · 04ddd10
1 parent 37297cf
commit 04ddd10
Show file tree

Hide file tree

Showing 2 changed files with 82 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -1,38 +1,34 @@
 # Topograph
 
-Topograph is a component designed to expose the underlying physical network topology of a cluster to enable a workload manager make network-topology aware scheduling decisions. It consists of four major components:
+Topograph is a component designed to expose the underlying physical network topology of a cluster to enable a workload manager make network-topology aware scheduling decisions.
 
-1. **CSP Connector**
-2. **API Server**
-3. **Topology Generator**
-4. **Node Observer**
+Topograph consists of four major components:
+
+1. **API Server**
+2. **Node Observer**
+3. **CSP Connector**
+4. **Topology Generator**
 
 <p align="center"><img src="docs/assets/design.png" width="600" alt="Design"></p>
 
 ## Components
 
-### 1. CSP Connector
-The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.
-
-### 2. API Server
+### 1. API Server
 The API Server listens for network topology configuration requests on a specific port. When a request is received, the server triggers the Topology Generator to populate the configuration.
 
-The API Server exposes two endpoints: one for synchronous requests and one for asynchronous requests.
+### 2. Node Observer
+The Node Observer is used when the Topology Generator is deployed in a Kubernetes cluster. It monitors changes in the cluster nodes.
+If a node's status changes (e.g., a node goes down or comes up), the Node Observer sends a request to the API Server to generate a new topology configuration.
 
-- The synchronous endpoint responds to the HTTP request with the topology configuration, though this process may take some time.
-- In the asynchronous mode, the API Server promptly returns a "202 Accepted" response to the HTTP request. It then begins generating and serializing the topology configuration.
+### 3. CSP Connector
+The CSP Connector is responsible for interfacing with various CSPs to retrieve cluster-related information. Currently, it supports AWS, OCI, GCP, CoreWeave, bare metal, with plans to add support for Azure. The primary goal of the CSP Connector is to obtain the network topology configuration of a cluster, which may require several subsequent API calls. Once the information is obtained, the CSP Connector translates the network topology from CSP-specific formats to an internal format that can be utilized by the Topology Generator.
 
-### 3. Topology Generator
+### 4. Topology Generator
 The Topology Generator is the central component that manages the overall network topology of the cluster. It performs the following functions:
-
 - **Notification Handling:** Receives notifications from the API Server.
 - **Topology Gathering:** Instructs the CSP Connector to fetch the current network topology from the CSP.
 - **User Cluster Update:** Translates network topology from the internal format into a format expected by the user cluster, such as SLURM or Kubernetes.
 
-### 4. Node Observer
-The Node Observer is used when the Topology Generator is deployed in a Kubernetes cluster. It monitors changes in the cluster nodes.
-If a node's status changes (e.g., a node goes down or comes up), the Node Observer sends a request to the API Server to generate a new topology configuration.
-
 ## Workflow
 
 - The API Server listens on the port and notifies the Topology Generator about incoming requests. In kubernetes, the incoming requests sent by the Node Observer, which watches changes in the node status.

diff --git a/docs/k8s.md b/docs/k8s.md
@@ -1,12 +1,74 @@
 # Topograph with Kubernetes
 
-In Kubernetes, Topograph performs two main actions:
+Topograph is a tool designed to enhance scheduling decisions in Kubernetes clusters by leveraging network topology information.
 
-- Creates a ConfigMap containing the topology information.
-- Applies node labels that define the node’s position within the cloud topology. For instance, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will label the node with the following:
-  - `topology.kubernetes.io/network-level-1: S1`
-  - `topology.kubernetes.io/network-level-2: S2`
-  - `topology.kubernetes.io/network-level-3: S3`
+### Overview
+
+Topograph's primary objective is to assist the Kubernetes scheduler in making intelligent pod placement decisions based on the cluster's network topology. It achieves this by:
+
+1. Interacting with Cloud Service Providers (CSPs)
+2. Extracting cluster topology information
+3. Updating the Kubernetes environment with this topology data
+
+### Current Functionality
+
+Topograph performs the following key actions:
+
+1. **ConfigMap Creation**: Generates a ConfigMap containing topology information. This ConfigMap is not currently utilized but serves as an example for potential future integration with the scheduler or other systems.
+
+2. **Node Labeling**: Applies labels to nodes that define their position within the cloud topology. For example, if a node connects to switch S1, which connects to switch S2, and then to switch S3, Topograph will apply the following labels to the node:
+
+   ```
+   topology.kubernetes.io/network-level-1: S1
+   topology.kubernetes.io/network-level-2: S2
+   topology.kubernetes.io/network-level-3: S3
+   ```
+
+### Use of Topograph
+
+While there is currently no fully network-aware scheduler capable of optimally placing groups of pods based on network considerations, Topograph serves as a stepping stone toward developing such a scheduler.
+
+Topograph can be used in conjunction with Kubernetes' existing PodAffinity and Topology Spread Constraints features. This combination enhances pod distribution based on network topology information.
+
+The following excerpt describes a Kubernetes object specification for a cluster with a three-tier network switch hierarchy. The goal is to improve inter-pod communication by assigning pods to nodes within
+closer network proximity.
+
+```yaml
+    affinity:
+      podAffinity:
+        preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 70
+            podAffinityTerm:
+              labelSelector:
+                matchExpressions:
+                  - key: app
+                    operator: In
+                    values:
+                      - myapp
+              topologyKey: topology.kubernetes.io/network-level-2
+          - weight: 90
+            podAffinityTerm:
+              labelSelector:
+                matchExpressions:
+                  - key: app
+                    operator: In
+                    values:
+                      - myapp
+              topologyKey: topology.kubernetes.io/network-level-1
+```
+Pods are prioritized to be placed on nodes sharing the label `topology.kubernetes.io/network-level-1`.
+These nodes are connected to the same network switch, ensuring the lowest latency for communication.
+
+Nodes with the label `topology.kubernetes.io/network-level-2` are next in priority.
+Pods on these nodes will still be relatively close, but with slightly higher latency.
+
+In the three-tier network, all nodes will have a `topology.kubernetes.io/network-level-3` label, so it doesn’t
+need to be included in pod affinity settings.
+
+Since the default Kubernetes scheduler places one pod at a time, the placement may vary depending on where
+the first pod is placed. As a result, each scheduling decision might not be globally optimal.
+However, by aligning pod placement with network-aware labels, we can significantly improve inter-pod
+communication efficiency within the limitations of the scheduler.
 
 ## Configuration and Deployment
 TBD