NVIDIA Maintenance Operator provides Kubernetes API(Custom Resource Definition) to allow node maintenance operators in K8s cluster in a coordinated manner. It performs some common operations to prepare a node for maintenance such as cordoning the node as well as draining it.
Users/Consumers can request to perform maintenance on a node by creating NodeMaintenance Custom Resource(CR). The operator will then reconcile NodeMaintenance CRs. At high level this the the reconcile flow:
- Scheduling - schedule NodeMaintenance to be processed by the operator, taking into account constraints such as the maximal allowed parallel operations.
- Node preparation for maintenance such as cordon and draning of the node
- Mark NodeMaintenance as Ready (via condition)
- Cleanup on deletion of NodeMaintenance such as node uncordon
- Kubernetes cluster
# Clone project
git clone https://github.com/Mellanox/maintenance-operator.git ; cd maintenance-operator
# Install Operator
helm install -n maintenance-operator --create-namespace --set operator.image.tag=latest maintenance-operator ./deployment/maintenance-operator-chart
# View deployed resources
kubectl -n maintenance-operator get all
Note
Refer to helm values documentation for more information
helm install -n maintenance-operator --create-namespace maintenance-operator oci://ghcr.io/mellanox/maintenance-operator-chart
# clone project
git clone https://github.com/Mellanox/maintenance-operator.git ; cd maintenance-operator
# build image
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make docker-build
# push image
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make docker-push
# deploy
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make deploy
# undeploy
make undeploy
The MaintenanceOperatorConfig CRD is used for operator runtime configuration
for more information refer to api-reference
apiVersion: maintenance.nvidia.com/v1alpha1
kind: MaintenanceOperatorConfig
metadata:
name: default
namespace: maintenance-operator
spec:
logLevel: info
maxParallelOperations: 4
In this example we configure the following for the operator:
- Log level (
logLevel
) is set toinfo
- The max number of parallel maintenance operations (
maxParallelOperations
) is set to4
The NodeMaintenance CRD is used to request to perform a maintenance operation on a specific K8s node. In addition, it specifies which common (K8s related operations) need to happend in order to preare a node for maintenance.
Once the node is ready for maintenance the operator will set Ready
condition in status
field to True
After maintenance operation was done by the requestor, NodeMaintenance CR should be deleted to finish the maintenance operation.
for more information refer to api-reference
apiVersion: maintenance.nvidia.com/v1alpha1
kind: NodeMaintenance
metadata:
name: my-maintenance-operation
namespace: default
spec:
requestorID: some.one.acme.com
nodeName: wokrer-01
cordon: true
waitForPodCompletion:
podSelector: "app=important"
timeoutSeconds: 0
drainSpec:
force: true
podSelector: ""
timeoutSeconds: 0
deleteEmptyDir: true
podEvictionFilters:
- byResourceNameRegex: nvidia.com/gpu-*
- byResourceNameRegex: nvidia.com/rdma*
In this example we request to perform maintenance for node worker-1
.
the following steps will occur before the node is marked as ready for maintenance:
- cordon of
worker-1
node - waiting for pods with
app: important
label to finish - draining of
worker-1
with the provideddrainSpec
- force draining of pods even if they dont belong to a controller
- allow draining of pods with emptyDir mount
- only drain pods that consume either
nvidia.com/gpu-*
,nvidia.com/rdma*
resources
once the node is ready for maintenance Ready
condition will be True
$ kubectl get nodemaintenances.maintenance.nvidia.com -A
NAME NODE REQUESTOR READY PHASE FAILED
my-maintenance-operation worker-01 some.one.acme.com True Ready
stateDiagram-v2
pending: maintenance request registered, waiting to be scheduled
scheduled: maintenance request scheduled
cordon: cordon node
waitForPodCompletion: wait for specified pods to complete
draining: node draining
ready: node ready for maintenance
requestorFailed: requestor failed the maintenance operations
[*] --> pending : NodeMaintenance created
pending --> scheduled : scheduler selected NodeMaintenance for maintenance, add finalizer
scheduled --> cordon : preparation for cordon completed
cordon --> waitForPodCompletion : cordon completed
waitForPodCompletion --> draining : finished waiting for pods
draining --> ready : drain operation completed successfully, node is ready for maintenance, Ready condition is set to True
ready --> requestorFailed : requestor has set RequestorFailed condition
pending --> [*] : object deleted
scheduled --> [*] : object deleted
cordon --> [*] : object marked for deletetion, cleanup before deletion
waitForPodCompletion --> [*] : object marked for deletetion, cleanup before deletion
draining --> [*] : object marked for deletetion, cleanup before deletion
ready --> [*] : object marked for deletetion, cleanup before deletion
requestorFailed --> [*] : RequestorFailed condition cleared by requestor or external user, object marked for deletion, cleanup before deletion