Skip to content

Commit

Permalink
expand/update appwrapper FT discussion
Browse files Browse the repository at this point in the history
  • Loading branch information
dgrove-oss committed Aug 12, 2024
1 parent a31a8ab commit ea3c534
Showing 1 changed file with 73 additions and 25 deletions.
98 changes: 73 additions & 25 deletions USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ is the recommended way of submitting workloads.

### PyTorchJobs

To submit a `PyTorchJob` to MLBatch, simply include the queue name:
To submit an unwrapped `PyTorchJob` to MLBatch, simply include the queue name:
```yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
Expand Down Expand Up @@ -114,7 +114,9 @@ Try the above with:
```sh
oc apply -n team1 -f samples/pytorchjob.yaml
```
MLBatch implicitly enables gang scheduling and packing for `PyTorchJobs`.
MLBatch implicitly enables gang scheduling and packing for `PyTorchJobs` by
configuring the Kubeflow Training Operator to automatically inject the
necessary scheduling directives into all Pods it creates for `PyTorchJobs`.

### AppWrappers

Expand Down Expand Up @@ -154,7 +156,7 @@ See [AppWrappers](https://project-codeflare.github.io/appwrapper/) for more
information and use cases.

MLBatch implicitly enables packing for `AppWrappers`. For workloads consisting
of multiple pods, add a `PodGroup` to enable and gang scheduling, for instance:
of multiple pods, add a `PodGroup` to enable gang scheduling, for instance:
```yaml
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
Expand Down Expand Up @@ -282,28 +284,74 @@ on constant human monitoring of workload health. AppWrappers should be used to
submit all workloads that are intended to run without close human supervision of
their progress.
Throughout the execution of a workload, the AppWrapper controller monitors the
number and health of the workload's Pods and uses this information to determine
if a workload is unhealthy. A workload can be deemed *unhealthy* if any of the
following conditions are true:
+ It has a non-zero number of `Failed` Pods.
+ It takes longer than `AdmissionGracePeriod` for the expected number of Pods
to reach the `Pending` state.
+ It takes longer than the `WarmupGracePeriod` for the expected number of
Pods to reach the `Running` state.

If a workload is determined to be unhealthy, the AppWrapper controller first
waits for a `FailureGracePeriod` to allow the primary resource controller an
opportunity to react and return the workload to a healthy state. If the
`FailureGracePeriod` passes and the workload is still unhealthy, the AppWrapper
controller will *reset* the workload by deleting its resources, waiting for a
`RetryPausePeriod`, and then creating new instances of the resources. During
this retry pause, the AppWrapper **does not** release the workload's quota; this
ensures that when the resources are recreated they will still have sufficient
quota to execute. The number of times an AppWrapper is reset is tracked as part
of its status; if the number of resets exceeds the `RetryLimit`, then the
AppWrapper moves into a `Failed` state and its resources are deleted (thus
finally releasing its quota).
```mermaid!
---
title: Overview of AppWrapper Fault Tolerance Phase Transitions
---
stateDiagram-v2

rn : Running
s : Succeeded
f : Failed
rt : Resetting
rs : Resuming

%% Happy Path
rn --> s

%% Requeuing
rn --> f : Retries Exceeded
rn --> rt : Workload Unhealthy
rt --> rs : All Resources Removed
rs --> rn : All Resources Recreated

classDef quota fill:lightblue
class rs quota
class rn quota
class rt quota

classDef failed fill:pink
class f failed

classDef succeeded fill:lightgreen
class s succeeded
```

Throughout the execution of the workload, the AppWrapper controller
monitors the number and health of the workload's Pods. It also watches
the top-level created resources and for selected resources types
understands how to interpret their status information. This information
is combined to determine if a workload is unhealthy. A workload can be
deemed *unhealthy* if any of the following conditions are true:
+ There are a non-zero number of `Failed` Pods.
+ It takes longer than `AdmissionGracePeriod` for the expected
number of Pods to reach the `Pending` state.
+ It takes longer than the `WarmupGracePeriod` for the expected
number of Pods to reach the `Running` state.
+ If a non-zero number of `Running` Pods are using resources
that Autopilot has tagged as `NoExecute`.
+ The status information of a batch/v1 Job or PyTorchJob indicates
that it has failed.
+ A top-level wrapped resource is externally deleted.

If a workload is determined to be unhealthy by one of the first three
Pod-level conditions above, the AppWrapper controller first waits for
a `FailureGracePeriod` to allow the primary resource controller an
opportunity to react and return the workload to a healthy state. The
`FailureGracePeriod` is elided for the remaining conditions because the
primary resource controller is not expected to take any further
action. If the `FailureGracePeriod` passes and the workload is still
unhealthy, the AppWrapper controller will *reset* the workload by
deleting its resources, waiting for a `RetryPausePeriod`, and then
creating new instances of the resources. During this retry pause, the
AppWrapper **does not** release the workload's quota; this ensures
that when the resources are recreated they will still have sufficient
quota to execute. The number of times an AppWrapper is reset is
tracked as part of its status; if the number of resets exceeds the
`RetryLimit`, then the AppWrapper moves into a `Failed` state and its
resources are deleted (thus finally releasing its quota). Deletion of
a top-level wrapped resource will cause the AppWrapper directly enter
the `Failed` state independent of the `RetryLimit`.

To support debugging `Failed` workloads, an annotation can be added to an
AppWrapper that adds a `DeletionOnFailureGracePeriod` between the time the
Expand Down

0 comments on commit ea3c534

Please sign in to comment.