Correctly handle resource overrides in KF plugins #4467

jeevb · 2023-11-21T22:50:41Z

Tracking issue

Describe your changes

This PR:

Adds a helper to override interruptibility and task resources in TaskExecutionContext before passing to ToK8sPodSpec where the base pod spec is constructed, and hydrated with defaults and additional configuration (e.g. affinity, tolerations, etc.).
Updates dask and spark plugins to use this new helper for overriding interruptibility of the driver pods
Updates KF plugins (i.e. mpi, pytorch and tensorflow) to correctly handle resource overrides specified in the respective task configs. This allows for correctly applying resource-specific tolerations (e.g. GPU) to the pod spec.

Config:

plugins:
  k8s:
    resource-tolerations:
      nvidia.com/gpu:
      - key: "flyte/GPU"
        operator: "Exists"
        effect: "NoSchedule"

Task decorator:

@task(
    task_config=MPIJob(
        launcher=Launcher(
            replicas=1,
        ),
        worker=Worker(
            replicas=1,
            limits=Resources(cpu="2", mem="1000Mi", gpu="1"),
        ),
    ),
    retries=3,
    cache=True,
    cache_version="0.1",
    limits=Resources(cpu="2", mem="1000Mi"),
)

Launcher tolerations:

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Worker tolerations:

  tolerations:
  - effect: NoSchedule
    key: flyte/GPU
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Setup Process

Screenshots

Note to reviewers

Related PRs

Signed-off-by: Jeev B <[email protected]>

codecov · 2023-11-21T23:10:47Z

Codecov Report

Attention: 156 lines in your changes are missing coverage. Please review.

Comparison is base (7878b8e) 59.63% compared to head (307bead) 59.52%.

❗ Current head 307bead differs from pull request most recent head 7df36cf. Consider uploading reports for the commit 7df36cf to get more accurate results

Files	Patch %	Lines
...ks/pluginmachinery/flytek8s/plugin_exec_context.go	0.00%	75 Missing ⚠️
.../plugins/k8s/kfoperators/common/common_operator.go	1.61%	61 Missing ⚠️
...lugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go	72.72%	5 Missing and 4 partials ⚠️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	69.56%	4 Missing and 3 partials ⚠️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go	88.57%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4467      +/-   ##
==========================================
- Coverage   59.63%   59.52%   -0.12%     
==========================================
  Files         638      636       -2     
  Lines       53995    53838     -157     
==========================================
- Hits        32201    32045     -156     
- Misses      19262    19265       +3     
+ Partials     2532     2528       -4

Flag	Coverage Δ
unittests	`59.52% <32.17%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jeev B <[email protected]>

jeevb · 2023-11-22T21:13:14Z

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go

-				return nil, err
-			}
-			launcherReplica.RestartPolicy = common.ParseRestartPolicy(launcherReplicaSpec.GetRestartPolicy())
+		launcherReplicaSpec, err = common.ToReplicaSpecWithOverrides(ctx, taskCtx, kfMPITaskExtraArgs.GetLauncherReplicas(), kubeflowv1.MPIJobDefaultContainerName, true)


Need to confirm if launcher pods should be treated as special and be non-interruptible.

No clear evidence to suggest one way or the other. This pod is responsible for launching the jobs on the MPI workers (somewhat akin to Ray head nodes), so will give it special treatment for now.

jeevb · 2023-11-22T21:13:55Z

flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator.go

+	taskCtxOptions := []flytek8s.PluginTaskExecutionContextOption{}
+	// Master should always run as non-interruptible
+	if isMaster {
+		taskCtxOptions = append(taskCtxOptions, flytek8s.WithInterruptible(false))


Note that "master" pods will be tagged as non-interruptible. Currently this is set for PyTorch masters and MPI launchers.

eapolinario

This looks great, thank you!

Re: whether kfoperators master pods should be non-interruptible, why do we have do consider those as special? Is it just for consistency with the other k8s plugins (dask and spark have their driver/master pods as non-interruptible?

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go

jeevb · 2023-11-23T00:31:29Z

This looks great, thank you!

Re: whether kfoperators master pods should be non-interruptible, why do we have do consider those as special? Is it just for consistency with the other k8s plugins (dask and spark have their driver/master pods as non-interruptible?

If it would be detrimental to lose the master, we should run them as non-interruptible. Same logic as dask and spark, yes.

Signed-off-by: Jeev B <[email protected]>

jeevb · 2023-11-27T16:24:44Z

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go

+			return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification, "Unable to create replica spec: [%v]", err.Error())
+		}
+		launcherReplicaSpec = replicaSpec.DeepCopy()
+		// TODO (jeev): Is this even a valid configuration. Can there be more than 1


Maintains parity for now, but will need to follow up in a subsequent investigation.

hamersaw

lgtm. would feel more confident with a few more eyes here. @fg91 @yubofredwang mind taking a quick look?

fg91

I don't agree with the change that a pytorchjob master replica should always be non-interruptible. The master replica doesn't start the worker replicas or restart them in case they are preempted. If one of the worker replicas is preempted, all other pods including the master replica have to stop/be restarted. So there is no value in preventing the master replica from being preempted. Or am I overlooking something here?

I think differentiating between master and worker replicas doesn't make that much sense for pytorchjobs in the first place and in the newer elastic pytorchjobs, there are no more master replicas at all.

Could you please revert this change (unless I'm overlooking something)?

I appreciate that you tagged me @hamersaw @jeevb since we rely on pytorch jobs a lot :)
I ran a flytepropeller with your changes in our staging cluster and apart from the non-interruptible master replica everything works as expected.

However, for pytorchjobs everything worked as expected also before. Do I understand it correctly that the large diff in pytorch.go comes from the fact that now the overriding of resources and resource/interruptible tolerations happens more centrally instead of being handled separately in every single plugin?

Signed-off-by: Jeev B <[email protected]>

jeevb · 2023-11-30T17:01:05Z

I went ahead and reverted forcing non-interruptibility on KF plugin masters. This keeps parity with the current implementation for now.

@fg91: The big change here was to apply resource overrides from plugin-specific config before ToK8sPodSpec is called. This ensures that things like resource-based tolerations (maybe not relevant for you guys since GKE injects GPU tolerations automatically), and other additional pod configuration work correctly - that function is doing quite a bit. The remaining changes in the PyTorch plugin are just structural to make the code less repetitive and easier to maintain - lots of commonalities between the KF plugins.

fg91 · 2023-11-30T17:05:09Z

I went ahead and reverted forcing non-interruptibility on KF plugin masters. This keeps parity with the current implementation for now.

Parity also for MPI? Asking because I know only for pytorch that the master shouldn't always be non-interruptible - I don't know for MPI ..

@fg91: The big change here was to apply resource overrides to before ToK8sPodSpec is called. This ensures that things like resource-based tolerations (maybe not relevant for you guys since GKE injects GPU tolerations automatically), and other additional pod configuration work correctly - that function is doing quite a bit. The remaining changes in the PyTorch plugin are just structural to make the code less repetitive and easier to maintain - lots of commonalities between the KF plugins.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

jeevb · 2023-11-30T17:16:06Z

Yes keeping parity for MPI as well for now, since no one has been able to provide a convincing argument one way or the other. I don't know MPI well enough. Let me know if you have thoughts @fg91.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

This is still handled when the plugin is building the resource. The overrides in this case are derived from parsing the plugin configuration. These changes basically give plugins a simple mechanism to inject overrides and benefit from the helpers in pluginmachinery.

fg91 · 2023-11-30T17:30:02Z

Yes keeping parity for MPI as well for now, since no one has been able to provide a convincing argument one way or the other. I don't know MPI well enough. Let me know if you have thoughts @fg91.

I don't know MPI well enough either unfortunately 🤷 But I'm sure about pytorch jobs ... and as long as we keep parity here, I think this is good to merge.

Sounds great! Out of curiosity, does this mean that resource tolerations are automatically handled correctly if another plugin is added because this is handled before the plugin builds the resource?

This is still handled when the plugin is building the resource. The overrides in this case are derived from parsing the plugin configuration. These changes basically give plugins a simple mechanism to inject overrides and benefit from the helpers in pluginmachinery.

Nice. I always disliked that the plugins had to reinvent the wheel here instead of using shared helpers.

jeevb added 8 commits November 21, 2023 14:50

wip - mpi tests failing

886dd84

Signed-off-by: Jeev B <[email protected]>

get mpi tests passing

e1e1838

Signed-off-by: Jeev B <[email protected]>

Update pytorch and tensorflow plugins

28ebc38

Signed-off-by: Jeev B <[email protected]>

cleanup

b4587c2

Signed-off-by: Jeev B <[email protected]>

cleanup

34ff26f

Signed-off-by: Jeev B <[email protected]>

cleanup

236846e

Signed-off-by: Jeev B <[email protected]>

cleanup

f684b0d

Signed-off-by: Jeev B <[email protected]>

fixes

cfc2f24

Signed-off-by: Jeev B <[email protected]>

jeevb force-pushed the jeev/fix-kfplugins branch from 55063aa to cfc2f24 Compare November 21, 2023 22:51

Merge branch 'master' into jeev/fix-kfplugins

78832e2

Add tests for resource tolerations

ca1a839

Signed-off-by: Jeev B <[email protected]>

jeevb marked this pull request as ready for review November 21, 2023 23:26

jeevb commented Nov 22, 2023

View reviewed changes

eapolinario reviewed Nov 22, 2023

View reviewed changes

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go Outdated Show resolved Hide resolved

flyteplugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go Outdated Show resolved Hide resolved

PR comments

1d29e02

Signed-off-by: Jeev B <[email protected]>

jeevb force-pushed the jeev/fix-kfplugins branch from 3253966 to 1d29e02 Compare November 23, 2023 00:33

jeevb commented Nov 27, 2023

View reviewed changes

Merge branch 'master' into jeev/fix-kfplugins

1bb70a6

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Nov 27, 2023

hamersaw previously approved these changes Nov 28, 2023

View reviewed changes

hamersaw requested a review from fg91 November 28, 2023 15:39

Merge branch 'master' into jeev/fix-kfplugins

307bead

pingsutw previously approved these changes Nov 30, 2023

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 30, 2023

fg91 requested changes Nov 30, 2023

View reviewed changes

Revert forcing non-interruptibility on kfoperator masters

2116ec2

Signed-off-by: Jeev B <[email protected]>

jeevb dismissed pingsutw’s stale review via 2116ec2 November 30, 2023 16:54

jeevb dismissed hamersaw’s stale review via 2116ec2 November 30, 2023 16:54

dosubot bot removed the lgtm This PR has been approved by a maintainer label Nov 30, 2023

Merge branch 'master' into jeev/fix-kfplugins

7df36cf

jeevb requested review from fg91 and hamersaw November 30, 2023 17:01

fg91 approved these changes Nov 30, 2023

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 30, 2023

jeevb merged commit 7481b3d into master Nov 30, 2023
43 checks passed

jeevb deleted the jeev/fix-kfplugins branch November 30, 2023 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly handle resource overrides in KF plugins #4467

Correctly handle resource overrides in KF plugins #4467

jeevb commented Nov 21, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

jeevb Nov 22, 2023

jeevb Nov 27, 2023 •

edited

Loading

jeevb Nov 22, 2023 •

edited

Loading

eapolinario left a comment

jeevb commented Nov 23, 2023

jeevb Nov 27, 2023

hamersaw left a comment

fg91 left a comment •

edited

Loading

jeevb commented Nov 30, 2023 •

edited

Loading

fg91 commented Nov 30, 2023

jeevb commented Nov 30, 2023 •

edited

Loading

fg91 commented Nov 30, 2023

Correctly handle resource overrides in KF plugins #4467

Correctly handle resource overrides in KF plugins #4467

Conversation

jeevb commented Nov 21, 2023 • edited Loading

Tracking issue

Describe your changes

Check all the applicable boxes

Setup Process

Screenshots

Note to reviewers

Related PRs

codecov bot commented Nov 21, 2023 • edited Loading

Codecov Report

jeevb Nov 22, 2023

Choose a reason for hiding this comment

jeevb Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

jeevb Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

eapolinario left a comment

Choose a reason for hiding this comment

jeevb commented Nov 23, 2023

jeevb Nov 27, 2023

Choose a reason for hiding this comment

hamersaw left a comment

Choose a reason for hiding this comment

fg91 left a comment • edited Loading

Choose a reason for hiding this comment

jeevb commented Nov 30, 2023 • edited Loading

fg91 commented Nov 30, 2023

jeevb commented Nov 30, 2023 • edited Loading

fg91 commented Nov 30, 2023

jeevb commented Nov 21, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

jeevb Nov 27, 2023 •

edited

Loading

jeevb Nov 22, 2023 •

edited

Loading

fg91 left a comment •

edited

Loading

jeevb commented Nov 30, 2023 •

edited

Loading

jeevb commented Nov 30, 2023 •

edited

Loading