Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron Device Plugin Addon #777

Merged
merged 14 commits into from
Feb 14, 2024
Merged

Neuron Device Plugin Addon #777

merged 14 commits into from
Feb 14, 2024

Conversation

youngjeong46
Copy link
Collaborator

@youngjeong46 youngjeong46 commented Jul 16, 2023

Issue #, if available:

Description of changes: This PR adds Neuron Device Plugin to be able to leverage Inferentia and Trainium nodes on EKS Blueprints. This includes:

  • Addon
  • Example showing how to deploy it with an Inferentia Nodegroup (inf1 & inf2)
  • Documentations

Other related changes include revisions to helper functions to load yaml files and additional unit tests.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Collaborator

@shapirov103 shapirov103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngjeong46 great work, I added a few tactical comments.
With respect to the usage of the manifests, can you check if there is any approach to test whether there is a new version of Neuron and how can we incorporate it for continuous maintenance, including patch releases, version upgrades, etc.

lib/addons/neuron/index.ts Outdated Show resolved Hide resolved
lib/addons/neuron/index.ts Outdated Show resolved Hide resolved
lib/addons/neuron/index.ts Outdated Show resolved Hide resolved
docs/addons/neuron-plugin-addon.md Outdated Show resolved Hide resolved
import { ClusterAddOn, ClusterInfo } from "../../spi";
import { loadYaml, readYamlDocument } from "../../utils/yaml-utils";

export class NeuronPluginAddOn implements ClusterAddOn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No options to deploy? Not even namespace? It is fine if ns should be kube-system, just want to check if anything is reasonable to expose for configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this one, it is straight forward. There may be some optional scheduler which I just saw, that I will do with options in a fast follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it in this PR actually, testing it right now.

@freschri
Copy link
Collaborator

freschri commented Sep 12, 2023

@youngjeong46 great work, I added a few tactical comments. With respect to the usage of the manifests, can you check if there is any approach to test whether there is a new version of Neuron and how can we incorporate it for continuous maintenance, including patch releases, version upgrades, etc.

@shapirov103 @youngjeong46
can these 2 locations be an option?
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/f57f27621e52b305dba7d624c477977a/k8s-neuron-device-plugin.yml
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/46fb1da6e5e79c3310ebc0cbd6ad2353/k8s-neuron-device-plugin-rbac.yml

Copy link
Collaborator

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngjeong46 Nice work, thankyou for doing this. I have some comments on the PR. Also want mkdocs and Addon index update is missing for the new addon.

examples/young-construct/index.ts Outdated Show resolved Hide resolved
import { ClusterAddOn, ClusterInfo } from "../../spi";
import { loadYaml, readYamlDocument } from "../../utils/yaml-utils";

export class NeuronPluginAddOn implements ClusterAddOn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options

lib/utils/yaml-utils.ts Show resolved Hide resolved
@youngjeong46
Copy link
Collaborator Author

@youngjeong46 great work, I added a few tactical comments. With respect to the usage of the manifests, can you check if there is any approach to test whether there is a new version of Neuron and how can we incorporate it for continuous maintenance, including patch releases, version upgrades, etc.

@shapirov103 @youngjeong46 can these 2 locations be an option? https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/f57f27621e52b305dba7d624c477977a/k8s-neuron-device-plugin.yml https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/46fb1da6e5e79c3310ebc0cbd6ad2353/k8s-neuron-device-plugin-rbac.yml

@freschri I'm linking it to the following two links (as they are raw files that can be imported straight as part of the addon). For now, unless the Neuron team announces otherwise, they would update the manifests through this location on GitHub.
https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml
https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml

@elamaran11
Copy link
Collaborator

@youngjeong46 @freschri Are you folks planning to make progress to this PR?

@youngjeong46
Copy link
Collaborator Author

@shapirov103 It looks good to me, also just tested locally and can verify it works.

@shapirov103
Copy link
Collaborator

/do-e2e-tests

Copy link

@aws-ia-ci aws-ia-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end to end tests failed. A maintainer can provide more details.

@shapirov103
Copy link
Collaborator

Error:

| blueprint-construct-dev | 145/255 | 11:32:44 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | blueprint-construct-dev/adot-collector-sa/manifest-adot-collector-saServiceAccountResource/Resource/Default (blueprintconstructdevadotcollectorsamanifestadotcollectorsaServiceAccountResourceA34512D7) Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (NotFound): error when creating "/tmp/manifest.yaml": namespaces "adot" not found\n'

@elamaran11 - the issue seems to be specific to adot, has nothing to do with neuron.

@elamaran11
Copy link
Collaborator

elamaran11 commented Feb 14, 2024

Error:

| blueprint-construct-dev | 145/255 | 11:32:44 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | blueprint-construct-dev/adot-collector-sa/manifest-adot-collector-saServiceAccountResource/Resource/Default (blueprintconstructdevadotcollectorsamanifestadotcollectorsaServiceAccountResourceA34512D7) Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (NotFound): error when creating "/tmp/manifest.yaml": namespaces "adot" not found\n'

@elamaran11 - the issue seems to be specific to adot, has nothing to do with neuron.

@shapirov103 I ran the e2e even yesterday on my local, didnt see any issues. I think the PR might not have pulled the latest changes from main we had for OTEL config. I recommend to rerun the E2E after merging from main. Also recommend a local run. Plus we are using ADOT addon consistently in CDK accelerator with no issues.

@youngjeong46
Copy link
Collaborator Author

Error:

| blueprint-construct-dev | 145/255 | 11:32:44 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | blueprint-construct-dev/adot-collector-sa/manifest-adot-collector-saServiceAccountResource/Resource/Default (blueprintconstructdevadotcollectorsamanifestadotcollectorsaServiceAccountResourceA34512D7) Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (NotFound): error when creating "/tmp/manifest.yaml": namespaces "adot" not found\n'

@elamaran11 - the issue seems to be specific to adot, has nothing to do with neuron.

@shapirov103 I ran the e2e even yesterday on my local, didnt see any issues. I think the PR might not have pulled the latest changes from main we had for OTEL config. I recommend to rerun the E2E after merging from main. Also recommend a local run. Plus we are using ADOT addon consistently in CDK accelerator with no issues.

@elamaran11 This is with the main merged.

@shapirov103
Copy link
Collaborator

@youngjeong46 I submitted a fix for it. The defect is very intermittent, race condition, but it is also non-determinstic. the same blueprint may fail one time, will succeed another time. See #931

@shapirov103
Copy link
Collaborator

/do-e2e-tests

Copy link

@aws-ia-ci aws-ia-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end to end tests passed

@elamaran11 elamaran11 merged commit 7f6442a into main Feb 14, 2024
2 checks passed
@elamaran11 elamaran11 deleted the feature/neuron-addon branch February 14, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants