-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neuron Device Plugin Addon #777
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
3423025
first commit - adding the addons, manifests, and example with GPU nodes
youngjeong46 25a717b
add documentations
youngjeong46 402eb10
slight fix to the docs
youngjeong46 0e878d6
merge main
youngjeong46 6276978
Merge branch 'main' into feature/neuron-addon
youngjeong46 38b70f6
test fix, lint fix, and removed local testing example blueprint
youngjeong46 5643240
doc fix to remove nodegroup
youngjeong46 7f1302a
added back GPU node group
youngjeong46 51103c9
removed local neuron yaml files in place of urls
youngjeong46 b272111
removing local construct test
youngjeong46 b4f9530
removed unneccessary yaml util, added jsdocs on helper functions
youngjeong46 c71979a
changed mkdocks and docs/index.md to point to neuron addon
b072506
Merge pull request #923 from ariveroi/feature/neuron-addon
shapirov103 e93145a
Merge branch 'main' into feature/neuron-addon
youngjeong46 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Neuron Device Plugin Addon | ||
|
||
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances. This addon will install the Neuron Device Plugin necessary to run the instances on Amazon EKS (and Blueprints). Note that you **must** use *inf1, inf2, trn1,* or *trn1n* instances. | ||
|
||
## Usage | ||
|
||
#### **`index.ts`** | ||
```typescript | ||
import 'source-map-support/register'; | ||
import * as cdk from 'aws-cdk-lib'; | ||
import * as blueprints from '@aws-quickstart/eks-blueprints'; | ||
|
||
const app = new cdk.App(); | ||
|
||
const addOn = new blueprints.addons.NeuronPluginAddon(); | ||
|
||
const clusterProvider = new blueprints.GenericClusterProvider({ | ||
version: KubernetesVersion.V1_27, | ||
managedNodeGroups: [ | ||
inferentiaNodeGroup() | ||
] | ||
}); | ||
|
||
function inferentiaNodeGroup(): blueprints.ManagedNodeGroup { | ||
return { | ||
id: "mng1", | ||
instanceTypes: [new ec2.InstanceType('inf1.2xlarge')], | ||
desiredSize: 1, | ||
maxSize: 2, | ||
nodeGroupSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }, | ||
}; | ||
} | ||
|
||
const blueprint = blueprints.EksBlueprint.builder() | ||
.clusterProvider(clusterProvider) | ||
.addOns(addOn) | ||
.build(app, 'my-stack-name'); | ||
``` | ||
|
||
Once deployed, you can see the plugin daemonset in the `kube-system` namespace. | ||
|
||
```sh | ||
$ kubectl get daemonset neuron-device-plugin-daemonset -n kube-system | ||
|
||
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE | ||
neuron-device-plugin-daemonset 1 1 1 1 1 <none> 24m 20m | ||
``` | ||
|
||
## Functionality | ||
|
||
1. Deploys the plugin daemonset in `kube-system` namespace by default. | ||
2. Provides a plugin for the blueprint to leverage the Inferentia or Trainium instances to use the Neuron SDK. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
import { Construct } from "constructs"; | ||
|
||
import { ClusterAddOn, ClusterInfo } from "../../spi"; | ||
import { KubectlProvider, ManifestDeployment } from "../helm-addon/kubectl-provider"; | ||
import { loadExternalYaml } from "../../utils/yaml-utils"; | ||
|
||
const PLUGIN_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml"; | ||
const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml"; | ||
|
||
export class NeuronPluginAddOn implements ClusterAddOn { | ||
deploy(clusterInfo: ClusterInfo): Promise<Construct> { | ||
const kubectlProvider = new KubectlProvider(clusterInfo); | ||
|
||
// Read in YAML docs | ||
const rbac = loadExternalYaml(RBAC_URL); | ||
const rbacManifest: ManifestDeployment = { | ||
name: "neuron-rbac-manifest", | ||
namespace: "", | ||
manifest: rbac, | ||
values: {} | ||
}; | ||
|
||
const plugin = loadExternalYaml(PLUGIN_URL); | ||
const pluginManifest: ManifestDeployment = { | ||
name: "neuron-plugin-manifest", | ||
namespace: "kube-system", | ||
manifest: plugin, | ||
values: {} | ||
}; | ||
|
||
const rbacStatement = kubectlProvider.addManifest(rbacManifest); | ||
const pluginStatement = kubectlProvider.addManifest(pluginManifest); | ||
|
||
// Plugin dependency on the RBAC manifest | ||
pluginStatement.node.addDependency(rbacStatement); | ||
|
||
return Promise.resolve(pluginStatement); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
kind: ClusterRole | ||
--- | ||
kind: Deployment | ||
--- | ||
kind: Pod |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
apiVersion: apps/v1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
import * as yaml from "../../lib/utils/yaml-utils"; | ||
|
||
describe('Unit tests for yaml utils', () => { | ||
|
||
test("The YAML Document file is read correctly", () => { | ||
const doc = yaml.readYamlDocument(__dirname +'/yaml-test.yaml'); | ||
|
||
expect(doc).toBe("apiVersion: apps/v1"); | ||
}); | ||
|
||
test("The YAML Document file is serialized correctly", () => { | ||
const sample = {"apiVersion":"apps/v1","resource":"Deployment"}; | ||
|
||
const serialized = yaml.serializeYaml(sample); | ||
|
||
expect(serialized.length).toBe(41); | ||
}); | ||
|
||
test("The YAML Document with multiple resources is read correctly", () => { | ||
const doc = yaml.loadMultiResourceYaml(__dirname +'/multi-yaml-test.yaml'); | ||
|
||
const firstPart = { "kind": "ClusterRole" }; | ||
const secondPart = { "kind": "Deployment" }; | ||
const lastPart = { "kind": "Pod" }; | ||
|
||
expect(doc.length).toBe(4); | ||
expect(doc[1]).toStrictEqual(firstPart); | ||
expect(doc[2]).toStrictEqual(secondPart); | ||
expect(doc[3]).toStrictEqual(lastPart); | ||
}); | ||
|
||
test("External YAML Document is read correctly", () => { | ||
const doc = yaml.loadExternalYaml('https://raw.githubusercontent.com/kubernetes/examples/master/guestbook/legacy/frontend-controller.yaml'); | ||
const part = { | ||
apiVersion: "v1", | ||
kind: "ReplicationController", | ||
metadata: {name: "frontend"}, | ||
spec: { | ||
replicas: 3, | ||
template: { | ||
metadata: { | ||
labels: {app: "guestbook", tier: "frontend"} | ||
}, | ||
spec: { | ||
containers: [{ | ||
name: "php-redis", | ||
image: "gcr.io/google_samples/gb-frontend:v4", | ||
resources: { | ||
requests: { | ||
cpu: "100m", | ||
memory: "100Mi" | ||
} | ||
}, | ||
env: [{name: "GET_HOSTS_FROM", value: "dns"}], | ||
ports:[{containerPort: 80}] | ||
}] | ||
} | ||
} | ||
} | ||
}; | ||
|
||
expect(doc.length).toBe(1); | ||
expect(doc[0]).toStrictEqual(part); | ||
}); | ||
}); |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No options to deploy? Not even namespace? It is fine if ns should be kube-system, just want to check if anything is reasonable to expose for configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this one, it is straight forward. There may be some optional scheduler which I just saw, that I will do with options in a fast follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind creating an issue and assigning to you or Riccardo if you want to do a fast follow-up for options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it in this PR actually, testing it right now.