Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK-e2e #306

Merged
merged 10 commits into from
Jan 10, 2024
Merged

SDK-e2e #306

merged 10 commits into from
Jan 10, 2024

Conversation

ChristianZaccaria
Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria commented Aug 14, 2023

Issue link

Closes #292

What changes have been made

  • Added e2e test workflow.
  • Added e2e test that creates a Ray cluster, and trains the MNIST dataset using the CodeFlare SDK. Asserts successful completion of the training job. Covering the installation of the CodeFlare SDK, as well as the RBAC required for the SDK to successfully perform requests to the cluster.
  • Added action to setup KinD cluster and configuration.
  • Added SDK script, MNIST training script, pip requirements, and SDK installation script.
  • Initilialized go.mod and go.sum to make use of common codeflare functions.
  • Created a script that copies the root directory (codeflare-sdk) to the volume mount in the Job container. This is done to install the codeflare-sdk with poetry with the latest changes, i.e., for opened PRs.

Verification steps

To run the e2e test locally on KinD cluster, follow these steps:

  • Setup Phase:
    make kind-e2e
    export CLUSTER_HOSTNAME=kind
    export CLUSTER_TYPE=KIND
    export CODEFLARE_TEST_TIMEOUT_LONG=20m
    make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.0.0-rc.4
    make setup-e2e
    
  • Test Phase:
    • Once we have the codeflare-operator and kuberay-operator running and ready, we can run the e2e test on the codeflare-sdk repo:
    go test -timeout 30m -v ./tests/e2e
    

To run the e2e on HyperShift cluster, follow these steps:

  • Setup Phase:
    export CLUSTER_TYPE=HYPERSHIFT
    export CODEFLARE_TEST_TIMEOUT_LONG=20m
    make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.0.0-rc.4
    make setup-e2e
    
  • Test Phase:
    • Once we have the codeflare-operator and kuberay-operator running and ready, we can run the e2e test on the codeflare-sdk repo:
    go test -timeout 30m -v ./tests/e2e
    

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@ChristianZaccaria ChristianZaccaria added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 5, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 11, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2023
@ChristianZaccaria
Copy link
Collaborator Author

@ChristianZaccaria Have you considered whether it has more sense for e2e test to be implemented in Python instead of Go, to install and leverage SDK directly?

I think @astefanutti suggested we kept it as a go test, but this could change.

We were discussing that in this PR we currently have a well working e2e test and that it might be best to merge it to gate PRs and unblock dependabot updates.

To not diverge from the purpose of this PR, we have this issue opened to refactor/simplify the test and make use of unprivileged users: https://issues.redhat.com/browse/RHOAIENG-60. In the scope of that issue most items including the job spec, RBAC, and configmap in go test will be removed. Taking that into account as well, you're right I think we should implement the e2e test all in Python.

I'll be working on the new issue with @Bobbins228 if you think that's okay :)

@sutaakar
Copy link
Contributor

sutaakar commented Jan 3, 2024

Sure, fine for me

Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

As intermittent solution to have some e2e tests up and running.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2024
@astefanutti
Copy link
Contributor

@ChristianZaccaria Have you considered whether it has more sense for e2e test to be implemented in Python instead of Go, to install and leverage SDK directly?

I think @astefanutti suggested we kept it as a go test, but this could change.

We were discussing that in this PR we currently have a well working e2e test and that it might be best to merge it to gate PRs and unblock dependabot updates.

Right, this PR has been open for quite some times already, so now that we have a working e2e test, we would want to merge it ASAP. We can rework the test logic in Python in a separate PR, but that's lower priority compared to having the e2e test suite running in CI.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2024
@sutaakar
Copy link
Contributor

sutaakar commented Jan 4, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2024
@jbusche
Copy link
Contributor

jbusche commented Jan 5, 2024

I tried this out on my Mac connected to my OpenShift Cluster, and it worked after I made 2 adjustments.

  1. The SED on my MAC doesn't have the same syntax as Linux, so to get the make commands to work I had to edit the Makefile and change:
SED ?= /usr/bin/sed
to
SED ?= /opt/homebrew/bin/gsed
  1. The Ray operator wasn't able to find it's image, so I had to do this:
oc edit deploy kuberay-operator -n ray-system
and  change:
 image: kuberay/operator:v1.0.0-rc.1
to
 image: quay.io/kuberay/operator:v1.0.0-rc.1

At that point, then the ray operator image can start and the e2e test can run and succeed.

go test -timeout 30m -v ./tests/e2e
# github.com/project-codeflare/codeflare-sdk/tests/e2e.test
ld: warning: -bind_at_load is deprecated on macOS
ld: warning: '/private/var/folders/_4/270ms5_50qq81ql6yt7f27d00000gn/T/go-link-1664163050/go.o' has malformed LC_DYSYMTAB, expected 122 undefined symbols to start at index 147609, found 142 undefined symbols starting at index 83
=== RUN   TestMNISTRayClusterSDK
=== PAUSE TestMNISTRayClusterSDK
=== CONT  TestMNISTRayClusterSDK
    mnist_raycluster_sdk_test.go:49: Created ConfigMap test-ns-dncvt/config-gfp5d successfully
    mnist_raycluster_sdk_test.go:84: Created ServiceAccount test-ns-dncvt/sa-8zmvh successfully
    mnist_raycluster_sdk_test.go:85: Created Role test-ns-dncvt/role-phv6m successfully
    mnist_raycluster_sdk_test.go:86: Created RoleBinding test-ns-dncvt/role-phv6m successfully
    mnist_raycluster_sdk_test.go:195: Created Job test-ns-dncvt/sdk successfully
    mnist_raycluster_sdk_test.go:232: Waiting for Job test-ns-dncvt/sdk to complete
    mnist_raycluster_sdk_test.go:214: Waiting for pod to start...
    mnist_raycluster_sdk_test.go:209: Pod is running!
    test.go:134: Retrieving Pod Container test-ns-dncvt/mnist-head-wvw4m/ray-head logs
    test.go:122: Creating ephemeral output directory as CODEFLARE_TEST_OUTPUT_DIR env variable is unset
    test.go:125: Output directory has been created at: /var/folders/_4/270ms5_50qq81ql6yt7f27d00000gn/T/TestMNISTRayClusterSDK1609076889/001
    test.go:134: Retrieving Pod Container test-ns-dncvt/mnist-worker-small-group-mnist-27b27/machine-learning logs
    test.go:134: Retrieving Pod Container test-ns-dncvt/sdk-jlrd2/test logs
--- PASS: TestMNISTRayClusterSDK (296.83s)
PASS
ok  	github.com/project-codeflare/codeflare-sdk/tests/e2e	297.377s

Copy link
Contributor

@jbusche jbusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

openshift-ci bot commented Jan 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jbusche
Once this PR has been reviewed and has the lgtm label, please ask for approval from dimakis. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChristianZaccaria
Copy link
Collaborator Author

I tried this out on my Mac connected to my OpenShift Cluster, and it worked after I made 2 adjustments.

  1. The SED on my MAC doesn't have the same syntax as Linux, so to get the make commands to work I had to edit the Makefile and change:
SED ?= /usr/bin/sed
to
SED ?= /opt/homebrew/bin/gsed

Thanks a lot for testing it out and for pointing those details out! I see the SED command comes from CodeFlare-Operator's makefile, perhaps we could make it check for the OS and assign the SED path accordingly.

  1. The Ray operator wasn't able to find it's image, so I had to do this:
oc edit deploy kuberay-operator -n ray-system
and  change:
 image: kuberay/operator:v1.0.0-rc.1
to
 image: quay.io/kuberay/operator:v1.0.0-rc.1

At that point, then the ray operator image can start and the e2e test can run and succeed.

--- PASS: TestMNISTRayClusterSDK (296.83s)
PASS
ok  	github.com/project-codeflare/codeflare-sdk/tests/e2e	297.377s

In terms of the KubeRay image I'm not really sure. When running this in KinD, the pod uses this image when running make setup-e2e:

image: docker.io/kuberay/operator:v1.0.0-rc.1

Thanks again James!

@astefanutti astefanutti merged commit 3ed1b42 into project-codeflare:main Jan 10, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SDK e2e Testing
7 participants