-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH action to generate report #199
base: main
Are you sure you want to change the base?
Conversation
(Also increase timeout)
.github/workflows/report.yaml
Outdated
pull_request: | ||
branches: | ||
- main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also run this on a weekly basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on PR push just so I can test easily, but yes, 1/week sounds good to me. @kabilar Do you have a preference for what day/time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you. How about Mondays at 6am EST? We can then review the report on Monday mornings.
Hi @asmacdo, just checking in to see how this report generation is going? Thanks. |
f53e820
to
ff52971
Compare
job shouldnt take that long, but this is wall time, includes docker pull, etc
took almost 60 sec to start up
b2eac64
to
0191c85
Compare
a4b44f9
to
965a81e
Compare
see PR for comment
Ran into some problems The du script ran for about 50 minutes and then the pod disappeared without logs. Worse it kicked my jupyterhub pod as well as another user.
I think this means we need to take a different approach. By setting resource limits, we should have isolated our job from the other pods, but since I have no other logs about what happened here I think we need to take a more conservative approach that is completely isolated from user pods. I did it this way because I thought it would be simpler, but if theres any chance that we affect a running user pod, we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether. |
Fixes #177
Step 1: Create Skeleton
I've verified that when a user-node is available (created by running a
tiny
jupyterhub), the job pod schedules on that node. I then shut down my jupyterhub and all user-nodes scaled down. I reran this job, and Karpenter successfully scaled up a new spot node, the pod was scheduled on it, ran successfully, was deleted, and the node cleaned up. Step 1 complete!Step 2 Generate Report
Step 3 Push Report
Questions to answer: