Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH action to generate report #199

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from
Draft

GH action to generate report #199

wants to merge 39 commits into from

Conversation

asmacdo
Copy link
Member

@asmacdo asmacdo commented Sep 25, 2024

Fixes #177

Step 1: Create Skeleton

  • Authenticate with AWS
  • Connect to K8s cluster
  • deploys our job-runner pod onto a Karpenter NodeClaim
  • Creates a SPOT node as needed
  • Run dummy job
  • Delete Pod
  • Scale Down

I've verified that when a user-node is available (created by running a tiny jupyterhub), the job pod schedules on that node. I then shut down my jupyterhub and all user-nodes scaled down. I reran this job, and Karpenter successfully scaled up a new spot node, the pod was scheduled on it, ran successfully, was deleted, and the node cleaned up. Step 1 complete!

Step 2 Generate Report

  • Connect Pod to EFS
  • List users
  • du each user
  • du shared
  • collate data into report
  • Double Check that nodes come up and down successfully
  • Run job several times in 1 day, check next day for EFS usage spike (IIUC we should be fine because EFS is Bursting mode)

Step 3 Push Report

  • Create private GitHub repository to store reports
  • Configure bot permission to push to repo
  • push report to repo on complete

Questions to answer:

  • If a SPOT node is preempted, can we redeploy again later?

Comment on lines 4 to 6
pull_request:
branches:
- main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also run this on a weekly basis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on PR push just so I can test easily, but yes, 1/week sounds good to me. @kabilar Do you have a preference for what day/time?

Copy link
Member

@kabilar kabilar Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you. How about Mondays at 6am EST? We can then review the report on Monday mornings.

@kabilar
Copy link
Member

kabilar commented Nov 7, 2024

Hi @asmacdo, just checking in to see how this report generation is going? Thanks.

see PR for comment
@asmacdo
Copy link
Member Author

asmacdo commented Nov 11, 2024

Ran into some problems

The du script ran for about 50 minutes and then the pod disappeared without logs.

Worse it kicked my jupyterhub pod as well as another user.

[I 2024-11-11 17:57:57.999 JupyterHub log:192] 200 GET /hub/error/503?url=%2Fuser%2Fasmacdo%2Fterminals%2Fwebsocket%2F1 (@100.64.247.104) 7.52ms
[W 2024-11-11 17:57:59.266 JupyterHub base:1254] User asmacdo server stopped, with exit code: 1
[I 2024-11-11 17:57:59.266 JupyterHub proxy:357] Removing user asmacdo from proxy (/user/asmacdo/)

I think this means we need to take a different approach. By setting resource limits, we should have isolated our job from the other pods, but since I have no other logs about what happened here I think we need to take a more conservative approach that is completely isolated from user pods.

I did it this way because I thought it would be simpler, but if theres any chance that we affect a running user pod, we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User data quota cron job
2 participants