GH action to generate report #199

asmacdo · 2024-09-25T20:07:00Z

Fixes #177

Step 1: Create Skeleton

I've verified that when a user-node is available (created by running a tiny jupyterhub), the job pod schedules on that node. I then shut down my jupyterhub and all user-nodes scaled down. I reran this job, and Karpenter successfully scaled up a new spot node, the pod was scheduled on it, ran successfully, was deleted, and the node cleaned up. Step 1 complete!

Step 2 Generate Report

Connect Pod to EFS
List users
du each user
du shared
collate data into report
Double Check that nodes come up and down successfully
Run job several times in 1 day, check next day for EFS usage spike (IIUC we should be fine because EFS is Bursting mode)

Step 3 Push Report

Create private GitHub repository to store reports
Configure bot permission to push to repo
push report to repo on complete

Questions to answer:

If a SPOT node is preempted, can we redeploy again later?

(Also increase timeout)

kabilar · 2024-09-26T14:53:56Z

.github/workflows/report.yaml

+  pull_request:
+    branches:
+      - main


Should we also run this on a weekly basis?

It's on PR push just so I can test easily, but yes, 1/week sounds good to me. @kabilar Do you have a preference for what day/time?

Great, thank you. How about Mondays at 6am EST? We can then review the report on Monday mornings.

kabilar · 2024-11-07T00:05:29Z

Hi @asmacdo, just checking in to see how this report generation is going? Thanks.

job shouldnt take that long, but this is wall time, includes docker pull, etc

took almost 60 sec to start up

see PR for comment

asmacdo · 2024-11-11T18:51:24Z

Ran into some problems

The du script ran for about 50 minutes and then the pod disappeared without logs.

Worse it kicked my jupyterhub pod as well as another user.

[I 2024-11-11 17:57:57.999 JupyterHub log:192] 200 GET /hub/error/503?url=%2Fuser%2Fasmacdo%2Fterminals%2Fwebsocket%2F1 (@100.64.247.104) 7.52ms
[W 2024-11-11 17:57:59.266 JupyterHub base:1254] User asmacdo server stopped, with exit code: 1
[I 2024-11-11 17:57:59.266 JupyterHub proxy:357] Removing user asmacdo from proxy (/user/asmacdo/)

I think this means we need to take a different approach. By setting resource limits, we should have isolated our job from the other pods, but since I have no other logs about what happened here I think we need to take a more conservative approach that is completely isolated from user pods.

I did it this way because I thought it would be simpler, but if theres any chance that we affect a running user pod, we would be better off directly deploying a separate EC2 instance and bind the EFS directly, avoiding Kubernetes altogether.

asmacdo added 10 commits September 25, 2024 15:06

Inital commit to add GH action to generate report

d16abb2

Assume Jupyterhub Provisioning Role

713d64c

Fixup: indent

519360c

Rename job

e6f4814

Add assumed role to update-kubeconfig

72496f4

No need to add ProvisioningRole to masters

8428d3a

Deploy a pod to the cluster, and schedule with Karpenter

e170b59

Fixup: correct path to pod manifest

bfce046

Fixup again ugh, rename file

0993129

Delete Pod even if previous step times out

87027d2

(Also increase timeout)

kabilar reviewed Sep 26, 2024

View reviewed changes

asmacdo added 2 commits November 8, 2024 12:36

Hack out initial du

686f686

tmp comment out job deployment, test dockerhub build

ff52971

asmacdo force-pushed the cron-data-usage-report branch from f53e820 to ff52971 Compare November 8, 2024 18:41

asmacdo added 15 commits November 8, 2024 12:42

Fixup hyphens for image name

ca6db89

Write file to output location

d228f9d

use kubectl cp to retrieve report

68f707f

Combine run blocks to use vars

ad6b589

Mount efs and pass arg to du script

f18e8b7

Comment out repo pushing, lets see if the report runs

387cfc1

Restrict job to asmacdo for testing

04b4193

Sanity check. Just list the directories

a443081

Job was deployed, but never assigned to node, back to sanity check

99ac264

change from job to pod

6ee89b2

deploy pod to same namespace as pvc

a8f6ed3

Use ns in action

664853b

increase timeout to 60s

e35c974

job shouldnt take that long, but this is wall time, includes docker pull, etc

fixup: image name in manifest

a8af5f2

increase timeout to 150

024cf6e

took almost 60 sec to start up

asmacdo added 2 commits November 8, 2024 15:29

override entrypoint so i can debug with exec

49c346e

bound /home actually meant path was /home/home/asmacdo

0191c85

asmacdo force-pushed the cron-data-usage-report branch from b2eac64 to 0191c85 Compare November 8, 2024 21:40

asmacdo added 9 commits November 8, 2024 15:43

Create output dir prior to writing report

3eb9157

pod back to job

676a00e

Fixup use the correct job api

c085751

Add namespace to pod retrieval

3e18a37

write directly to pv to test job

0fa5ece

fixup script fstring

e1ecbc3

no retry on failure, we were spinning up 5 pods, lets just fail 1 time

082d3cc

Fixup backup limit job not template

d46ea44

Initial report

965a81e

asmacdo force-pushed the cron-data-usage-report branch from a4b44f9 to 965a81e Compare November 11, 2024 16:55

disable report

7366d2d

see PR for comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH action to generate report #199

GH action to generate report #199

asmacdo commented Sep 25, 2024 •

edited

Loading

kabilar Sep 26, 2024

asmacdo Sep 26, 2024

kabilar Sep 26, 2024 •

edited

Loading

kabilar commented Nov 7, 2024

asmacdo commented Nov 11, 2024 •

edited

Loading

GH action to generate report #199

Are you sure you want to change the base?

GH action to generate report #199

Conversation

asmacdo commented Sep 25, 2024 • edited Loading

Step 1: Create Skeleton

Step 2 Generate Report

Step 3 Push Report

kabilar Sep 26, 2024

Choose a reason for hiding this comment

asmacdo Sep 26, 2024

Choose a reason for hiding this comment

kabilar Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

kabilar commented Nov 7, 2024

asmacdo commented Nov 11, 2024 • edited Loading

asmacdo commented Sep 25, 2024 •

edited

Loading

kabilar Sep 26, 2024 •

edited

Loading

asmacdo commented Nov 11, 2024 •

edited

Loading