You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After launching an experiment that had an invalid podspec, eg
[2023-01-13 07:07:21] [52c09f4b] Pod "exp-180-trial-219-0-180.c5109f82-4158-4f2b-be1a-81d9c0dfae7e.1-current-pigeon" is invalid: spec.containers[1].volumeMounts[22].name: Not found: "test" <error> [2023-01-13 07:07:21] || ERROR: Trial (Experiment 180) was terminated: allocation failed: task failed without an associated exit code: pod actor exited while pod was running
Deletion of that experiment fails because it attempts to access the same podspec to delete checkpoints
2023-01-18T19:04:29.849447686Z [info]: resources are requested by /delete-checkpoint-gc-2c9920e2-363f-4aaa-a728-14b2df07897c/2f662cac-2d5f-477f -a8f2-7b8e495adf45.1 (Allocation ID: 2f662cac-2d5f-477f-a8f2-7b8e495adf45.1) actor-local-addr="kubernetes" actor-system="master" go-type="kube rnetesResourcePool" 2023-01-18T19:04:29.893581143Z [error]: error creating pod gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat actor-local-addr="kubernetes -worker-3" actor-system="master" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat\" is invalid: spec.containers[1].volumeMou nts[17].name: Not found: \"test\"" go-type="requestProcessingWorker" handler="/pods/pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" 2023-01-18T19:04:29.893627422Z [error]: pod actor notified that resource creation failed actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbd ce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vi tal-gnat\" is invalid: spec.containers[1].volumeMounts[17].name: Not found: \"test\"" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f 32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893656662Z [info]: requesting to delete kubernetes resources actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" a ctor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vita l-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893674352Z [warning]: updating container state after pod actor exited unexpectedly actor-local-addr="pod-54e7c0b4-1ede-41a 3-b4df-7247cbdce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92 e0-12a0c048f32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.900406597Z [error]: allocation encountered fatal error actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-sys tem="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="task failed without an associated exit code: pod actor exited while pod was running" go-type="Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.903045789Z [info]: allocation failed: task failed without an associated exit code: pod actor exited while pod was running actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type= "Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.908565983Z [error]: wasn't able to delete checkpoints from checkpoint storage actor-local-addr="delete-checkpoint-gc-4c179 240-6a91-410d-a7bb-71dbcfbedefc" actor-system="master" error="task failed without an associated exit code: pod actor exited while pod was runni ng" go-type="checkpointGCTask" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.912193128Z [error]: deleting experiment 180 error="failed to gc checkpoints for experiment: checkpoint GC task failed because allocation failed: task failed without an associated exit code: pod actor exited while pod was running"
Reproduction Steps
Launch an experiment with an invalid podspec
Try to delete the failed experiment
Expected Behavior
Failed experiments should be deleted regardless of failure reason.
Screenshot
N/A
Environment
Running determined 0.19.9 in a k8s cluster
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Interesting situation. The checkpoint-gc task inherits the podspec from the experiment for various reasons, but if the experiment couldn't run, then there can't be any checkpoints, and so there shouldn't have to be a checkpoint-gc in the first place.
That would actually skirt the other issue here, which is "what do we do if there are checkpoints that can't be deleted for some reason".
Describe the bug
After launching an experiment that had an invalid podspec, eg
[2023-01-13 07:07:21] [52c09f4b] Pod "exp-180-trial-219-0-180.c5109f82-4158-4f2b-be1a-81d9c0dfae7e.1-current-pigeon" is invalid: spec.containers[1].volumeMounts[22].name: Not found: "test" <error> [2023-01-13 07:07:21] || ERROR: Trial (Experiment 180) was terminated: allocation failed: task failed without an associated exit code: pod actor exited while pod was running
Deletion of that experiment fails because it attempts to access the same podspec to delete checkpoints
2023-01-18T19:04:29.849447686Z [info]: resources are requested by /delete-checkpoint-gc-2c9920e2-363f-4aaa-a728-14b2df07897c/2f662cac-2d5f-477f -a8f2-7b8e495adf45.1 (Allocation ID: 2f662cac-2d5f-477f-a8f2-7b8e495adf45.1) actor-local-addr="kubernetes" actor-system="master" go-type="kube rnetesResourcePool" 2023-01-18T19:04:29.893581143Z [error]: error creating pod gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat actor-local-addr="kubernetes -worker-3" actor-system="master" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vital-gnat\" is invalid: spec.containers[1].volumeMou nts[17].name: Not found: \"test\"" go-type="requestProcessingWorker" handler="/pods/pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" 2023-01-18T19:04:29.893627422Z [error]: pod actor notified that resource creation failed actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbd ce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="Pod \"gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vi tal-gnat\" is invalid: spec.containers[1].volumeMounts[17].name: Not found: \"test\"" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f 32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893656662Z [info]: requesting to delete kubernetes resources actor-local-addr="pod-54e7c0b4-1ede-41a3-b4df-7247cbdce93a" a ctor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92e0-12a0c048f32e.1-vita l-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.893674352Z [warning]: updating container state after pod actor exited unexpectedly actor-local-addr="pod-54e7c0b4-1ede-41a 3-b4df-7247cbdce93a" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type="pod" pod="gc-0-6ea2373f-2090-45c4-92 e0-12a0c048f32e.1-vital-gnat" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.900406597Z [error]: allocation encountered fatal error actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-sys tem="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" error="task failed without an associated exit code: pod actor exited while pod was running" go-type="Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.903045789Z [info]: allocation failed: task failed without an associated exit code: pod actor exited while pod was running actor-local-addr="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" actor-system="master" allocation-id="6ea2373f-2090-45c4-92e0-12a0c048f32e.1" go-type= "Allocation" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.908565983Z [error]: wasn't able to delete checkpoints from checkpoint storage actor-local-addr="delete-checkpoint-gc-4c179 240-6a91-410d-a7bb-71dbcfbedefc" actor-system="master" error="task failed without an associated exit code: pod actor exited while pod was runni ng" go-type="checkpointGCTask" task-id="6ea2373f-2090-45c4-92e0-12a0c048f32e" task-type="CHECKPOINT_GC" 2023-01-18T19:04:29.912193128Z [error]: deleting experiment 180 error="failed to gc checkpoints for experiment: checkpoint GC task failed because allocation failed: task failed without an associated exit code: pod actor exited while pod was running"
Reproduction Steps
Expected Behavior
Failed experiments should be deleted regardless of failure reason.
Screenshot
N/A
Environment
Running determined 0.19.9 in a k8s cluster
Additional Context
No response
The text was updated successfully, but these errors were encountered: