Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is run status handled? #3176

Open
gpascale opened this issue Jun 25, 2024 · 1 comment
Open

How is run status handled? #3176

gpascale opened this issue Jun 25, 2024 · 1 comment
Labels
type / question Issue type: question

Comments

@gpascale
Copy link

gpascale commented Jun 25, 2024

❓Question

It's extremely unclear to me how run status (active, finished, failed etc...) is determined - specifically whether a run is active. In my code, I'm calling report_successful_finish when my model has finished training and testing and I've uploaded the figures I want to, but I can't tell if this actually impacts the state? Most of my runs automatically transition to the finished state, but not always. Does this happen automatically when the process exits? When the run object is destroyed?

My dashboard is littered with week-old runs that still show as in progress. In some cases, maybe the processes crashed? I can't tell. I've tried using the CLI to "close" them with little success - usually it reports no errors but the run still shows as in progress.

I've searched extensively through the documentation but I hardly see anything about this.

@gpascale gpascale added the type / question Issue type: question label Jun 25, 2024
@mihran113
Copy link
Contributor

Hey @gpascale! Sorry for delayed response and thanks for the question. We try to automatically transition the run to finished state when the process exits (even if exceptions are thrown). But there are cases that the process hangs or is killed, in those cases we can't do much.

However we also have a background task in aim up command as a backup plan that checks for runs that stayed in the active state and no other process is holding locks for that run (this is the case when the process is killed). So the only un-handled case should be when the process is hang. If you can provide some more details on how specifically this cases happen, maybe I can provide some more help or try to reproduce it on my end to see what's going wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type / question Issue type: question
Projects
None yet
Development

No branches or pull requests

2 participants