Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle scenarios where registered processes disappear outside regular flow #337

Merged
merged 6 commits into from
Sep 11, 2024

Conversation

rosa
Copy link
Member

@rosa rosa commented Sep 8, 2024

This is a simple way to mitigate the effects of a computer going to sleep and then coming back to life, with processes with expired heartbeats that are, however, alive. It came up as part of the quest to reproduce another unrelated error in #324. It's not trivial for the supervisor to know whether a process is actually alive or not because the pid might have been reassigned by the OS to a different process. It can't rely on its registered supervisees either because we might be running multiple supervisors. We could possibly check its current forks, whether the pid matches, and in that case, skip the pruning, but I'm wary about introducing complicated logic there because the risk is ending up with dead registered processes and locked jobs.

Since this scenario should happen only in development, I opted for keeping the pruning logic the same (except from preventing the supervisor from pruning itself, as in that case, the supervisor running the pruning is very much alive and running) and just have each process check whether their registered process record is gone when they heartbeat. If it's gone, just stop as if a TERM or INT signal had been received. If that process was a supervised fork, its supervisor will replace it as soon as it realises it's gone.

This might be handy in the future as well, to stop a given worker from Mission Control.

Big thanks to @npezza93 for catching this scenario.

@rosa rosa force-pushed the improve-ungracious-termination branch 2 times, most recently from 3a46abf to 29f297c Compare September 9, 2024 17:22
@@ -53,7 +53,10 @@ def stop_heartbeat
end

def heartbeat
process.heartbeat
process.with_lock(&:heartbeat)
Copy link

@npezza93 npezza93 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like with_lock doesnt pass the object to the block. Seeing a SolidQueue-0.8.2 Error in thread (0.0ms) error: "ArgumentError no receiver given" error.

process.with_lock { process.heartbeat }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, I'm dumb, seriously. This was working on Sunday and I decided to break it yesterday without testing 😅 I'll fix it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no worries!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, no, that was correct 😕 I was wondering why the tests were passing... and this is how with_lock is supposed to be used. The error must come from something else, I'll investigate and fix it.

Copy link
Member Author

@rosa rosa Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, no, your version is the correct one 😆 And it doesn't break the tests because the error doesn't happen
there! The first time the heartbeat happens, the process is gone so the block is never called. Ahh, I need a holiday soon 😅 😅

@npezza93
Copy link

npezza93 commented Sep 9, 2024

If i manually go into the db and trash all the processes i get this error:
image

supervisor.rb

        if registered_process = process&.supervisees&.find_by(name: terminated_fork.name)

Adding this patch fixes the error but a new supervisor never gets restarted.

@rosa
Copy link
Member Author

rosa commented Sep 10, 2024

If i manually go into the db and trash all the processes i get this error:

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

@npezza93
Copy link

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

Thats a good point. i was just impatient and decided to nuke the records instead waiting for my mac to hibernate. Going the hibernation route this never gets raised so i think you can ignore more here

For example, when coming back from being suspended and having its
heartbeat expired.
As these are all common between all processes.
If, for some reason, the process failed a heartbeat and the supervisor
pruned it, we shouldn't continue running. Just stop as if we had
received a signal.

This could be used in the future from Mission Control to stop a worker.
To guard against race conditions of the record being deleted precisely
then.
@rosa rosa force-pushed the improve-ungracious-termination branch from fdb7bc0 to 6d7bc6f Compare September 11, 2024 11:59
Left-over from somethign I was rewriting.
@rosa
Copy link
Member Author

rosa commented Sep 11, 2024

We're now running this in production in HEY to make sure everything is ok.

@rosa rosa merged commit 439202c into main Sep 11, 2024
8 checks passed
@rosa rosa deleted the improve-ungracious-termination branch September 11, 2024 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants