Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Resolve SIGKILL and SIGINT errors on macOS #367

Merged
merged 1 commit into from
Sep 21, 2024

Conversation

jmontleon
Copy link
Member

No description provided.

@jmontleon jmontleon force-pushed the exit-more-gracefully branch 2 times, most recently from 759fb99 to 61f875e Compare September 16, 2024 21:46
@jwmatthews
Copy link
Member

I saw the crash screen on immediate launch of make run-server

Screenshot 2024-09-17 at 8 44 16 AM
Output from console
$ make run-server
PYTHONPATH="/Users/jmatthews/git/jwmatthews/kai/kai:" python kai/server.py
[2024-09-17 08:45:16 -0400] [54123] [INFO] Starting gunicorn 22.0.0
[2024-09-17 08:45:16 -0400] [54123] [INFO] Listening at: http://0.0.0.0:8080 (54123)
[2024-09-17 08:45:16 -0400] [54123] [INFO] Using worker: aiohttp.GunicornWebWorker
[2024-09-17 08:45:16 -0400] [54126] [INFO] Booting worker with pid: 54126
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-17 08:45:16,695 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-17 08:45:16,697 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-17 08:45:16,697 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
objc[54126]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[54126]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-17 08:45:16 -0400] [54123] [ERROR] Worker (pid:54126) was sent SIGKILL! Perhaps out of memory?
[2024-09-17 08:45:16 -0400] [54127] [INFO] Booting worker with pid: 54127
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-17 08:45:16,796 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-17 08:45:16,798 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-17 08:45:16,798 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
[2024-09-17 08:45:16 -0400] [54128] [INFO] Booting worker with pid: 54128
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-17 08:45:16,807 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-17 08:45:16,810 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-17 08:45:16,810 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
objc[54127]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[54127]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-17 08:45:16 -0400] [54123] [ERROR] Worker (pid:54127) was sent SIGKILL! Perhaps out of memory?
objc[54128]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[54128]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-17 08:45:16 -0400] [54123] [ERROR] Worker (pid:54128) was sent SIGKILL! Perhaps out of memory?
[2024-09-17 08:45:16 -0400] [54129] [INFO] Booting worker with pid: 54129
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-17 08:45:16,870 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-17 08:45:16,872 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-17 08:45:16,873 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
objc[54129]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[54129]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-17 08:45:16 -0400] [54123] [ERROR] Worker (pid:54129) was sent SIGKILL! Perhaps out of memory?
[2024-09-17 08:45:16 -0400] [54130] [INFO] Booting worker with pid: 54130
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
....

Copy link
Member

@jwmatthews jwmatthews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed to run on MacOS (arm64).
I added what I saw from console, I don't know the cause at moment or a fix.

@jmontleon jmontleon marked this pull request as draft September 17, 2024 12:59
@jmontleon
Copy link
Member Author

signal.signal(signal.SIGINT, signal.SIG_IGN)

This should just be causing it to ignore (SIG_IGN) additional interrupt signals (SIGINT).

In your log it is complaing about receiving a SIGKILL though, which seems odd.
[2024-09-17 08:45:16 -0400] [54123] [ERROR] Worker (pid:54126) was sent SIGKILL! Perhaps out of memory?

@jmontleon
Copy link
Member Author

@jwmatthews new attempt trying to handle this from the shell if you wouldn't mind testing.

All this does is trap the SIGINT and send a SIGTERM to signal to gunicorn it should start a graceful shutdown rather than immediate.:

If all goes well, you should see:
^C[2024-09-17 11:18:14 -0400] [60418] [INFO] Handling signal: term

instead of

^C[2024-09-17 11:21:07 -0400] [61156] [INFO] Handling signal: int

Workers shoudl shutdown with [INFO] tag instead of [ERROR] as well.

On Linux this looks like:

INFO - 2024-09-17 11:18:13,100 - kai.server - [           server.py:58   -                  app()] - Kai server is ready to receive requests.
INFO - 2024-09-17 11:18:13,199 - kai.service.kai_application.kai_application - [  kai_application.py:84   -             __init__()] - Selected incident store: postgresql
INFO - 2024-09-17 11:18:13,200 - kai.server - [           server.py:58   -                  app()] - Kai server is ready to receive requests.
^C[2024-09-17 11:18:14 -0400] [60418] [INFO] Handling signal: term
[2024-09-17 11:18:15 -0400] [60488] [INFO] Worker exiting (pid: 60488)
[2024-09-17 11:18:15 -0400] [60489] [INFO] Worker exiting (pid: 60489)
[2024-09-17 11:18:15 -0400] [60490] [INFO] Worker exiting (pid: 60490)
[2024-09-17 11:18:15 -0400] [60493] [INFO] Worker exiting (pid: 60493)
[2024-09-17 11:18:15 -0400] [60495] [INFO] Worker exiting (pid: 60495)
[2024-09-17 11:18:15 -0400] [60485] [INFO] Worker exiting (pid: 60485)
[2024-09-17 11:18:15 -0400] [60486] [INFO] Worker exiting (pid: 60486)
[2024-09-17 11:18:15 -0400] [60487] [INFO] Worker exiting (pid: 60487)
[2024-09-17 11:18:16 -0400] [60418] [INFO] Shutting down: Master
[1]+  Done                    PYTHONPATH="/home/jason/Documents/go/src/github.com/konveyor/kai/kai:" python kai/server.py

Makefile Outdated
PYTHONPATH=$(KAI_PYTHON_PATH) python kai/server.py
bash -c 'set -m; _trap () { kill -15 $$PID; } ; trap _trap SIGINT ;\
PYTHONPATH=$(KAI_PYTHON_PATH) python kai/server.py & export PID=$$! ;\
while kill -0 $$PID > /dev/null 2>&1; do wait $$PID; done'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"kill -0" sends no signal; it just checks for the pids existence. This loop just prevents the shell from exiting before the graceful shutdown is complete.

@jwmatthews
Copy link
Member

@jmontleon I updated to latest commit in this branch and saw immediate crash

I see the below immediately when I try to run.

Screenshot 2024-09-18 at 8 20 45 AM

I then did a CTRL-C and I saw another crash message from Mac pop up


File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-18 08:20:48,336 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-18 08:20:48,338 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-18 08:20:48,339 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
objc[65369]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[65369]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-18 08:20:48 -0400] [65063] [ERROR] Worker (pid:65369) was sent SIGKILL! Perhaps out of memory?
objc[65370]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[65370]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-18 08:20:48 -0400] [65371] [INFO] Booting worker with pid: 65371
[2024-09-18 08:20:48 -0400] [65063] [ERROR] Worker (pid:65370) was sent SIGKILL! Perhaps out of memory?
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
File logging for 'kai' is set to level 'DEBUG' writing to file: '/Users/jmatthews/git/jwmatthews/kai/logs/kai_server.log'
INFO - 2024-09-18 08:20:48,442 - kai.service.kai_application.kai_application - [  kai_application.py:54   -             __init__()] - Tracing enabled.
INFO - 2024-09-18 08:20:48,445 - kai.service.kai_application.kai_application - [  kai_application.py:63   -             __init__()] - Selected provider: ChatIBMGenAI
INFO - 2024-09-18 08:20:48,445 - kai.service.kai_application.kai_application - [  kai_application.py:64   -             __init__()] - Selected model: meta-llama/llama-3-70b-instruct
[2024-09-18 08:20:48 -0400] [65063] [INFO] Handling signal: term
objc[65371]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[65371]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-09-18 08:20:48 -0400] [65063] [ERROR] Worker (pid:65371) was sent SIGKILL! Perhaps out of memory?
[2024-09-18 08:20:48 -0400] [65063] [INFO] Shutting down: Master
[1]+  Done                    PYTHONPATH="/Users/jmatthews/git/jwmatthews/kai/kai:" python kai/server.py

@jmontleon
Copy link
Member Author

The SIGKILL errors are addressed in #374

I can confirm the bash trap works to exit cleanly before the new SIGKILL issue, however with the dependency update that introduced the new problem and the workaround it seems no longer necessary.

If we hit this again this will send a SIGTERM for graceful shutdown instead of SIGINT for quick shutdown:

run-server:
	bash -c 'set -m; _trap () { kill -15 $$PID; } ; trap _trap SIGINT ;\
	PYTHONPATH=$(KAI_PYTHON_PATH) python kai/server.py & export PID=$$! ;\
	while kill -0 $$PID > /dev/null 2>&1; do wait $$PID; done'

@jmontleon jmontleon force-pushed the exit-more-gracefully branch 3 times, most recently from 8b5f4b6 to cf9ad3d Compare September 18, 2024 14:47
@jwmatthews
Copy link
Member

Looking good on MacOS in regard to running and automatically setting OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

$ make run-server
if [[ "$(uname)" -eq "Darwin" ]] ; then export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ; fi ;\
	PYTHONPATH="/Users/jmatthews/git/jwmatthews/kai/kai:" python kai/server.py
[2024-09-21 07:59:29 -0400] [12827] [INFO] Starting gunicorn 22.0.0
[2024-09-21 07:59:29 -0400] [12827] [INFO] Listening at: http://0.0.0.0:8080 (12827)
[2024-09-21 07:59:29 -0400] [12827] [INFO] Using worker: aiohttp.GunicornWebWorker
[2024-09-21 07:59:29 -0400] [12830] [INFO] Booting worker with pid: 12830
Config loaded: KaiConfig(log_level='info', file_log_level='debug', log_dir='$pwd/logs', demo_mode=False, trace_enabled=True, gunicorn_workers=8, gunicorn_timeout=3600, gunicorn_bind='0.0.0.0:8080', incident_store=KaiConfigIncidentStore(solution_detectors=<SolutionDetectorKind.NAIVE: 'naive'>, solution_producers=<SolutionProducerKind.TEXT_ONLY: 'text_only'>, args=KaiConfigIncidentStorePostgreSQLArgs(provider=<KaiConfigIncidentStoreProvider.POSTGRESQL: 'postgresql'>, host='127.0.0.1', database='kai', user='kai', password='dog8code', connection_string=None, solution_detection=<SolutionDetectorKind.NAIVE: 'naive'>)), models=KaiConfigModels(provider='ChatIBMGenAI', args={'model_id': 'meta-llama/llama-3-70b-instruct', 'parameters': {'max_new_tokens': 2048}}, template=None, llama_header=None, llm_retries=5, llm_retry_delay=10.0), solution_consumers=[<SolutionConsumerKind.DIFF_ONLY: 'diff_only'>, <SolutionConsumerKind.LLM_SUMMARY: 'llm_summary'>])
Console logging for 'kai' is set to level 'INFO'
....

Note, I still see the odd crash when I CTRL-C make run-server to exit
Full stack trace: https://gist.github.com/jwmatthews/c353df2178999ca37f9a7ea5bb37d300

Copy link
Member

@jwmatthews jwmatthews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, this fixes the issue we recently saw where make run-server would crash immediately on startup

@jwmatthews jwmatthews merged commit 5fb7153 into konveyor:main Sep 21, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants