Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logging and gracefully terminating workers #230

Merged
merged 2 commits into from
Dec 3, 2023
Merged

Conversation

coderabhigupta
Copy link
Contributor

@coderabhigupta coderabhigupta commented Dec 2, 2023

  • Fixed logging to make sure the parent TaskHandler process and child TaskRunner worker process logs are logged properly. Previously no logs from TaskHandler were going through. Many TaskRunner logs were out of order or missing.
    • We're using a multiprocessing queue.
    • Parent TaskHandler process writes logs to this queue by configuring a stream handler.
    • All the child processes also write logs to this queue by configuring a dedicated stream handler.
    • A dedicated logger process reads from this queue and acts as a centralized log writer.
  • Terminate child processes by default rather than killing them and kill only when exception is encountered while terminating
  • Terminate processes when a Keyboard Interrupt is sent from a worker app which calls join_processes on TaskHandler

Logs before changes:

2023-12-02 16:12:23,975 [32377] conductor.client.automator.task_runner DEBUG    Polling task for: python_execute_example
2023-12-02 16:12:23,977 [32377] conductor.client.automator.task_runner DEBUG    Polling task for: python_execute_example_2
2023-12-02 16:12:23,978 [32377] conductor.client.automator.task_runner DEBUG    Polling task for: python_task_type_b
2023-12-02 16:12:23,980 [32377] conductor.client.automator.task_runner DEBUG    Polling task for: python_task_type_a
2023-12-02 16:12:23,985 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): localhost:8080
2023-12-02 16:12:23,986 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): localhost:8080
2023-12-02 16:12:23,988 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): localhost:8080
2023-12-02 16:12:24,093 urllib3.connectionpool DEBUG    http://localhost:8080 "GET /api/tasks/poll/python_execute_example_2?workerid=e2&domain=hmmm HTTP/1.1" 204 0
2023-12-02 16:12:24,093 urllib3.connectionpool DEBUG    http://localhost:8080 "GET /api/tasks/poll/python_task_type_b?workerid=b&domain=pheww HTTP/1.1" 204 0
2023-12-02 16:12:24,094 urllib3.connectionpool DEBUG    http://localhost:8080 "GET /api/tasks/poll/python_execute_example?workerid=e1&domain=environment HTTP/1.1" 204 0
2023-12-02 16:12:24,094 [32377] conductor.client.automator.task_runner DEBUG    Polled task: python_task_type_b, worker_id: b, domain: pheww
2023-12-02 16:12:24,094 [32377] conductor.client.automator.task_runner DEBUG    Polled task: python_execute_example_2, worker_id: e2, domain: hmmm
2023-12-02 16:12:24,095 [32377] conductor.client.automator.task_runner DEBUG    Sleep for 0.9 seconds
2023-12-02 16:12:24,095 [32377] conductor.client.automator.task_runner DEBUG    Sleep for 3.1 seconds
2023-12-02 16:12:24,095 urllib3.connectionpool DEBUG    http://localhost:8080 "GET /api/tasks/poll/python_task_type_a?workerid=a&domain=environment HTTP/1.1" 204 0
2023-12-02 16:12:24,095 [32377] conductor.client.automator.task_runner DEBUG    Polled task: python_execute_example, worker_id: e1, domain: environment
2023-12-02 16:12:24,096 [32377] conductor.client.automator.task_runner DEBUG    Sleep for 0.9 seconds
2023-12-02 16:12:24,096 [32377] conductor.client.automator.task_runner DEBUG    Polled task: python_task_type_a, worker_id: a, domain: environment
2023-12-02 16:12:24,096 [32377] conductor.client.automator.task_runner DEBUG    Sleep for 2.1 seconds
...
...
...
Traceback (most recent call last):
  File "/Users/abhigupta/Work/orkes/workers/python-domain-worker/main.py", line 78, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
    main()
Traceback (most recent call last):
  File "/Users/abhigupta/Work/orkes/workers/python-domain-worker/main.py", line 75, in main
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/abhigupta/.pyenv/versions/sdk/lib/python3.10/site-packages/conductor/client/automator/task_runner.py", line 56, in run
    self.run_once()
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abhigupta/.pyenv/versions/3.10.10/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abhigupta/.pyenv/versions/sdk/lib/python3.10/site-packages/conductor/client/automator/task_runner.py", line 65, in run_once
    self.__wait_for_polling_interval()
  File "/Users/abhigupta/.pyenv/versions/sdk/lib/python3.10/site-packages/conduc  File "/Users/abhigupta/.

Logs after changes:

abhigupta@Abhisheks-MacBook-Pro python-domain-worker % python3 main.py
2023-12-02 16:10:58,274 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): localhost:8080
2023-12-02 16:10:58,357 urllib3.connectionpool DEBUG    http://localhost:8080 "POST /api/token HTTP/1.1" 200 None
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created TaskRunner processes
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created all processes
2023-12-02 16:10:58,359 [31834] conductor.client.automator.task_handler INFO     Starting worker processes...
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created TaskRunner processes
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created TaskRunner processes
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created all processes
2023-12-02 16:10:58,358 [31834] conductor.client.automator.task_handler INFO     Created all processes
2023-12-02 16:10:58,359 [31834] conductor.client.automator.task_handler INFO     Starting worker processes...
2023-12-02 16:10:58,359 [31834] conductor.client.automator.task_handler INFO     Starting worker processes...
2023-12-02 16:10:58,364 [31859] conductor.client.automator.task_runner INFO     Started worker process for task(s): python_execute_example
2023-12-02 16:10:58,364 [31859] conductor.client.automator.task_runner INFO     Started worker process for task(s): python_execute_example
2023-12-02 16:10:58,366 [31859] conductor.client.automator.task_runner DEBUG    Polling task for: python_execute_example
2023-12-02 16:10:58,364 [31859] conductor.client.automator.task_runner INFO     Started worker process for task(s): python_execute_example
2023-12-02 16:10:58,367 [31834] conductor.client.automator.task_handler INFO     Started TaskRunner processes
2023-12-02 16:10:58,368 [31834] conductor.client.automator.task_handler INFO     Started 
2023-12-02 16:10:59,505 [31860] conductor.client.automator.task_runner DEBUG    Polled task: python_execute_example_2, worker_id: e2, domain: hmmm
2023-12-02 16:10:59,505 [31860] conductor.client.automator.task_runner DEBUG    Sleep for 0.9 seconds
2023-12-02 16:10:59,505 [31860] conductor.client.automator.task_runner DEBUG    Sleep for 0.9 seconds
2023-12-02 16:10:59,505 [31859] conductor.client.automator.task_runner DEBUG    Polled task: python_execute_example, 
....
....
....
2023-12-02 16:11:00,119 [31834] conductor.client.automator.task_handler INFO     KeyboardInterrupt: Stopping all processes
2023-12-02 16:11:00,119 [31834] conductor.client.automator.task_handler DEBUG    Terminating process: 31859
2023-12-02 16:11:00,120 [31834] conductor.client.automator.task_handler DEBUG    Terminating process: 31860
2023-12-02 16:11:00,120 [31834] conductor.client.automator.task_handler DEBUG    Terminating process: 31861
2023-12-02 16:11:00,120 [31834] conductor.client.automator.task_handler DEBUG    Terminating process: 31862
2023-12-02 16:11:00,120 [31834] conductor.client.automator.task_handler INFO     Stopped worker processes...
2023-12-02 16:11:00,120 [31834] conductor.client.automator.task_handler INFO     Stopping logger process...

@coderabhigupta coderabhigupta linked an issue Dec 2, 2023 that may be closed by this pull request
@@ -51,6 +51,13 @@ def get_task_definition_name(self) -> str:
"""
return self.task_definition_name_cache

@property
def task_definition_names(self):
if isinstance(self.task_definition_name, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will happen when the same worker shares polling for different tasks. This was done as an optimization by a customer developer - #219.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document this a little better.

@v1r3n v1r3n changed the base branch from main to sdk_updates December 3, 2023 17:39
@v1r3n v1r3n merged commit 6ca03fd into sdk_updates Dec 3, 2023
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conductor Python workers do not acknowledge SIGTERM
2 participants