Update eager task launching & monitoring #3042

wild-endeavor · 2025-01-07T22:52:44Z

Why are the changes needed?

This change simplifies the runner that kicks off executions for eager tasks by making its main executor function async, thus removing the need to handle an explicit look. Also the background functions that launch and monitor executions don't need to be async.

What changes were proposed in this pull request?

Merged the two classes in the worker_queue file into one.
Updated the add function in Controller which is called by the call handler in promise.py to be async. Because of this, the async call handler can now just await on this function.
Change the functions in the Controller object that actually launch and monitor the executions to be sync instead of async.
- This also means we can remove the separate internal event loop that it was holding onto. Note that this has the side effect of no longer sharing the FlyteContext since that is stored in a thread local context var.

How was this patch tested?

Tested using local sandbox and running the internal hpo example.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR refactors the eager task execution system by simplifying async/sync interaction and streamlining the Controller class. Major enhancements include improved thread safety, error handling, and execution state management. Changes include variable renaming for clarity (wi to work_item), restructured Python interface handling, and enhanced state management through method renaming. Implementation includes comprehensive test coverage and improved logging with context manager implementation.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 4

Signed-off-by: Yee Hing Tong <[email protected]>

…hread fails Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-07T22:52:57Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-08T04:16:17Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-09T00:14:33Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>

codecov · 2025-01-10T02:05:34Z

Codecov Report

Attention: Patch coverage is 64.18919% with 53 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (f634d53) to head (2e659ca).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
flytekit/core/worker_queue.py	64.58%	44 Missing and 7 partials ⚠️
flytekit/core/context_manager.py	66.66%	0 Missing and 1 partial ⚠️
flytekit/core/promise.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3042      +/-   ##
==========================================
- Coverage   82.79%   79.47%   -3.33%     
==========================================
  Files           3      202     +199     
  Lines         186    21390   +21204     
  Branches        0     2756    +2756     
==========================================
+ Hits          154    16999   +16845     
- Misses         32     3616    +3584     
- Partials        0      775     +775

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-10T03:07:53Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-10T19:52:14Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-10T23:12:15Z

Code Review Agent Run Status

Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

flyte-bot · 2025-01-14T01:58:21Z

Code Review Agent Run #be129b

Actionable Suggestions - 7

tests/flytekit/unit/core/test_worker_queue.py - 2
- Consider adding hash equality verification · Line 251-251
- Consider expanding WorkItem equality test cases · Line 251-251
flytekit/core/worker_queue.py - 3
- Consider moving initialization to post_init · Line 104-105
- Consider more reliable cleanup mechanism · Line 184-184
- Consider keeping event loop cleanup method · Line 229-232
flytekit/core/context_manager.py - 2
- Consider platform signal handling support check · Line 998-999
- Consider thread safety in signal handler · Line 998-999

Additional Suggestions - 3

flytekit/core/worker_queue.py - 3
- Consider configurable sleep duration value · Line 395-395
- Consider using enum comparison instead of strings · Line 113-113
- Consider adding assertion error message · Line 354-354

Review Details

Files reviewed - 5 · Commit Range: 45e68ed..2e659ca
- flytekit/core/context_manager.py
- flytekit/core/promise.py
- flytekit/core/worker_queue.py
- tests/flytekit/integration/remote/test_remote.py
- tests/flytekit/unit/core/test_worker_queue.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flyte-bot · 2025-01-14T02:05:17Z

Changelist by Bito

This pull request implements the following key changes.

Key Change	Files Impacted
Feature Improvement - Refactor Eager Task Execution System	- `context_manager.py` - Added thread safety check for signal handler initialization - `promise.py` - Simplified async execution by removing explicit loop handling - `worker_queue.py` - Major refactor of Controller class with improved state management and thread safety - `test_worker_queue.py` - Added comprehensive tests for new Controller functionality
Feature Improvement - Refactor Eager Task Execution System	- `context_manager.py` - Added thread safety check for signal handler initialization - `promise.py` - Simplified async execution by removing explicit loop handling - `worker_queue.py` - Major refactor of Controller class with improved state management and thread safety - `test_worker_queue.py` - Enhanced test coverage with error handling and work item equality tests

flyte-bot · 2025-01-14T02:05:19Z

tests/flytekit/unit/core/test_worker_queue.py

+    wi1 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
+    wi2 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
+    wi2.uuid = wi1.uuid
+    assert wi1 == wi2


Consider adding hash equality verification

The test case test_work_item_hashing_equality() manually sets the uuid to test equality but doesn't verify hash equality. Consider adding an assertion to verify that hash(wi1) == hash(wi2) since equal objects should have equal hashes.

Code suggestion

Check the AI-generated fix before applying

Suggested change

assert wi1 == wi2

assert wi1 == wi2

assert hash(wi1) == hash(wi2)

Code Review Run #be129b

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-14T02:05:21Z

tests/flytekit/unit/core/test_worker_queue.py

+    wi1 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
+    wi2 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
+    wi2.uuid = wi1.uuid
+    assert wi1 == wi2


Consider expanding WorkItem equality test cases

Consider adding test cases to verify WorkItem equality behavior when input_kwargs or wf_exec differ between instances. The current test only verifies equality for identical objects.

Code suggestion

Check the AI-generated fix before applying

Suggested change

assert wi1 == wi2

assert wi1 == wi2

# Test inequality cases

wi3 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={'param': 'value'})

wi3.uuid = wi1.uuid

assert wi1 != wi3 # Different input_kwargs

wi4 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})

assert wi1 != wi4 # Different UUIDs

Code Review Run #be129b

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-14T02:05:22Z

flytekit/core/worker_queue.py

+    python_interface: typing.Optional[Interface] = None
+    uuid: typing.Optional[uuid.UUID] = None


Consider moving initialization to post_init

Consider initializing python_interface and uuid in __init__ or __post_init__ instead of using class-level defaults, since these are already being set in __post_init__.

Code suggestion

Check the AI-generated fix before applying

Suggested change

python_interface: typing.Optional[Interface] = None

uuid: typing.Optional[uuid.UUID] = None

python_interface: typing.Optional[Interface]

uuid: typing.Optional[uuid.UUID]

Code Review Run #be129b

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flyte-bot · 2025-01-14T02:05:23Z

flytekit/core/worker_queue.py

+            target=self._execute, daemon=True, name="controller-thread"
+        )
+        self.__runner_thread.start()
+        atexit.register(self._close, event=self.stopping_condition, runner=self.__runner_thread)


Consider more reliable cleanup mechanism

Consider using weakref.finalize() instead of atexit.register() for cleanup. atexit handlers may not run if the program exits abnormally, while weakref.finalize() provides more reliable cleanup.

Code suggestion

Check the AI-generated fix before applying

Suggested change

atexit.register(self._close, event=self.stopping_condition, runner=self.__runner_thread)

import weakref

weakref.finalize(self, self._close, event=self.stopping_condition, runner=self.__runner_thread)

Code Review Run #be129b

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

flytekit/core/worker_queue.py

flytekit/core/context_manager.py

thomasjpfan

Quick glance over

flytekit/core/worker_queue.py

flytekit/core/context_manager.py

thomasjpfan · 2025-01-14T19:31:19Z

flytekit/core/worker_queue.py

-
-            exc = EagerException(f"Error executing {work.entity.name} with error: {work.wf_exec.closure.error}")
-            work.set_error(exc)
+        return self.status == ItemStatus.SUCCESS or self.status == ItemStatus.FAILED


ready seems like a weird name here. Should this be completed?

left over from when it was an asyncio.Future. Let me change it to is_in_terminal_state

flytekit/core/worker_queue.py

thomasjpfan · 2025-01-14T19:36:53Z

flytekit/core/worker_queue.py

+                    elif update.wf_exec.closure.phase == WorkflowExecutionPhase.FAILED:
+                        update.status = ItemStatus.FAILED
+                else:
+                    assert item.status == ItemStatus.RUNNING


Should this end up being a more detailed error just in case this is not true?

replacing with a debug log line, just to capture the other arm of the conditional is all.

flytekit/core/worker_queue.py

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot · 2025-01-16T22:52:17Z

Code Review Agent Run #bfa426

Actionable Suggestions - 1

flytekit/core/worker_queue.py - 1
- Method rename may affect compatibility · Line 114-114

Review Details

Files reviewed - 3 · Commit Range: 2e659ca..7873024
- flytekit/core/context_manager.py
- flytekit/core/worker_queue.py
- tests/flytekit/unit/core/test_worker_queue.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by

flytekit/core/worker_queue.py

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 10 commits January 7, 2025 10:35

refactor

45e68ed

Signed-off-by: Yee Hing Tong <[email protected]>

flip order

f6e8450

Signed-off-by: Yee Hing Tong <[email protected]>

use a uuid to link updates and the original object

da7b74c

Signed-off-by: Yee Hing Tong <[email protected]>

add print

1ebc98f

Signed-off-by: Yee Hing Tong <[email protected]>

error

4f67956

Signed-off-by: Yee Hing Tong <[email protected]>

debug

b369032

Signed-off-by: Yee Hing Tong <[email protected]>

skip signal if not on main thread

f2bc7a8

Signed-off-by: Yee Hing Tong <[email protected]>

re-import in constructor

1d6caef

Signed-off-by: Yee Hing Tong <[email protected]>

add prints

079737c

Signed-off-by: Yee Hing Tong <[email protected]>

remove explicit loop and future, add an explicit exit if background t…

795bf92

…hread fails Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 5 commits January 7, 2025 14:58

revert test file

9c24a25

Signed-off-by: Yee Hing Tong <[email protected]>

debug

b5e04fa

Signed-off-by: Yee Hing Tong <[email protected]>

logging

cc8a97c

Signed-off-by: Yee Hing Tong <[email protected]>

debug

b24636a

Signed-off-by: Yee Hing Tong <[email protected]>

add assert, uuid

c4a53b9

Signed-off-by: Yee Hing Tong <[email protected]>

add status running to update object

73c0294

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 2 commits January 8, 2025 16:52

lint

3a78c21

Signed-off-by: Yee Hing Tong <[email protected]>

testing thread again

b38546e

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 2 commits January 9, 2025 18:12

stack

f96d97a

Signed-off-by: Yee Hing Tong <[email protected]>

more debug

0bbd368

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 2 commits January 10, 2025 10:50

more debug

ea0d35d

Signed-off-by: Yee Hing Tong <[email protected]>

_execute logs

5a1e301

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 2 commits January 10, 2025 14:08

debugs

e7260db

Signed-off-by: Yee Hing Tong <[email protected]>

more debug

0de3449

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor added 3 commits January 10, 2025 14:32

remote debug

5cf71b1

Signed-off-by: Yee Hing Tong <[email protected]>

remove debugging

8fe1b94

Signed-off-by: Yee Hing Tong <[email protected]>

new lines

2e659ca

Signed-off-by: Yee Hing Tong <[email protected]>

wild-endeavor marked this pull request as ready for review January 14, 2025 00:57

wild-endeavor requested review from kumare3, eapolinario, pingsutw, cosmicBboy, samhita-alla, thomasjpfan and Future-Outlier as code owners January 14, 2025 00:57

flyte-bot reviewed Jan 14, 2025

View reviewed changes

flytekit/core/worker_queue.py Show resolved Hide resolved

flyte-bot reviewed Jan 14, 2025

View reviewed changes

flytekit/core/context_manager.py Show resolved Hide resolved

flyte-bot reviewed Jan 14, 2025

View reviewed changes

flytekit/core/context_manager.py Show resolved Hide resolved

wild-endeavor changed the title ~~Eager watch batch~~ Update eager task launching & monitoring Jan 14, 2025

thomasjpfan reviewed Jan 14, 2025

View reviewed changes

flytekit/core/worker_queue.py Show resolved Hide resolved

pr comments

7873024

Signed-off-by: Yee Hing Tong <[email protected]>

flyte-bot reviewed Jan 16, 2025

View reviewed changes

flytekit/core/worker_queue.py Show resolved Hide resolved

eapolinario approved these changes Jan 17, 2025

View reviewed changes

wild-endeavor merged commit a465932 into master Jan 18, 2025
102 of 104 checks passed

shuyingliang pushed a commit to shuyingliang/flytekit that referenced this pull request Jan 22, 2025

Update eager task launching & monitoring (flyteorg#3042)

d4b49bc

Signed-off-by: Yee Hing Tong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update eager task launching & monitoring #3042

Update eager task launching & monitoring #3042

wild-endeavor commented Jan 7, 2025 •

edited by flyte-bot

Loading

flyte-bot commented Jan 7, 2025

flyte-bot commented Jan 8, 2025

flyte-bot commented Jan 9, 2025

codecov bot commented Jan 10, 2025 •

edited

Loading

flyte-bot commented Jan 10, 2025

flyte-bot commented Jan 10, 2025

flyte-bot commented Jan 10, 2025

flyte-bot commented Jan 14, 2025 •

edited

Loading

Code Review Agent Run #be129b

flyte-bot commented Jan 14, 2025 •

edited

Loading

Changelist by Bito

flyte-bot Jan 14, 2025

flyte-bot Jan 14, 2025

flyte-bot Jan 14, 2025

flyte-bot Jan 14, 2025

thomasjpfan left a comment

thomasjpfan Jan 14, 2025

wild-endeavor Jan 16, 2025

thomasjpfan Jan 14, 2025

wild-endeavor Jan 16, 2025

flyte-bot commented Jan 16, 2025 •

edited

Loading

Code Review Agent Run #bfa426

	assert wi1 == wi2
	assert wi1 == wi2
	assert hash(wi1) == hash(wi2)

-    assert wi1 == wi2
+    assert wi1 == wi2
+    # Test inequality cases
+    wi3 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={'param': 'value'})
+    wi3.uuid = wi1.uuid
+    assert wi1 != wi3  # Different input_kwargs
+    wi4 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
+    assert wi1 != wi4  # Different UUIDs

		python_interface: typing.Optional[Interface] = None
		uuid: typing.Optional[uuid.UUID] = None

	atexit.register(self._close, event=self.stopping_condition, runner=self.__runner_thread)
	import weakref
	weakref.finalize(self, self._close, event=self.stopping_condition, runner=self.__runner_thread)

Update eager task launching & monitoring #3042

Update eager task launching & monitoring #3042

Conversation

wild-endeavor commented Jan 7, 2025 • edited by flyte-bot Loading

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Summary by Bito

flyte-bot commented Jan 7, 2025

Code Review Agent Run Status

flyte-bot commented Jan 8, 2025

Code Review Agent Run Status

flyte-bot commented Jan 9, 2025

Code Review Agent Run Status

codecov bot commented Jan 10, 2025 • edited Loading

Codecov Report

flyte-bot commented Jan 10, 2025

Code Review Agent Run Status

flyte-bot commented Jan 10, 2025

Code Review Agent Run Status

flyte-bot commented Jan 10, 2025

Code Review Agent Run Status

flyte-bot commented Jan 14, 2025 • edited Loading

Code Review Agent Run #be129b

flyte-bot commented Jan 14, 2025 • edited Loading

Changelist by Bito

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyte-bot commented Jan 16, 2025 • edited Loading

Code Review Agent Run #bfa426

wild-endeavor commented Jan 7, 2025 •

edited by flyte-bot

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading

flyte-bot commented Jan 14, 2025 •

edited

Loading

flyte-bot commented Jan 16, 2025 •

edited

Loading