Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update eager task launching & monitoring #3042

Merged
merged 28 commits into from
Jan 18, 2025
Merged

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Jan 7, 2025

Why are the changes needed?

This change simplifies the runner that kicks off executions for eager tasks by making its main executor function async, thus removing the need to handle an explicit look. Also the background functions that launch and monitor executions don't need to be async.

What changes were proposed in this pull request?

  • Merged the two classes in the worker_queue file into one.
  • Updated the add function in Controller which is called by the call handler in promise.py to be async. Because of this, the async call handler can now just await on this function.
  • Change the functions in the Controller object that actually launch and monitor the executions to be sync instead of async.
    • This also means we can remove the separate internal event loop that it was holding onto. Note that this has the side effect of no longer sharing the FlyteContext since that is stored in a thread local context var.

How was this patch tested?

Tested using local sandbox and running the internal hpo example.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR refactors the eager task execution system by simplifying async/sync interaction and streamlining the Controller class. Major enhancements include improved thread safety, error handling, and execution state management. Changes include variable renaming for clarity (wi to work_item), restructured Python interface handling, and enhanced state management through method renaming. Implementation includes comprehensive test coverage and improved logging with context manager implementation.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 4

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 64.18919% with 53 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (f634d53) to head (2e659ca).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/core/worker_queue.py 64.58% 44 Missing and 7 partials ⚠️
flytekit/core/context_manager.py 66.66% 0 Missing and 1 partial ⚠️
flytekit/core/promise.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3042      +/-   ##
==========================================
- Coverage   82.79%   79.47%   -3.33%     
==========================================
  Files           3      202     +199     
  Lines         186    21390   +21204     
  Branches        0     2756    +2756     
==========================================
+ Hits          154    16999   +16845     
- Misses         32     3616    +3584     
- Partials        0      775     +775     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 14, 2025

Code Review Agent Run #be129b

Actionable Suggestions - 7
  • tests/flytekit/unit/core/test_worker_queue.py - 2
    • Consider adding hash equality verification · Line 251-251
    • Consider expanding WorkItem equality test cases · Line 251-251
  • flytekit/core/worker_queue.py - 3
  • flytekit/core/context_manager.py - 2
    • Consider platform signal handling support check · Line 998-999
    • Consider thread safety in signal handler · Line 998-999
Additional Suggestions - 3
  • flytekit/core/worker_queue.py - 3
    • Consider configurable sleep duration value · Line 395-395
    • Consider using enum comparison instead of strings · Line 113-113
    • Consider adding assertion error message · Line 354-354
Review Details
  • Files reviewed - 5 · Commit Range: 45e68ed..2e659ca
    • flytekit/core/context_manager.py
    • flytekit/core/promise.py
    • flytekit/core/worker_queue.py
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/unit/core/test_worker_queue.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 14, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Refactor Eager Task Execution System

context_manager.py - Added thread safety check for signal handler initialization

promise.py - Simplified async execution by removing explicit loop handling

worker_queue.py - Major refactor of Controller class with improved state management and thread safety

test_worker_queue.py - Added comprehensive tests for new Controller functionality

Feature Improvement - Refactor Eager Task Execution System

context_manager.py - Added thread safety check for signal handler initialization

promise.py - Simplified async execution by removing explicit loop handling

worker_queue.py - Major refactor of Controller class with improved state management and thread safety

test_worker_queue.py - Enhanced test coverage with error handling and work item equality tests

wi1 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
wi2 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
wi2.uuid = wi1.uuid
assert wi1 == wi2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding hash equality verification

The test case test_work_item_hashing_equality() manually sets the uuid to test equality but doesn't verify hash equality. Consider adding an assertion to verify that hash(wi1) == hash(wi2) since equal objects should have equal hashes.

Code suggestion
Check the AI-generated fix before applying
Suggested change
assert wi1 == wi2
assert wi1 == wi2
assert hash(wi1) == hash(wi2)

Code Review Run #be129b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

wi1 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
wi2 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
wi2.uuid = wi1.uuid
assert wi1 == wi2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider expanding WorkItem equality test cases

Consider adding test cases to verify WorkItem equality behavior when input_kwargs or wf_exec differ between instances. The current test only verifies equality for identical objects.

Code suggestion
Check the AI-generated fix before applying
Suggested change
assert wi1 == wi2
assert wi1 == wi2
# Test inequality cases
wi3 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={'param': 'value'})
wi3.uuid = wi1.uuid
assert wi1 != wi3 # Different input_kwargs
wi4 = WorkItem(entity=t1, wf_exec=fwex, input_kwargs={})
assert wi1 != wi4 # Different UUIDs

Code Review Run #be129b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 104 to 105
python_interface: typing.Optional[Interface] = None
uuid: typing.Optional[uuid.UUID] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving initialization to post_init

Consider initializing python_interface and uuid in __init__ or __post_init__ instead of using class-level defaults, since these are already being set in __post_init__.

Code suggestion
Check the AI-generated fix before applying
Suggested change
python_interface: typing.Optional[Interface] = None
uuid: typing.Optional[uuid.UUID] = None
python_interface: typing.Optional[Interface]
uuid: typing.Optional[uuid.UUID]

Code Review Run #be129b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

target=self._execute, daemon=True, name="controller-thread"
)
self.__runner_thread.start()
atexit.register(self._close, event=self.stopping_condition, runner=self.__runner_thread)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider more reliable cleanup mechanism

Consider using weakref.finalize() instead of atexit.register() for cleanup. atexit handlers may not run if the program exits abnormally, while weakref.finalize() provides more reliable cleanup.

Code suggestion
Check the AI-generated fix before applying
Suggested change
atexit.register(self._close, event=self.stopping_condition, runner=self.__runner_thread)
import weakref
weakref.finalize(self, self._close, event=self.stopping_condition, runner=self.__runner_thread)

Code Review Run #be129b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@wild-endeavor wild-endeavor changed the title Eager watch batch Update eager task launching & monitoring Jan 14, 2025
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick glance over

flytekit/core/worker_queue.py Outdated Show resolved Hide resolved
flytekit/core/context_manager.py Show resolved Hide resolved

exc = EagerException(f"Error executing {work.entity.name} with error: {work.wf_exec.closure.error}")
work.set_error(exc)
return self.status == ItemStatus.SUCCESS or self.status == ItemStatus.FAILED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready seems like a weird name here. Should this be completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over from when it was an asyncio.Future. Let me change it to is_in_terminal_state

flytekit/core/worker_queue.py Outdated Show resolved Hide resolved
flytekit/core/worker_queue.py Outdated Show resolved Hide resolved
elif update.wf_exec.closure.phase == WorkflowExecutionPhase.FAILED:
update.status = ItemStatus.FAILED
else:
assert item.status == ItemStatus.RUNNING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this end up being a more detailed error just in case this is not true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacing with a debug log line, just to capture the other arm of the conditional is all.

Signed-off-by: Yee Hing Tong <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 16, 2025

Code Review Agent Run #bfa426

Actionable Suggestions - 1
  • flytekit/core/worker_queue.py - 1
Review Details
  • Files reviewed - 3 · Commit Range: 2e659ca..7873024
    • flytekit/core/context_manager.py
    • flytekit/core/worker_queue.py
    • tests/flytekit/unit/core/test_worker_queue.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@wild-endeavor wild-endeavor merged commit a465932 into master Jan 18, 2025
102 of 104 checks passed
shuyingliang pushed a commit to shuyingliang/flytekit that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants