Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Cancel and Start again within 1s caused module not found [v2] #5007

Merged
merged 12 commits into from
Jan 3, 2025

Conversation

pyranota
Copy link
Collaborator

@pyranota pyranota commented Jan 3, 2025

when we cancel the job, it has up to 1 second window before actually getting cancelled (Because we pull db every 1s to ping it and find out if job has been cancelled). Thus the directory with wheel in windmill's cache cleaned only after that . If we manage to start new job during that period windmill might see that wanted wheel is already there (because we have not cleaned it yet) and write it to installed wheels, meanwhile previous job will clean that wheel. That's why some users could get "module not found" error if they restart job rapidly


Important

Fix race condition in python_executor.rs by serializing UV installations with a mutex to prevent 'module not found' errors when canceling and restarting jobs quickly.

  • Concurrency Control:
    • Introduces BUSY_WITH_UV_INSTALL mutex in python_executor.rs to serialize UV installations, preventing race conditions when canceling and restarting jobs quickly.
  • Error Handling:
    • Modifies handle_python_reqs() to lock UV installations, ensuring that a new job does not interfere with the cleanup of a previous job's resources.
    • Updates error handling in handle_python_reqs() to properly await stderr before checking process exit status, preventing hang-ups.
  • Misc:
    • Removes unused -q flag from spawn_uv_install() command arguments.

This description was created by Ellipsis for bc71804. It will automatically update as commits are pushed.

Copy link

cloudflare-workers-and-pages bot commented Jan 3, 2025

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: bc71804
Status: ✅  Deploy successful!
Preview URL: https://d9d7b544.windmill.pages.dev
Branch Preview URL: https://fix-py-module-not-found-lock.windmill.pages.dev

View logs

@pyranota pyranota marked this pull request as ready for review January 3, 2025 14:05
@pyranota pyranota requested a review from rubenfiszel as a code owner January 3, 2025 14:05
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to bc71804 in 2 minutes and 1 seconds

More details
  • Looked at 115 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. backend/windmill-worker/src/python_executor.rs:1359
  • Draft comment:
    Consider adding error handling to ensure the lock is released in case of an error during the installation process. This will prevent potential deadlocks.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment shows a misunderstanding of Rust's RAII principles. The Mutex lock will be automatically released when it goes out of scope, regardless of whether the code completes normally or encounters an error. This is handled by Rust's drop system. The _lock binding ensures the lock is held for the entire scope. No additional error handling is needed for lock release.
    Could there be some edge cases where the lock might not be released properly that I'm not considering? What if the process is killed abruptly?
    Even in case of abrupt process termination, Rust's drop system will still run destructors and release the lock. The futures::lock::Mutex is specifically designed for async contexts and handles cleanup properly. The comment is suggesting unnecessary additional error handling.
    The comment should be deleted as it suggests unnecessary error handling. Rust's RAII system already guarantees the lock will be released when it goes out of scope, even in error cases.
2. backend/windmill-worker/src/python_executor.rs:1528
  • Draft comment:
    Consider adding error handling to ensure the lock is released in case of an error during the installation process. This will prevent potential deadlocks.
  • Reason this comment was not posted:
    Marked as duplicate.
3. backend/windmill-worker/src/python_executor.rs:1778
  • Draft comment:
    Consider adding error handling to ensure the lock is released in case of an error during the installation process. This will prevent potential deadlocks.
  • Reason this comment was not posted:
    Marked as duplicate.

Workflow ID: wflow_sKzt3bTPAw4nCHF5


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@rubenfiszel rubenfiszel merged commit c998d2c into main Jan 3, 2025
7 checks passed
@rubenfiszel rubenfiszel deleted the fix-py-module-not-found-lockless branch January 3, 2025 15:09
@github-actions github-actions bot locked and limited conversation to collaborators Jan 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants