Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

achimnol
Copy link
Member

@achimnol achimnol commented Nov 10, 2024

resolves #2177

Key Changes

  • ai.backend.models.resource_policy
    • Added and collected the unified ConcurrencyTracker API and constants related to querying and manipulation of the session concurrency_used tracker, with explicit Redis key prefix.
    • (TODO) Still, we need to eliminate direct references to the Redis keys in all other modules.
  • Switch from a single integer counter to a set of session IDs for checking the number of concurrently running sessions, to avoid duplicate counting when predicate checks are retries multiple times.
    • Previously, there was a "rollback" function of predicates for when the predicate check has failed, recalculating the entire concurrency values, but it didn't work well and caused too much database loads (see next).
    • (TODO) It would be ideal if predicate checks do NOT mutate any state such as concurrency_used values, but the session concurrency limit requires a special attention because the timing of session state change (PENDING → SCHEDULED) differs from the timing of predicate checks, which may result in exceed of concurrency limits when there are multiple sessions scheduled together within a single scheduler loop.
  • Remove unnecessary fullscan and recalculation of agent resources whenever we terminate kernels and sessions.
    • Since the concurrency tracker is for sessions, we don't need to run them with kernel termination events.
    • Signify whether the method performs a full-scan or not with the naming of recalculation methods.
  • Add explicit .with_for_update() query options to reduce serialization failures by explicitly locking the affected rows in update transactions.
    • e.g., when selecting agents to be updated with new resource occupation info

Notice on refactored RootContext in Manager

You may be overwhelmed by the number of changed files, but don't worry — Most changes under src/ai/backend/manager/api are the relocation of root context attributes:

  • root_ctx.{pidx,local_config,shared_config}root_ctx.c.{...} ("c" from configurations)
  • root_ctx.{db,redis_*}root_ctx.h.{...} ("h" from halfstack clients)
  • root_ctx.{concurrency_tracker,event_*,hook_*,*_monitor,...}root_ctx.g.{...} ("g" from global singletons)

For this PR, I had to pass the ConcurrencyTracker object to the AgentRegistry instance, and realized that we should parametrize the halfstack client objects for both. This led to restructuring the root context attributes like this.

This makes it easier to spot the mis-constructed abstraction that mixes .h access and .g access in the manager API layer. The manager model layer should implement abstract operations used by the manager API layer, and .g.concurrency_tracker is a first example to be added from the beginning in this way.

For future reference:

  • We need to move most shared_config methods to the models.config module.
    Actually, the migration of container registries from etcd (shared config) to postgres is in line with this direction.
  • We need to remove usage of .h in the API layer in most cases, replacing them with explicit model-layer APIs written as stateful objects in .g or stateless functions imported from ai.backend.models.
  • Currently AgentRegistry is treated specially; it does not belong to .g but just sits at RootContext directly. I haven't decided where to put it... 🤔

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation

@achimnol achimnol added this to the 24.12 milestone Nov 10, 2024
@achimnol achimnol added comp:manager Related to Manager component urgency:5 It is imperative that action be taken right away. labels Nov 10, 2024
@achimnol achimnol self-assigned this Nov 10, 2024
@achimnol achimnol force-pushed the topic/fix-concurrency_used-tracking branch from 9b93aed to c5b40fb Compare November 10, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:manager Related to Manager component urgency:5 It is imperative that action be taken right away.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor concurrency_used
1 participant