fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

achimnol · 2024-11-10T19:02:24Z

resolves #2177

Key Changes

ai.backend.models.resource_policy
- Added and collected the unified ConcurrencyTracker API and constants related to querying and manipulation of the session concurrency_used tracker, with explicit Redis key prefix.
- (TODO) Still, we need to eliminate direct references to the Redis keys in all other modules.
Switch from a single integer counter to a set of session IDs for checking the number of concurrently running sessions, to avoid duplicate counting when predicate checks are retries multiple times.
- Previously, there was a "rollback" function of predicates for when the predicate check has failed, recalculating the entire concurrency values, but it didn't work well and caused too much database loads (see next).
- (TODO) It would be ideal if predicate checks do NOT mutate any state such as concurrency_used values, but the session concurrency limit requires a special attention because the timing of session state change (PENDING → SCHEDULED) differs from the timing of predicate checks, which may result in exceed of concurrency limits when there are multiple sessions scheduled together within a single scheduler loop.
Remove unnecessary fullscan and recalculation of agent resources whenever we terminate kernels and sessions.
- Since the concurrency tracker is for sessions, we don't need to run them with kernel termination events.
- Signify whether the method performs a full-scan or not with the naming of recalculation methods.
Add explicit .with_for_update() query options to reduce serialization failures by explicitly locking the affected rows in update transactions.
- e.g., when selecting agents to be updated with new resource occupation info

Notice on refactored `RootContext` in Manager

You may be overwhelmed by the number of changed files, but don't worry — Most changes under src/ai/backend/manager/api are the relocation of root context attributes:

root_ctx.{pidx,local_config,shared_config} → root_ctx.c.{...} ("c" from configurations)
root_ctx.{db,redis_*} → root_ctx.h.{...} ("h" from halfstack clients)
root_ctx.{concurrency_tracker,event_*,hook_*,*_monitor,...} → root_ctx.g.{...} ("g" from global singletons)

For this PR, I had to pass the ConcurrencyTracker object to the AgentRegistry instance, and realized that we should parametrize the halfstack client objects for both. This led to restructuring the root context attributes like this.

This makes it easier to spot the mis-constructed abstraction that mixes .h access and .g access in the manager API layer. The manager model layer should implement abstract operations used by the manager API layer, and .g.concurrency_tracker is a first example to be added from the beginning in this way.

For future reference:

We need to move most shared_config methods to the models.config module.
Actually, the migration of container registries from etcd (shared config) to postgres is in line with this direction.
We need to remove usage of .h in the API layer in most cases, replacing them with explicit model-layer APIs written as stateful objects in .g or stateless functions imported from ai.backend.models.
Currently AgentRegistry is treated specially; it does not belong to .g but just sits at RootContext directly. I haven't decided where to put it... 🤔

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation

- fix: Apply with_for_update() in agent table queries in recalc txns

achimnol added this to the 24.12 milestone Nov 10, 2024

achimnol added comp:manager Related to Manager component urgency:5 It is imperative that action be taken right away. labels Nov 10, 2024

achimnol self-assigned this Nov 10, 2024

achimnol requested review from HyeockJinKim and kyujin-cho November 10, 2024 19:02

achimnol added 6 commits November 11, 2024 04:26

WIP: refactor concurrency tracker

f6089ab

WIP: fix init of root context

1d68053

WIP: update all root_ctx references in manager

5d742e1

fix: Remove unnecessary fullscan and recalc of agent resources

9b94dee

- fix: Apply with_for_update() in agent table queries in recalc txns

test: Update AgentRegistry initialization

818fff0

fix: minor mistakes

c5b40fb

achimnol force-pushed the topic/fix-concurrency_used-tracking branch from 9b93aed to c5b40fb Compare November 10, 2024 19:26

achimnol added 3 commits November 11, 2024 05:39

feat: Implement session-based concurrency update func

217a4c7

feat: Fill implementation of ConcurrencyTracker

d2d84df

feat: Implement session-scoped agent resource recalculation

a03ced5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

achimnol commented Nov 10, 2024 •

edited

Loading

fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

Are you sure you want to change the base?

fix: Miscalc and inefficient db access patterns of session concurrency limits #3064

Conversation

achimnol commented Nov 10, 2024 • edited Loading

Key Changes

Notice on refactored RootContext in Manager

achimnol commented Nov 10, 2024 •

edited

Loading

Notice on refactored `RootContext` in Manager