Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix selection to work also for the transcoding #3188

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Oct 1, 2024

Changes to make the selection logic in ai-video work for transcoding as well (make it compatible with the current master:

  1. Remove setting 0 max price for unused capability:
    • go-livepeer always worked in a way that if you don't set max price (for transcoding), then it'll accept ANY price (not judging if it's good or bad)
    • This change is not strictly required for Studio, because we always set up max price anyway
    • I believe that it's for AI, because for AI we anyway always set up max price per capability (please double check this statement @ad-astra-video @rickstaa )
  2. Keep the selection logic from AI (effectively it will revert selection: Clear known sessions if none of them has good enough latency score #3086 from master)

@leszko leszko requested a review from rickstaa as a code owner October 1, 2024 11:10
@github-actions github-actions bot added the AI Issues and PR related to the AI-video branch. label Oct 1, 2024
@rickstaa
Copy link
Member

I believe that it's for AI, because for AI we anyway always set up max price per capability (please double check this statement @ad-astra-video @rickstaa )

Correct prices for AI jobs are a lot higher and the pools smaller so the risks of somebody gaming the system and slowly draining a Gateway wallet are a lot higher.

(CC: @rickstaa ) => If it's still an issue for AI, then let me know, I'll either extract it to separate selector or just revert this session change for transcoding (it was kind-of optimization anyway).

As discussed this does work out, thanks for the fix 👍🏻.

This commit adds tests that ensure that the Capability pricing related
functions are working correctly. It also ensures that nil values are
handled gracefully.
Copy link
Member

@rickstaa rickstaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leszko, although I verified that this line can be removed, the current logic does not work correctly on the AI side. Here's what happens now:

  1. The AI sets minLS to 0.
  2. When the first request comes in, all sessions are still in the s.unknownSessions map, so an unknown session is selected.
  3. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  4. The next request comes in, and it checks the latency score, which is always greater than 0. As a result, an unknown session is selected again.
  5. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  6. Now, let's assume there are no unknown sessions left. Since the latency score is still greater than 0, the system will attempt to select an unknown session. However, with no unknown sessions remaining, the Gateway software gets stuck.

The old code worked because it used known sessions as a fallback when no unknown sessions were available.

@rickstaa
Copy link
Member

@leszko, although I verified that this line can be removed, the current logic does not work correctly on the AI side. Here's what happens now:

  1. The AI sets minLS to 0.
  2. When the first request comes in, all sessions are still in the s.unknownSessions map, so an unknown session is selected.
  3. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  4. The next request comes in, and it checks the latency score, which is always greater than 0. As a result, an unknown session is selected again.
  5. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  6. Now, let's assume there are no unknown sessions left. Since the latency score is still greater than 0, the system will attempt to select an unknown session. However, with no unknown sessions remaining, the Gateway software gets stuck.

The old code worked because it used known sessions as a fallback when no unknown sessions were available.

The old code worked as follows:

  1. If there are no known sessions, it selects an unknown session.
  2. Once the session is completed, it records the latency and moves the session to the known sessions.
  3. It then selects the next unknown session, completes it, records the latency, and moves it to the known sessions.
  4. When no unknown sessions are left, it selects the known session with the lowest latency.

@leszko
Copy link
Contributor Author

leszko commented Oct 21, 2024

@leszko, although I verified that this line can be removed, the current logic does not work correctly on the AI side. Here's what happens now:

  1. The AI sets minLS to 0.
  2. When the first request comes in, all sessions are still in the s.unknownSessions map, so an unknown session is selected.
  3. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  4. The next request comes in, and it checks the latency score, which is always greater than 0. As a result, an unknown session is selected again.
  5. The request is completed, and the session is moved to the s.knownSessions map with its latency score.
  6. Now, let's assume there are no unknown sessions left. Since the latency score is still greater than 0, the system will attempt to select an unknown session. However, with no unknown sessions remaining, the Gateway software gets stuck.

The old code worked because it used known sessions as a fallback when no unknown sessions were available.

The old code worked as follows:

  1. If there are no known sessions, it selects an unknown session.
  2. Once the session is completed, it records the latency and moves the session to the known sessions.
  3. It then selects the next unknown session, completes it, records the latency, and moves it to the known sessions.
  4. When no unknown sessions are left, it selects the known session with the lowest latency.

I thought about it for a while. Let's go back to the logic we had before [1] (so the logic you're currently using for AI). I think this PR [1] is not good. We should not clear know sessions, but rather increase the refreshThreshold [2] (so effectively increase the size of knownSession pool).

[1] #3086
[2]

var maxRefreshSessionsThreshold = 8.0

@leszko leszko requested a review from rickstaa October 21, 2024 13:14
@leszko leszko merged commit b5d50db into ai-video Oct 21, 2024
8 checks passed
@leszko leszko deleted the rafal/ai-video-fix-transcoding-selection-2 branch October 21, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants