Implementation of AI Remote Worker (AI-323) (rebased) #3168

ad-astra-video · 2024-09-10T03:47:49Z

What does this pull request do? Explain your changes. (required)
See PR 3088 rebased to current ai-video branch and adding segment anything 2 pipeline. This was a community effort in the same way PR 3088 was with contributions from multiple contributors in the ecosystem which I am very grateful for. Detail from the commits of PR 3088 were squashed in the rebase to make it cleaner and easier to complete. Credit for the contributions to implementing the remote ai worker will be included on the squashed commit on PR approval

Specific updates (required)

See PR 3088 and tech spec that was updated after rebase completed
Tests were added for core and server parts of the ai-worker additions.
Implementation is backwards compatible with current gateways while laying the foundation to upgrade to passing json job status/results back separate from binary outputs (e.g. images or videos).

How did you test each of these updates (required)
Current ai-worker has been used on subnet by myself and Pon. I serve I2I, T2I, I2V and upscale models using 5-6 separate ai-workers on mainnet and have completed over 5,000 requests.

Checklist:

Read the contribution guide
make runs successfully
[>>] All tests in ./test.sh pass >>tests in core/ai_test.go and server/ai_worker_test.go pass
README and other documentation updated
Pending changelog updated

…strator and aiworker

…rch)

… capabilities is used

…ate related test

…through. small update to aiResults endpoint and related test update

…ving ai capabilities

…d update tests

net/lp_rpc.proto

core/livepeernode.go

server/ai_worker.go

This commit ensures that the AImodels startup error is only thrown for AIWorkers.

This commit applies some small textual changes I noticed during my review.

rickstaa

LGTM. We can remove some redundant code in subsequent pull requests 👍🏻.

This reverts commit e1fd2f2.

This commit adds a new AI remote worker node which can be used to split worker and orchestrator machines similar to how it is done on the transcoding side. Co-authored-by: kyriediculous <[email protected]> Co-authored-by: Reuben Rodrigues <[email protected]> Co-authored-by: Rafał Leszko <[email protected]> Co-authored-by: Rick Staa <[email protected]>

kyriediculous · 2024-10-19T08:12:40Z

core/ai.go

@@ -127,3 +132,102 @@ func ParseStepsFromModelID(modelID *string, defaultSteps float64) float64 {

 	return numInferenceSteps
 }
+
+// AddAICapabilities adds AI capabilities to the node.
+func (n *LivepeerNode) AddAICapabilities(caps *Capabilities) {


This function should not exist

Not following. Is this not similar to AddCapacity?

You are saying it would be better to squash this into AddCapacity? Without updating tests to cover this addition into AddCapacity function?

New function was added because transcoding does not use capabilities in this way right now and tried to avoid adding complication to a function used with remote transcoder connection that didn't need to be there.

kyriediculous · 2024-10-19T08:12:51Z

core/ai.go

+}
+
+// RemoveAICapabilities removes AI capabilities from the node.
+func (n *LivepeerNode) RemoveAICapabilities(caps *Capabilities) {


This function should not exist

Not following. Is this not similar to RemoveCapacity?

You are saying it would be better to squash this into RemoveCapacity? Without updating tests to cover this addition into RemoveCapacity function?

New function was added because transcoding does not use capabilities in this way right now and tried to avoid adding complication to a function used with remote transcoder connection that didn't need to be there.

kyriediculous · 2024-10-19T08:13:00Z

core/ai.go

+	return fmt.Errorf("failed to reserve AI capability capacity, pipeline does not exist pipeline=%v modelID=%v", pipeline, modelID)
+}
+
+func (n *LivepeerNode) ReleaseAICapability(pipeline string, modelID string) error {


This function should not exist

Not following, please explain. Intent here was to start managing capacity for each pipeline/modelID and also work with workers having multiple GPUs serving the same pipeline/modelID behind one ai worker.

kyriediculous · 2024-10-19T08:13:15Z

core/ai.go

+	}
+}
+
+func (n *LivepeerNode) ReserveAICapability(pipeline string, modelID string) error {


This function should not exist

Not following, please explain. Intent here was to start managing capacity for each pipeline/modelID and also work with workers having multiple GPUs serving the same pipeline/modelID behind one ai worker.

kyriediculous · 2024-10-19T08:14:24Z

core/ai_worker.go

+	}
+}
+
+type RemoteAIWorkerManager struct {


Wrong file structure

Most AI related things are in separate files right?

Can you help me understand why using ai_worker.go files does not help keep development on AI and transcoding from causing issues for anyone developing on one or the other?

kyriediculous · 2024-10-19T08:16:04Z

net/lp_rpc.proto

+
+  // Called by the aiworker to register to an orchestrator. The orchestrator
+  // notifies registered aiworkers of jobs as they come in.
+  rpc RegisterAIWorker(RegisterAIWorkerRequest) returns (stream NotifyAIJob);


RegisterAIWorker request doesn't need to be a newly defined type

A new RegisterAIWorkerRequest was added because the generic capacity field is not helpful when trying to manage GPUs for AI jobs in my opinion. For transcoding one GPU can do multiple requests at a time and there was one job type. For AI, my experience is most models slow down significantly when more than one request is fed to it concurrently.

Do you think a generic capacity field set at launch of the ai-worker would let the orchestrator appropriately manage the ai workers?

Do you think that AI workers and remote transcoders would always have the same requirements when connecting to the orchestrator?

kyriediculous · 2024-10-19T08:19:13Z

core/ai_worker.go

+	"github.com/livepeer/lpms/ffmpeg"
+)
+
+var ErrRemoteWorkerTimeout = errors.New("Remote worker took too long")


Why is this exported ?

Followed how other similar errors are implemented in core package. Linked errors and errors above could probably be changed to not be exported.

go-livepeer/core/orchestrator.go

Line 852 in 4a66b22

var ErrRemoteTranscoderTimeout = errors.New("Remote transcoder took too long")

kyriediculous · 2024-10-19T08:19:36Z

core/ai_worker.go

+)
+
+var ErrRemoteWorkerTimeout = errors.New("Remote worker took too long")
+var ErrNoCompatibleWorkersAvailable = errors.New("no workers can process job requested")


Why is this exported ?

Followed how other similar errors are implemented in core package. Linked errors and errors above could probably be changed to not be exported.

go-livepeer/core/orchestrator.go

Line 852 in 4a66b22

var ErrRemoteTranscoderTimeout = errors.New("Remote transcoder took too long")

kyriediculous · 2024-10-19T08:20:06Z

core/ai_worker.go

+
+var ErrRemoteWorkerTimeout = errors.New("Remote worker took too long")
+var ErrNoCompatibleWorkersAvailable = errors.New("no workers can process job requested")
+var ErrNoWorkersAvailable = errors.New("no workers available")


Why is this exported ?

Followed how other similar errors are implemented in core package. Linked errors and errors above could probably be changed to not be exported.

go-livepeer/core/orchestrator.go

Line 852 in 4a66b22

var ErrRemoteTranscoderTimeout = errors.New("Remote transcoder took too long")

This commit adds a new AI remote worker node which can be used to split worker and orchestrator machines similar to how it is done on the transcoding side. Co-authored-by: Rafał Leszko <[email protected]> Co-authored-by: Rick Staa <[email protected]>

ad-astra-video and others added 26 commits September 5, 2024 08:50

grpc updates for remote ai worker

502568a

update starter to add aiworker and split aiModels processing to orche…

b81ed3d

…strator and aiworker

add ai_worker core with small update to ModelConstraint to add capacity

7a82100

move functions to core/ai_worker.go

ec3fb04

add ai_worker server changes

e712c13

make T2I, I2I and Upscale backwards compatible

1057ece

add additional ai worker metrics

f79f0af

add AI max size of response from worker

2ffef3e

fix auto price modelid overwritten when price updated

5680321

fix core/ai_worker.go for tests

0290dd0

add ai worker tests

ce3bac4

updates for typos

b5f562a

dont let remote workers connect if incompatible version (lower than o…

aa54b8d

…rch)

remove separate capacity from remote ai worker. capacity per model on…

4bde68f

… capabilities is used

add err check and rename addl data field for server/ai_worker.go

f6ba022

add server/ai_worker tests and update to core/ai_tests.go

65b2948

update interfaces in stubs for server package tests

4a092d9

fix serverGetOrchInfo api to add capabilities for ai-video

e025643

add more tests, small updates to server/ai_worker.go

db4f297

update core/ai.go ReserveAICapability to check for 0 capacity and upd…

e75dab4

…ate related test

revert remote ai worker saving input to disk and change to bytes pass…

a5ba007

…through. small update to aiResults endpoint and related test update

fix err check on handleAIRequest

64889d8

fix lock/unlock capabilities for ai capabilities changes and fix remo…

6db747e

…ving ai capabilities

fix typo in log line

9cda839

update to make Segment Anything 2 compatible with remote ai worker an…

9a46d4d

…d update tests

fix image-to-video

08c82c3

github-actions bot added the AI Issues and PR related to the AI-video branch. label Sep 10, 2024

ad-astra-video added 3 commits September 10, 2024 13:47

delete image-to-video result file

6854eae

fix segment-anything-2 capability lookup

9cf5737

increase max response size to 3GB

5311e38

rickstaa reviewed Oct 9, 2024

View reviewed changes

net/lp_rpc.proto Outdated Show resolved Hide resolved

rickstaa reviewed Oct 9, 2024

View reviewed changes

core/livepeernode.go Outdated Show resolved Hide resolved

ad-astra-video added 4 commits October 10, 2024 20:25

remove aiJobMutex not used

e2d4bdb

update number for capabilities field in RegisterAIWorkerRequest

3dc663e

grpc codegen updates

8c7889c

remove url from grpc AIJobData

49317fb

rickstaa reviewed Oct 18, 2024

View reviewed changes

server/ai_worker.go Show resolved Hide resolved

rickstaa reviewed Oct 18, 2024

View reviewed changes

server/ai_worker.go Show resolved Hide resolved

rickstaa reviewed Oct 18, 2024

View reviewed changes

server/ai_worker.go Outdated Show resolved Hide resolved

rickstaa reviewed Oct 18, 2024

View reviewed changes

server/ai_worker.go Show resolved Hide resolved

rickstaa added 2 commits October 18, 2024 19:46

refactor: fix AImodels startup error

b59e04c

This commit ensures that the AImodels startup error is only thrown for AIWorkers.

refactor: apply some small textual changes

d084218

This commit applies some small textual changes I noticed during my review.

rickstaa approved these changes Oct 18, 2024

View reviewed changes

rickstaa merged commit e1fd2f2 into livepeer:ai-video Oct 18, 2024
7 of 8 checks passed

rickstaa added a commit that referenced this pull request Oct 18, 2024

Revert "feat: add AI Remote Worker (#3168)"

5b724bf

This reverts commit e1fd2f2.

rickstaa mentioned this pull request Oct 18, 2024

Revert "Implementation of AI Remote Worker (AI-323) (rebased) #3212

Closed

kyriediculous reviewed Oct 19, 2024

View reviewed changes

ad-astra-video deleted the ai-video-remoteaiworker-pr-rebase branch October 19, 2024 13:18

This was referenced Oct 21, 2024

[Do not merge] Copy of Remote Worker branch sent from origin #3194

Closed

Remote AI Worker (Livepool version) #3106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of AI Remote Worker (AI-323) (rebased) #3168

Implementation of AI Remote Worker (AI-323) (rebased) #3168

ad-astra-video commented Sep 10, 2024

rickstaa left a comment

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

kyriediculous Oct 19, 2024

ad-astra-video Oct 19, 2024

Implementation of AI Remote Worker (AI-323) (rebased) #3168

Implementation of AI Remote Worker (AI-323) (rebased) #3168

Conversation

ad-astra-video commented Sep 10, 2024

rickstaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment