Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate training into celery and upload results to s3 #1157

Merged
merged 13 commits into from
May 14, 2024

Conversation

andrewpeng02
Copy link
Contributor

@andrewpeng02 andrewpeng02 commented Apr 9, 2024

Migrate training into celery and upload results to s3

Github Issue Number Here: #1136
What user problem are we solving?

Currently, we do the training during the HTTP request. I plan on changing the train HTTP endpoints by scheduling a train job via Celery and returning the job id in the request. This offers numerous advantages

Long training tasks (>2 min) shouldn't be done in an HTTP request. Scheduling the training job will allow the endpoint to return quickly
Eventually, we can decouple the backend with the training, so that we can use cheaper EC2 instances for the Django backend, and GPU instances for the actual training
Notifying the user will be done in websockets in this issue (#920 (comment)), for now I'll create an HTTP endpoint to retrieve the training results that the user can ping.

What solution does this PR provide?
Training jobs are executed by a celery worker, which polls from the queue set up in aws sqs. Currently you have to run the celery worker locally, but eventually the workers will run in the g4dn.xlarge ec2 instances. In this pr, I added celery endpoints which the backend will call. Then, the frontend will request the training results data from GET /api/training/results/{trainspaceId}, and it'll display the data.

I also moved some files into /celery to make it more clear that the celery worker will operate in there (but it'll also access files in the django app)

Testing Methodology

  1. Install the new backend dependencies
  2. Start the frontend using dlp-cli frontend start
  3. From the training/ folder, start the backend using AWS_PROFILE=dlp docker compose up --build
  4. Start the sst locally, using AWS_PROFILE=sst
  5. Tested training tabular regression and classification data
newmovie.mov
Screenshot 2024-04-09 at 3 47 28 PM

Any other considerations

@andrewpeng02 andrewpeng02 marked this pull request as ready for review April 9, 2024 20:17
@andrewpeng02 andrewpeng02 requested a review from a team as a code owner April 9, 2024 20:17
Copy link

sonarcloud bot commented Apr 9, 2024

Quality Gate Passed Quality Gate passed

Issues
3 New issues
54 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@karkir0003
Copy link
Member

@andrewpeng02 Question RE testing video:

  1. After you submitted the training requests, were the results shown just dummy data for now or was it actual?

Copy link
Member

@karkir0003 karkir0003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed PR here. Main question I had was:

  1. How does the data flow work from frontend sending a train request to backend to passing the data to celery? What will be the data flow in production?
  2. Any high level diagram you could provide/document/link me to in order to understand how celery works and how it's scalable for our use case?

@andrewpeng02
Copy link
Contributor Author

  • The results shown were actual data, not dummy
  • Frontend sends training params to backend -> backend submits training job with training params, user id, trainspace id to the queue -> celery worker picks up a job from the queue and trains -> celery worker saves results to s3, which the frontend can request from sst
  • I believe the architecture diagram here should answer your question

Copy link
Member

@karkir0003 karkir0003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better. followed up on the comments

frontend/src/features/Train/redux/trainspaceApi.ts Outdated Show resolved Hide resolved
frontend/src/pages/train/[train_space_id].tsx Outdated Show resolved Hide resolved
training/training/celeryconfig.py Outdated Show resolved Hide resolved
training/training/celeryconfig.py Outdated Show resolved Hide resolved
training/training/core/celery/worker.py Outdated Show resolved Hide resolved
@andrewpeng02 andrewpeng02 force-pushed the feature-1136-celery branch 2 times, most recently from e7ce7d6 to c6cffd1 Compare May 10, 2024 16:10
…nd request from endpoint

🎨 Auto-generated directory tree for repository in Architecture.md

🎨 Format Python code with psf/black

slight reformatting

add refetch

reduce duplicated code
🎨 Auto-generated directory tree for repository in Architecture.md
Copy link
Member

@karkir0003 karkir0003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded to some of the followups

Copy link
Contributor

@codingwithsurya codingwithsurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just took a look -- this looks awesome!

Copy link
Member

@karkir0003 karkir0003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWESOME PR @andrewpeng02 . I left a few nits in the PR, but looks really good.

Quick question: Will there be any documentation to understand the development process for say adding a new training type now that we have a celery based structure?

@andrewpeng02
Copy link
Contributor Author

AWESOME PR @andrewpeng02 . I left a few nits in the PR, but looks really good.

Quick question: Will there be any documentation to understand the development process for say adding a new training type now that we have a celery based structure?

It shouldn't be difficult to support new training types, you should be able to figure it out by referencing the existing tabular and image training jobs

Copy link

sonarcloud bot commented May 14, 2024

Quality Gate Passed Quality Gate passed

Issues
2 New issues
42 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.8% Duplication on New Code

See analysis details on SonarCloud

@andrewpeng02 andrewpeng02 added this pull request to the merge queue May 14, 2024
Merged via the queue into nextjs with commit 54a800c May 14, 2024
19 checks passed
@andrewpeng02 andrewpeng02 deleted the feature-1136-celery branch May 14, 2024 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants