post training CLI #51

SLR722 · 2024-11-28T00:48:51Z

What does this PR do?

Add post training related CLI to client SDK

user experience

Since kick off supervised finetune job need to setup several configs and hyper-parameters, to make user experience friendly, we provide an example script under examples/post_training/supervised_fine_tune_client.py to kick off the post training job

test

kick off training
python supervised_fine_tune_client.py "devgpu018.nha2.facebook.com" 5000 "1236" "meta-llama/Llama-3.2-3B-Instruct"

get job list
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training list

get job status
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training status --job-uuid "1235"

get job artifacts
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training artifacts --job-uuid "1235"

### Context This is the 1st of series PRs that integrate torchtune with llama-stack as meta reference post-training implementation. For MVP, we will focus on single device LoRA SFT. Though this PR is still WIP, we want to get early feedback on the high level design of this skeleton while still working on several details ### Scope To limit the scope of this PR, we focus on the skeleton of the implementation. **What are included?** - refine the post-training SFT apis - skeleton of supervised_fine_tune implementation. We verified that we can call the supervised_fine_tune API successfully from llama stack client SDK (client side PR: meta-llama/llama-stack-client-python#51) - a very basic single device LoRA training recipe based on torchtune core components - parity check with torchtune library and post training api unit test **What are not includes?** - implementation of other job management, get training artifacts apis (separate PR) - refactor the meta reference inference logic to support eval on finetuned model (separate PR) - several necessary functionality in the training recipe such as logging, validation etc (separate PR) - interop with telemetry for tracing and metrics logging, currently temporarily log to local disk (separate PR) ### Testing **e2e test** Although we haven't added detailed testing and numerical parity check with torchtune yet, we did a simple E2E test from client to server 1. setup server with` llama stack build --template experimental-post-training --image-type conda` and `llama stack run experimental-post-training ` 2. On client, run `llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training supervised_fine_tune` 3. Training finishes successfully. On server side, get the finetune checkpoints under output dir. On client side, get the job uuid server <img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM" src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d"> client <img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM" src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838"> **parity check** torchtune dataloader output and llama-stack post training dataloader output are same <img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM" src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee"> torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca dataset with llama3.2 3B instruct model are numerical match <img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM" src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99"> <img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM" src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443"> **unit test ** ![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()

ashwinb · 2024-12-18T19:49:00Z

I am surprised about the formatting changes to the generated code. We should not be formatting any stainless generated code because it is going to cause a forever conflict.

SLR722 · 2024-12-18T19:52:42Z

Let me revert these format changes and try to figure out how to avoid formatter touching those files

SLR722 · 2024-12-18T22:54:01Z

Address comment by disabling the auto format on save in IDE

temp commit

1f172a9

facebook-github-bot added the cla signed label Nov 28, 2024

SLR722 changed the title ~~[WIP] post training CLI~~ [WIP] post_training CLI Nov 28, 2024

temp commit

8b729ac

SLR722 mentioned this pull request Dec 3, 2024

[1/n] torchtune <> llama-stack integration skeleton meta-llama/llama-stack#540

Merged

SLR722 marked this pull request as ready for review December 3, 2024 19:56

refine api

3dafdaf

SLR722 requested review from ashwinb, raghotham and yanxi0830 December 9, 2024 21:36

SLR722 changed the title ~~[WIP] post_training CLI~~ post_training CLI Dec 9, 2024

SLR722 added 2 commits December 9, 2024 20:30

refine

caf19d9

support job management etc

8a5783d

SLR722 changed the title ~~post_training CLI~~ [WIP] post_training CLI Dec 11, 2024

SLR722 marked this pull request as draft December 11, 2024 05:01

SLR722 mentioned this pull request Dec 11, 2024

[2/n][torchtune integration] implement job management and return training artifacts meta-llama/llama-stack#593

Merged

temp commit

25da09d

SLR722 added 7 commits December 17, 2024 15:04

temp commit

3de3057

Merge branch 'main' into post_training

bd7ef60

Merge branch 'main' into post_training

b057d74

temp commit

062a26c

refine

8b3cc7b

refine

a91dfa6

refine

13fa064

SLR722 changed the title ~~[WIP] post_training CLI~~ post training CLI Dec 18, 2024

SLR722 marked this pull request as ready for review December 18, 2024 19:44

SLR722 requested review from hardikjshah, dltn and dineshyv as code owners December 18, 2024 19:44

address comment

70a963a

ashwinb approved these changes Dec 18, 2024

View reviewed changes

SLR722 merged commit b982fec into main Dec 18, 2024
3 checks passed

SLR722 deleted the post_training branch December 18, 2024 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post training CLI #51

post training CLI #51

SLR722 commented Nov 28, 2024 •

edited

Loading

ashwinb commented Dec 18, 2024

SLR722 commented Dec 18, 2024

SLR722 commented Dec 18, 2024

post training CLI #51

post training CLI #51

Conversation

SLR722 commented Nov 28, 2024 • edited Loading

What does this PR do?

user experience

test

ashwinb commented Dec 18, 2024

SLR722 commented Dec 18, 2024

SLR722 commented Dec 18, 2024

SLR722 commented Nov 28, 2024 •

edited

Loading