Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post training CLI #51

Merged
merged 14 commits into from
Dec 18, 2024
Merged

post training CLI #51

merged 14 commits into from
Dec 18, 2024

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Nov 28, 2024

What does this PR do?

Add post training related CLI to client SDK

user experience

Since kick off supervised finetune job need to setup several configs and hyper-parameters, to make user experience friendly, we provide an example script under examples/post_training/supervised_fine_tune_client.py to kick off the post training job

test

kick off training
python supervised_fine_tune_client.py "devgpu018.nha2.facebook.com" 5000 "1236" "meta-llama/Llama-3.2-3B-Instruct"
Screenshot 2024-12-18 at 11 39 35 AM

get job list
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training list

Screenshot 2024-12-18 at 11 40 27 AM

get job status
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training status --job-uuid "1235"

Screenshot 2024-12-18 at 11 41 34 AM

get job artifacts
llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training artifacts --job-uuid "1235"
Screenshot 2024-12-18 at 11 42 22 AM

@SLR722 SLR722 changed the title [WIP] post training CLI [WIP] post_training CLI Nov 28, 2024
@SLR722 SLR722 changed the title [WIP] post_training CLI post_training CLI Dec 9, 2024
@SLR722 SLR722 changed the title post_training CLI [WIP] post_training CLI Dec 11, 2024
@SLR722 SLR722 marked this pull request as draft December 11, 2024 05:01
ashwinb pushed a commit to meta-llama/llama-stack that referenced this pull request Dec 13, 2024
### Context 
This is the 1st of series PRs that integrate torchtune with llama-stack
as meta reference post-training implementation. For MVP, we will focus
on single device LoRA SFT.

Though this PR is still WIP, we want to get early feedback on the high
level design of this skeleton while still working on several details

### Scope
To limit the scope of this PR, we focus on the skeleton of the
implementation.

**What are included?**
- refine the post-training SFT apis
- skeleton of supervised_fine_tune implementation. We verified that we
can call the supervised_fine_tune API successfully from llama stack
client SDK (client side PR:
meta-llama/llama-stack-client-python#51)
- a very basic single device LoRA training recipe based on torchtune
core components
- parity check with torchtune library and post training api unit test

**What are not includes?**
- implementation of other job management, get training artifacts apis
(separate PR)
- refactor the meta reference inference logic to support eval on
finetuned model (separate PR)
- several necessary functionality in the training recipe such as
logging, validation etc (separate PR)
- interop with telemetry for tracing and metrics logging, currently
temporarily log to local disk (separate PR)

### Testing
**e2e test**
Although we haven't added detailed testing and numerical parity check
with torchtune yet, we did a simple E2E test from client to server
1. setup server with` llama stack build --template
experimental-post-training --image-type conda` and `llama stack run
experimental-post-training `
2. On client, run `llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 post_training
supervised_fine_tune`
3. Training finishes successfully. On server side, get the finetune
checkpoints under output dir. On client side, get the job uuid

server 
<img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM"
src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d">

client 
<img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM"
src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838">

**parity check**
torchtune dataloader output and llama-stack post training dataloader
output are same
<img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM"
src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee">

torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca
dataset with llama3.2 3B instruct model are numerical match

<img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM"
src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99">

<img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM"
src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443">

**unit test ** 
![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()
@SLR722 SLR722 changed the title [WIP] post_training CLI post training CLI Dec 18, 2024
@SLR722 SLR722 marked this pull request as ready for review December 18, 2024 19:44
@ashwinb
Copy link
Contributor

ashwinb commented Dec 18, 2024

I am surprised about the formatting changes to the generated code. We should not be formatting any stainless generated code because it is going to cause a forever conflict.

@SLR722
Copy link
Contributor Author

SLR722 commented Dec 18, 2024

Let me revert these format changes and try to figure out how to avoid formatter touching those files

@SLR722
Copy link
Contributor Author

SLR722 commented Dec 18, 2024

Address comment by disabling the auto format on save in IDE

@SLR722 SLR722 merged commit b982fec into main Dec 18, 2024
3 checks passed
@SLR722 SLR722 deleted the post_training branch December 18, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants