[1/n] torchtune <> llama-stack integration skeleton #540

SLR722 · 2024-11-27T23:23:21Z

Context

This is the 1st of series PRs that integrate torchtune with llama-stack as meta reference post-training implementation. For MVP, we will focus on single device LoRA SFT.

Though this PR is still WIP, we want to get early feedback on the high level design of this skeleton while still working on several details

Scope

To limit the scope of this PR, we focus on the skeleton of the implementation.

What are included?

refine the post-training SFT apis
skeleton of supervised_fine_tune implementation. We verified that we can call the supervised_fine_tune API successfully from llama stack client SDK (client side PR: post training CLI llama-stack-client-python#51)
a very basic single device LoRA training recipe based on torchtune core components
parity check with torchtune library and post training api unit test

What are not includes?

implementation of other job management, get training artifacts apis (separate PR)
refactor the meta reference inference logic to support eval on finetuned model (separate PR)
several necessary functionality in the training recipe such as logging, validation etc (separate PR)
interop with telemetry for tracing and metrics logging, currently temporarily log to local disk (separate PR)

Testing

e2e test
Although we haven't added detailed testing and numerical parity check with torchtune yet, we did a simple E2E test from client to server

setup server with llama stack build --template experimental-post-training --image-type conda and llama stack run experimental-post-training
On client, run llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 post_training supervised_fine_tune
Training finishes successfully. On server side, get the finetune checkpoints under output dir. On client side, get the job uuid

server

client

parity check
torchtune dataloader output and llama-stack post training dataloader output are same

torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca dataset with llama3.2 3B instruct model are numerical match

**unit test **

raghotham · 2024-12-03T10:52:07Z

added some initial comments

llama_stack/providers/inline/post_training/meta_reference/__init__.py

llama_stack/providers/inline/post_training/meta_reference/config.py

...stack/providers/inline/post_training/meta_reference/recipes/lora_finetuning_single_device.py

llama_stack/providers/inline/post_training/meta_reference/utils.py

llama_stack/apis/post_training/post_training.py

llama_stack/providers/registry/post_training.py

llama_stack/templates/meta-reference-gpu/build.yaml

llama_stack/providers/inline/post_training/torchtune/utils.py

llama_stack/apis/post_training/post_training.py

llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py

llama_stack/providers/inline/post_training/torchtune/utils.py

ashwinb · 2024-12-11T05:18:40Z

llama_stack/providers/inline/post_training/torchtune/utils.py

+}
+
+EXPECTED_DATASET_SCHEMA: Dict[str, List[Dict[str, ParamType]]] = {
+    "alpaca": [


A few questions:

what do these three options mean?

what does instruction mean? does it mean system_prompt?

do you think we can use the types we have in the rest of our system -- for example, how is a dialog represented? We should be able to re-use the UserMessage, SystemMessage types we have in the rest of the system. Evals uses some of them.

what do these three options mean?

the 3 options mean 3 eligible alpaca dataset schemas. 'input' and 'text' columns are optional for alpaca dataset schema (see: https://github.com/pytorch/torchtune/blob/9cfa28835246a4c1ac4449e703eae8f49227db55/torchtune/data/_messages.py#L696 and https://huggingface.co/datasets/tatsu-lab/alpaca?row=0).

what does instruction mean? does it mean system_prompt?

instruction is different from system_prompt here. In alpaca dataset, 'instruction' pairs with 'input' as user_prompt (example: https://github.com/pytorch/torchtune/blob/9cfa28835246a4c1ac4449e703eae8f49227db55/torchtune/data/_messages.py#L696)

do you think we can use the types we have in the rest of our system -- for example, how is a dialog represented? We should be able to re-use the UserMessage, SystemMessage types we have in the rest of the system. Evals uses some of them.

torchtune has its own Message definition in the data transform https://github.com/pytorch/torchtune/blob/9cfa28835246a4c1ac4449e703eae8f49227db55/torchtune/data/_messages.py#L724. I lean toward directly import torchtune data transform to the stack and reuse its Message type. For dataset schema validation, I refer to how eval does

llama-stack/llama_stack/providers/inline/eval/meta_reference/eval.py

Line 80 in 41487e6

async def validate_eval_input_dataset_schema(self, dataset_id: str) -> None:

llama_stack/providers/inline/post_training/torchtune/utils.py

ashwinb

let's get this in!!!!

SLR722 added 4 commits November 25, 2024 17:27

temp commit

d7598c6

temp commit

9a976bc

temp commit

15e21cb

temp commit

bfc782c

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 27, 2024

SLR722 changed the title ~~Post training v2~~ [WIP] torchtune <> llama-stack integration Nov 27, 2024

SLR722 added 3 commits November 27, 2024 16:46

temp commit

6c709ab

temp commit

79c525b

temp commit

745163b

SLR722 changed the title ~~[WIP] torchtune <> llama-stack integration~~ [1/n] torchtune <> llama-stack integration skeleton Dec 3, 2024

fix pre-commit

5838b72

SLR722 mentioned this pull request Dec 3, 2024

post training CLI meta-llama/llama-stack-client-python#51

Merged

raghotham reviewed Dec 3, 2024

View reviewed changes

SLR722 marked this pull request as ready for review December 3, 2024 19:57

SLR722 requested review from ashwinb, yanxi0830, hardikjshah and dltn as code owners December 3, 2024 19:57

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/apis/post_training/post_training.py Outdated Show resolved Hide resolved

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/apis/post_training/post_training.py Outdated Show resolved Hide resolved

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/apis/post_training/post_training.py Outdated Show resolved Hide resolved

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/apis/post_training/post_training.py Show resolved Hide resolved

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/apis/post_training/post_training.py Show resolved Hide resolved

refine api

41cf2bb

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/providers/registry/post_training.py Outdated Show resolved Hide resolved

ashwinb reviewed Dec 4, 2024

View reviewed changes

llama_stack/templates/meta-reference-gpu/build.yaml Outdated Show resolved Hide resolved

SLR722 added 3 commits December 4, 2024 13:59

temp commit

2a15a8a

address comment

12eef58

address comments v2

29a1ddc

SLR722 added 3 commits December 4, 2024 20:26

remove unnecessary provider apis from expermental post training template

9c80a57

add unit test

5a628d3

refine

9c1ae08