training UX: automatic generating make_train_step #8495

qihqi · 2024-12-17T03:56:19Z

This PR create a function that can generate a train_step function based on model and optimizer.

It also introduces a class that can wrap Module List into a Scan loop.

Then it changes the examples/train_llama_titan to show case usages of those 2 new tools.

tengyifei · 2025-01-02T18:36:58Z

experimental/torch_xla2/examples/basic_training_jax.py


+train_step = interop.jax_jit(train_step, kwargs_for_jax_jit={'donate_argnums': (0, 2)})


Does donate_argnums here imply that input buffers are donated to outputs? The (0, 2) is pretty cryptic to me. Consider commenting on their meaning.

Or better, maybe this could be handled internally? We could jit the function inside make_train_step.

You're right. The current issue is that sometimes I want to print out the stablehlo for inspection. So need to make the jax_jit'd object also to store the jax function. I'll followup.

tengyifei · 2025-01-02T18:42:16Z

experimental/torch_xla2/examples/train_llama_torchtitan/Dockerfile

+WORKDIR /
+RUN git clone https://github.com/pytorch/xla.git
+WORKDIR xla/experimental/torch_xla2
+RUN git checkout hanq_hybrid_mesh


Is the hanq_hybrid_mesh branch intended?

tengyifei · 2025-01-02T18:43:25Z

experimental/torch_xla2/examples/train_llama_torchtitan/README.md

+    return jax.make_array_from_single_device_arrays(shape, sharding, x_split)
+```
+
+When running on single-host, `jax.device_put` sufficies. Multi-host need some 


nit: suffices

tengyifei · 2025-01-02T18:43:34Z

experimental/torch_xla2/examples/train_llama_torchtitan/README.md

+```
+
+When running on single-host, `jax.device_put` sufficies. Multi-host need some 
+extra encantations so that we split an array to only the shards corresponding


nit: incantations

tengyifei · 2025-01-02T18:48:49Z

experimental/torch_xla2/examples/train_llama_torchtitan/README.md

+        jax_optimizer = optax.sgd(0.01)
+        opt_state = torch_view(jax_optimizer.init(jax_view(jittable_mod.params)))
+
+        #opt_state = torch_xla2.interop.call_jax(jax_optimizer.init, jittable_mod.params)


nit: obsolete code

tengyifei · 2025-01-02T18:52:49Z

experimental/torch_xla2/examples/train_llama_torchtitan/README.md

+5. `interop.call_jax` API is used whenever we need something from Jax. Those API can be 
+   wrapped and have the "jaxiness" hidden. However, I don't think we need to do such hidding.
+
+6. Precompile: call to `helpers.precompile_step`. This is not needed. If not used, then


nit: helpers.compile_step_func

tengyifei · 2025-01-02T18:59:03Z

experimental/torch_xla2/examples/train_llama_torchtitan/train_llama.py

+        jax_optimizer = optax.sgd(0.01)
+        opt_state = torch_view(jax_optimizer.init(jax_view(jittable_mod.params)))
+
+        #opt_state = torch_xla2.interop.call_jax(jax_optimizer.init, jittable_mod.params)


obsolete code?

tengyifei · 2025-01-02T19:05:04Z

experimental/torch_xla2/examples/train_llama_torchtitan/train_llama.py

+
+    def custom_attention(
+        query, key, value, attn_mask=None,
+        dropout_p=0.0, is_causal=False,


What happens if dropout_p is not zero or when is_causal is False? Should we assert that their values matches the behavior of splash_attention?

This bit is to show case that you can register your own override of op.
So the user is writing this bit and we assume the user would know which special case it applies to them.

tengyifei · 2025-01-02T19:18:42Z

experimental/torch_xla2/torch_xla2/train.py

+def make_train_step(model_fn, 
+                    loss_fn, optax_optimizer, 
+                    remat_policy=None, 
+                    mark_fsdp_sharding_axis=None):


This seems fairly specific to the FSDP sharding scheme. What if my model input and output uses different sharding schemes? What if I want my model output to be 2D sharded? What if they are PyTrees?

Instead, I wonder if we could separate the sharding concern from make_train_step. For example, what if we had

def shard_input(fn, in_shardings) -> fn def shard_output(fn, out_shardings) -> fn # Alternatively, just a `shard` function where `in_shardings` and `out_shardings` semantics matches what's in https://jax.readthedocs.io/en/latest/jax.experimental.pjit.html def shard(fn, in_shardings, out_shardings)

That could internally wrap any function (such as the model code) and then annotate the inputs or outputs with sharding annotations? Then the user could apply whatever sharding they want.

moved sharding annotation to the client.

tengyifei · 2025-01-02T19:22:43Z

experimental/torch_xla2/torch_xla2/train.py

+          h, *rest = args
+          newh = torch.func.functional_call(self.c.one_mod, weight, args)
+          # next layer's input; and residual to be added to list
+          return (newh, *rest), torch.ones(1, device='jax')


Could the torch.ones(1, ...) simply be None? None is the base case of a PyTree and is what I managed to got working in torch_xla's scan.

tengyifei · 2025-01-16T22:06:01Z

Wondering if I should review this again

qihqi · 2025-01-17T01:00:01Z

Wondering if I should review this again

Yes, PTAL, thanks!

tengyifei · 2025-01-21T23:09:44Z

experimental/torch_xla2/torch_xla2/train.py

+  optax_optimizer: the optimizer from optax library. for example, optax.adam
+  remat_policy: One of jax.ad_checkpoint.checkpoint_policies, specifies how
+      to do gradient checkpointing. If None, then it means checkpoint everything.
+  mark_fsdp_sharding_axis: str. A string name for marking sharding for 


nit: obsolete comment

qihqi force-pushed the hanq_train branch 2 times, most recently from 7ebd346 to 0769c21 Compare December 18, 2024 20:05

qihqi requested a review from tengyifei December 21, 2024 00:07

qihqi force-pushed the hanq_train branch 2 times, most recently from eb3acbd to 0b0f2b5 Compare December 23, 2024 18:21

qihqi requested a review from lsy323 December 23, 2024 18:22

tengyifei requested changes Jan 2, 2025

View reviewed changes

qihqi force-pushed the hanq_train branch 2 times, most recently from dafa6b2 to 075cfe5 Compare January 16, 2025 18:39

qihqi requested review from mikegre-google, ManfeiBai and zpcore as code owners January 16, 2025 18:39

qihqi force-pushed the hanq_train branch from 075cfe5 to 5680121 Compare January 16, 2025 18:40

qihqi removed request for zpcore, ManfeiBai and mikegre-google January 17, 2025 00:53

qihqi requested a review from tengyifei January 17, 2025 01:00

tengyifei reviewed Jan 21, 2025

View reviewed changes

tengyifei approved these changes Jan 21, 2025

View reviewed changes

qihqi force-pushed the hanq_train branch from 5c8d42b to d9c01b1 Compare January 22, 2025 01:14

qihqi added 8 commits January 22, 2025 01:16

training UX: automatic generating make_train_step

ee876ff

Add readme and docker file

89ac182

commit Scan

dcd78e9

checkpoint on v6e

7312ed1

commit changes

1fc660b

Move sharding logic inside of the model function

5149b02

train update

3407c7a

update from v6e examples

e23b50c

qihqi added 3 commits January 22, 2025 01:16

misc fixes

20759e1

Add test and fix comment

dfc0104

fix test fail on jax 0.5.0

9f8831a

qihqi force-pushed the hanq_train branch from d9c01b1 to 9f8831a Compare January 22, 2025 01:17

tengyifei merged commit f5b33c5 into master Jan 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training UX: automatic generating make_train_step #8495

training UX: automatic generating make_train_step #8495

qihqi commented Dec 17, 2024 •

edited

Loading

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

qihqi Jan 16, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

tengyifei Jan 21, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei Jan 2, 2025

qihqi Jan 21, 2025

tengyifei Jan 2, 2025

qihqi Jan 17, 2025

tengyifei commented Jan 16, 2025

qihqi commented Jan 17, 2025

tengyifei Jan 21, 2025


		train_step = interop.jax_jit(train_step, kwargs_for_jax_jit={'donate_argnums': (0, 2)})

training UX: automatic generating make_train_step #8495

training UX: automatic generating make_train_step #8495

Conversation

qihqi commented Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengyifei commented Jan 16, 2025

qihqi commented Jan 17, 2025

Choose a reason for hiding this comment

qihqi commented Dec 17, 2024 •

edited

Loading