Tensor parallel distributed strategy without using deepspeed #280

kalyanjk · 2024-07-02T13:00:16Z

Tensor parallel by extending GaudiLlamaAttention -> TPGaudiLlamaAttention and GaudiLlamaMLP -> TPGaudiLlamaMLP

use parameter --distributed_strategy="tp" to invoke this code path

msinnha1

This is a big patch and reviewing it further

msinnha1 · 2024-07-09T12:24:39Z

optimum/habana/transformers/models/llama/modeling_llama.py

@@ -1013,6 +1161,7 @@ def forward(
            global has_fused_rope
            has_fused_rope = False

+


minor: please remove this

minor: please remove this

Done

msinnha1 · 2024-07-09T12:27:38Z

optimum/habana/transformers/models/llama/modeling_llama.py

-            [GaudiLlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
-        )
+        layers = []
+        for i in range(config.num_hidden_layers):


minor: layer_idx in place of 'i'

minor: layer_idx in place of 'i'

Done

msinnha1 · 2024-07-12T04:33:50Z

optimum/habana/distributed/strategy.py

+import torch.distributed
+from torch import nn
+
+#from optimum.habana.distributed import tp_wrapping


minor: please remove the commented code

minor: please remove the commented code

Done

msinnha1 · 2024-07-12T04:39:43Z

optimum/habana/distributed/strategy.py

+        pass
+
+
+class NotDistributed(DistributedStrategy):


why derived class is NotDistributed for the base class as DistributedStrategy? It is creating some confusion in readability, may require some other name?

msinnha1 · 2024-07-12T04:45:27Z

optimum/habana/distributed/strategy.py

+    def distribute_layer(self, block: nn.Module, layer: int) -> nn.Module:
+        device = self.layer_to_device[layer]
+        if self.from_meta:
+            # https://github.com/pytorch/pytorch/pull/113647


this PR is closed, and we can possibly remove the reference to such comments from foundation repo, #Comment

msinnha1 · 2024-07-12T04:46:23Z

optimum/habana/distributed/tensorparallel.py

+        )
+        if par_mod.bias is not None:
+            par_mod.bias.copy_(torch.split(mod.bias, output_size_per_partition)[rank])
+    # print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")


msinnha1 · 2024-07-12T04:46:39Z

optimum/habana/distributed/tensorparallel.py

+                par_mod.bias.copy_(mod.bias)
+            else:
+                par_mod.bias.zero_()
+    # print(f"For rank {rank}, we have the following weights: Base weight {mod.weight}, bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")


msinnha1 · 2024-07-12T04:47:19Z

optimum/habana/distributed/tensorparallel.py

+        par_mod.weight.copy_(
+            torch.split(mod.weight, output_size_per_partition, dim=1)[rank]
+        )
+    # print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")


msinnha1 · 2024-07-12T04:47:45Z

optimum/habana/distributed/tensorparallel.py

+    # The transposes here are to avoid excessive recompilation due to split()
+    # specializing the dimension where the all_gather is happening
+    last_dim = input_.dim() - 1
+    # Starting PT 2.3, we can go back to funcol.all_gather_tensor


…I#280) * TP reference - ibm foundation-model-stack * Code cleanup -removed unused code --------- Co-authored-by: Kalyan <[email protected]>

…299) * TP reference - ibm foundation-model-stack * Code cleanup -removed unused code --------- Co-authored-by: Kalyan <[email protected]>

astachowiczhabana · 2024-07-29T10:35:54Z

huggingface#1121

…abanaAI#280) (HabanaAI#299)" This reverts commit 32c86d3.

…abanaAI#280)" This reverts commit c6e5f9c.

* Revert "Tensor parallel distributed strategy without using deepspeed (#280) (#299)" This reverts commit 32c86d3. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <[email protected]> --------- Co-authored-by: Kalyan <[email protected]>

* Revert "Tensor parallel distributed strategy without using deepspeed (#280)" This reverts commit c6e5f9c. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <[email protected]> --------- Co-authored-by: Kalyan <[email protected]>

* Revert "Tensor parallel distributed strategy without using deepspeed (#280)" This reverts commit c6e5f9c. * Tensor parallel distributed strategy without using deepspeed (huggingface#1121) Co-authored-by: Kalyan <[email protected]> --------- Change-Id: Ic30c85e697dbd6a51767e21e1c06c9a20120d9f6 Co-authored-by: Kalyan <[email protected]>

kalyanjk requested review from mandy-li, libinta and dvarshney-habana as code owners July 2, 2024 13:00

msinnha1 reviewed Jul 12, 2024

View reviewed changes

kalyanjk requested review from ssarkar2, bhargaveede and vivekgoe as code owners July 15, 2024 07:02

TP reference - ibm foundation-model-stack

a9275fd

kalyanjk force-pushed the tp_strategy branch 3 times, most recently from 576860d to 1e82fac Compare July 15, 2024 07:19

Code cleanup -removed unused code

657f3c1

kalyanjk force-pushed the tp_strategy branch from 1e82fac to 657f3c1 Compare July 15, 2024 07:22

msinnha1 approved these changes Jul 15, 2024

View reviewed changes

dvarshney-habana approved these changes Jul 15, 2024

View reviewed changes

dvarshney-habana merged commit c6e5f9c into HabanaAI:habana-main Jul 15, 2024

kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Jul 31, 2024

Revert "Tensor parallel distributed strategy without using deepspeed (H…

42fdb44

…abanaAI#280) (HabanaAI#299)" This reverts commit 32c86d3.

kalyanjk mentioned this pull request Jul 31, 2024

Tensor parallel distributed strategy without using deepspeed #320

Merged

kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Jul 31, 2024

Revert "Tensor parallel distributed strategy without using deepspeed (H…

bd6520e

…abanaAI#280)" This reverts commit c6e5f9c.

kalyanjk mentioned this pull request Jul 31, 2024

Tensor parallel distributed strategy without using deepspeed #321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallel distributed strategy without using deepspeed #280

Tensor parallel distributed strategy without using deepspeed #280

kalyanjk commented Jul 2, 2024

msinnha1 left a comment

msinnha1 Jul 9, 2024

kalyanjk Jul 15, 2024

msinnha1 Jul 9, 2024

kalyanjk Jul 15, 2024

msinnha1 Jul 12, 2024

kalyanjk Jul 15, 2024

msinnha1 Jul 12, 2024

msinnha1 Jul 12, 2024

kalyanjk Jul 15, 2024

msinnha1 Jul 12, 2024

msinnha1 Jul 12, 2024

msinnha1 Jul 12, 2024

msinnha1 Jul 12, 2024

astachowiczhabana commented Jul 29, 2024

		@@ -1013,6 +1161,7 @@ def forward(
		global has_fused_rope
		has_fused_rope = False

Tensor parallel distributed strategy without using deepspeed #280

Tensor parallel distributed strategy without using deepspeed #280

Conversation

kalyanjk commented Jul 2, 2024

msinnha1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astachowiczhabana commented Jul 29, 2024