Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor parallel distributed strategy without using deepspeed #280

Merged
merged 2 commits into from
Jul 15, 2024

Conversation

kalyanjk
Copy link

@kalyanjk kalyanjk commented Jul 2, 2024

Tensor parallel by extending GaudiLlamaAttention -> TPGaudiLlamaAttention and GaudiLlamaMLP -> TPGaudiLlamaMLP

use parameter --distributed_strategy="tp" to invoke this code path

Copy link

@msinnha1 msinnha1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big patch and reviewing it further

@@ -1013,6 +1161,7 @@ def forward(
global has_fused_rope
has_fused_rope = False


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please remove this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please remove this

Done

[GaudiLlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
layers = []
for i in range(config.num_hidden_layers):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: layer_idx in place of 'i'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: layer_idx in place of 'i'

Done

import torch.distributed
from torch import nn

#from optimum.habana.distributed import tp_wrapping

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please remove the commented code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please remove the commented code

Done

pass


class NotDistributed(DistributedStrategy):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why derived class is NotDistributed for the base class as DistributedStrategy? It is creating some confusion in readability, may require some other name?

def distribute_layer(self, block: nn.Module, layer: int) -> nn.Module:
device = self.layer_to_device[layer]
if self.from_meta:
# https://github.com/pytorch/pytorch/pull/113647

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is closed, and we can possibly remove the reference to such comments from foundation repo, #Comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
if par_mod.bias is not None:
par_mod.bias.copy_(torch.split(mod.bias, output_size_per_partition)[rank])
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Comment

par_mod.bias.copy_(mod.bias)
else:
par_mod.bias.zero_()
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight}, bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Comment

par_mod.weight.copy_(
torch.split(mod.weight, output_size_per_partition, dim=1)[rank]
)
# print(f"For rank {rank}, we have the following weights: Base weight {mod.weight} bias {mod.bias}; Par weight {par_mod.weight}, bias {par_mod.bias}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Comment

# The transposes here are to avoid excessive recompilation due to split()
# specializing the dimension where the all_gather is happening
last_dim = input_.dim() - 1
# Starting PT 2.3, we can go back to funcol.all_gather_tensor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Comment

@kalyanjk kalyanjk force-pushed the tp_strategy branch 3 times, most recently from 576860d to 1e82fac Compare July 15, 2024 07:19
@dvarshney-habana dvarshney-habana merged commit c6e5f9c into HabanaAI:habana-main Jul 15, 2024
kalyanjk added a commit to kalyanjk/optimum-habana-fork that referenced this pull request Jul 15, 2024
…I#280)

* TP reference -  ibm foundation-model-stack

* Code cleanup -removed unused code

---------

Co-authored-by: Kalyan <[email protected]>
dvarshney-habana pushed a commit that referenced this pull request Jul 15, 2024
…299)

* TP reference -  ibm foundation-model-stack

* Code cleanup -removed unused code

---------

Co-authored-by: Kalyan <[email protected]>
@astachowiczhabana
Copy link

huggingface#1121

kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Jul 31, 2024
kalyanjk pushed a commit to kalyanjk/optimum-habana-fork that referenced this pull request Jul 31, 2024
dvarshney-habana pushed a commit that referenced this pull request Jul 31, 2024
* Revert "Tensor parallel  distributed strategy without using deepspeed (#280) (#299)"

This reverts commit 32c86d3.

* Tensor parallel distributed strategy without using deepspeed (huggingface#1121)

Co-authored-by: Kalyan <[email protected]>

---------

Co-authored-by: Kalyan <[email protected]>
dvarshney-habana pushed a commit that referenced this pull request Jul 31, 2024
* Revert "Tensor parallel  distributed strategy without using deepspeed (#280)"

This reverts commit c6e5f9c.

* Tensor parallel distributed strategy without using deepspeed (huggingface#1121)

Co-authored-by: Kalyan <[email protected]>

---------

Co-authored-by: Kalyan <[email protected]>
astachowiczhabana pushed a commit that referenced this pull request Aug 6, 2024
* Revert "Tensor parallel  distributed strategy without using deepspeed (#280)"

This reverts commit c6e5f9c.

* Tensor parallel distributed strategy without using deepspeed (huggingface#1121)

Co-authored-by: Kalyan <[email protected]>

---------

Change-Id: Ic30c85e697dbd6a51767e21e1c06c9a20120d9f6
Co-authored-by: Kalyan <[email protected]>
astachowiczhabana pushed a commit that referenced this pull request Aug 6, 2024
* Revert "Tensor parallel  distributed strategy without using deepspeed (#280)"

This reverts commit c6e5f9c.

* Tensor parallel distributed strategy without using deepspeed (huggingface#1121)

Co-authored-by: Kalyan <[email protected]>

---------

Change-Id: Ic30c85e697dbd6a51767e21e1c06c9a20120d9f6
Co-authored-by: Kalyan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants