Revise LayerDrop implementation #484

cbalioglu · 2024-04-27T02:28:38Z

This PR revises the LayerDrop implementation and moves its from ModuleList to StandardTransformerEncoder and StandardTransformerDecoder. Although the original implementation was ideal, both DDP and FSDP had trouble correctly handling forward/backward passes and either silently failed to sync gradients (DDP), or failed with some cryptic error. This implementation causes redundant computation of dropped layers, but since the autograd graph stays constant, both DDP and FSDP can handle it correctly.

kauterry · 2024-04-27T03:57:31Z

src/fairseq2/nn/module_list.py

        drop_p: float = 0.0,
        generator: Optional[Generator] = None,


Shouldn't we remove these as well?

You mean the generator? It should be still around in case someone wants to provide a different RNG for layerdrop

kauterry · 2024-04-27T03:58:16Z

src/fairseq2/nn/module_list.py


+# compat
 @final
 class ModuleList(TorchModuleList):


What is the purpose of this class? Why don't we just use torch.nn.ModuleList instead of this? I don't see this even used anywhere, I propose for this file to be removed.

We have several teams using this module for now. Everything tagged with “# compat” will eventually get removed once we migrate those uses (before v0.3 release)

kauterry · 2024-04-27T04:37:52Z

src/fairseq2/nn/transformer/encoder.py

+    def backward(ctx: Any, grad_output: Tensor) -> Tuple[Tensor, Tensor]:
+        return grad_output, torch.zeros_like(grad_output)


Smartly done! The gradient with respect to x is going to be just the same as the gradient of the output, since this is just the identity function when you drop layers.

Also, what is the advantage of using this method over PyTorch hooks?

shagunsodhani · 2024-05-01T10:54:22Z

both DDP and FSDP had trouble correctly handling forward/backward passes and either silently failed to sync gradients (DDP), or failed with some cryptic error.

Did this happen because rng across the different gpus was different - causing different layers to be dropped across gpus ?

shagunsodhani · 2024-05-01T10:47:53Z

src/fairseq2/nn/transformer/encoder.py

@@ -204,6 +221,19 @@ def forward(

        return seqs, padding_mask

+    def _drop_iter(self) -> Iterator[Tuple[Module, bool]]:


I think this logic (and some other pieces in the forward pass) are used in decoder as well. Maybe we could consider adding a base component in the future (that both encoder and decoder would inherit from)

Revise LayerDrop implementation

40866e4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2024

Update

f24a3e4

cbalioglu merged commit f193bd8 into main Apr 27, 2024
10 checks passed

cbalioglu deleted the layerdrop branch April 27, 2024 02:42

kauterry reviewed Apr 27, 2024

View reviewed changes

shagunsodhani reviewed May 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise LayerDrop implementation #484

Revise LayerDrop implementation #484

cbalioglu commented Apr 27, 2024

kauterry Apr 27, 2024

cbalioglu Apr 27, 2024

kauterry Apr 27, 2024

cbalioglu Apr 27, 2024

kauterry Apr 27, 2024

cbalioglu Apr 27, 2024

kauterry Apr 27, 2024

shagunsodhani commented May 1, 2024

shagunsodhani May 1, 2024

		def backward(ctx: Any, grad_output: Tensor) -> Tuple[Tensor, Tensor]:
		return grad_output, torch.zeros_like(grad_output)

		@@ -204,6 +221,19 @@ def forward(

		return seqs, padding_mask

		def _drop_iter(self) -> Iterator[Tuple[Module, bool]]:

Revise LayerDrop implementation #484

Revise LayerDrop implementation #484

Conversation

cbalioglu commented Apr 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shagunsodhani commented May 1, 2024

Choose a reason for hiding this comment