help on understanding the implementation of FSDP. #5441

jq-wei · 2024-09-15T09:52:53Z

I don't understand the implementation of FSDP. I think in order to use FSDP, the model needs to be wrapped by FullyShardedDataParallel (or FSDP) imported from torch.distributed.fsdp. But I can not find anywhere in this repo uses these two keywords.

So I don't think FSDP is implemented natively, but rather, perhaps, some combination to zero2&3 from Deepspeed?. I am missing something?

In fact, some other issues is related to this question. Basically the model is not shared, but each GPU holds one full copy of the model. For example issue 5140 4912.

Thank you very much in advance for giving some hints.

github-actions bot added the pending This problem is yet to be addressed label Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help on understanding the implementation of FSDP. #5441

help on understanding the implementation of FSDP. #5441

jq-wei commented Sep 15, 2024

help on understanding the implementation of FSDP. #5441

help on understanding the implementation of FSDP. #5441

Comments

jq-wei commented Sep 15, 2024