[emulator] feat: veScale correctness emulator #45
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request contains veScale Correctness Emulator that emulates the results from multiple devices execution on a single device.
Why veScale Correctness Emulator?
What is veScale Correctness Emulator?
NCCL Emulation
We are using the NCCL version 2.19.3 code as a reference for our emulation implementation. The code can be found at NVIDIA/nccl.
veScale Correctness Emulator can perfectly emulate NCCL collective APIs' results. This is achieved by implementing the same NCCL collective algorithms and modeling NCCL's computation order via calculating the correct chunk size.
Collective APIs Emulation
These are standalone collective APIs which emulate the results from collective APIs of NCCL on a single device.
Supported APIs:
all_reduce
all_gather
reduce_scatter
all_to_all
Mesh Collective APIs Emulation
These are standalone mesh collective APIs which emulate the results from mesh collective APIs of PyTorch on a single device.
Supported APIs:
mesh_all_reduce
mesh_all_gather
mesh_reduce_scatter
mesh_all_to_all
mesh_broadcast
mesh_scatter
DTensor Redistribution Function Emulation
These are standalone DTensor redistribution functions which emulate the results from DTensor redistribution functions of PyTorch on a single device.
R2R
R2S
S2R
P2R
Comming soon: A full list of emulator DTensor redistribution functions will be added to support nD parallelisms including DP, TP, SP, PP, EP, and OP.
How does veScale Correctness Emulator work?
veScale Correctness Emulator achieves bitwise correctness in emulating NCCL collectives APIs results. This is done by implementing the same NCCL collective algorithms and modeling NCCL's algorithm and protocol selection function and chunk size calculation process to ensure the same computation order as NCCL.
Based on the emulation functions for NCCL collectives, veScale Correctness Emulator implements a global-view emulator
ProcessGroup
andDeviceMesh
that contain all the process groups in the enviroment, while PyTorch'sProcessGroup
andDeviceMesh
only view process groups related to the current ranks.Aided by the global-view emulator
ProcessGroup
andDeviceMesh
, veScale Correctness Emulator can emulate the results of collective APIs, mesh collective APIs, and DTensor redistribution functions on a single device.