[emulator] feat: veScale correctness emulator #45

jiannanWang · 2024-08-09T21:23:23Z

This pull request contains veScale Correctness Emulator that emulates the results from multiple devices execution on a single device.

Why veScale Correctness Emulator?

Modern Frameworks promise Single-Device Abstraction for nD Parallelism. But it is still missing a critical component that can verify the correctness of Single-Device Abstraction of nD Parallelism. For example, there are differences between the loss curve of single device training and loss curves of 3D parallelism training.
How do we know the difference is correct? To what extent is it correct?
- "Correct" differences come from nD Parallelism
  - Communication difference (e.g., ring allreduce)
  - Compute difference (e.g., matmul)
  - Hardware difference (e.g. FP16)
- "Incorrect" differences come from bugs in
  - User configuration
  - User model code
  - System implementation code
  - Data loader
  - Model checkpoint
  - Random seed and offset

What is veScale Correctness Emulator?

veScale Correctness Emulator verifies nD prarllelism correctness by emulating nD parallel training on a single device,
veScale Correctness Emulator isolates correctness at different layers and seperates differences come from nD parallelism with differences come from bugs.
veScale Correctness Emulator achieves bitwise correctness in three levels: NCCL collectives, mesh collectives, and DTensor.

NCCL Emulation

We are using the NCCL version 2.19.3 code as a reference for our emulation implementation. The code can be found at NVIDIA/nccl.

veScale Correctness Emulator can perfectly emulate NCCL collective APIs' results. This is achieved by implementing the same NCCL collective algorithms and modeling NCCL's computation order via calculating the correct chunk size.

Collective APIs Emulation

These are standalone collective APIs which emulate the results from collective APIs of NCCL on a single device.
Supported APIs:

all_reduce
all_gather
reduce_scatter
all_to_all

Mesh Collective APIs Emulation

These are standalone mesh collective APIs which emulate the results from mesh collective APIs of PyTorch on a single device.
Supported APIs:

mesh_all_reduce
mesh_all_gather
mesh_reduce_scatter
mesh_all_to_all
mesh_broadcast
mesh_scatter

DTensor Redistribution Function Emulation

These are standalone DTensor redistribution functions which emulate the results from DTensor redistribution functions of PyTorch on a single device.

R2R
R2S
S2R
P2R

Comming soon: A full list of emulator DTensor redistribution functions will be added to support nD parallelisms including DP, TP, SP, PP, EP, and OP.

How does veScale Correctness Emulator work?

veScale Correctness Emulator achieves bitwise correctness in emulating NCCL collectives APIs results. This is done by implementing the same NCCL collective algorithms and modeling NCCL's algorithm and protocol selection function and chunk size calculation process to ensure the same computation order as NCCL.

Based on the emulation functions for NCCL collectives, veScale Correctness Emulator implements a global-view emulator ProcessGroup and DeviceMesh that contain all the process groups in the enviroment, while PyTorch's ProcessGroup and DeviceMesh only view process groups related to the current ranks.

Aided by the global-view emulator ProcessGroup and DeviceMesh, veScale Correctness Emulator can emulate the results of collective APIs, mesh collective APIs, and DTensor redistribution functions on a single device.

add emulator

7b9d307

leonardo0lyj approved these changes Aug 9, 2024

View reviewed changes

MackZackA approved these changes Aug 10, 2024

View reviewed changes

MackZackA merged commit e439aa9 into volcengine:main Aug 10, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[emulator] feat: veScale correctness emulator #45

[emulator] feat: veScale correctness emulator #45

jiannanWang commented Aug 9, 2024

[emulator] feat: veScale correctness emulator #45

[emulator] feat: veScale correctness emulator #45

Conversation

jiannanWang commented Aug 9, 2024

Why veScale Correctness Emulator?

What is veScale Correctness Emulator?

NCCL Emulation

Collective APIs Emulation

Mesh Collective APIs Emulation

DTensor Redistribution Function Emulation

How does veScale Correctness Emulator work?