Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[emulator] feat: veScale correctness emulator #45

Merged
merged 1 commit into from
Aug 10, 2024

Conversation

jiannanWang
Copy link
Contributor

This pull request contains veScale Correctness Emulator that emulates the results from multiple devices execution on a single device.

Why veScale Correctness Emulator?

  • Modern Frameworks promise Single-Device Abstraction for nD Parallelism. But it is still missing a critical component that can verify the correctness of Single-Device Abstraction of nD Parallelism. For example, there are differences between the loss curve of single device training and loss curves of 3D parallelism training.
  • How do we know the difference is correct? To what extent is it correct?
    • "Correct" differences come from nD Parallelism
      • Communication difference (e.g., ring allreduce)
      • Compute difference (e.g., matmul)
      • Hardware difference (e.g. FP16)
    • "Incorrect" differences come from bugs in
      • User configuration
      • User model code
      • System implementation code
      • Data loader
      • Model checkpoint
      • Random seed and offset

What is veScale Correctness Emulator?

  • veScale Correctness Emulator verifies nD prarllelism correctness by emulating nD parallel training on a single device,
  • veScale Correctness Emulator isolates correctness at different layers and seperates differences come from nD parallelism with differences come from bugs.
  • veScale Correctness Emulator achieves bitwise correctness in three levels: NCCL collectives, mesh collectives, and DTensor.

NCCL Emulation

We are using the NCCL version 2.19.3 code as a reference for our emulation implementation. The code can be found at NVIDIA/nccl.

veScale Correctness Emulator can perfectly emulate NCCL collective APIs' results. This is achieved by implementing the same NCCL collective algorithms and modeling NCCL's computation order via calculating the correct chunk size.

Collective APIs Emulation

These are standalone collective APIs which emulate the results from collective APIs of NCCL on a single device.
Supported APIs:

  • all_reduce
  • all_gather
  • reduce_scatter
  • all_to_all

Mesh Collective APIs Emulation

These are standalone mesh collective APIs which emulate the results from mesh collective APIs of PyTorch on a single device.
Supported APIs:

  • mesh_all_reduce
  • mesh_all_gather
  • mesh_reduce_scatter
  • mesh_all_to_all
  • mesh_broadcast
  • mesh_scatter

DTensor Redistribution Function Emulation

These are standalone DTensor redistribution functions which emulate the results from DTensor redistribution functions of PyTorch on a single device.

  • R2R
  • R2S
  • S2R
  • P2R

Comming soon: A full list of emulator DTensor redistribution functions will be added to support nD parallelisms including DP, TP, SP, PP, EP, and OP.

How does veScale Correctness Emulator work?

veScale Correctness Emulator achieves bitwise correctness in emulating NCCL collectives APIs results. This is done by implementing the same NCCL collective algorithms and modeling NCCL's algorithm and protocol selection function and chunk size calculation process to ensure the same computation order as NCCL.

Based on the emulation functions for NCCL collectives, veScale Correctness Emulator implements a global-view emulator ProcessGroup and DeviceMesh that contain all the process groups in the enviroment, while PyTorch's ProcessGroup and DeviceMesh only view process groups related to the current ranks.

Aided by the global-view emulator ProcessGroup and DeviceMesh, veScale Correctness Emulator can emulate the results of collective APIs, mesh collective APIs, and DTensor redistribution functions on a single device.

@MackZackA MackZackA merged commit e439aa9 into volcengine:main Aug 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants