UNet end-to-end performance scales poorly in data parallel mode #12961

esmalTT · 2024-09-21T00:18:44Z

Summary

On the current main (commit 861fb7e) - the single chip performance of UNet is approx. 329 fps. Running the same test except data parallel on N300 measures 246 fps. The performance does not change if we disable async. Similarly on T3K, we are only getting 443 fps end-to-end.

We should investigate why this is the case.

Steps to reproduce

Using UNet Shallow

Build latest main and enable ethernet dispatch cores: export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml.

Run the following steps:

# Run single-chip performance wth trace+2CQ
pytest models/experimental/functional_unet/tests/test_unet_trace.py::test_unet_trace_2cq

# Run data parallel on 2 devices wth trace+2CQ+async mode
pytest models/experimental/functional_unet/tests/test_unet_trace.py::test_unet_trace_2cq_multi_device

Using isolated test case

The behaviour is also present in the following test:

import time
import pytest
import ttnn
import torch


@pytest.mark.parametrize("device_params", [{"l1_small_size": 16384}], indirect=True)
@pytest.mark.parametrize("layout", [ttnn.TILE_LAYOUT])
@pytest.mark.parametrize("sharded", [True])
@pytest.mark.parametrize("enable_async_mode", (True,), indirect=True)
def test_transfer_md(mesh_device, layout, sharded, use_program_cache, enable_async_mode):
    expected = torch.rand([2, 1, 337920, 1])

    inputs_mesh_mapper = ttnn.ShardTensorToMesh(mesh_device, dim=0)

    input_tensor = ttnn.from_torch(expected, dtype=ttnn.bfloat16, mesh_mapper=inputs_mesh_mapper)
    input_tensor = ttnn.to_layout(input_tensor, layout)

    sharded_memory_config = ttnn.create_sharded_memory_config(
        [1, 1, 337920, 32], ttnn.CoreGrid(x=8, y=8), ttnn.ShardStrategy.HEIGHT
    )

    # Warmup
    x = ttnn.to_device(input_tensor, mesh_device, sharded_memory_config)
    x = x.cpu(blocking=False)
    ttnn.synchronize_devices(mesh_device)

    iterations = 32
    for _ in range(iterations):
        x = ttnn.to_device(input_tensor, mesh_device, sharded_memory_config if sharded else ttnn.L1_MEMORY_CONFIG)
        x.cpu(blocking=False)
    ttnn.synchronize_devices(mesh_device)

Viewing the profiler in tracy shows device 1 being slower than device 0:

Investigation

Doing some additional testing I found something interesting: Running UNet on chip 1 instead of chip 0 shows readback performance >3x slower when reading from chip 1. Based on this, it seems like the performance degredation is from inputs going to non-MMIO devices.

The text was updated successfully, but these errors were encountered:

aliuTT · 2024-10-09T14:53:25Z

Copying a thread from slack. Evan was testing remote vs. mmio readback performance:

import time
import pytest
import ttnn
import torch


@pytest.mark.parametrize("enable_async_mode", (True,False), indirect=True)
@pytest.mark.parametrize("device_params", [{"l1_small_size": 16384}], indirect=True)
@pytest.mark.parametrize("layout", [ttnn.TILE_LAYOUT])
@pytest.mark.parametrize("device_id", [0, 1])
def test_transfer(mesh_device, layout, device_id, use_program_cache, enable_async_mode):
    device = mesh_device.get_devices()[device_id]

    sharded = True
    H = 337920
    expected = torch.rand([1, 1, H, 1])

    input_tensor = ttnn.from_torch(expected, dtype=ttnn.bfloat16)
    input_tensor = ttnn.to_layout(input_tensor, layout)
    sharded_memory_config = ttnn.create_sharded_memory_config(
        [1, 1, H, 32], ttnn.CoreGrid(x=8, y=8), ttnn.ShardStrategy.HEIGHT
    )

    input_tensor = ttnn.to_device(input_tensor, device, sharded_memory_config if sharded else ttnn.L1_MEMORY_CONFIG)

    # Warmup
    x = input_tensor
    x = x.cpu(blocking=False)
    ttnn.synchronize_devices(device)

    outputs = []
    iterations = 32 * 32
    start = time.time()
    for _ in range(iterations):
        x = input_tensor
        outputs.append(x.cpu(blocking=False))
    ttnn.synchronize_devices(device)
    end = time.time()
    total_time = end - start
    print(f"time: {1000.0 * total_time:.2f} ms")
    print(f"avg time: {1000.0 * total_time / iterations:.2f} ms")

    elem_size = 2
    num_elem = H * 32
    num_devices = 1
    num_bytes = num_elem * elem_size * num_devices * iterations
    transfer_speed = num_bytes / total_time
    print(f"transfer speed: {transfer_speed * 1e-9:.2f} gb/s")

Latency:

For example, when reading from device in a loop:
For a tensor of shape (2, 2048, 32)
Device 0 takes an average of 0.70 ms per transfer
Device 1 takes an average of 0.84 ms per transfer
For a tensor of shape (2, 337920, 32)
Device 0 takes an average of 3.7 ms per transfer
Device 1 takes an average of 11.3 ms  per transfer

Bandwidth:

almost 6 GB/s e2e on the chip 0 but only 1.9 GB/s on chip 1

esmalTT · 2024-10-16T23:04:15Z

Update: @tt-asaigal has provided a fix (similar to this) that significantly improves end-to-end scaling for multiple devices. This fix removes a bottleneck on host by making sure that worker and completion queue threads are on completely independent cores.

Measuring the new end-to-end perf with this fix shows that performance scales much better when only using MMIO chips:

	MMIO Devices	Remote Devices	FPS	FPS per device	FPS per MMIO device
T3K (MMIO-only)	4	0	1530	382.5	382.5
T3K	4	4	1366	170.75	341.5
N300	1	1	377	188.5	377

Poor scaling on remote devices hints that the main bottleneck is now likely on read/write to the remote chip

…13955)

esmalTT added P0 CNNs CNN_bug Unet-Shallow labels Sep 21, 2024

aliuTT assigned aliuTT, tt-aho and pgkeller and unassigned tt-aho and aliuTT Oct 9, 2024

tt-asaigal added a commit that referenced this issue Oct 17, 2024

#12961: Allow CQ reader and worker thread to be on different cores

43a7608

tt-asaigal mentioned this issue Oct 17, 2024

#12961: Allow CQ reader and worker thread to be on different cores #13955

Merged

5 tasks

tt-asaigal added a commit that referenced this issue Oct 18, 2024

#12961: Allow CQ reader and worker thread to be on different cores

6c40c6d

tt-asaigal added a commit that referenced this issue Oct 18, 2024

#12961: Allow CQ reader and worker thread to be on different cores

fb42828

tt-asaigal added a commit that referenced this issue Oct 18, 2024

#12961: Allow CQ reader and worker thread to be on different cores

1558f98

tt-asaigal added a commit that referenced this issue Oct 18, 2024

#12961: Allow CQ reader and worker thread to be on different cores (#…

6f69c38

…13955)

pgkeller assigned nhuang-tt Nov 1, 2024

mywoodstock added P1 and removed P0 labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNet end-to-end performance scales poorly in data parallel mode #12961

UNet end-to-end performance scales poorly in data parallel mode #12961

esmalTT commented Sep 21, 2024 •

edited

Loading

aliuTT commented Oct 9, 2024 •

edited

Loading

esmalTT commented Oct 16, 2024

UNet end-to-end performance scales poorly in data parallel mode #12961

UNet end-to-end performance scales poorly in data parallel mode #12961

Comments

esmalTT commented Sep 21, 2024 • edited Loading

Summary

Steps to reproduce

Using UNet Shallow

Using isolated test case

Investigation

aliuTT commented Oct 9, 2024 • edited Loading

esmalTT commented Oct 16, 2024

esmalTT commented Sep 21, 2024 •

edited

Loading

aliuTT commented Oct 9, 2024 •

edited

Loading