Missing error handling in benchmark_sdxl_rocm.py #286

ScottTodd · 2024-07-10T15:45:52Z

On iree-org/iree#17847, compilation failed and the benchmark job using iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py at 3603a45 did not handle that gracefully:

https://github.com/iree-org/iree/actions/runs/9874435266/job/27269408927#step:16:46

INFO     root:benchmark_sdxl_rocm.py:31 Command failed with error: b''
INFO     root:benchmark_sdxl_rocm.py:161 Running SDXL ROCm benchmark failed. Exiting
INFO     root:benchmark_sdxl_rocm.py:179 E2E Benchmark Time: None ms (golden time 320.0 ms)

...

>       check.less_equal(benchmark_e2e_mean_time, goldentime_rocm_e2e, "SDXL e2e benchmark time should not regress")
E       TypeError: '<=' not supported between instances of 'NoneType' and 'float'

SHARK-TestSuite/iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py:298: TypeError

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-07-12T21:27:39Z

Traced this a bit.

We only log stderr on failure here, but we still return stdout:

SHARK-TestSuite/iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py

Lines 23 to 32 in 3603a45

    
           def run_iree_command(args: Sequence[str] = ()): 
        
               command = "Exec:", " ".join(args) 
        
               logging.getLogger().info(command) 
        
               proc = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False) 
        
               stdout_v, stderr_v, = proc.stdout, proc.stderr 
        
               return_code = proc.returncode 
        
               if return_code == 0: 
        
                   return 0, proc.stdout 
        
               logging.getLogger().info(f"Command failed with error: {proc.stderr}") 
        
               return 1, proc.stdout

We get that stdout output here and pass it to job_summary_process:

SHARK-TestSuite/iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py

Lines 174 to 176 in 3603a45

    
           # e2e benchmark 
        
           ret_value, output = run_sdxl_rocm_benchmark(rocm_chip, gpu_number) 
        
           benchmark_e2e_mean_time = job_summary_process(ret_value, output)

The stdout output is then ignored in job_summary_process if the ret value is 1:

SHARK-TestSuite/iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py

Lines 159 to 167 in 3603a45

    
           def job_summary_process(ret_value, output): 
        
               if ret_value == 1: 
        
                   logging.getLogger().info("Running SDXL ROCm benchmark failed. Exiting") 
        
                   return 
        
               bench_lines = output.decode().split("\n")[3:] 
        
               benchmark_results = decode_output(bench_lines) 
        
               logging.getLogger().info(benchmark_results) 
        
               benchmark_mean_time = float(benchmark_results[10].time.split()[0]) 
        
               return benchmark_mean_time

Reverts #17847 This broke SDXL rocm pipeline tests on mi300, see #17847 (comment). The tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31 Command failed with error: b''`) so I can't easily tell what the issue is, nod-ai/SHARK-TestSuite#286 is filed to improve the situation there.

ScottTodd · 2024-07-15T18:41:23Z

New coverage with pytest prior to the benchmark script also helps here.

Progress on nod-ai/SHARK-TestSuite#286 Tested here: https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171 (actually nvm, that failed before this script even ran... errr... well, it's probably fine lol) ci-exactly: build_packages,regression_test

ScottTodd · 2024-07-15T23:47:35Z

Landed a fix in IREE. Can copy it to this repo as well or just call this fixed.

…rg#17894) Reverts iree-org#17847 This broke SDXL rocm pipeline tests on mi300, see iree-org#17847 (comment). The tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31 Command failed with error: b''`) so I can't easily tell what the issue is, nod-ai/SHARK-TestSuite#286 is filed to improve the situation there. Signed-off-by: Lubo Litchev <[email protected]>

Progress on nod-ai/SHARK-TestSuite#286 Tested here: https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171 (actually nvm, that failed before this script even ran... errr... well, it's probably fine lol) ci-exactly: build_packages,regression_test Signed-off-by: Lubo Litchev <[email protected]>

ScottTodd assigned saienduri Jul 10, 2024

This was referenced Jul 12, 2024

[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction iree-org/iree#17847

Merged

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" iree-org/iree#17894

Merged

ScottTodd mentioned this issue Jul 12, 2024

Add in-tree special_models test suite using reworked iree-tooling. iree-org/iree#17883

Merged

ScottTodd mentioned this issue Jul 15, 2024

Log more context when sdxl benchmark commands fail. iree-org/iree#17907

Merged

ScottTodd self-assigned this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing error handling in benchmark_sdxl_rocm.py #286

Missing error handling in benchmark_sdxl_rocm.py #286

ScottTodd commented Jul 10, 2024

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

Missing error handling in benchmark_sdxl_rocm.py #286

Missing error handling in benchmark_sdxl_rocm.py #286

Comments

ScottTodd commented Jul 10, 2024

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024