Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing error handling in benchmark_sdxl_rocm.py #286

Open
ScottTodd opened this issue Jul 10, 2024 · 3 comments
Open

Missing error handling in benchmark_sdxl_rocm.py #286

ScottTodd opened this issue Jul 10, 2024 · 3 comments
Assignees

Comments

@ScottTodd
Copy link
Member

On iree-org/iree#17847, compilation failed and the benchmark job using iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py at 3603a45 did not handle that gracefully:

https://github.com/iree-org/iree/actions/runs/9874435266/job/27269408927#step:16:46

INFO     root:benchmark_sdxl_rocm.py:31 Command failed with error: b''
INFO     root:benchmark_sdxl_rocm.py:161 Running SDXL ROCm benchmark failed. Exiting
INFO     root:benchmark_sdxl_rocm.py:179 E2E Benchmark Time: None ms (golden time 320.0 ms)

...

>       check.less_equal(benchmark_e2e_mean_time, goldentime_rocm_e2e, "SDXL e2e benchmark time should not regress")
E       TypeError: '<=' not supported between instances of 'NoneType' and 'float'

SHARK-TestSuite/iree_tests/benchmarks/sdxl/benchmark_sdxl_rocm.py:298: TypeError
@ScottTodd
Copy link
Member Author

Traced this a bit.

We only log stderr on failure here, but we still return stdout:

def run_iree_command(args: Sequence[str] = ()):
command = "Exec:", " ".join(args)
logging.getLogger().info(command)
proc = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
stdout_v, stderr_v, = proc.stdout, proc.stderr
return_code = proc.returncode
if return_code == 0:
return 0, proc.stdout
logging.getLogger().info(f"Command failed with error: {proc.stderr}")
return 1, proc.stdout

We get that stdout output here and pass it to job_summary_process:

# e2e benchmark
ret_value, output = run_sdxl_rocm_benchmark(rocm_chip, gpu_number)
benchmark_e2e_mean_time = job_summary_process(ret_value, output)

The stdout output is then ignored in job_summary_process if the ret value is 1:

def job_summary_process(ret_value, output):
if ret_value == 1:
logging.getLogger().info("Running SDXL ROCm benchmark failed. Exiting")
return
bench_lines = output.decode().split("\n")[3:]
benchmark_results = decode_output(bench_lines)
logging.getLogger().info(benchmark_results)
benchmark_mean_time = float(benchmark_results[10].time.split()[0])
return benchmark_mean_time

ScottTodd added a commit to iree-org/iree that referenced this issue Jul 12, 2024
Reverts #17847

This broke SDXL rocm pipeline tests on mi300, see
#17847 (comment). The
tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31
Command failed with error: b''`) so I can't easily tell what the issue
is, nod-ai/SHARK-TestSuite#286 is filed to
improve the situation there.
@ScottTodd ScottTodd self-assigned this Jul 15, 2024
@ScottTodd
Copy link
Member Author

New coverage with pytest prior to the benchmark script also helps here.

ScottTodd added a commit to iree-org/iree that referenced this issue Jul 15, 2024
Progress on nod-ai/SHARK-TestSuite#286

Tested here:
https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171
(actually nvm, that failed before this script even ran... errr... well,
it's probably fine lol)

ci-exactly: build_packages,regression_test
@ScottTodd
Copy link
Member Author

Landed a fix in IREE. Can copy it to this repo as well or just call this fixed.

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
…rg#17894)

Reverts iree-org#17847

This broke SDXL rocm pipeline tests on mi300, see
iree-org#17847 (comment). The
tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31
Command failed with error: b''`) so I can't easily tell what the issue
is, nod-ai/SHARK-TestSuite#286 is filed to
improve the situation there.

Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
Progress on nod-ai/SHARK-TestSuite#286

Tested here:
https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171
(actually nvm, that failed before this script even ran... errr... well,
it's probably fine lol)

ci-exactly: build_packages,regression_test
Signed-off-by: Lubo Litchev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants