Abnormally slow GPU sync after running median filter, and running `get()` on median filter cupy array result? #153

yousefmoazzam · 2024-09-19T09:34:41Z

More info on the apparent issue

All the below is with httomolibgpu at commit ce1c15f on main.

From doing test runs of httomo that make use of padding (from this PR DiamondLightSource/httomo#446), I was running the median_filter method and found that when increasing the value for the kernel_size parameter, the processing of blocks within a section that included media_filter became quite slow.

Initially I thought it was because of the padding feature in httomo, but from some simple profiling of httomo runs using cProfile to see where the time was being spent, it looked to be in:

GPU synchronisation via CUDA events in the cupy API
the get() method on cupy arrays

which made me wonder if this slowness could be reproduced outside of httomo.

I believe I have been able to reproduce the slowness outside of httomo via two standalone scripts that depend only on httomolibgpu: one demonstrating the slowness due to GPU synchronisation, and the other demonstrating the slowness due to calling the .get() method on a cupy array.

I have also seemingly found that this is a peculiarity of median_filter but not of other methods. I have randomly chosen paganin_filter_tomopy as the method to compare against. Each script allows passing in a single arg, median or paganin, to choose which method to use when doing the check that the script does. The result of running both scripts for both median_filter and paganin_filter_tomopy seem to indicate that median_filter exhibits the slowness, but paganin_filter_tomopy does not.

Scripts

GPU synchronisation script

import time
import sys
from typing import Literal

import cupy as cp

from httomolibgpu.misc.corr import median_filter
from httomolibgpu.prep.phase import paganin_filter_tomopy


def main(method: Literal["median", "paganin"]):
    # Generate data to run on
    DATA_SHAPE = (180, 128, 160)
    DTYPE = cp.float32
    data = cp.random.random_sample(size=DATA_SHAPE, dtype=DTYPE)

    # Define CUDA event objects to time how long the CUDA kernel in the median filter method takes
    # to run
    start = cp.cuda.Event()
    end = cp.cuda.Event()

    # Run method
    KERNEL_SIZE = 7
    cpu_including_sync_start_time = time.time()
    cpu_excluding_sync_start_time = time.time()
    start.record()

    if method == "median":
        _ = median_filter(data, kernel_size=KERNEL_SIZE)
    else:
        _ = paganin_filter_tomopy(data)

    end.record()
    cpu_excluding_sync_end_time = time.time()

    # Synchronize the GPU in order to calculate the elapsed time between the two CUDA events
    cpu_sync_start = time.time()
    end.synchronize()
    cpu_sync_end = time.time()

    cpu_including_sync_end_time = time.time()

    # Print elapsed time for GPU kernel
    gpu_elapsed_time = cp.cuda.get_elapsed_time(start, end) * 1e-3
    print(f"GPU kernel elapsed time: {gpu_elapsed_time}")

    # Print elapsed time for CPU excluding GPU synchronisation
    print(
        f"CPU elapsed time excluding sync: {cpu_excluding_sync_end_time - cpu_excluding_sync_start_time}"
    )

    # Print elapsed time for CPU including the GPU synchronisation
    cpu_elapsed_time = cpu_including_sync_end_time - cpu_including_sync_start_time
    print(f"CPU elapsed time including sync: {cpu_elapsed_time}")

    # Print CPU time taken for GPU synchronisation
    print(f"CPU time for CUDA sync to complete: {cpu_sync_end - cpu_sync_start}")


if __name__ == "__main__":
    args = sys.argv[1:]
    if args[0] not in ["median", "paganin"]:
        print("Please provide only one arg: 'median' or 'paganin'")
        sys.exit()

    main(args[0])

`get()` script

import sys
import time
from typing import Literal

import cupy as cp

from httomolibgpu.misc.corr import median_filter
from httomolibgpu.prep.phase import paganin_filter_tomopy


def main(method: Literal["median", "paganin"]):
    # Generate data to run on
    DATA_SHAPE = (180, 128, 160)
    DTYPE = cp.float32
    data = cp.random.random_sample(size=DATA_SHAPE, dtype=DTYPE)

    # Run method
    KERNEL_SIZE = 7
    cpu_including_get_start_time = time.time()
    cpu_excluding_get_start_time = time.time()

    if method == "median":
        res = median_filter(data, kernel_size=KERNEL_SIZE)
    else:
        res = paganin_filter_tomopy(data)

    cpu_excluding_get_end_time = time.time()

    # Transfer data from GPU to CPU
    _ = res.get()
    cpu_including_get_end_time = time.time()

    # Print CPU elapsed time for running method excluding `get()`
    print(
        f"CPU time excluding `get()`: {cpu_excluding_get_end_time - cpu_excluding_get_start_time}"
    )

    # Print CPU elapsed time for running method including `get()`
    print(
        f"CPU time including `get()`: {cpu_including_get_end_time - cpu_including_get_start_time}"
    )


if __name__ == "__main__":
    args = sys.argv[1:]
    if args[0] not in ["median", "paganin"]:
        print("Please provide only one arg: 'median' or 'paganin'")
        sys.exit()

    main(args[0])

Results

Running the GPU synchronisation script for both methods shows the following:

(base) root@492a2a3538d1:/httomo# python cupy-event-sync.py paganin
GPU kernel elapsed time: 0.10271372985839844
CPU elapsed time excluding sync: 0.10265207290649414
CPU elapsed time including sync: 0.10383081436157227
CPU time for CUDA sync to complete: 0.0011780261993408203
(base) root@492a2a3538d1:/httomo# python cupy-event-sync.py median
GPU kernel elapsed time: 47.182890625
CPU elapsed time excluding sync: 0.3372988700866699
CPU elapsed time including sync: 47.18360257148743
CPU time for CUDA sync to complete: 46.84630250930786

Running the get() script for both methods shows the following:

(base) root@492a2a3538d1:/httomo# python cupy-data-get.py paganin
CPU time excluding `get()`: 0.10548996925354004
CPU time including `get()`: 0.11040711402893066
(base) root@492a2a3538d1:/httomo# python cupy-data-get.py median
CPU time excluding `get()`: 0.3312513828277588
CPU time including `get()`: 45.806589126586914

Concluding remarks

I typically develop inside a container so this issue was first found when running inside a container. I have also run the scripts on the host machine outside of a container and can reproduce the slowness of median_filter.

Perhaps there's something wrong with the random data generated in the scripts? I can also try running on the test data that is in the httomo repo, tomo_standard.nxs, and update with results from that experiment.

Being reproducible by people other than me would give me more confidence that this is an issue. Until then, I don't think any conclusions should be made yet.

The text was updated successfully, but these errors were encountered:

dkazanc · 2024-09-19T10:39:21Z

Thanks @yousefmoazzam for spotting this and the provided code. So I looked into cupy-event-sync.py first. I have to say that by increasing the kernel larger than 5 we certainly slow things down, as I believe this particular implementation of median filter is not quite suitable for larger kernels. Also, it is a 3D filter, so the calculation becomes more intense.
For me it was like this for median:

7 x 7 x 7 kernel size : GPU kernel elapsed time: 8.12
5 x 5 x 5 kernel size : GPU kernel elapsed time: 0.91
3 x 3 x 3 kernel size : GPU kernel elapsed time: 0.25

And this is more or less expected, the scaling here in terms of time will be nonlinear (could be exponential) with the increased kernel size.

And for the 2nd script it is more or less the same expected:

7 x 7 x 7
CPU time excluding `get()`: 0.31023693084716797
CPU time including `get()`: 8.148758888244629

I do not think that the users will use anything larger than 5 x 5 x 5 as we will have better denoising methods if you use larger kernels for median.

yousefmoazzam · 2024-09-19T10:58:52Z

Thanks @dkazanc for checking this!

I also tried this on ws448 and a P100 node, the sync time and get() time is much more reasonable even for kernel size 7. It's possible that my workstation's particular GPU just doesn't handle the median filter well for some reason.

I thought that the large difference between including clocking the sync operation and not including it, this part:

CPU elapsed time excluding sync: 0.3372988700866699
CPU elapsed time including sync: 47.18360257148743

was meaning that the method was running in decent time, but something else weird was happening when attempting to sync the GPU or transfer from GPU to CPU.

But. since it's only my machine that is producing the very long sync and get() times, I think this can be closed for now. (If it comes back we can revive it).

yousefmoazzam added the question Further information is requested label Sep 19, 2024

DiamondLightSource deleted a comment Sep 19, 2024

yousefmoazzam closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormally slow GPU sync after running median filter, and running `get()` on median filter cupy array result? #153

Abnormally slow GPU sync after running median filter, and running `get()` on median filter cupy array result? #153

yousefmoazzam commented Sep 19, 2024 •

edited

Loading

dkazanc commented Sep 19, 2024

yousefmoazzam commented Sep 19, 2024

Abnormally slow GPU sync after running median filter, and running get() on median filter cupy array result? #153

Abnormally slow GPU sync after running median filter, and running get() on median filter cupy array result? #153

Comments

yousefmoazzam commented Sep 19, 2024 • edited Loading

More info on the apparent issue

Scripts

GPU synchronisation script

get() script

Results

Concluding remarks

dkazanc commented Sep 19, 2024

yousefmoazzam commented Sep 19, 2024

Abnormally slow GPU sync after running median filter, and running `get()` on median filter cupy array result? #153

Abnormally slow GPU sync after running median filter, and running `get()` on median filter cupy array result? #153

yousefmoazzam commented Sep 19, 2024 •

edited

Loading

`get()` script