Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid-stride iteration and ceilfracf #952

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jahooker
Copy link

@jahooker jahooker commented Mar 5, 2023

Hello Sjors, Dari, & co.!

I have come across what I believe to be a bug that affects pixel sampling in cuda_kernel_softMaskBackgroundValue and cuda_kernel_cosineFilter. These functions make use of ceilfracf:

__device__ int ceilfracf(T1 a, T2 b)

So far as I can tell from its name and the places it is called from, ceilfracf(a, b) should return the least integer n such that n * b >= a. But, as is clear from its definition, it does not do that. In cases where b divides a with no remainder,
the present implementation will return the intended result + 1.

template< typename T1, typename T2 >
static inline
__device__ int ceilfracf(T1 a, T2 b)
{
//	return __float2int_ru(__fdividef( (float)a, (float)b ) );
	return (int)(a/b + 1);
}

We want to return not a / b + 1 but ceilf(float(a) / float(b)). Or, without casting to float: a / b + bool(a % b) (assuming T1 and T2 are integral types, which is true at all the present call sites).

template <typename T1, typename T2>
static inline
__device__ int ceilfracf(T1 a, T2 b) {
    return ceilf(float(a) / float(b));
}

Now, why does this matter? As I said, ceilfracf is called from a handful of functions, including cuda_kernel_softMaskBackgroundValue and cuda_kernel_cosineFilter. In these places, it is being used to calculate the number of strides needed to iterate over some image data, given some number of parallel CUDA threads. In the pathological case, when the image size is divisible by the number of threads, there will be too great a gap between where the threads start, and pixels will be missed. For instance, given a 32 × 32 × 32 image, calling cuda_kernel_softMaskBackgroundValue with 128 threads per block and 128 blocks per grid (as is currently done), ~10k out of ~30k pixels will be ignored. The situation is less disastrous for larger images, since as a (the image size) increases, the relative error in ceilfracf(a, b) decreases. Given a 64 × 64 × 64 image, "only" ~15k out of ~260k (6%) get missed.

So, the most obvious fix is to change ceilfracf in the manner described above.

This brings me to my next point. Do we even need ceilfracf? It took me some time to convince myself that (when ceilfracf does what it should) cuda_kernel_softMaskBackgroundValue samples each pixel in the image exactly once. As it stands now, threads iterate over vol in block-sized strides. Each thread block passes over a different subspan of vol,
and each thread within a block samples that subspan once every SOFTMASK_BLOCK_SIZE pixels. The loop that controls this iteration increments two things: texel and a separate counter pass, which is unused in the body of the loop.
The loop checks two conditions on every iteration: whether texel has gone past the end of vol, and whether pass has gone past texel_pass_num (which, as I have explained, will sometimes be off by 1). There need only be one iterator to increment. cuda_kernel_softMaskBackgroundValue has been like this since its inception in 2016. It looks to me like the intention is to do a grid-stride loop. So, that is what I have tried to implement. I have taken the opportunity to get rid of the shared-memory array img_pixels, for which I saw no use, and to introduce a closure weight, defined outside the loop body and invoked within it. Now, there is not even any need for ceilfracf. We can dispense with it. I have similarly rewritten cuda_kernel_softMaskOutsideMap and cuda_kernel_cosineFilter. There are other functions that make use of ceilfracf, but they only do block-stride iteration. I think they are safe.

Best,
James

jahooker added 2 commits March 5, 2023 20:51
Now, `ceilfracf` does what its name suggests.
Incidentally, it is no longer needed.
`cuda_kernel_softMaskBackgroundValue`, `cuda_kernel_cosineFilter`, and `cuda_kernel_softMaskOutsideMap` now do proper grid-stride iteration.
@dkimanius
Copy link
Contributor

Hi James,

The ceilfracf function is only used in index management. The cases where there is no remainder in the division are unlikely to occur and even if they do they only add a tiny performance penalty. The numeical results are unaffected. However, making the suggested change would add a performance penalty that would always occur due to the added floating point division in addition to a reduced numerical accuracy for large numbers.

That function is a common way of doing index management in kernels for this reason.

@jahooker
Copy link
Author

Thanks Dari,

On review, I must concede that cuda_kernel_softMaskBackgroundValue samples the volume evenly. (Imaginary crisis averted?) But boy is it not obvious! Please do consider my proposal, not for ceilfracf but for the actual kernels. My changes obviate the need for ceilfracf in the first place, and with it, the need for this division. Also, do you not find my version so much more readable? 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants