Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform-specific algorithm composition #52

Open
stephenswat opened this issue Jul 16, 2021 · 2 comments
Open

Platform-specific algorithm composition #52

stephenswat opened this issue Jul 16, 2021 · 2 comments
Labels
discussion Let's talk! feature New feature or request

Comments

@stephenswat
Copy link
Member

Right now, our algorithms are structured as function objects (or as C++ like to confusingly call them, functors), which wrap some algorithmic code. This works fairly well for CPU code, and it will likely work well for other heterogeneous code for the near future, but there is a significant problem with this strategy, which is that it assumes that function composition should work in the same way. Since these function objects are called sequentially on the CPU side, each algorithm is followed by an implicit synchronisation point, even for platforms where that is undesirable. For example, CUDA algorithms will need to wait unnecessarily, and we will not be able to exploit the asynchronous properties of the CUDA programming model.

I am opening this issue because I think that we will, in the close-to-medium future, need to come up with a model for algorithm composition that is flexible towards the properties of specific platforms. We will want to encode in C++ the different behaviours we want to have for different platforms, preferable with as little additional code as possible. For example:

  • For CPU code, g ∘ f is simply equal to running f, implicitly synchronizing, and then running g.
  • For CUDA code, g ∘ f could be significantly more performant if we create a CUDA stream, add f to it, add g to it, and then synchronize only at the end of that sequence of instructions.
  • Other heterogeneous platforms that support asynchronous execution will need something similar.

I am not sure C++ allows us to solve this problem in a truly elegant way, but we might be able to expand our definition of what is constitutes an algorithm by requiring two methods, instead of one:

  1. A private asynchronous method that allows us to compose actions across streams and other models of asynchronous computation.
  2. A public synchronous method, that implements some sort of platform-specific synchronisation barrier after calling the private asynchronous method, so we retain the right cause-and-effect relationship on the CPU side.

Then, on top of that, we would obviously need to formalize the composition itself. Essentially, we would be building a monad. For example, the function composition g ∘ f might look something like this on a CPU (in C++-like pseudocode):

template<typename I, template M, typename R>
class cpu_composition : algorithm<I, R> {
public:
    cpu_composition(algorithm<I, M> & f, algorithm & g) : f1(f), f2(g) {}

    R run_sync(I & i) {
        return run_async(i);
    }

private:
    R run_async(I & i) {
        return f2.run_async(f1.run_async(i));
    }

    algorithm & f1, f2;
}

...but it would look very different on the CUDA side...

template<typename I, template M, typename R>
class cuda_composition : algorithm<I, R> {
public:
    cuda_composition(algorithm<I, M> & f, algorithm & g) : f1(f), f2(g) {}

    R run_sync(I & i) {
        cudaStream_t s;
        R r = run_async(i, s);
        cudaStreamSynchronize(s);
        return r;
    }

private:
    R run_async(I & i, cudaStream_t & s) {
        f1.run_async(i, s);
        // Passing return data between these two is iffy, because CUDA always
        // uses output parameters which are hard to model, but that is an
        // implementation detail.
        f2.run_async(..., s);
    }

    algorithm & f1, f2;
}

The thing is, I am not sure if this would be the best solution. On the plus side, it requires minimal boilerplate code, which is great. On the other hand, it is a serious pain in the behind for the type system, and it would require us to really think long and hard about what we want to decide at compile-time, and what we want to do at run-time, because that will decide a lot about the further design of traccc.

Anyhow, this was mostly a brain dump, and I would be very curious to hear everyone else's opinion on how we should proceed here.

@stephenswat stephenswat added discussion Let's talk! feature New feature or request labels Jul 16, 2021
@aaronj0
Copy link

aaronj0 commented Feb 1, 2023

Hello everyone, I'm Aaron and I'm quite interested in contributing to the traccc project.

We can utilize the fact that kernels launched in the same stream are sequentially executed. So if the function g calls multiple kernels and has a dependency ( example: g calls f2(f1) ) we can let the function run on one stream. but if a function g1 does not depend on g2 we can run both on separate cuda streams.

in the example provided, run_async is not asynchronous wrt the other function calls, since it's the same cuda stream, the kernels will be launched one after the other. It's asynchronous wrt other cuda_compostions.

I am unsure about how we can implement error synchronization since checking for errors is atomic. In case a function results in an error, the kernel will have to be aborted without blocking it.

@stephenswat
Copy link
Member Author

stephenswat commented Mar 28, 2023

Hi Aaron, sorry for responding to your comment in such a tardy fashion. I appreciate your comment; in the time since I opened this issue we've had relatively little progress towards this, but I have started work on composable CUDA graphs (#307) which allows the runtime to restructure the execution graph as it sees fit while keeping dependencies in place. I'd be interested to hear your further opinions.

If you want to contribute still, please contact me by e-mail and we can get you started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Let's talk! feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants