Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cooperative Groups API integration #87

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thedodd
Copy link
Contributor

@thedodd thedodd commented Sep 16, 2022

This works as follows:

  • Users build their Cuda code via CudaBuilder as normal.
  • If they want to use the cooperative groups API, then in their build.rs, just after building their PTX, they will:
    • Create a cuda_builder::cg::CooperativeGroups instance,
    • Add any needed opts for building the Cooperative Groups API bridge code (-arch=sm_* and so on),
    • Add their newly built PTX code to be linked with the CG API, which can include multiple PTX, cubin or fatbin files,
    • Call .compile(..), which will spit out a fully linked cubin,
  • In the user's main application code, instead of using launch! to schedule their GPU work, they will now use launch_cooperative!.

todo

  • Update cust to expose the cuLaunchCooperativeKernel in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.
  • Update cuda_builder crate to expose a nice API around linking PTX code with the Cooperative Groups API bridge code.
  • Remove the Justfile. I was only using it for POC testing.
  • Remove the PTX code from this branch, currently it is just here for reference.

@thedodd
Copy link
Contributor Author

thedodd commented Sep 16, 2022

@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place.

Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go.

Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later.

@thedodd thedodd force-pushed the cooperative-groups-1 branch 5 times, most recently from 4bbc882 to e44a8bc Compare September 18, 2022 03:28
This works as follows:
- Users build their Cuda code via `CudaBuilder` as normal.
- If they want to use the cooperative groups API, then in
  their `build.rs`, just after building their PTX, they will:
  - Create a `cuda_builder::cg::CooperativeGroups` instance,
  - Add any needed opts for building the Cooperative Groups
    API bridge code (`-arch=sm_*` and so on),
  - Add their newly built PTX code to be linked with the CG API,
    which can include multiple PTX, cubin or fatbin files,
  - Call `.compile(..)`, which will spit out a fully linked `cubin`,
- In the user's main application code, instead of using `launch!` to
  schedule their GPU work, they will now use `launch_cooperative!`.
@thedodd thedodd force-pushed the cooperative-groups-1 branch from e44a8bc to aefa92a Compare September 18, 2022 03:36
@thedodd thedodd marked this pull request as ready for review September 18, 2022 03:36
@thedodd thedodd changed the title Add bridging PTX code for cooperative_groups API Add Cooperative Groups API integration Sep 18, 2022
@RDambrosio016
Copy link
Member

This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction

@thedodd
Copy link
Contributor Author

thedodd commented Sep 21, 2022

wouldn't it be easier to use inline assembly?

I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code.

Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants