Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Buildkite GPU pipeline #411

Merged
merged 25 commits into from
Jun 20, 2024
Merged

Add Buildkite GPU pipeline #411

merged 25 commits into from
Jun 20, 2024

Conversation

rkierulf
Copy link
Collaborator

This is an initial implementation for creating a Buildkite pipeline to test on all 4 GPU backends (CUDA, AMD, Metal and oneAPI). As discussed with Carlos, I simplified the KomaMRICore tests so there are no longer separate tests for CPU single-thread, CPU multi-thread, and GPU. Instead, the tests now look to see if there is a test argument for "AMDGPU", "CUDA", "Metal", or "oneAPI", and if so load the corresponding package, which will trigger loading the corresponding ext module from KomaMRICore. The tests no longer set anything for the "Nthreads" parameter, so they can be made to run with a specific number of CPU threads by passing the julia argument --threads= to Pkg.test(). For now, the SimpleMotion and ArbitraryMotion tests still have the :skipci tag since there is at least one issue with SimpleMotion on Metal that has not yet been resolved, I still need to check to see if it has been fixed by #408 .

Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! I love the PRs that remove code 😄, well done!!

I added a few code suggestions, so we can filter with :skipbuildkite, as some tests are pretty GPU agnostic. Also, I think that "KomaMRIFiels" and "KomaMRIPlots" are not used for these tests, so I suggested removing them.

EDIT: I changed my mind about the :skipbuildkite, the tests code coverage would be harder to calculate if one part is run on GitHub and the other on Buildkite.

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
KomaMRICore/test/runtests.jl Show resolved Hide resolved
KomaMRICore/test/runtests.jl Show resolved Hide resolved
KomaMRICore/test/runtests.jl Show resolved Hide resolved
KomaMRICore/test/runtests.jl Show resolved Hide resolved
KomaMRICore/test/runtests.jl Show resolved Hide resolved
Copy link
Member

@cncastillo cncastillo Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can select which backend to use in VSCode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since there doesn't appear to be a way to pass test arguments through the VSCode test panel. If a user wants to run an individual test on a specific backend, the easiest way to do it might be to add this:

ARGS = ["CUDA"]

before the line with include("initialize.jl")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm maybe using Preferences.jl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Preferences.jl requires being inside a module that corresponds to a loaded package, and with the way TestItems.jl works it creates separate modules for each TestItem. If I try to do anything with Preferences.jl inside run_tests.jl or initialize.jl I get an error: 'ArgumentError: Module Main does not correspond to a loaded package!'. There might be a way to get this to work, but I'm not sure how at the moment.

Copy link
Member

@cncastillo cncastillo Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is fine then. If by default it tries to load CUDA, I think that would be enough, like isempty(ARGS) && ARGS = ["CUDA"].

EDIT: Ideally, there could be a way to control it locally for the local VSCode tests, but I am not sure what the best way is.

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
Removed KomaFiles & Plots and added `|` to `command` in pipeline.yml
Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied the code review changes, but some stuff from AMD needs Julia 1.10, so unfortunately I think we will need to use Julia 1.10 for that one. OneAPI seems to have the same problem with cumsum as Metal.

CUDA and Metal are working! https://buildkite.com/julialang/komamri-dot-jl/builds/593

I saw the CUDA ext being loaded in AMD, maybe we need to put them as weakdeps in the tests/Project.toml?

@rkierulf rkierulf linked an issue Jun 17, 2024 that may be closed by this pull request
@rkierulf
Copy link
Collaborator Author

I changed the Julia version for the AMD tests to 1.10 and added the same workaround we're using for Metal into oneAPIExt.jl.

For the test Project.toml dependencies, I tried moving the GPU packages to [weakdeps] but it complained about them not being loaded when I tried to run the tests, even if I included them with 'using' before. DiffEqGPU.jl is adding the packages and then deleting test/Manifest.toml to avoid having any of the packages as explicit dependencies: https://github.com/SciML/DiffEqGPU.jl/blob/master/.buildkite/runtests.yml, so we could try and follow what they're doing, however, the problem I see with this is that I'm not sure how a user would be able to run tests on their own GPU without changing the test/Project.toml themselves, which may not be desirable.

It looks like AMD now has an issue with the device name function (should be easy to fix), and Intel has an issue with 'unsupported use of double value'.

@rkierulf
Copy link
Collaborator Author

Nice, AMD is now passing!

The Intel error may take a bit more time to resolve, I'll look into tomorrow

@cncastillo
Copy link
Member

cncastillo commented Jun 18, 2024

This PR could be relevant for the test environment:

Also, this: https://fluxml.ai/Flux.jl/stable/guide/gpu/#Selecting-GPU-backend.

@rkierulf
Copy link
Collaborator Author

I copied the approach from that pull request so we are no longer installing all GPU backends. I also added a new gpu method without any backend parameter that uses the logic in get_backend() to determine which backend to use. The oneAPI tests are still failing on this line in BlochSimulationMethod.jl:

Mxy = [M.xy M.xy .* exp.(1im .* ϕ .- tp' ./ p.T2)]

I think there is a Float64 value sneaking in somehow based on the 'unsupported use of double value' message.

@rkierulf
Copy link
Collaborator Author

oneAPI is passing now that I switched the line with:

exp.(exp_argument)

to the equivalent:

exp.(real.(exp_argument)) .* (cos.(imag.(exp_argument)) .+ Complex{T}(0,1) * sin.(imag.(exp_argument)))

I'm pretty sure this is a oneAPI bug, but nice that we can work around it for now.

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 88.22%. Comparing base (515d915) to head (9068796).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   87.40%   88.22%   +0.82%     
==========================================
  Files          49       49              
  Lines        2802     2811       +9     
==========================================
+ Hits         2449     2480      +31     
+ Misses        353      331      -22     
Flag Coverage Δ
base 86.42% <ø> (ø)
core 79.39% <50.00%> (+5.70%) ⬆️
files 93.70% <ø> (ø)
komamri 93.98% <ø> (ø)
plots 89.27% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
KomaMRICore/ext/KomaAMDGPUExt.jl 41.66% <100.00%> (+41.66%) ⬆️
KomaMRICore/ext/KomaoneAPIExt.jl 60.00% <100.00%> (+60.00%) ⬆️
KomaMRICore/src/KomaMRICore.jl 100.00% <ø> (ø)
.../src/simulation/Bloch/BlochDictSimulationMethod.jl 87.50% <100.00%> (ø)
...Core/src/simulation/Bloch/BlochSimulationMethod.jl 100.00% <100.00%> (ø)
KomaMRICore/ext/KomaMetalExt.jl 0.00% <0.00%> (ø)
KomaMRICore/src/simulation/Functors.jl 32.43% <0.00%> (-20.70%) ⬇️

... and 3 files with indirect coverage changes

@cncastillo
Copy link
Member

We will be able to use exp when this is merged :)

@cncastillo
Copy link
Member

cncastillo commented Jun 19, 2024

We also got a passing badge!

Build status

We can specify ?step= to have a badge for each backend, but I haven't looked into this.

@rkierulf rkierulf requested a review from cncastillo June 19, 2024 18:42
@cncastillo cncastillo mentioned this pull request Jun 19, 2024
Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to problems with the codecov keys for Metal, we decided to remove Metal's codecov 😞 .

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.yml Outdated Show resolved Hide resolved
Fixed codecov key, removed Metal codecov
@cncastillo cncastillo changed the title Add Buildkite pipeline Add Buildkite GPU pipeline Jun 20, 2024
@cncastillo cncastillo merged commit 3175e71 into master Jun 20, 2024
16 of 18 checks passed
@cncastillo cncastillo deleted the buildkite branch June 20, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using BuildKite for GPU related CI?
2 participants