distinguish github_actions_labels for CPU vs. CUDA builds #6910

h-vetinari · 2025-01-10T15:43:32Z

Some feedstocks have moved to the cirun open-gpu server, and even fewer are using GPU agents (basically tensorflow & pytorch). Those are the rarest resource (currently also at halved capacity due to some issues with the physical GPUs in the server).

Additionally, we cannot yet separate build and test phases into different agents, c.f. conda-forge/pytorch-cpu-feedstock#314 & conda-forge/conda-smithy#1472

As a consequence, a given pytorch PR currently takes at least >24h to complete, and that's if there are no competing GPU builds. To alleviate the pressure on the rarest resource, I'd like to zip github_actions_labels into the cuda_compiler zip, because this would allow us to at least choose CPU agents for the non-CUDA builds. An example based on this PR can be found on this branch.

This would very likely break the rendering of existing feedstocks on the open-gpu server, but there are currently only 7 of those, out of which only 5 are cuda-enabled (and it won't concern CPU-only feedstocks due to being gated on CF_CUDA_ENABLED), so IMO this should be manageable.

@conda-forge/pytorch-cpu
@conda-forge/tensorflow
@conda-forge/jaxlib
@conda-forge/libmagma
@conda-forge/flash-attn

conda-forge-admin · 2025-01-10T15:44:58Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

hmaarrfk · 2025-01-12T19:19:32Z

recipe/conda_build_config.yaml

 zip_keys:
  -                             # [unix]
    - c_compiler_version        # [unix]
    - cxx_compiler_version      # [unix]
    - fortran_compiler_version  # [unix]
    - cuda_compiler             # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
    - cuda_compiler_version     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+    - github_actions_labels     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]


these zip keys have become almost impossible to read.

Not sure what alternative you'd have in mind - unless we completely redesign how conda works (or stop supporting CUDA builds), this is irreducible complexity IMO

I think we can start with: #6917

hmaarrfk · 2025-01-12T20:38:39Z

I would prefer that users of the GPU runners throttle themselves instead of adding extra complexity.

The GPU runners aren't necessary in many cases. They also don't really guarantee that the builds don't fail (due to flaky tests).

They can "throttle" by falling back to the CPU runners and rerendering.

h-vetinari · 2025-01-12T21:23:52Z

IMO flaky tests can be skipped, and I'd much rather have an extra entry in the zip rather than sacrifice testing fidelity. The ideal would be to build on CPU agents and only test the GPU bits on a GPU runner, but given that that's not yet possible and pytorch has chosen to test also the GPU paths, I think the best (=least bad) tradeoff would be what this PR proposes.

The CUDA 11.8 discussion is IMO unrelated - I'm not opposed but I think it should be had on its own merits.

hmaarrfk · 2025-01-12T21:35:13Z

The CUDA 11.8 discussion is IMO unrelated - I'm not opposed but I think it should be had on its own merits.

Its partially related. We are doing all this work and jumping through hoops for cuda builds. Some of the old tricks need to go, to make it possible for us to think of new ones.

recipe/conda_build_config.yaml

carterbox · 2025-01-15T17:44:05Z

The only two feedstocks using GPU runners are pytorch and magma. For magma, I never got around to enabling any tests that need GPUS, so I can move that feedstock to the cpu runners if that helps. Though, TBH, magma is rarely built anymore.

My understanding is that the problem you're trying to solve is that pytorch is competing with itself for GPU runners because ALL of the builds are assigned to GPU runners (even the non-GPU builds). Is that correct?

h-vetinari · 2025-01-15T21:43:22Z

My understanding is that the problem you're trying to solve is that pytorch is competing with itself for GPU runners because ALL of the builds are assigned to GPU runners (even the non-GPU builds). Is that correct?

Exactly.

jakirkham · 2025-01-15T21:55:32Z

Wonder if it is worth reconsidering ccache or similar ( conda-forge/conda-forge.github.io#389 )

We had tried that before. However enough is different between then and now that we could do things that were not previously possible

For example we can store artifacts on CI. So using that for caching is possible (without needing to create that from scratch)

…s placeholders Co-authored-by: Daniel Ching <[email protected]>

carterbox

I think this will work, but as you mentioned it will break any of the 7 existing feedstocks who use the openstack server for CUDA builds. They will need to update their conda_build_config.yaml to have three items in the github_actions_labels. Otherwise, they will get a rendering error about the length of the zip not being equal.

h-vetinari · 2025-01-21T08:17:01Z

@hmaarrfk, could you live with this for now?

isuruf · 2025-01-21T08:35:41Z

Which feedstocks are you planning to use it for? If it's small, you could do it as a migrator.

isuruf · 2025-01-21T08:37:55Z

Also, is it possible to only set this in the feedstock recipe/conda_build_config.yaml and extend the zip_keys there?

h-vetinari · 2025-01-21T08:40:11Z

Also, is it possible to only set this in the feedstock recipe/conda_build_config.yaml and extend the zip_keys there?

unfortunately not, because it's not possible to extend an existing zip_key (I tried)¹. C.f. the POC branch that was rendered with the CBC from this PR.

Which feedstocks are you planning to use it for? If it's small, you could do it as a migrator.

Primarily pytorch and tensorflow for now.

I guess it should be possible in principle (if all the keys of the to-be-overwritten zip are present), but this would need work in conda-build ↩

isuruf · 2025-01-21T09:11:55Z

We could just do

skip: True   # [linux and (cuda_compiler_version == "None") == (github_actions_labels == "cirun-openstack-gpu-placeholder")]

after adding

github_actions_labels:                                      # [linux]
  - cirun-openstack-cpu-placeholder                 # [linux]
  - cirun-openstack-gpu-placeholder                 # [linux]

in the feedstock

h-vetinari · 2025-01-21T09:38:18Z

That seems pretty hacky to me (aside from the fact that smithy does not prune the jobs correctly - I tested it).

I mean, if the resistance to this proposal is strong, I'm not gonna die on this hill, but to me it seems natural to add the runner-type (CPU vs. CUDA) to the same zip that decides whether it's a CPU or a CUDA build.

hmaarrfk · 2025-01-21T12:48:16Z

Looking at Pytorch v2.4.x conda-forge/pytorch-cpu-feedstock#325

The CPU jobs take about 3 hours to run. - 2 jobs
The CUDA jobs take 9 hours - 4 jobs

So to save 6 / 42 hours we are adding complexity to anybody zipping keys at conda-forge. it just doesn't seem smart to me.

I'm not going to vote "for this approach", but I don't want to block it. Just I'm a little tired of fighting with these zipp'ed keys.

Lets keep it simple and not take precious human resources to discuss this.

IMO: just request the CPU runners on pytorch for things like the 2.4.x maintenance branch and one should be fine. We could also create ways to temporarily reduce the cuda target architectures during debugging sessions. These are "pytorch only" solutions that don't need to be imposed on the rest of conda-forge.

h-vetinari · 2025-01-21T15:09:16Z

TL;DR: an extra zip entry costs ~nothing, but without we'd pay the extra runtime for every CI run

For pytorch main, the calculus is 3x CPU + 3xCUDA, and the aarch CPU build got longer because we're running parts of the test suite in emulation. So it's something like 12h runtime out of a total of 39, or almost a third. I don't care much what we do on the maintenance branches, but main should be relatively painfree to rebuild, and IMO we shouldn't accept an avoidable 50% more runtime.

The simple fact that the GPU agents are essentially our rarest and most precious CI resource would IMO be enough reason to do this - even if it were only for pytorch (which it isn't, because tensorflow will be in the same situation and has an even bigger matrix).

we are adding complexity to anybody zipping keys at conda-forge.

This affects a grand total of 5 feedstocks in conda-forge, 2 of which would benefit from it. 🤷

Overall, if this is so contentious, then I'll just go with Isuru's suggestion - but then don't start complaining about the superfluous CI workflows that this creates (half of which get skipped as soon as they hit conda-build) -- I find that quite a bit more painful than an extra zip entry, but de gustibus non est disputandum I guess...

h-vetinari · 2025-01-21T21:28:59Z

Just I'm a little tired of fighting with these zipp'ed keys.

FWIW, I'm happy to help out with issues on these things if you ping me. I haven't run into any unsolvable cases yet (well, aside from extending the zip locally, which conda-build doesn't support).

h-vetinari · 2025-01-25T23:06:15Z

Welp, given the concerns here I was planning to use Isuru's workaround, but on the pytorch feedstock that actually breaks windows

Checking out the ref
  "C:\Program Files\Git\bin\git.exe" checkout --progress --force -B main refs/remotes/origin/main
  Error: error: unable to create file .ci_support/linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-cpu-2xlargeis_rcFalse.yaml: Filename too long
  Error: error: unable to create file .ci_support/linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-gpu-2xlargeis_rcFalse.yaml: Filename too long
  [...]

Perhaps this is just a question of needing to add

git config --system core.longpaths true

on the windows server? Any chance we could try to add that @baszalmstra @wolfv?

hmaarrfk · 2025-01-25T23:34:50Z

I don’t want to block this. Please disregard my concerns.

isuruf · 2025-01-26T04:29:33Z

We should probably reduce 250 at https://github.com/conda-forge/conda-smithy/blob/5228b1965e6c4ff2a4891aa238ff6311a15908b7/conda_smithy/configure_feedstock.py#L797

h-vetinari · 2025-01-26T22:15:24Z

We should probably reduce 250 at https://github.com/conda-forge/conda-smithy/blob/5228b1965e6c4ff2a4891aa238ff6311a15908b7/conda_smithy/configure_feedstock.py#L797

Thanks for the proposition. I tested this in conda-forge/pytorch-cpu-feedstock#332 but even with some more careful changes in conda-forge/conda-smithy#2233, this either shortens the variant configs to the point that the key attributes (is it a CUDA build?) wouldn't be visible from the filename anymore, or still runs into issues with the filename lengths.

Despite the fact that the enthusiasm here is not exactly high, I think this is the right way to go about it conceptually (the zip that determines CUDA/non-CUDA is the canonical place for this distinction), and that the work-arounds are worse (filename length issues, aside from pointlessly doubling the number of linux jobs, etc.) than what they're meant to avoid (a single extra zip entry).

So I'll merge this and fix up the handful of affected feedstocks. The changes are minimal, e.g. I tested that this works for jaxlib:

--- a/recipe/conda_build_config.yaml
+++ b/recipe/conda_build_config.yaml
@@ -8,5 +8,8 @@ MACOSX_SDK_VERSION:        # [osx and x86_64]
 - '10.14'                  # [osx and x86_64]
 c_stdlib_version:          # [osx and x86_64]
 - '10.14'                  # [osx and x86_64]
-github_actions_labels:
-- cirun-openstack-cpu-large
+
+github_actions_labels:      # [linux]
+- cirun-openstack-cpu-large # [linux]
+- cirun-openstack-cpu-large # [linux]
+- cirun-openstack-cpu-large # [linux]

isuruf · 2025-01-26T22:20:20Z

Please don't merge things like this on your own without approval

h-vetinari · 2025-01-26T22:26:06Z

There were other approvals, and Mark withdrew his objections. You proposed alternatives but didn't request changes, much less provide arguments why this would be wrong. I'm willing to do the required clean-ups on the 5 affected feedstocks too, so this is a non-issue, unless you have a very concrete argument why this is the wrong approach.

h-vetinari · 2025-01-26T22:30:09Z

Also: I've tried all the suggested alternatives beforehand as well, and they were worse than what they were working around.

isuruf · 2025-01-26T22:32:44Z

I still have Mark's objection. The zip_keys for docker_image and related ones are getting out of hand. I'd like to one day remove docker_image from that zip_keys and adding something like this would make it impossible to do down the road.

h-vetinari · 2025-01-26T22:34:48Z

I'd like to one day remove docker_image from that zip_keys and adding something like this would make it impossible to do down the road.

I agree on this goal! And there's nothing standing in the way there AFAICT - as soon as we drop CUDA 11.8, we'll be able to drop docker_image from the zip; the image and the GHA label are not coupled (i.e. both non-CUDA and CUDA 12.6 use the alma9 image).

isuruf · 2025-01-26T22:38:18Z

Sure, but this zip_key is still getting out of hand, and I'd like to keep it as simple as possible.

isuruf · 2025-01-26T22:39:54Z

This is a conversation we need to have in an issue and I think the best course of action is to revert this PR, open an issue and if needed discuss in a core meeting.

h-vetinari · 2025-01-26T22:40:14Z

Sure, but this zip_key is still getting out of hand, and I'd like to keep it as simple as possible.

We recently removed c_stdlib_version and cdt_name. So it's getting simpler already.

If we keep using the newest CUDA, we might even get to drop {c,cxx,fortran}_compiler_version from the zip. My point is that the github_action_labels fundamentally belongs in the same zip as cuda_compiler_version, and the results of working around this are worse than the impact it's trying to avoid.

h-vetinari · 2025-01-26T23:12:15Z

Wrote up an issue: #6967

distinguish github_actions_labels for CPU vs. CUDA builds

4ca75bc

h-vetinari requested a review from a team as a code owner January 10, 2025 15:43

hmaarrfk reviewed Jan 12, 2025

View reviewed changes

ngam approved these changes Jan 13, 2025

View reviewed changes

carterbox suggested changes Jan 15, 2025

View reviewed changes

recipe/conda_build_config.yaml Outdated Show resolved Hide resolved

provide a clearer indication for intended use of github_actions_label…

e41e1b6

…s placeholders Co-authored-by: Daniel Ching <[email protected]>

h-vetinari force-pushed the cpu_vs_gpu branch from 46c9819 to e41e1b6 Compare January 16, 2025 08:30

h-vetinari mentioned this pull request Jan 16, 2025

2.4.x backports conda-forge/pytorch-cpu-feedstock#322

Merged

5 tasks

carterbox approved these changes Jan 16, 2025

View reviewed changes

This was referenced Jan 22, 2025

tensorflow v2.18.0 (redux) conda-forge/tensorflow-feedstock#414

Open

Move libtorch bits to PREFIX & enable cuDSS, cuSPARSELt, CUPTI on win; fix duplicate .pyc versions conda-forge/pytorch-cpu-feedstock#328

Merged

This was referenced Jan 26, 2025

Test: CPU vs GPU workaround conda-forge/pytorch-cpu-feedstock#332

Draft

lower limit at which we start shortening filenames for variant configs conda-forge/conda-smithy#2233

Draft

h-vetinari merged commit ff043bb into conda-forge:main Jan 26, 2025
3 checks passed

h-vetinari deleted the cpu_vs_gpu branch January 26, 2025 22:15

h-vetinari mentioned this pull request Jan 26, 2025

Reducing the size of the cuda_compiler zip #6967

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distinguish github_actions_labels for CPU vs. CUDA builds #6910

distinguish github_actions_labels for CPU vs. CUDA builds #6910

h-vetinari commented Jan 10, 2025

conda-forge-admin commented Jan 10, 2025

hmaarrfk Jan 12, 2025

h-vetinari Jan 12, 2025

hmaarrfk Jan 12, 2025

hmaarrfk commented Jan 12, 2025

h-vetinari commented Jan 12, 2025

hmaarrfk commented Jan 12, 2025

carterbox commented Jan 15, 2025

h-vetinari commented Jan 15, 2025

jakirkham commented Jan 15, 2025

carterbox left a comment •

edited

Loading

h-vetinari commented Jan 21, 2025

isuruf commented Jan 21, 2025

isuruf commented Jan 21, 2025

h-vetinari commented Jan 21, 2025 •

edited

Loading

isuruf commented Jan 21, 2025

h-vetinari commented Jan 21, 2025

hmaarrfk commented Jan 21, 2025

h-vetinari commented Jan 21, 2025

h-vetinari commented Jan 21, 2025

h-vetinari commented Jan 25, 2025

hmaarrfk commented Jan 25, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025 •

edited

Loading

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

distinguish github_actions_labels for CPU vs. CUDA builds #6910

distinguish github_actions_labels for CPU vs. CUDA builds #6910

Conversation

h-vetinari commented Jan 10, 2025

conda-forge-admin commented Jan 10, 2025

hmaarrfk Jan 12, 2025

Choose a reason for hiding this comment

h-vetinari Jan 12, 2025

Choose a reason for hiding this comment

hmaarrfk Jan 12, 2025

Choose a reason for hiding this comment

hmaarrfk commented Jan 12, 2025

h-vetinari commented Jan 12, 2025

hmaarrfk commented Jan 12, 2025

carterbox commented Jan 15, 2025

h-vetinari commented Jan 15, 2025

jakirkham commented Jan 15, 2025

carterbox left a comment • edited Loading

Choose a reason for hiding this comment

h-vetinari commented Jan 21, 2025

isuruf commented Jan 21, 2025

isuruf commented Jan 21, 2025

h-vetinari commented Jan 21, 2025 • edited Loading

Footnotes

isuruf commented Jan 21, 2025

h-vetinari commented Jan 21, 2025

hmaarrfk commented Jan 21, 2025

h-vetinari commented Jan 21, 2025

TL;DR: an extra zip entry costs ~nothing, but without we'd pay the extra runtime for every CI run

h-vetinari commented Jan 21, 2025

h-vetinari commented Jan 25, 2025

hmaarrfk commented Jan 25, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025 • edited Loading

h-vetinari commented Jan 26, 2025

isuruf commented Jan 26, 2025

isuruf commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

h-vetinari commented Jan 26, 2025

carterbox left a comment •

edited

Loading

h-vetinari commented Jan 21, 2025 •

edited

Loading

isuruf commented Jan 26, 2025 •

edited

Loading