-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] CUDA jobs failing when installing packages #6001
Comments
I just tried another rebuild on #5999, and this is still happening. (build link) I think fixing this requires some administrative action. @shiyu1994 since you're the only one with access to the machine these CUDA jobs run on, could you please try the following: Run the following on that machine, and choose sudo apt-get update
sudo apt autoremove
sudo apt-get install --no-install-recommends -y \
curl \
lsb-release \
software-properties-common Then try re-triggering the CUDA jobs, e.g. at https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15276054852?pr=5999. |
@shiyu1994 can you please help? Sorry to keep I just tried re-running again, and they jobs failed the same way: https://github.com/microsoft/LightGBM/actions/runs/5648086657 This problem won't go away on its own. |
@jameslamb Sorry for the late response. I'll check the machine. |
I just tried re-running again, and the jobs failed the same way: https://github.com/microsoft/LightGBM/actions/runs/5640538711/job/15486941200 @shiyu1994 if you don't have time to help with this, can I just have access to the machine so I can fix it? I want development to restart in the repo as soon as possible. |
The issue is fixed. And now the cuda ci jobs seems ok to run. |
And sorry for the delay. |
Thank you so much!
Is it possible for me to get access to the machine so I can do things like this in the future? Or for someone else like @guolinke to also have access? These parts of our development where there is only one person who can do something pose a big risk of long disruptions like this. |
I'm trying with @guolinke to see if the machine can be safely accessed by other maintainers. |
Thank you, that would be helpful! As a general rule, having more than one person for every operational responsibility in the repo would improve the long-term health and sustainability of this project. |
This issue was fixed a while ago, and we've moved the discussion about expanding access to administer the machine for these jobs into a private space. This can be closed. |
Description
All the CUDA jobs across several PRs (e.g. #5997, #5999) started failing yesterday, with the following errors.
Reproducible example
This is happening on
master
and all PRs.(example build link)
Additional Comments
Some resources that might be helpful:
The text was updated successfully, but these errors were encountered: