Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] consolidate CUDA versions #5677

Merged
merged 21 commits into from
Feb 1, 2023
Merged

[CUDA] consolidate CUDA versions #5677

merged 21 commits into from
Feb 1, 2023

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Jan 17, 2023

Changes

Proposes removing the CUDA implementation from #3160.

As of this PR, the only CUDA build of LightGBM would be the one we've been calling cuda_exp, which @shiyu1994 started in #4528 and #4630.

Specifically:

  • removes all mentions of "CUDA exp" or "CUDA Experimental" in docs and internal code
  • removes all code specific only to the implementation from Add support for CUDA-based GPU build #3160
  • when using python setup.py --cuda-exp or cmake -DUSE_CUDA_EXP=1, raises a deprecation warning and still uses the version we've been until now calling "cuda_exp"
  • removes 2 CUDA CI jobs, so now there will be three, one each for pip, source, and wheel builds of the CUDA-enabled Python package
  • increases the minimum supported CUDA version from 9.0 to 10.0

History

(please correct me if I've mischaracterized the history below)

In #3160 (merged September 2020), a team from IBM added a first CUDA implementation of LightGBM because the existing OpenCL-based build didn't support some platforms (namely, IBM Power).

About a year after that, @shiyu1994 and @guolinke (along with others at Microsoft?) started on an "experimental" CUDA implementation.

That "experimental" implementation was first merged in #4630 (March 2022), and since then we've had two CUDA implementations maintained in this repo:

  • cuda = the IBM contribution
  • cuda_exp = the newer implementation from Microsoft

Since then, @shiyu1994 has been working actively on that cuda_exp version, with the plan to include it in a v4.0.0 release (#5153).

The cuda_exp version is still missing some important features, like distributed training (#5076) and on-GPU computation of metrics and loss functions (#5163).

Despite the current limitations, this PR implements the proposal from #5153 (comment) to consolidate down to only one CUDA implementation in LightGBM... the one currently called cuda_exp.

Motivation for this change

In my opinion, LightGBM does not have enough maintainer/contributor availability to maintain two separate CUDA implementations.

Consolidating down to 1 allows the project to more effectively channel the limited attention of its maintainers and contributors towards improving the LightGBM-on-GPU experience, by not duplicating effort across two different builds intended to serve the same purpose.

  • improves development velocity by removing two costly CI jobs
  • reduces confusion for users wanting to run GPU-accelerated LightGBM
  • noticeably simplifies the codebase and reduces its size
  • focuses all feature requests, bug reports, code contributions, etc. on one CUDA implementation

This represents a temporary loss of functionality (e.g. multi-GPU training), but I think it'll help the project to move faster and @shiyu1994 has said that that functionality is actively under development for the cuda_exp implementation.

Notes for Reviewers

I know this is a large change, so tagging in others for their opinions.

@shiyu1994 @guolinke @huanzhang12 @jmoralez @StrikerRUS @btrotta @ChipKerchner @ceseo

👋 Thanks all for your consideration.

@jameslamb jameslamb changed the title WIP: consolidate CUDA versions [CUDA] consolidate CUDA versions Jan 18, 2023
@jameslamb jameslamb marked this pull request as ready for review January 18, 2023 05:40
@jameslamb jameslamb mentioned this pull request Jan 18, 2023
60 tasks
@ChipKerchner
Copy link
Contributor

ChipKerchner commented Jan 18, 2023

We just want to make sure it still works (compiles, run, etc) similar to the original version. Is there a CI building these approaches (cuda, cuda_exp and combined)?

@jameslamb
Copy link
Collaborator Author

Is there a CI building these approaches (cuda, cuda_exp and combined)?

@ChipKerchner I don't totally understand what you mean by this question, especially "and combined". I'll try to answer but please let me know if that's not sufficient.

Every commit merged into master in this project in at least the last 6 months has seen the Python package built successfully and its unit tests pass for both the cuda version (from #3160) and cuda_exp version (from #4630 and onwards).

Here's the configuration for that:

include:
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "11.7.1"
task: cuda
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
task: cuda
- method: wheel
compiler: gcc
python_version: "3.10"
cuda_version: "9.0"
task: cuda
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "11.7.1"
task: cuda_exp
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
task: cuda_exp

And build logs for the latest commit to master: https://github.com/microsoft/LightGBM/actions/runs/3935991924

The same CI coverage is preserved in this PR, and we'll continue to block any future PRs that break the CUDA support.

cuda, cuda_exp, and combined

I'm confused by your use of the phrase "and combined" here, so I want to be absolutely sure you understand what's being proposed here. As of this PR, there will only be one CUDA implementation of LightGBM.

@ChipKerchner
Copy link
Contributor

By "combined" I meant this "one CUDA implementation of LightGBM" approach for this PR.

@jameslamb
Copy link
Collaborator Author

ah got it! Yes, the CI as of this PR tests that "one CUDA implementation of LightGBM" on Ubuntu 18.04 for the following combinations of CUDA versions, Python versions, and compilers:

- method: wheel
compiler: gcc
python_version: "3.10"
cuda_version: "11.7.1"
task: cuda
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "10.0"
task: cuda
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "11.7.1"
task: cuda

you can see the build logs by clicking "Details" next to any of the checks with names starting like "CUDA Version" on this PR, e.g. https://github.com/microsoft/LightGBM/actions/runs/3945899727/jobs/6753199490

@jameslamb
Copy link
Collaborator Author

@shiyu1994 @guolinke is there any other information I could provide to help with this?

This type of PR will be difficult to keep up to date with master if other CUDA code is merged, since it touches so many files. I'd really like to do whatever I can to move it forward.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work. I've viewed all the changes and just left a few comments about whether to keep cuda_exp as a valid input in compilation and device parameter, in addition with a few change suggestions.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
docs/Parameters.rst Outdated Show resolved Hide resolved
python-package/setup.py Outdated Show resolved Hide resolved
src/boosting/gbdt.h Outdated Show resolved Hide resolved
src/boosting/gbdt.h Outdated Show resolved Hide resolved
src/boosting/gbdt.h Outdated Show resolved Hide resolved
src/io/config.cpp Outdated Show resolved Hide resolved
src/treelearner/tree_learner.cpp Outdated Show resolved Hide resolved
src/treelearner/tree_learner.cpp Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

Thanks @shiyu1994 . I just pushed some commits completely removing CUDA_EXP, instead of keeping that configuration option and raising a warning. I think that'll reduce confusion, and it's ok given that it was never officially included in a release.

@shiyu1994
Copy link
Collaborator

@jameslamb Thanks. The changes LGTM. It seems that we are encountering some CI issues. One R test failes. And one gpu job in Azure Devops ci test fails. I've tried retrigger the R test again but it fails again. Maybe we should fix the ci issue first.

@jameslamb
Copy link
Collaborator Author

Thanks!

@shiyu1994 I've already put up a PR to fix the R CI issues. Can you please review #5689?

I've noticed that the GPU jobs on Azure DevOps have gotten flakier since I merged #5292. It's usually fixed by re-running once or twice. I'll keep doing that. We can turn the Dask tests back off on GPU builds in the future if it gets too annoying or we don't have time to investigate the issues.

@shiyu1994
Copy link
Collaborator

I've already put up a PR to fix the R CI issues. Can you please review #5689?

Sorry for the delay. I see that PR is already merged. Thanks.

I've noticed that the GPU jobs on Azure DevOps have gotten flakier since I merged #5292.

I agree. But I'll try to spare some time to investigate it if it happens frequently.

@jameslamb
Copy link
Collaborator Author

no problem, thanks @shiyu1994 ! I just merged in the changes from #5689 (thanks to @jmoralez for reviewing!) here so once CI rebuilds on this PR, and if you approve this PR, I think we can merge it.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The changes LGTM.

@shiyu1994 shiyu1994 merged commit 4f47547 into master Feb 1, 2023
@shiyu1994 shiyu1994 deleted the remove-cuda-v1 branch February 1, 2023 03:27
@jameslamb
Copy link
Collaborator Author

awesome, thank you so much @shiyu1994 !!! I'm excited to have shorter CI times and for us to be able to focus on a single CUDA version 😁

And thank you so much @ChipKerchner and your teammates for getting LightGBM started on this CUDA journey back in #3160.

@shiyu1994
Copy link
Collaborator

Thanks @ChipKerchner and your team for the contribution to LightGBM CUDA version!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants