Skip to content

Releases: NVIDIA/cccl

CCCL 2.7.0

06 Jan 22:12
v2.7.0
b5fe509
Compare
Choose a tag to compare

What’s New

C++

Thrust / CUB

  • Inclusive scan now supports initial value #1940
  • Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
  • New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms #1817
  • New thrust::tabulate_output_iterator fancy iterator #2282

Libcudacxx

  • Enable Assertions on host and device depending on users choice
  • C++26 inplace_vector has been implemented and backported to C++14
  • Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
  • cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
  • Reworked our atomics implementation
  • Improved <cuda/std/bit> conformance
  • Implemented <cuda/std/bitset> and backported to C++14
  • Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
  • Various backports and constexpr improvements (bool_constant, cuda::std::max)
  • Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

Read more

CCCL 2.6.1

10 Sep 18:45
v2.6.1
9019a6a
Compare
Choose a tag to compare

This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.

What's Changed

Full Changelog: v2.6.0...v2.6.1

CCCL 2.6.0

04 Sep 17:42
c67b1c3
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.5.0...v2.6.0

CCCL 2.5.0

17 Jun 18:00
69be18c
Compare
Choose a tag to compare

What's New

This release includes several notable improvements and new features:

  • CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14.
  • We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.

What's Changed

Read more

v2.4.0

23 Apr 21:30
1c009d2
Compare
Choose a tag to compare

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

  • cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
  • New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
  • Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

  • Added new cuda::ptx namespace with wrappers for inline-PTX instructions
  • cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Read more

v2.3.2

12 Mar 20:22
64d3a5f
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.3.1...v2.3.2

v2.3.1

23 Apr 21:29
299eb62
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
  • Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
  • Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1

CCCL 2.3.0

28 Feb 18:36
c4eda1a
Compare
Choose a tag to compare

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.

Thrust

thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.

CUB

Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.

libcudacxx

  • The cuda::ptx namespace and <cuda/ptx> header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics.
  • #379 - Added experimental bulk TMA memcpy under <cuda/barrier>

What's Changed

  • Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
  • Branch/2.2.x by @gevtushenko in #305
  • Tune unique by key on A100 by @gevtushenko in #306
  • Merge branch/2.2.x to main by @jrhemstad in #308
  • Add example cmake project by @jrhemstad in #177
  • Adds catch2 tests for reduce-by-key by @elstehle in #311
  • Tune scan by key on A100 by @gevtushenko in #325
  • Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
  • Fix MSVC / CUB tests build by @gevtushenko in #336
  • gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
  • Add a nvrtc configuration for libcu++ by @miscco in #202
  • GH Infra: project automation and issue template fixes by @jarmak-nv in #297
  • Tune reduce by key on A100 by @gevtushenko in #346
  • Merge commits from 2.2 branch by @miscco in #350
  • Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
  • Assorted fixes for MSVC 2017 by @miscco in #341
  • [skip-tests] Guard inline variables with _LIBCUDACXX_INLINE_VAR macro by @miscco in #355
  • Port cub::DeviceScan tests to catch2 by @elstehle in #347
  • Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
  • Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
  • Work around strange gcc bug by @miscco in #363
  • Implement iter_swap CPO by @miscco in #332
  • Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
  • Add clang16 devcontainer and CI job by @miscco in #362
  • [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
  • [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
  • Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
  • Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
  • Use copy-pr-bot by @ajschmidt8 in #381
  • Implement the permutable concept by @miscco in #367
  • [NFC] We missed some _NOEXCEPT_ macro uses by @miscco in #371
  • Implement identity changes for c++20 by @miscco in #383
  • Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
  • Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
  • Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
  • Add commit information to the test output by @miscco in #401
  • Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
  • Project Automation: set Roadmap project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389
  • Add support for tests that should fail at runtime by @ahendriksen in #418
  • Port DeviceAdjacentDifference::SubtractRight tests to catch2 by @miscco in #390
  • Project automation - Fix indentation for continue-on-error by @jarmak-nv in #425
  • [BUG] Ensure that all headers build on their own by @miscco in #200
  • Remove util_device.cuh from iterator headers to enable online compilation by @leofang in #412
  • Fix ci-overview example by @gevtushenko in #428
  • Port cub::DeviceRunLengthEncode tests to catch2 by @miscco in #411
  • Add cuda::device::barrier_arrive tx by @ahendriksen in #358
  • Fix CubDebug by @gevtushenko in #430
  • Do not use static member functions to initialize static member variables. by @miscco in #438
  • Implement the projected helper struct by @miscco in #385
  • Add PTX wrapping functions for TMA features by @ahendriksen in #379
  • Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
  • Enable lit to determine the compute architectures by @miscco in #447
  • Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
  • Improve documentation of cuda::barrier by @ahendriksen in #440
  • Extend thrust::complex unit tests to prepare for upcoming replacement with std::complex by @Blonck in #413
  • Remove having two install rules for -header-search.cmake by @robertmaynard in #298
  • Run .devcontainer/launch.sh with bash + add error checking by @wence- in #407
  • Remove C++03 compatability from unit tests by @Blonck in #378
  • [libcu++] Fix use of __ppc64__ by @miscco in #451
  • Update the README by @jrhemstad in #291
  • [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
  • Consolidate matrix logic into single script/job by @jrhemstad in #361
  • Implement the indirectly_comparable concept by @miscco in #445
  • Fix compute matrix dropping trailing zeros by @jrhemstad in #466
  • Avoid integer promotion warnings with MSVC by @miscco in #460
  • Implement ranges comparison objects by @miscco in #464
  • Fix CUB/MSVC/RDC tests by @gevtushenko in #469
  • Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
  • Script for Running CUB Benchmarks by @gevtushenko in #472
  • [skip ci] Add list of CCCL users to README by @jrhemstad in #474
  • constexpr all the things by @pb-dseifert in #476
  • Add Gonzalo/Allard to trustees by @jrhemstad in #482
  • Implement the sortable concept by @miscco in #471
  • [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
  • Refactor thrust::complex as a struct derived from cuda::std::complex by @Blonck in #454
  • Add ci scripts for windows by...
Read more

CCCL 2.2.0

07 Sep 19:09
36f379f
Compare
Choose a tag to compare

(Note that these release notes are not yet finalized. They do not reflect any PRs that were merged to Thrust/CUB/libcudacxx before migrating to the nvidia/cccl repo).

What's Changed

New Contributors

Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0