Releases: NVIDIA/cccl
CCCL 2.7.0
What’s New
C++
Thrust / CUB
- Inclusive scan now supports initial value #1940
- Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
- New
cub::DeviceMerge::MergeKeys
andcub::DeviceMerge::MergePairs
algorithms #1817 - New
thrust::tabulate_output_iterator
fancy iterator #2282
Libcudacxx
- Enable Assertions on host and device depending on users choice
- C++26 inplace_vector has been implemented and backported to C++14
- Improved support for extended floating point types
__half
and__nv_bfloat16
both for cmath functions and complex cuda::std::tuple
is now trivially copyable if the stored types are trivially copyable- Reworked our atomics implementation
- Improved
<cuda/std/bit>
conformance - Implemented
<cuda/std/bitset>
and backported to C++14 - Implemented and backported C++20
bit_cast
. It is available in all standard modes and constexpr with compiler support - Various backports and constexpr improvements (
bool_constant
,cuda::std::max
) - Moved the experimental memory resources from
<cuda/memory_resource>
into<cuda/experimental/memory_resource.cuh>
Python
cuda.cooperative
Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative
algorithms with user-defined data types and operators, implemented directly in Python.
Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.
cuda.parallel
Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.
What's Changed
- Fix documentation generation for thrust::pair by @bernhardmgruber in #1976
- Correct typo in a launch configuration header name by @pciolkosz in #1972
- Fix thrust::sort for large problem sizes by @gevtushenko in #1952
- Avoid SIGPIPE when truncating verbose output in CI scripts. by @alliepiper in #1971
- Clarify compiler support by @bernhardmgruber in #1970
- Experimental Python cooperative algorithms by @gevtushenko in #1973
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #1928
- Guard against an overflow in sort tests by @bernhardmgruber in #1980
- Remove obsolete Thrust function traits by @bernhardmgruber in #1962
- Python: Add version string & wheel build command by @leofang in #1985
- Add device inclusive scan with init_value by @gonidelis in #1845
- Fix BWUtil report on early exit by @gonidelis in #1994
- Use libcu++ void_t everywhere by @bernhardmgruber in #1977
- Drop zipped_binary_op by @bernhardmgruber in #1988
- Clarify PtxVersion and SmVersion by @bernhardmgruber in #2004
- More simplifications for CUB util_device by @bernhardmgruber in #1948
- fix some typos in
<cuda/stream_ref>
by @ericniebler in #2003 - Add CI slack notifications. by @alliepiper in #1961
- Allow nightly workflow to be manually invoked. by @alliepiper in #2007
- Need to use a different approach to reuse secrets in reusable workflows vs. actions. by @alliepiper in #2008
- Enable RAPIDS builds for manually dispatched workflows. by @alliepiper in #2009
- clean up complex.inl by @ZelboK in #1655
- Add github token to nightly workflow-results action. by @alliepiper in #2012
- Remove obsolete build system glue from the Thrust/CUB submodule structure. by @alliepiper in #2016
- Benchmark thrust::copy with non-trivially relocatable type by @bernhardmgruber in #1989
- Make bool_constant available in C++11 by @bernhardmgruber in #1997
- Spell value initialization where used in thrust vectors by @bernhardmgruber in #1990
- Do no redefine
__ELF__
macro by @miscco in #2018 - Port
thrust::merge[_by_key]
to CUB by @bernhardmgruber in #1817 - Simplify some pointer traits by @bernhardmgruber in #2020
- Simplify test data setup by @bernhardmgruber in #2023
- Add tests to ensure that we properly propagate common_type for complex types by @miscco in #2025
- Update Thrust CMake README to use CCCL repo. by @alliepiper in #2026
- Include container toolkit in manual prereqs by @bryevdv in #2064
- Avoid ADL issues with
thrust::distance
by @miscco in #2053 - Simplify thrust::detail::wrapped_function by @bernhardmgruber in #2019
- Add a test for Thrust scan with non-commutative op by @bernhardmgruber in #2024
- Update memory_resource docs by @miscco in #1883
- Temporarily switch nightly H100 CI to build-only. by @alliepiper in #2060
- Do not rely on conversions between float and extended floating point types by @miscco in #2046
- experimental wrapper types for
cudaEvent_t
that provide a modern C++ interface. by @ericniebler in #2017 - [CUDAX] Add a dummy device struct for now by @pciolkosz in #2066
- Allow (somewhat) different input value types for merge by @bernhardmgruber in #2075
- Avoid
::result_type
for partial sums in TBB reduce_by_key by @bernhardmgruber in #1998 - Fix formatting by @bernhardmgruber in #2090
- Rename and refactor transform_iterator_base by @bernhardmgruber in #1987
- Benchmark analysis: Print all top rows when asked for by @bernhardmgruber in #2089
- Makes user-provided functors in our examples use
__device__
instead ofCUB_RUNTIME_FUNCTION
by @elstehle in #2088 - Separate
cuda/experimental
when sorting includes by @bernhardmgruber in #2094 - add support to
cudax::device
for querying a device's attributes by @ericniebler in #2084 - [CUDAX] Add experimental owning abstraction for cudaStream_t by @pciolkosz in #2093
- Do not query NVRTC for cuda runtime header by @miscco in #2102
- Cleanup CUB block/thread load and exchange by @bernhardmgruber in #1946
- Improve binary function objects and replace thrust implementation by @srinivasyadav18 in #1872
- Replace
_LIBCUDACXX_CPO_ACCESSIBILITY
with_CCCL_GLOBAL_VARIABLE
by @miscco in #1881 - Add script to update RAPIDS version. by @bdice in #2082
- Update bad links by @bryevdv in #2080
- Fix line break issues that break doxygen code examples by @miscco in #2103
- Add internal wrapper for cuda driver APIs by @pciolkosz in #2070
- Use
common_type
for complexpow
by @miscco in #1800 - [CUDAX] rename
device
todevice_ref
, add immovabledevice
as a place to cache properties by @ericniebler in #2110 - Use the float flavors of the cmath functions in the extended floating point fallbacks by @miscco in #2106
- [PoC]: Implement
cuda::experimental::uninitialized_buffer
by @miscco in #1831 - Ensure that we avoid ABI Version conflics by @miscco in #2137
- Ensure that
cuda_memory_resource
allocates memory on the proper device by @miscco in #2073 - Clarify compatibility wrt. template specializations by @bernhardmgruber in #2138
- Implement a
cudax::get_stream
CPO by @miscco in #2135 - Make
cuda::std::tuple
trivially copyable by @miscco in #2127 - Fix missing copy of docs artifacts by @miscco in #2162
...
CCCL 2.6.1
CCCL 2.6.0
What's Changed
- Restrict active histogram channels to channel count by @bernhardmgruber in #1796
- Cleanup internal thrust CUDA utils by @bernhardmgruber in #1802
- Use variadic interfaces in agent launcher by @bernhardmgruber in #1804
- Use
nullptr
overNULL
by @bernhardmgruber in #1805 - Rework the documentation to be build with sphinx by @miscco in #1753
- Let Catch2 report cudaError descriptions by @bernhardmgruber in #1808
- Check size-querying CUB API invocation in tests by @bernhardmgruber in #1809
- Update docs link by @gevtushenko in #1812
- Add missing inline specifiers by @bernhardmgruber in #1813
- Upgrade actions that use node16 to versions that use node20 by @trxcllnt in #1779
- Document NVTX range behavior during graph capture by @bernhardmgruber in #1814
- Clean up AliasTemporaries by @bernhardmgruber in #1815
- Drop removed clang-tidy option by @bernhardmgruber in #1810
- Exclude docs from cccl infra changes. by @alliepiper in #1821
- Clean up thrust merge unit tests by @bernhardmgruber in #1819
- Fix atomic performance regressions by avoiding use of memcpy with natively supported atomic types. by @wmaxey in #1801
- Clean up
merge_by_key
andmerge_key_value
tests by @bernhardmgruber in #1824 - Restore the old thrust api documentation in rst by @miscco in #1818
- Drop all internal implementations of exceptions by @miscco in #1806
- Fix span for non-ranges by @miscco in #1836
- Cleanup thrust test special types by @bernhardmgruber in #1837
- Add inclusive_scan with initial value support (warp/block) by @gonidelis in #1749
- Fix loading from incorrect URI on 404 page. by @wmaxey in #1843
- Port CUB temporary storage layout test to Catch2 by @bernhardmgruber in #1835
- Port CUB thread operators test to Catch2 by @bernhardmgruber in #1834
- Adds ceil_div by @gonzalobg in #1825
- Split workflow into multiple dispatch groups to avoid skipped jobs. by @alliepiper in #1797
- Fix broken CUB doc build and add 404 page to Sphinx. by @wmaxey in #1846
- Port CUB thread sort test to Catch2 by @bernhardmgruber in #1838
- Cleanup CUB temporary storage layout test by @bernhardmgruber in #1848
- Propogate error when docsbuild fails, add docs build to CI. by @alliepiper in #1852
- Cleanup CUB util_macro.cuh by @bernhardmgruber in #1849
- Provide libcu++ transparent functors in C++11 by @bernhardmgruber in #1851
- Roll back upload-pages-artifact to v2. by @alliepiper in #1861
- Port CUB iterator test to Catch2 by @bernhardmgruber in #1822
- Symbol visibility is now invariant in regards to
__cuda_std__
definition by @robertmaynard in #1832 - Add dimensions description functionality to CUDA Experimental library by @pciolkosz in #1743
- Document Asynchronous Operations by @gonzalobg in #1781
- Remove cpp11_required.h by @bernhardmgruber in #1860
- Add workflow to build RAPIDS from source with local CCCL by @trxcllnt in #1667
- Refactor CI matrix. by @alliepiper in #1844
- Adds tests for large number of items in
cub::DeviceScan
by @elstehle in #1830 - Make CUB test launch wrappers functor instances by @bernhardmgruber in #1850
- Improve CUB test overview docs by @bernhardmgruber in #1867
- Skip devcontainer validation jobs if not needed. by @alliepiper in #1853
- Improve CUB device-scope documentation by @bernhardmgruber in #1862
- Make integer sequence et al. available in C++11 by @bernhardmgruber in #1859
- Minimize template instantiations in CUB thread_load by @bernhardmgruber in #1857
- Create major version 2.6.0 by @wmaxey in #1880
- Drop facilities deprecated in CUB 1.x by @bernhardmgruber in #1868
- Make thrust::sort use radix sort with more comparators by @bernhardmgruber in #1884
- Make cuda::ptx::*_multicast pass on all architectures by @ahendriksen in #1874
- Replace typedef by alias declarations in CUB by @bernhardmgruber in #1885
- Remove legacy benchmarks and other dvs/p4 remnants by @alliepiper in #1901
- Qualify call to distance in thrust::async_reduce by @bernhardmgruber in #1904
- Rename CUB uninitialized_copy by @bernhardmgruber in #1913
- Sanitizer fixes by @alliepiper in #1916
- Use c2h::vectors in all non-example CUB tests by @bernhardmgruber in #1914
- Renamed overlooked uninitialized_copy by @bernhardmgruber in #1920
- Add assert implementation for device side testing by @pciolkosz in #1918
- Thrust and CUB: README: Fix copy-paste from libcu++ and links by @pauleonix in #1878
- Follow-up fixes to CUB iterator test by @bernhardmgruber in #1875
- Replace typedef by alias declarations in Thrust by @bernhardmgruber in #1915
- Cleanup CUB util_type.cuh by @bernhardmgruber in #1863
- Fix include for in cub/util_type.cuh by @bernhardmgruber in #1929
- Fix issues with comments in the concept emulation by @miscco in #1931
- Deprecate and reduce use of old functional stuff by @bernhardmgruber in #1925
- Deprecate more nested aliases in thrust functors by @bernhardmgruber in #1932
- Fix various typos in CUB documentation and comments. by @brycelelbach in #1933
- Add BabelStream flavors as thrust::transform benchmarks by @bernhardmgruber in #1921
- Some cleanup in Thrust config headers by @bernhardmgruber in #1934
- Update to CUDA 12.5 containers by @jrhemstad in #1935
- Check that the current version of CMake supports policy 141 before se… by @alliepiper in #1924
- Fix
memmove
optimization by @miscco in #1937 - Fixes
thrust::unique_by_key
examples by @elstehle in #1943 - Use only explicit NVTX3 V1 API in CUB by @bernhardmgruber in #1751
- Suppress a clang warning on array size computation by @bernhardmgruber in #1942
- Add a benchmark for thrust::equal by @bernhardmgruber in #1944
- Strip prefix paths to improve doc rendering by @bdice in #1954
- Modernize Thrust's alignment.h and triple_chevron_launch by @bernhardmgruber in #1905
- Restore RAPIDS devcontainer by @bdice in #1955
- Fix for in-place
DeviceSelect
&thrust::remove_if
by @elstehle in #1782 - Drop Thrust's cstdint.h by @bernhardmgruber in #1959
- Use
make_devcontainers.sh --clean
when validating. by @alliepiper in #1963 - Fix missing binary_pred in thrust::unique_by_key by @bernhardmgruber in #1957
- cuda::launch and launch configuration object with minimal functionality by @pciolkosz in #1950
- Backport PR #2046 - Fixing FP16 conversions. by @wmaxey in #2222
Full Changelog: v2.5.0...v2.6.0
CCCL 2.5.0
What's New
This release includes several notable improvements and new features:
- CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14.
- We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.
What's Changed
- Clean up libcu++ docs landing page by @jrhemstad in #1492
- PTX: Add
cuda::ptx::elect_sync
by @ahendriksen in #1537 - Print a summary of all tests sorted by execution time. by @alliepiper in #1539
- Fix unused variable warning for
__can_use_complete_tx
by @wmaxey in #1547 - Fix usage of naked array with 0 elements in sm90 barrier tests. by @wmaxey in #1546
- Add support for stream operators for complex by @miscco in #1538
- Fix
__half
for older architectures by @miscco in #1543 - Feat 565 remove redundant thrust dialect conditional by @ZelboK in #566
- fix missing device hint in WarpMergeSort Documentation by @MARD1NO in #1553
- Minor fixes and additions on cub developer guides by @gonidelis in #1559
- Consolidate handling of
constexpr
andif constexpr
by @miscco in #1562 - Ensure that
cuda::aligned_size_t
is usable in a constexpr context by @miscco in #1564 - Group CUB docs by @gevtushenko in #1565
- Update toolkit to 12.4 by @miscco in #1554
- Work around change in cuTensorMapEncode by @miscco in #1567
- Remove stdlib arg from .clangd. by @alliepiper in #1569
- Add the DeviceSelect::FlaggedIf algorithm by @gonidelis in #1533
- Catch2 segmented sort by @alliepiper in #1484
- Do not emit diagnostic with extended device lambdas with preserved re… by @Revaj in #1495
- Use absolute includes for libcu++ by @miscco in #1560
- [NFC] Modularize
<exception>
by @miscco in #199 - Add test support for launching kernels with cluster size > 1 by @ahendriksen in #416
- Fix typo in README.md by @bprb in #1574
- [FEA]: Modularize
<cuda/memory_resource>
by @miscco in #1532 - Cleanup_complex by @miscco in #1555
- Add missing comma in barrier
__try_wait
by @miscco in #1593 - Segmented sort test fix by @alliepiper in #1591
- Add pre-commit configuration by @bdice in #1596
- Preserve
.devcontainer/img/
when cleaning. by @alliepiper in #1604 - Add some documentation for recent additions to libcu++ by @miscco in #1594
- Ensure
cuda::std::nullopt
is visible in device code by @trxcllnt in #1598 - Fix ordering of
alignas
and__shared__
by @miscco in #1601 - Update Thrust CI tests. by @alliepiper in #1605
- Implement tuple interface for cuda vector types by @miscco in #1410
- Inspect PR changes to determine if subproject builds are needed. by @alliepiper in #1572
- Apply clang-format to cub by @bdice in #1602
- Add missing non-volatile atomic overloads. by @wmaxey in #1582
- Drop unused libcxx files by @miscco in #1606
- Apply formatting to libcudacxx by @miscco in #1610
- Add conda documentation to the README. by @bdice in #1581
- Allow jobs to be skipped. by @alliepiper in #1611
- Make libcu++ work with exceptions by @miscco in #1607
- Implement
cuda::mr::cuda_memory_resource
by @miscco in #1578 - Implement
cuda::mr::managed_memory_resource
by @miscco in #1579 - Apply formatting to thrust by @miscco in #1616
- Update example_device_radix_sort.cu by @eriktedhamre in #1608
- Implement
cuda::mr::pinned_memory_resource
by @miscco in #1580 - Set the devcontainers to format on save. by @miscco in #1624
- Enable internal use of
std::allocator
related functionality by @miscco in #1583 - Adds tests for large number of items for
cub::DeviceSelect
by @elstehle in #1612 - Add pre-commit docs to CONTRIBUTING.md. by @bdice in #1627
- Move visibility attributes to cccl by @miscco in #1595
- Work around thrust/memory.h circular include by @dkolsen-pgi in #1634
- Fix mbarrier.init addressing by @ahendriksen in #1636
- Trim trailing whitespace and normalize newlines. by @bdice in #1633
- Add a
git-blame-ignore-revs
file by @miscco in #1629 - Revert "PTX: Add
cuda::ptx::elect_sync
(#1537)" by @ahendriksen in #1638 - Address potential oob in cub when passing in an invalid device counter by @miscco in #1641
- Allow ninja_summary to fail by @jrhemstad in #1644
- Mostly flatten the folder structure of libcu++ by @miscco in #1630
- Make
--cmake-options=""
always override others. by @alliepiper in #1648 - Fix invalid
_CCCL_CUDACC
definition for clang cuda by @miscco in #1656 - Add missing #pragma once in some headers by @bernhardmgruber in #1668
- Add NVTX ranges for all CUB algorithms by @bernhardmgruber in #1657
- Implement LWG-3843 and LWG-3940 by @miscco in #1621
- Modularize
<memory>
by @miscco in #1639 - Expose
<cuda/std/numeric>
to be publicly available by @miscco in #1671 - Add nsight support for automated debugging by @gonidelis in #1660
- Format core headers by @miscco in #1670
- Guard
resource_ref
and friends behind feature flag by @miscco in #1675 - Create major version 2.5.0 by @wmaxey in #1677
- Install CUB headers with .hpp extension by @bernhardmgruber in #1687
- Update CMakePresets.json by @alliepiper in #1686
- Fix deprecated status by @gevtushenko in #1692
- Test combined internal/user-side use of NVTX by @bernhardmgruber in #1690
- CI Overhaul, new nightly workflow by @alliepiper in #1654
- Fix CMake option handling. by @alliepiper in #1698
- Fix issues that came up with building cuDF with main by @miscco in #1643
- Drop new properties until we are certain about the design by @miscco in #1681
- Remove more uses of
__cuda_std__
by @miscco in #1669 - Fix usage of
result_of
in thrust by @miscco in #1705 - Fix thrust::optional<T&>::emplace() by @Snektron in #1707
- Remove old f(void) function signatures by @bernhardmgruber in #1708
- Fix code sample in README and docs by @pauleonix in #1652
- Format libcudacxx/include files without extensions by @bdice in #1676
- Several improvements to zip_iterator/zip_function by @bernhardmgruber in #1710
- Expose thrust's contiguous iterator unwrap helpers by @bernhardmgruber in #1717
- Fix flakey heterogeneous tests by @wmaxey in #1712
- Ensure that we can use
cuda::std::optional
with types that are not__host__ __device__
by @miscco in #1663 - Fix a typo in barrier docs and update the godbolt link by @PointKernel in #1718
- Massively improve test times in heterogeneous atomics tests by @wmaxey in #1719
- Consolidate more common functi...
v2.4.0
What’s New
We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
Thrust
As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.
CUB
cub::DeviceSelect::UniqueByKey
now supports equality operator and large problem sizes.- New cub::DeviceFor family of algorithms goes beyond conventional
cub::DeviceFor::ForEach
.cub::DeviceFor::ForEachCopy
can provide you with additional performance benefits from vectorized memory accesses. - Many CUB algorithms now support CUDA graph capture mode.
libcudacxx
- Added new
cuda::ptx
namespace with wrappers for inline-PTX instructions cuda::std::complex
specializations for CUDA typesbfloat
andhalf
.
What's Changed
- Implement remaining ranges iterator concepts and modernize array by @miscco in #627
- Fix C++11 support of recently added tests by @ahendriksen in #651
- Update CUDA newest to CTK 12.3 by @jrhemstad in #629
- Add
cuda::ptx::*
namespace by @ahendriksen in #574 - The test seems to pass just fine by @miscco in #654
- Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
- Reduce benchmarking time by @gevtushenko in #657
- Add CCCL_VERSION and script for updating version by @jrhemstad in #652
- Fixes compiler error for extended fp type data gen by @elstehle in #666
- fixup
___CUDA_VPTX
->_CUDA_VPTX
by @wmaxey in #664 - Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
- Rework our system header approach to be more error proof by @miscco in #661
- Project automation - fix sync action and draft setting step by @jarmak-nv in #625
- Fix fallback when checking git repo by @wmaxey in #1085
- Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
- Adds virtual shared memory helper and tests by @elstehle in #619
- Add
cuda::ptx::st_async
by @ahendriksen in #1078 - Add
cuda::ptx::red_async
by @ahendriksen in #1080 - Remove libcudacxx symlinks by @wmaxey in #1075
- Move PTX tests that missed the symlink PR by @wmaxey in #1098
- Fix truncation of constant value by @gevtushenko in #1097
- Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}
by @ahendriksen in #674 - Initial CUB/NVRTC support by @gevtushenko in #1081
- Fix
cuda::ptx::red.async
for int32_t types by @ahendriksen in #1102 - Fix local test runs with lit by @miscco in #1108
- Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
- Do not replace the sccache binary for windows by @miscco in #1115
- Test cuda graph capture by @gevtushenko in #1112
- Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
- Introduce CUB transform reduce by @gevtushenko in #1091
- Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
- Fix GCC6 / FP8 warning by @gevtushenko in #1130
- Fix thrust transform reduce bench by @gevtushenko in #1133
- Fix
ptx.st.async.compile.pass.cpp
failing in C++11. by @wmaxey in #1132 - Fix
_LIBCUDACXX_UNREACHABLE
for old MSVC by @miscco in #1114 - Allow filtering P0 benchmarks by @gevtushenko in #1135
- Update barrier_arrive_tx.md docs by @gonzalobg in #1147
- Update std iterators by @miscco in #672
- Fix argument name in windows CI by @miscco in #1145
- Fix XFAIL condition for subsumption tests by @miscco in #1144
- Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
- Use rst in block-scope docs by @gevtushenko in #1150
- Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
- Fix icc / cub by @gevtushenko in #1152
- Abort testing on unsupported dialect flags by @wmaxey in #1158
- Run with latest nvbench by @robertmaynard in #583
- Set finer-grain workflow permissions by @jrhemstad in #1163
- Port device docs to rst by @gevtushenko in #1160
- CI log improvements by @jrhemstad in #621
- Setup documentation and corresponding github action by @wmaxey in #1118
- Update Docs links in README.md by @wmaxey in #1169
- Fix GCC 13 by @gevtushenko in #1175
- Add missing exit from
run-as-coder
by @jrhemstad in #1176 - Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
- Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
- Implement math functions for
thrust::complex
by @miscco in #1178 - Use anchors in matrix.yaml by @jrhemstad in #1193
- Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
- Fix availability of
is_constant_evaluated
on old MSVC by @miscco in #1180 - Enable std::variant for libcu++ by @miscco in #1076
- Implement
enable_borrowed_range
by @miscco in #1196 - Reduce thrust benchmarks noise by @gevtushenko in #1203
- Prepare more algorithms by @miscco in #1161
- Add icc compiler to CI matrix by @jrhemstad in #1159
- Unify handling of dialects by @miscco in #1200
- Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
- Move definitions of execution space macros into
cccl
by @miscco in #1199 - Adds new virtual shared memory facility to
DeviceSelect::UniqueByKey
by @elstehle in #1197 - Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
- Fix the example on README.md by @so298 in #1220
- Add missing overloads for thrust::pow by @miscco in #1222
- Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
- Fix examples in reduce docs by @gevtushenko in #1230
- Do not benchmark small problem sizes by @gevtushenko in #1243
- Implement
enable_view
by @miscco in #1208 - Refactors
thrust::unique_by_key
to usecub::DeviceSelect::UniqueByKey
by @elstehle in #1245 - Fix merge conflict from incoming PR by @miscco in #1250
- Disable
fast-math
for ICC by @miscco in #1252 - Fix a typo in thrust-config.cmake by @valgur in #1259
- Implement
ranges::{c}begin
andranges::{c}end
by @miscco in #1256 - Switch to entropy-based stopping criterion by @gevtushenko in #1280
- Fix a sync bug in
stream_ref::wait
by @PointKernel in #1238 - Silence some static asserts in ptx helpers by @miscco in #1257
- Restore docs images...
v2.3.2
What's Changed
- [BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in #1284
- [BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in #1292
- [BACKPORT]: Properly test internal headers (#1258) by @miscco in #1299
- [Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in #1298
- [BACKPORT] Fix MSVC issues (#1261) by @miscco in #1297
- [backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in #1317
- [BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in #1421
- Create patch 2.3.2 by @wmaxey in #1530
Full Changelog: v2.3.1...v2.3.2
v2.3.1
CCCL 2.3.0
What’s New
In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
System Headers and Warnings
Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem
. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I
.
To achieve the same effect as -isystem
, CCCL headers will now use the system_header
pragma. For more information, see #527.
TL;DR: You should never see warnings emitted from a CCCL header ever again!
Linkage Issues
Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE
macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE
. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.
Thrust
thrust::tuple
, thrust::pair
, and thrust::complex
have been replaced with cuda::std
alternatives. This can be a breaking change, but should be source compatible.
CUB
Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey
, cub::DeviceScan::ExclusiveSumByKey
, and cub::DeviceReduce::ReduceByKey
on A100. cub::DeviceSegmentedReduce
now supports 64-bit indexing.
libcudacxx
- The
cuda::ptx
namespace and<cuda/ptx>
header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics. - #379 - Added experimental bulk TMA memcpy under
<cuda/barrier>
What's Changed
- Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
- Branch/2.2.x by @gevtushenko in #305
- Tune unique by key on A100 by @gevtushenko in #306
- Merge branch/2.2.x to main by @jrhemstad in #308
- Add example cmake project by @jrhemstad in #177
- Adds catch2 tests for reduce-by-key by @elstehle in #311
- Tune scan by key on A100 by @gevtushenko in #325
- Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
- Fix MSVC / CUB tests build by @gevtushenko in #336
- gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
- Add a nvrtc configuration for libcu++ by @miscco in #202
- GH Infra: project automation and issue template fixes by @jarmak-nv in #297
- Tune reduce by key on A100 by @gevtushenko in #346
- Merge commits from 2.2 branch by @miscco in #350
- Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
- Assorted fixes for MSVC 2017 by @miscco in #341
- [skip-tests] Guard inline variables with
_LIBCUDACXX_INLINE_VAR
macro by @miscco in #355 - Port cub::DeviceScan tests to catch2 by @elstehle in #347
- Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
- Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
- Work around strange gcc bug by @miscco in #363
- Implement
iter_swap
CPO by @miscco in #332 - Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
- Add clang16 devcontainer and CI job by @miscco in #362
- [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
- [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
- Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
- Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
- Use
copy-pr-bot
by @ajschmidt8 in #381 - Implement the
permutable
concept by @miscco in #367 - [NFC] We missed some
_NOEXCEPT_
macro uses by @miscco in #371 - Implement
identity
changes for c++20 by @miscco in #383 - Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
- Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
- Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
- Add commit information to the test output by @miscco in #401
- Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
- Project Automation: set
Roadmap
project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389 - Add support for tests that should fail at runtime by @ahendriksen in #418
- Port
DeviceAdjacentDifference::SubtractRight
tests to catch2 by @miscco in #390 - Project automation - Fix indentation for
continue-on-error
by @jarmak-nv in #425 - [BUG] Ensure that all headers build on their own by @miscco in #200
- Remove
util_device.cuh
from iterator headers to enable online compilation by @leofang in #412 - Fix ci-overview example by @gevtushenko in #428
- Port
cub::DeviceRunLengthEncode
tests to catch2 by @miscco in #411 - Add cuda::device::barrier_arrive tx by @ahendriksen in #358
- Fix CubDebug by @gevtushenko in #430
- Do not use static member functions to initialize static member variables. by @miscco in #438
- Implement the
projected
helper struct by @miscco in #385 - Add PTX wrapping functions for TMA features by @ahendriksen in #379
- Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
- Enable lit to determine the compute architectures by @miscco in #447
- Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
- Improve documentation of
cuda::barrier
by @ahendriksen in #440 - Extend
thrust::complex
unit tests to prepare for upcoming replacement withstd::complex
by @Blonck in #413 - Remove having two install rules for -header-search.cmake by @robertmaynard in #298
- Run
.devcontainer/launch.sh
with bash + add error checking by @wence- in #407 - Remove C++03 compatability from unit tests by @Blonck in #378
- [libcu++] Fix use of
__ppc64__
by @miscco in #451 - Update the README by @jrhemstad in #291
- [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
- Consolidate matrix logic into single script/job by @jrhemstad in #361
- Implement the
indirectly_comparable
concept by @miscco in #445 - Fix compute matrix dropping trailing zeros by @jrhemstad in #466
- Avoid integer promotion warnings with MSVC by @miscco in #460
- Implement ranges comparison objects by @miscco in #464
- Fix CUB/MSVC/RDC tests by @gevtushenko in #469
- Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
- Script for Running CUB Benchmarks by @gevtushenko in #472
- [skip ci] Add list of CCCL users to README by @jrhemstad in #474
constexpr
all the things by @pb-dseifert in #476- Add Gonzalo/Allard to trustees by @jrhemstad in #482
- Implement the
sortable
concept by @miscco in #471 - [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
- Refactor
thrust::complex
as a struct derived fromcuda::std::complex
by @Blonck in #454 - Add ci scripts for windows by...
CCCL 2.2.0
(Note that these release notes are not yet finalized. They do not reflect any PRs that were merged to Thrust/CUB/libcudacxx before migrating to the nvidia/cccl repo).
What's Changed
- Add axis for docker builds by @raydouglass in #1
- Docker: Add support for ICPC and NVC++, install newer CMake, and add curl by @brycelelbach in #4
- Update excludes by @raydouglass in #5
- Docker: OS and CUDA upgrades, support for additional configurations by @brycelelbach in #9
- Docker: Add Thrust/CUB documentation toolchain to Ubuntu docker images by @brycelelbach in #15
- Re-enable CentOS images. by @allisonvacanti in #16
- Add sccache to dockerfile by @msadang in #17
- Update base containers. by @allisonvacanti in #18
- Update
sccache
version by @ajschmidt8 in #19 - Build
11.5.1
containers by @ajschmidt8 in #20 - Add ops-bot.yaml by @jrhemstad in #80
- Monorepo workflow by @jrhemstad in #99
- Add devcontainers by @jrhemstad in #105
- Update the libcu++ submodule by @miscco in #109
- Update libcudaxx again by @miscco in #110
- Remove submodules from CI workflow by @jrhemstad in #115
- Fix CUB CI by @senior-zero in #114
- Fix async scan / counting iterator tests by @senior-zero in #118
- Make sccache work locally by @jrhemstad in #113
- Fix compilation of thrust and cub by @miscco in #120
- Fix segfault in cub::CachingDeviceAllocator by @senior-zero in #119
- Initial GH Infra Setup by @jarmak-nv in #23
- Visualize variant space coverage by @senior-zero in #125
- Fix broken issue templates by @jarmak-nv in #124
- Tune scan by key for SM90 by @senior-zero in #121
- Update PR template to more explicitly prompt for a linked issue closed by the PR by @jrhemstad in #134
- Change component section to more general "area" by @jrhemstad in #132
- Try and fix CI for old CTK by @miscco in #116
- Fix
tuple_cat
forstd::
qualified types by @miscco in #144 - Add ccache to lit invocation by @miscco in #147
- Benchmark batched memcpy by @senior-zero in #136
- Properly querry
CMAKE_CUDA_COMPILER_LAUNCHER
for ccache support by @miscco in #152 - Implement Three-Way Partition Tuning / Benchmark by @senior-zero in #155
- Port three-way partition to use Catch2 by @senior-zero in #156
- Add gcc-6 to the test matrix by @miscco in #160
- Tune reduce / unique by key for SM90 by @senior-zero in #163
- Remove unused folders by @miscco in #145
- Fix documentation of
atomic_ref
by @miscco in #164 - New iterator traits by @miscco in #158
- Improve implementation of
destructible
by @miscco in #157 - Build script improvements by @jrhemstad in #149
- Fix icpc / denormals by @senior-zero in #185
- Enable tests by @jrhemstad in #167
- Monorepo by @jrhemstad in #194
- Multi-benchmark tuning by @senior-zero in #208
- Fixes universal_vector test failure on CTK 11.1 & gcc-6 by @elstehle in #209
- Delete several directories for older CI infra. by @wmaxey in #218
- Memory-safe radix sort test by @senior-zero in #222
- [FEA] Implement
iter_move
CPO by @miscco in #197 - Build cub benchmarks in build_cub.sh by @jrhemstad in #216
- [skip-tests] Do not run tests when
skip-tests
is part of the latest commit message by @miscco in #224 - Factor out build job logic into a "run-as-coder" reusable workflow. by @jrhemstad in #205
- Fix instances of 'scan' copy-pasted into reduction documentation by @milesvant in #221
- Add clangd to devcontainer by @senior-zero in #225
- Add initial CODEOWNERS file by @jrhemstad in #226
- Attempt to fix codeowners by @jrhemstad in #231
- Make libcudacxx respect CMake options for CUDA archs. by @wmaxey in #235
- Optimize Three-Way Partition by @senior-zero in #228
- [BUG] Rework how we handle feature test macros by @miscco in #195
- Enable use of
cudaMemcpyAsync
forthrust::copy
by @miscco in #211 - Enable additional arguments in build_common.sh by @wmaxey in #236
- [BUG] Properly uglify all qualifiers in product headers by @miscco in #201
- Port
cub::Device{Select, Partition}
tests to catch2 by @miscco in #229 - Fix CUB tests / MSVC 2022 by @senior-zero in #255
- Ensure that any CMake re-rooting doesn't break our find_file by @miscco in #257
- [BUG] Fix compilation issues with MSVC 2017 by @miscco in #196
- Implement iterator concepts by @miscco in #223
- Tune Histogram on H100 by @senior-zero in #266
- Add WarpExchangeAlgorithm customization for WarpExchange class by @pb-dseifert in #256
- [BUG]: Avoid deprecation warning for
std::aligned_storage
when building with c++23 by @miscco in #258 - Port cub::DeviceReduce tests to catch2 by @elstehle in #267
- Add support for nvcc-specific matrix. by @jrhemstad in #243
- Fix anchor link to cooperative groups in CUDA programming guide by @wence- in #274
- Fix BibTeX syntax in CITATION.md [skip-tests] by @wence- in #276
- Enforce C++17 for benches by @senior-zero in #275
- Project Automation: Move PR and Linked Issues to In Progress by @jarmak-nv in #170
- Update to 23.08 devcontainers and CUDA 12.2 by @jrhemstad in #270
- [skip-tests] CTK 12.2 tuning image by @senior-zero in #282
- Fix single-thread block reduction by @senior-zero in #287
- Tune Select and Partition on A100 by @senior-zero in #289
- Fix CUB tests / MSVC by @senior-zero in #292
- Allow building CUB tests without cuRand by @senior-zero in #250
- Fixup to CUB build - s/curand/cudart/ by @wmaxey in #301
- Fix OOB in
cub::DeviceRunLengthEncode::NonTrivialRuns
by @senior-zero in #294 - Tune RLE on A100 by @senior-zero in #295
- Tune scan on A100 by @senior-zero in #302
- Add new CCCL:: CMake targets by @allisonvacanti in #244
- Fix
cudacc
andnvcc
mixup. by @wmaxey in #329 - [skip-tests] Use builtin for
destructible
concept on MSVC by @miscco in #333 - Fix merge conflict from two inflight PRs by @miscco in #338
New Contributors
- @raydouglass made their first contribution in #1
- @brycelelbach made their first contribution in #4
- @msadang made their first contribution in #17
- @wmaxey made their first contribution in #218
- @milesvant made their first contribution in #221
- @pb-dseifert made their first contribution in #256
- @wence- made their first contribution in #274
Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0