Skip to content

Latest commit

 

History

History
2264 lines (1947 loc) · 192 KB

CHANGELOG.md

File metadata and controls

2264 lines (1947 loc) · 192 KB

CHANGELOG

Full Changelog

Features:

  • Introduce new SequentialHostInit view allocation property #7229

Backend and Architecture Enhancements:

CUDA:

  • Experimental support for unified memory mode (intended for Grace-Hopper etc.) #6823

Bug Fixes

  • OpenMP: Fix issue related to the visibility of an internal symbol with shared libraries that affected ScatterView in particular #7284
  • Fix implicit copy assignment operators in few AVX2 masks being deleted #7296

Full Changelog

Features:

Backend and Architecture Enhancements:

CUDA:

  • nvcc_wrapper: Adding ability to process --disable-warnings flag #6936
  • Use recommended/max team size functions in Cuda ParallelFor and Reduce constructors #6891
  • Improve compile-times when building with Kokkos_ENABLE_DEBUG_BOUNDS_CHECK in Cuda #7013

HIP:

  • Use HIP builtin atomics #6882 #7000
  • Enable user-specified compiler and linker flags for AMD GPUs #7127

SYCL:

  • Add support for Graphs #6912
  • Fix multi-GPU support #6887
  • Improve performance of reduction and scan operations #6562, #6750
  • Fix lock for guarding scratch space in TeamPolicy parallel_reduce #6988
  • Include submission command queue property information into SYCL::print_configuration() #7004

OpenACC:

  • Make TeamPolicy parallel_for execute on the correct async queue #7012

OpenMPTarget:

  • Honor user requested loop ordering in MDRange policy #6925
  • Prevent data races by guarding the scratch space used in parallel_scan #6998

HPX:

  • Workaround issue with template argument deduction to support compilation with NVCC #7015

General Enhancements

  • Improve performance of view copies in host parallel regions #6730
  • Harmonize convertibility rules of Kokkos::RandomAccessIterator with Views #6929
  • Add a check precondition non-overlapping ranges for the adjacent_difference algorithm in debug mode #6922
  • Add deduction guides for TeamPolicy #7030
  • SIMD: Allow flexible vector width for 32 bit types #6802
  • Updates for Kokkos::Array: add kokkos_swap(Array<T, N>) specialization #6943, add Kokkos::to_array #6375, make Kokkos::Array equality-comparable #7148
  • Structured binding support for Kokkos::complex #7040

Build System Changes

  • Do not require OpenMP support for languages other than CXX #6965
  • Update Intel GPU architectures in Makefile #6895
  • Fix use of OpenMP with Cuda or HIP as compile language #6972
  • Define and enforce new minimum compiler versions for C++20 support #7128, #7123
  • Add nvidia Grace CPU architecture: Kokkos_ARCH_ARMV9_GRACE #7158
  • Fix Makefile.kokkos for Threads #6896
  • Remove support for NVHPC as CUDA device compiler #6987
  • Fix using CUDAToolkit for CMake 3.28.4 and higher #7062

Incompatibilities (i.e. breaking changes)

  • Drop Kokkos::Array special treatment in Views #6906
  • Drop Experimental::RawMemoryAllocationFailure #7145

Deprecations

  • Remove Experimental::LayoutTiled class template and deprecate is_layouttiled trait #6907
  • Deprecate Kokkos::layout_iterate_type_selector #7076
  • Deprecate specialization of Kokkos::pair for a single element #6947
  • Deprecate deep_copy of UnorderedMap of different size #6812
  • Deprecate trailing Proxy template argument of Kokkos::Array #6934
  • Deprecate implicit conversions of integers to ChunkSize #7151
  • Deprecate implicit conversions to execution spaces #7156

Bug Fixes

  • Do not return a copy of the input functor in Experimental::for_each #6910
  • Fix realloc on views of non-default constructible element types #6993
  • Fix undefined behavior in View initialization or fill with zeros #7014
  • Fix sort_by_key on host execution spaces when building with NVCC #7059
  • Fix using shared libraries and -fvisibility=hidden #7065
  • Fix view reference counting when functor copy constructor throws in parallel dispatch #6289
  • Fix initialize(InitializationSetting) for handling print_configuration setting #7098
  • Thread safety fixes for the Serial and OpenMP backend #7080, #6151

Full Changelog

Backend and Architecture Enhancements:

HIP:

  • MI300 support unified memory #6877

Bug Fixes

  • Serial: Use the provided execution space instance in TeamPolicy #6951
  • nvcc_wrapper: bring back support for --fmad option #6931
  • Fix CUDA reduction overflow for RangePolicy #6578

4.3.00 (2024-03-19)

Full Changelog

Features:

  • Add Experimental::sort_by_key(exec, keys, values) algorithm #6801

Backend and Architecture Enhancements:

CUDA:

  • Experimental multi-GPU support (from the same process) #6782
  • Link against CUDA libraries even with KOKKOS_ENABLE_COMPILE_AS_CMAKE_LANGUAGE #6701
  • Don't use the compiler launcher script if the CMake compile language is CUDA. #6704
  • nvcc(wrapper): adding "long" and "short" versions for all flags #6615

HIP:

  • Fix compilation when using amdclang (with ROCm >= 5.7) and RDC #6857
  • Use rocthrust for sorting, when available #6793

SYCL:

  • We only support OneAPI SYCL implementation: add check during initialization
    • Error out on initialization if the backend is different from ext_oneapi_* #6784
    • Filter GPU devices for ext_onapi_* GPU devices #6758
  • Performance Improvements
    • Avoid unnecessary zero-memset of the scratch flags in SYCL #6739
    • Use host-pinned memory to copy reduction/scan result #6500
  • Address deprecations after oneAPI 2023.2.0 #6577
  • Make sure to call find_dependency for oneDPL if necessary #6870

OpenMPTarget:

  • Use LLVM extensions for dynamic shared memory #6380
  • Guard scratch memory usage in ParallelReduce #6585
  • Update linker flags for Intel GPUs update #6735
  • Improve handling of printf on Intel GPUs #6652

OpenACC:

  • Add atomics support #6446
  • Make the OpenACC backend asynchronous #6772

Threads:

  • Add missing broadcast to TeamThreadRange parallel_scan #6601

OpenMP:

  • Improve performance of view initializations and filling with zeros #6573

General Enhancements

  • Improve performance of random number generation when using a normal distribution on GPUs #6556
  • Allocate temporary view with the user-provided execution space instance and do not initialize in unique algorithm #6598
  • Add deduction guide for Kokkos::Array #6373
  • Provide new public headers <Kokkos_Clamp.hpp> and <Kokkos_MinMax.hpp> #6687
  • Fix/improvement to remove_if parallel algorithm: use the provided execution space instance for temporary allocations and drop unnecessaryinitialization + avoid evaluating twice the predicate during final pass #6747
  • Add runtime function to query the number of devices and make device ID consistent with KOKKOS_VISIBLE_DEVICES #6713
  • simd: support vector_aligned_tag #6243
  • Avoid unnecessary allocation when default constructing Bitset #6524
  • Fix constness for views in std algorithms #6813
  • Improve error message on unsafe implicit conversion in MDRangePolicy #6855
  • CTAD (deduction guides) for RangePolicy #6850
  • CTAD (deduction guides) for MDRangePolicy #5516

Build System Changes

  • Require Kokkos_ENABLE_ATOMICS_BYPASS option to bypass atomic operation for Serial backend only builds #6692
  • Add support for RISCV and the Milk-V's Pioneer #6773
  • Add C++26 standard to CMake setup #6733
  • Fix Makefile when using gnu_generate_makefile.sh and make >= 4.3 #6606
  • Cuda: Fix configuring with CMake >= 3.28.4 - temporary fallback to internal CudaToolkit.cmake #6898

Incompatibilities (i.e. breaking changes)

  • Remove all DEPRECATED_CODE_3 option and all code that was guarded by it #6523
  • Drop guards to accommodate external code defining KOKKOS_ASSERT #6665
  • Profiling::ProfilingSection(std::string) constructor marked explicit and nodiscard #6690
  • Add bound check preconditions for RangePolicy and MDRangePolicy #6617 #6726
  • Add checks for unsafe implicit conversions in RangePolicy #6754
  • Remove Kokkos::[b]half_t volatile overloads #6579
  • Remove KOKKOS_IMPL_DO_NOT_USE_PRINTF #6593
  • Check matching static extents in View constructor #5190
  • Tools(profiling): fix typo Kokkos_Tools_Optim[i]zationGoal #6642
  • Remove variadic range policy constructor (disallow passing multiple trailing chunk size arguments) #6845
  • Improve message on view out of bounds access and always abort #6861
  • Drop KOKKOS_ENABLE_INTEL_MM_ALLOC macro #6797
  • Remove Kokkos::Experimental::LogicalMemorySpace (without going through deprecation) #6557
  • Remove Experimental::HBWSpace and support for linking against memkind #6791
  • Drop librt TPL and associated KOKKOS_ENABLE_LIBRT macro #6798
  • Drop support for old CPU architectures (ARCH_BGQ, ARCH_POWER7, ARCH_WSM and associated ARCH_SSE4 macro) #6806
  • Drop support for deprecated command-line arguments and environment variables #6744

Deprecations

  • Provide kokkos_swap as part of Core and deprecate Experimental::swap in Algorithms #6697
  • Deprecate {Cuda,HIP}::detect_device_count() and Cuda::[detect_]device_arch() #6710
  • Deprecate ExecutionSpace::in_parallel() #6582

Bug Fixes

  • Fix team-level MDRange reductions: #6511
  • Fix CUDA and SYCL small value type (16-bit) team reductions #5334
  • Enable {transform_}exclusive_scan in place #6667
  • fill_random overload that do not take an execution space instance argument should fence #6658
  • HIP,Cuda,OpenMPTarget: Fixup use provided execution space when copying host inaccessible reduction result #6777
  • Fix typo in cuda_func_set_attribute[s]_wrapper preventing proper setting of desired occupancy #6786
  • Avoid undefined behavior due to conversion between signed and unsigned integers in shift_{right, left}_team_impl #6821
  • Fix a bug in Makefile.kokkos when using AMD GPU architectures as AMD_GFXYYY #6892

4.2.01 (2023-12-07)

Full Changelog

Backend and Architecture Enhancements:

CUDA:

  • Add warp sync for parallel_reduce to avoid race condition #6630, #6746

HIP:

  • Fix Graph "multiple definition of" linking error (missing inline specifier) #6624
  • Add support for gfx940 (AMD Instinct MI300 GPU) #6671

Build System

  • CMake: Don't let Kokkos set CMAKE_CXX_FLAGS for Trilinos builds #6742

Bug Fixes

  • Remove deprecation warning for AllocationMechanism for GCC <11.0 #6653
  • Fix bug early tools finalize with non-default host execution instances #6635
  • Fix various issues for MSVC CUDA builds #6659
  • Fix "extra ;" warning with -pedantic flag in <Kokkos_SIMD_Scalar.hpp> #6510

4.2.00 (2023-11-06)

Full Changelog

Features:

  • SIMD: significant improvements to SIMD support and alignment with C++26 SIMD
    • add Kokkos::abs overload for SIMD types #6069
    • add generator constructors #6347
    • convert binary operators to hidden friends #6320
    • add shift operators #6109
    • add float support #6177
    • add remaining gather_from and scatter_to overloads #6220
    • define simd math function overloads in the Kokkos namespace #6465, #6487
    • Kokkos_ENABLE_NATIVE=ON autodetects SIMD types supported #6188
    • fix AVX2 SIMD support for ZEN2 AMD CPU #6238
  • Kokkos::printf #6083
  • Kokkos::sort: support custom comparator #6253
  • half_t and bhalf_t numeric traits #5778
  • half_t and bhalf_t mixed comparisons #6407
  • half_t and bhalf_t mathematical functions #6124
  • TeamThreadRange parallel_scan with return value #6090, #6301, #6302, #6303, #6307
  • ThreadVectorRange parallel_scan with return value #6235, #6242, #6308, #6305, #6292
  • Add team-level std algorithms #6200, #6205, #6207, #6208, #6209, #6210, #6211, #6212, #6213, #6256, #6258, #6350, #6351
  • Serial: Allow for distinct execution space instances #6441

Backend and Architecture Enhancements:

CUDA:

  • Fixed potential data race in Cuda parallel_reduce #6236
  • Use cudaMallocAsync by default #6402
  • Bugfix for using Kokkos from a thread of execution #6299

HIP:

  • New naming convention for AMD GPU: VEGA906, VEGA908, VEGA90A, NAVI1030 to AMD_GFX906, AMD_GFX908, AMD_GFX90A, AMD_GFX1030 #6266
  • Add initial support for gfx942: #6358
  • Improve reduction performance #6229
  • Deprecate HIP(hipStream_t,bool) constructor #6401
  • Add support for Graph #6370
  • Improve reduction performance when using Teams #6284
  • Fix concurrency calculation #6479
  • Fix potential data race in HIP parallel_reduce #6429

SYCL:

  • Enforce external sycl::queues to be in-order #6246
  • Improve reduction performance: #6272 #6271 #6270 #6264
  • Allow using the SYCL execution space on AMD GPUs #6321
  • Allow sorting via native oneDPL to support Views with stride=1 #6322
  • Make in-order queues the default via macro #6189

OpenACC:

  • Support Clacc compiler #6250

General Enhancements

  • Add missing is_*_view traits and is_*_view_v helper variable templates for DynRankView, DynamicView, OffsetView, ScatterView containers #6195
  • Make nvcc_wrapper and compiler_launcher scripts more portable by switching to a #!/usr/bin/env shebang #6357
  • Add an improved Kokkos::malloc / Kokkos::free performance test #6377
  • Ensure Views with size==0 can be used with deep_copy #6273
  • Kokkos::abort is moved to header Kokkos_Abort.hpp #6445
  • KOKKOS_ASSERT, KOKKOS_EXPECTS, KOKKOS_ENSURES are moved to header Kokkos_Assert.hpp #6445
  • Add a permuted-index mode to the gups benchmark #6378
  • Check for overflow during backend initialization #6159
  • Make constraints on Kokkos::sort more visible #6234 and cleanup API #6239
  • Add converting assignment to DualView: #6474

Build System Changes

  • Export Kokkos_CXX_COMPILER_VERSION #6282
  • Disable default oneDPL support in Trilinos #6342

Incompatibilities (i.e. breaking changes)

  • Ensure that Kokkos::complex only gets instantiated for cv-unqualified floating-point types #6251
  • Removed (deprecated-3) support for volatile join operators in reductions #6385
  • Enforce ViewCtorArgs restrictions for create_mirror_view #6304
  • SIMD types for ARM NEON are not autodetected anymore but need Kokkos_ARCH_ARM_NEON or Kokkos_ARCH_NATIVE=ON #6394
  • Remove #include <iostream> from headers where possible #6482

Deprecations

  • Deprecated Kokkos::vector #6252
  • All host allocation mechanisms except for STD_MALLOC have been deprecated #6341

Bug Fixes

  • Missing memory fence in RandomPool::free_state functions #6290
  • Fix for corner case in Kokkos::Experimental::is_partitioned algorithm #6257
  • Fix initialization of scratch lock variables in the Cuda backend #6433
  • Fixes for Kokkos::Array #6372
  • Fixed symlink configure issue for Windows #6241
  • OpenMPTarget init-join fix #6444
  • Fix atomic operations bug for Min and Max #6435
  • Fix implementation for cyl_bessel_i0 #6484
  • Fix various NVCC warnings in BinSort, Array, and bit manipulation function templates #6483

4.1.00 (2023-06-16)

Full Changelog

Features:

  • Add <Kokkos_BitManipulation.hpp> header #4577 #5907 #5967 #6101
  • Add UnorderedMapInsertOpTypes #5877 and documentation #350
  • Add multiple reducers support for team-level parallel reduce #5727

Backend and Architecture Enhancements:

CUDA:

  • Allow NVCC 12 to compile using C++20 flag #5977
  • Remove ability to disable CMake option Kokkos_ENABLE_CUDA_LAMBDA and unconditionally enable CUDA extended lambda support. #5964
  • Drop unnecessary fences around the memory allocation when using CudaUVMSpace in views #6008

HIP:

  • Improve performance for parallel_reduce. Use different parameters for LightWeight kernels #6029 and #6160

SYCL:

  • Only pass one wrapper object in SYCL reductions #6047
  • Improve and simplify parallel_scan implementation #6064
  • Remove workaround for submit_barrier not being enqueued properly #5504
  • Fix guards for using scratch space with SYCL #6003
  • Fix compiling SYCL with KOKKOS_IMPL_DO_NOT_USE_PRINTF_USAGE #6219

OpenMPTarget:

  • Improve hierarchical parallelism for Intel architectures #6043
  • Enable Cray compiler for the OpenMPTarget backend. #5889

HPX:

  • Update HPX backend to use HPX's sender/receiver functionality #5628
  • Increase minimum required HPX version to 1.8.0 #6132
  • Implement HPX::in_parallel #6143

General Enhancements

  • Export CMake Kokkos_{CUDA,HIP}_ARCHITECTURES variables #5919 #5925
  • Add Kokkos::Profiling::ScopedRegion #5959 #5972
  • Add support for View::rank[_dynamic]()#5870
  • Detect incompatible relocatable device code mode to prevent ODR violations #5991
  • Add (experimental) support for 32-bit Darwin and PPC #5916
  • Add missing half and bhalf specialization of the infinity numeric trait #6055
  • Add is_dual_view trait and align further with regular view #6120
  • Allow templated functors in parallel_for, parallel_reduce and parallel_scan #5976
  • Define KOKKOS_COMPILER_INTEL_LLVM and only define at most one KOKKOS_COMPILER* macro #5906
  • Allow linking against build tree #6078
  • Allow passing a temporary std::vector to partition_space #6167
  • Kokkos can be used as an external dependency in Trilinos #6142, #6157 #6163
  • Left align demangled stacktrace output #6191
  • Improve OpenMP affinity warning to include MPI concerns #6185

Build System Changes

  • Drop Kokkos_ENABLE_LAUNCH_COMPILER option which had no effect #6148
  • Export variables for relevant Kokkos options with cmake#6142

Incompatibilities (i.e. breaking changes)

  • Desul atomics always enabled #5801
  • Drop KOKKOS_ENABLE_CUDA_ASM* and KOKKOS_ENABLE_*_ATOMICS macros #5940
  • Drop KOKKOS_ENABLE_RFO_PREFETCH macro #5944
  • Deprecate Kokkos_ENABLE_CUDA_LAMBDA configuration option and force it to ON #5964
  • Remove TriBITS Kokkos subpackages #6104
  • Cuda: Remove unused attach_texture_object #6129
  • Drop Kokkos_ENABLE_PROFILING_LOAD_PRINT configuration option #6150
  • Drop pointless Kokkos{Algorithms,Containers}_config.h files #6108

Deprecations

  • Deprecate BinSort, BinOp1D, and BinOp3D default constructors #6131

Bug Fixes

  • Fix SYCLTeamMember to take arguments for scratch sizes as std::size_t #5981
  • Fix Kokkos_SIMD with AVX2 on 64-bit architectures #6075
  • Fix an incorrectly returning size for SIMD uint64_t in AVX2 #6004
  • Fix missing avx512 header file with gcc versions before 10 #6183
  • Fix incorrect results of parallel_reduce of types smaller than int on CUDA and HIP: #5745
  • CMake: update package compatibility mode when building within Trilinos #6012
  • Fix warnings generated from internal uses of ALL_t rather than Kokkos::ALL_t #6028
  • Fix bug in hpcbind script: check for correct Slurm variable #6116
  • KokkosTools: Don't call callbacks before backends are initialized #6114
  • Fix global fence in Kokkos::resize(DynRankView) #6184
  • Fix BinSort support for strided views #6081
  • Fix missing is_*_view traits in containers #6195
  • Fix broken OpenMP target on NVHPC #6171
  • Sorting an empty view should exit early and not fail #6130

4.0.01 (2023-04-14)

Full Changelog

Backend and Architecture Enhancements:

CUDA:

  • Allow NVCC 12 to compile using C++20 flag #6020
  • Add CUDA Ada architecture support #6022

HIP:

  • Add support for AMDGPU target NAVI31 / RX 7900 XT(X): gfx1100 #6021
  • HIP: Fix warning from std::memcpy #6019

SYCL:

  • Fix SYCLTeamMember to take arguments for scratch sizes as std::size_t #5986

General Enhancements

  • Fixup 4.0 change log #6023

Build System Changes

  • Cherry-pick TriBITS update from Trilinos #6037
  • CMake: update package compatibility mode when building within Trilinos #6013

Bug Fixes

  • Fix an incorrectly returning size for SIMD uint64_t in AVX2 #6011
  • Desul atomics: wrong value for desul::Impl::numeric_limits_max<uint64_t> #6018
  • Fix warning in some user code when using std::memcpy #6000
  • Fix excessive build times using Makefile.kokkos #6068

4.0.0 (2023-02-21)

Full Changelog

Features:

  • Allow value types without default constructor in Kokkos::View with Kokkos::WithoutInitializing #5307
  • parallel_scan with View as result type. #5146
  • Introduced SharedSpace, an alias for a MemorySpace that is accessible by every ExecutionSpace. The memory is moved and then accessed locally. #5289
  • Introduced SharedHostPinnedSpace, an alias for a MemorySpace that is accessible by every ExecutionSpace. The memory is pinned to the host and accessed via zero-copy access. #5405
  • Add team- and thread-level sort, sort_by_key algorithms. #5317
  • Groundwork for MDSpan integration. #4973 and #5304
  • Introduced MD version of hierarchical parallelism: TeamThreadMDRange, ThreadVectorMDRange and TeamVectorMDRange. #5238

Backend and Architecture Enhancements:

CUDA:

  • Allow CUDA PTX forward compatibility #3612 #5536 #5527
  • Add support for NVIDIA Hopper GPU architecture #5538
  • Don't rely on synchronization behavior of default stream in CUDA and HIP #5391
  • Improve CUDA cache config settings #5706

HIP:

  • Move HIP, HIPSpace, HIPHostPinnedSpace, and HIPManagedSpace out of the Experimental namespace #5383
  • Don't rely on synchronization behavior of default stream in CUDA and HIP #5391
  • Export AMD architecture flag when using Trilinos #5528
  • Fix linking error (see OLCF issue) when using amdclang: #5539
  • Remove support for MI25 and added support for Navi 1030 #5522
  • Fix race condition when using HSA_XNACK=1 #5755
  • Add parameter to force using GlobalMemory launch mechanism. This can be used when encountering compiler bugs with ROCm 5.3 and 5.4 #5796

SYCL:

  • Delegate choice of workgroup size for parallel_reduce with RangePolicy to the compiler. #5227
  • SYCL RangePolicy: manually specify workgroup size through chunk size #4875

OpenMPTarget:

  • Select the right device #5492

OpenMP:

  • Add partition_space #5105

General Enhancements

  • Implement OffsetView constructor taking pairs and ViewCtorProp #5303
  • Promote math constants to Kokkos::numbers namespace #5434
  • Add overloads of hypot math function that take 3 arguments #5341
  • Add fma fused multiply-add math function #5428
  • Views using MemoryTraits::Atomic don't need volatile overloads for the value type anymore. #5455
  • Added is_team_handle trait #5375
  • Refactor desul atomics to support compiling CUDA with NVC++ #5431 #5497 #5498
  • Support finding libquadmath with native compiler support #5286
  • Add architecture flags for MSVC #5673
  • SIMD backend for ARM NEON #5829

Build System Changes

  • Let CMake determine OpenMP flags. #4105
  • Update minimum compiler versions. #5323
  • Makefile and CMake support for C++23 #5283
  • Do not add -cuda to the link line with NVHPC compiler when the CUDA backend is not actually enabled #5485
  • Only add -latomic in generated GNU makefiles when OpenMPTarget backend is enabled #5501 #5537 (3.7 patch release candidate)
  • Kokkos_ENABLE_CUDA_LAMBDA now ON by default with NVCC #5580
  • Fix enabling of relocatable device code when using CUDA as CMake language #5564
  • Fix cmake configuration with CUDA 12 #5691

Incompatibilities (i.e. breaking changes)

  • Require C++17 #5277
  • Turn setting Kokkos_CXX_STANDARD into an error #5293
  • Remove all deprecations in Kokkos 3 #5297
  • Remove KOKKOS_COMPILER_CUDA_VERSION #5430
  • Drop reciprocal_overflow_threshold numeric trait #5326
  • Move reduction_identity out of <Kokkos_NumericTraits.hpp> into a new <Kokkos_ReductionIdentity.hpp> header #5450
  • Reduction and scan routines will report an error if the join() operator they would use takes volatile-qualified parameters #5409
  • ENABLE_CUDA_UVM is dropped in favor of using SharedSpace as MemorySpace explicitly #5608
  • Remove Kokkos_ENABLE_CUDA_LDG_INTRINSIC option #5623
  • Don't rely on synchronization behavior of default stream in CUDA and HIP - this potentially will break unintended implicit synchronization with other libraries such as MPI #5391
  • Make ExecutionSpace::concurrency() a non-static member function #5655 and related PRs
  • Remove code guarded by KOKKOS_ENABLE_DEPRECATED_CODE_3

Deprecations

  • Deprecate CudaUVMSpace::available() which always returned true #5614
  • Deprecate volatile-qualified members from Kokkos::pair and Kokkos::complex #5412
  • Deprecate KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_* macros #5824 (oversight in 3.6)

Bug Fixes

  • Avoid allocating memory for UniqueToken #5300
  • Fix pragma ivdep in Kokkos_OpenMP_Parallel.hpp #5356
  • Fix configuring with Threads support when rerunning CMake #5486
  • Fix View assignment between LayoutLeft and LayoutRight with static extents #5535 (3.7 patch release candidate)
  • Add fence() calls to sorting routine overloads that don't take an execution space parameter #5389
  • ClockTic changed to 64 bit to fix overflow on Power #5577 (incl. in 3.7.01 patch release)
  • Fix incorrect offset in CUDA and HIP parallel_scan for < 4 byte types #5555 (3.7 patch release candidate)
  • Fix incorrect alignment behavior of scratch allocations in some corner cases (e.g. very small allocations) #5687 (3.7 patch release candidate)
  • Add missing ReductionIdentity<char> specialization #5798
  • Don't install standard algorithms headers multiple times #5670
  • Fix max scratch size calculation for level 0 scratch in CUDA and HIP #5718

3.7.02 (2023-05-17)

Full Changelog

Backends and Archs Enhancements:

CUDA

  • Add Hopper support and update nvcc_wrapper to work with CUDA-12 #5693

General Enhancements:

  • sprintf -> snprintf #5787

Build System:

  • Add error message when not using hipcc and when CMAKE_CXX_STANDARD is not set #5945

Bug Fixes:

  • Fix Scratch allocation alignment issues #5692
  • Fix Intel Classic Compiler ICE #5710
  • Don't install std algorithm headers multiple times #5711
  • Fix static init order issue in InitalizationSettings #5721
  • Fix src/dst Properties in deep_copy(DynamicView,View) #5732
  • Fix build on Fedora Rawhide #5782
  • Finalize HIP lock arrays #5694
  • Fix CUDA lock arrays for current Desul #5812
  • Set the correct device/context in InterOp tests #5701

3.7.01 (2022-12-01)

Full Changelog

Bug Fixes:

  • Add fences to all sorting routines not taking an execution space instance argument #5547
  • Fix repeated team_reduce without barrier #5552
  • Fix memory spaces in create_mirror_view overloads using view_alloc #5521
  • Allow as_view_of_rank_n() to be overloaded for "special" scalar types #5553
  • Fix warning calling a __host__ function from a __host__ __device__ from View:: as_view_of_rank_n #5591
  • OpenMPTarget: adding implementation to set device id. #5557
  • Use Kokkos::atomic_load to Correct Race Condition Giving Rise to Seg Faulting Error in OpenMP tests #5559
  • cmake: define KOKKOS_ARCH_A64FX #5561
  • Only link against libatomic in gnu-make OpenMPTarget build #5565
  • Fix static extents assignment for LayoutLeft/LayoutRight assignment #5566
  • Do not add -cuda to the link line with NVHPC compiler when the CUDA backend is not actually enabled #5569
  • Export the flags in KOKKOS_AMDGPU_OPTIONS when using Trilinos #5571
  • Add support for detecting MPI local rank with MPICH and PMI #5570 #5582
  • Remove listing of undefined TPL dependencies #5573
  • ClockTic changed to 64 bit to fix overflow on Power #5592
  • Fix incorrect offset in CUDA and HIP parallel scan for < 4 byte types #5607
  • Fix initialization of Cuda lock arrays #5622

3.7.00 (2022-08-22)

Full Changelog

Features:

  • Use non-volatile join() member functions and operator+= in parallel_reduce/scan #4931 #4954 #4951
  • Add SIMD sub package (requires C++17) #5016
  • Add is_finalized() #5247
  • Promote mathematical functions from namespace Kokkos::Experimental to namespace Kokkos #4791
  • Promote min, max, clamp, minmax functions from namespace Kokkos::Experimental to namespace Kokkos #5170
  • Add round, logb, nextafter, copysign, and signbit math functions #4768
  • Add HIPManagedSpace, similar to CudaUVMSpace #5112
  • Accept view construction allocation properties in create_mirror[_view,_view_and_copy] and resize/realloc #5125 #5095 #5035 #4805 #4844
  • Allow MemorySpace::allocate() to be called with execution space #4826
  • Experimental: Compile time view subscriber #4197

Backends and Archs Enhancements:

  • Add support for Sapphire Rapids Intel architecture #5015
  • Add support for ICX, SKL and ICL Intel architectures #5013 #4929
  • Add arch flags for Intel GPU Ponte Vecchio #4932
  • SYCL: require GPU if GPU architecture was set at configuration time (i.e. do not allow fallback to CPU device) #5264 #5222
  • SYCL: Add SYCL::sycl_queue() for interoperability #5241
  • SYCL: Loosen restriction for using built-in sycl::group_broadcast #4552
  • SYCL: preserve address space #4396
  • OpenMPTarget: Adding a workaound for team scan #5219
  • OpenMPTarget: Adding logic to skip the kernel launch if league_size=0 #5067
  • OpenMPTarget: Make sure Kokkos::abort() causes abnormal program termination when called on the host-side #4808
  • HIP: Make HIPHostPinnedSpace coarse-grained #5152
  • Refactor OpenMP parallel_for implementation to use more native OpenMP constructs #4664
  • Add option to optimize for local CPU architecture Kokkos_ARCH_NATIVE #4930

Implemented enhancements

  • Add command line argument/environment variable to print the configuration #5233
  • Improve error message in view memory access violations #4950
  • Remove unnecessary fences in View initialization #4823
  • Make View::shmem_size() device-callable #4936
  • Update numerics support for __float128 #5081
  • Add log10 overload for Kokkos::complex #5009
  • Add [[nodiscard]] to ScopeGuard #5224
  • Add structured binding support for Kokkos::Array #4962
  • Enable accessing Kokkos::Array elements in constant expressions #4916
  • Mark as_view_of_rank_n as KOKKOS_FUNCTION #5248
  • Cleanup/rework fence overloads #5148
  • Assert that Layout construction from extents is valid in functions taking integer extents #5209
  • Add fill_random overload that takes an execution space as first argument #5181
  • Avoid some unnecessary fences in parallel_reduce/scan #5154
  • Include KOKKOS_ENABLE_LIBDL in options when printing configuration #5086
  • DynRankView: make layout() return the same as a corresponding static View #5026
  • Use _mm_malloc for icpx #5012
  • Avoid forcing matching execution spaces in BinSort constructor and sort() #4919
  • Check number of bins in BinSort #4890
  • Improve performance in parallel STL-like algorithms #4887 #4886
  • Disable memset on A64FX and launch parallel_for instead (performance) #4884
  • Allow non-power-of-two team sizes for team reductions and scans #4809

Harmonization of Kokkos execution environment initialization:

  • Warn when unable to detect local MPI rank and user explicitly asked for it #5263
  • Refactor parsing of command line arguments and environment variables #5221
  • Refactor device selection at initialization #5211
  • Rename tools settings for consistency #5201
  • Print help only once #5128
  • Update precedence rule in initialization #5130
  • Warn instead of just ignoring user settings when kokkos-tools is disabled #5088
  • Drop numa args in threads backend initialization #5127
  • Warn users when a flag prefixed with -[-]kokkos is not recognized and do not remove it #5256
  • Give back to Core what belongs to Core (aka moving tune_internals option from Tools back to Core) #5202

Build system updates:

  • nvcc_wrapper: filter out -pedantic-errors from nvcc options #5235
  • nvcc_wrapper: add known nvcc option --source-in-ptx #5052
  • Link libdl as interface library #5179
  • Only show GPU architectures with enabled corresponding backend #5119
  • Enable optional external desul build #5021 #5132
  • Export Kokkos_CXX_STANDARD variable with CMake #5068
  • Suppress warnings with nvc++ #5031
  • Disallow multiple host architectures in CMake #4996
  • Do not include compiler warning flags in the compile option of the cmake target #4989
  • AOT flags for OpenMPTarget targeting Intel GPUs #4915
  • Repurpose Kokkos_ARCH_INTEL_GEN for SYCL to mean JIT to be conforming with OMPT #4894
  • Replace amdgpu-target with offload-arch #4874
  • Do not enable kokkos_launch_compiler when CMAKE_CXX_COMPILER_LAUNCHER is set #4870
  • Move CMake version check up #4797

Incompatibilities:

  • Remove KOKKOS_THREAD_LOCAL #5064
  • Remove KOKKOS_ENABLE_POSIX_MEMALIGN #5011
  • Remove unused KOKKOS_ENABLE_TM #4995
  • Remove unused cmakedefine KOKKOS_ENABLE_COMPILER_WARNINGS #4883
  • Remove unused KOKKOS_ENABLE_DUALVIEW_MODIFY_CHECK #4882
  • Drop Instruction Set Architecture (ISA) macros #4981
  • Warn in ScopeGuard about illegal usage #5250

Deprecations:

  • Guard against non-public header inclusion #5178
  • Raise deprecation warnings if non empty WorkTag class is used #5230
  • Deprecate parallel_* overloads taking the label as trailing argument #5141
  • Deprecate nested types in functional #5185
  • Deprecate InitArguments struct and replace it with InitializationSettings #5135
  • Deprecate finalize_all() #5134
  • Deprecate command line arguments (other than --help) that are not prefixed with kokkos-* #5120
  • Deprecate --[kokkos-]numa cmdline arg and KOKKOS_NUMA env var #5117
  • Deprecate --[kokkos-]threads command line argument in favor of --[kokkos-]num-threads #5111
  • Deprecate Kokkos::is_reducer_type #4957
  • Deprecate OffsetView constructors taking index_list_type #4810
  • Deprecate overloads of Kokkos::sort taking a parameter bool always_use_kokkos_sort #5382
  • Warn about parallel_reduce cases that call join() with volatile-qualified arguments #5215

Bug Fixes:

  • CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer #4855
  • Work around Intel compiler bug #5301
  • Avoid allocating memory for UniqueToken #5300
  • DynamicView: Properly resize mirror instances after construction #5276
  • Remove Kokkos::Rank limit of 6 ranks #5271
  • Do not forget to set last element to nullptr when removing a flag in Kokkos::initialize #5272
  • Fix CUDA+MSVC build issue #5261
  • Fix DynamicView::resize_serial #5220
  • Fix cmake default compiler flags for unknown compiler #5217
  • Fix move_backward #5191
  • Fixing issue 5196 - missing symbol with intel compiler #5207
  • Preserve KOKKOS_INVALID_INDEX in ViewDimension and ArrayLayout construction #5188
  • Finalize deep_copy_space early avoiding printing to std::cerr for Cuda #5151
  • Use correct policy in Threads MDRange parallel_reduce #5123
  • Fix building with NVCC as the CXX compiler while the CUDA backend is not enabled #5115
  • OpenMPTarget Index range fix for MDRange. #5089
  • Fix bug with CUDA's team reduction for empty ranges #5079
  • Fix using ZeroMemset for Serial #5077
  • Fix Kokkos::Vector::push_back for default execution space #5047
  • ScatterView: Fix ScatterMin/ScatterMax to use proper atomics #5045
  • Fix calling ZeroMemset in deep_copy #5040
  • Make View self-assignment not produce double-free #5024
  • Guard against unrecognized pragma with intel compilers #5019
  • Fix racing condition in HIPParallelLaunch #5008
  • KokkosP: Fix device_id in profiling #4997
  • Fix for Kokkos::vector::insert into empty vector with begin and end iterators #4988
  • Fix Core header files installation #4984
  • Fix bounds errors with Kokkos::sort #4980
  • Fixup let RangePolicy::set_chunk_size return a reference to self #4918
  • Fix allocating large Views #4907
  • Fix combined reductions with Kokkos::View #4896
  • Fixed _CUDA_ARCH__ to __CUDA_ARCH__ for CUDA LDG #4893
  • Fixup View::access() truncate parameter pack #4876
  • Fix abort with HIP backend for ROCm 5.0.2 and beyond #4873
  • Fix HIP version when printing the configuration #4872
  • Fix scratch lock array when using scratch level 1 #4871
  • Fix Makefile.kokkos to work with fujitsu compiler #4867
  • cmake: Correct link THREADS link option #4854
  • UniqueToken impl_acquire function should be device only #4819
  • Fix example calls to non existing static print_configuration #4806
  • Fix requests for large team scratch sizes #4728

3.6.01 (2022-05-23)

Full Changelog

Bug Fixes:

  • Fix Threads: Fix serial resizing scratch space (3.6.01 cherry-pick) #5109
  • Fix ScatterMin/ScatterMax to use proper atomics (3.6.01 cherry-pick) #5046
  • Fix allocating large Views #4907
  • Fix bounds errors with Kokkos::sort #4980
  • Fix HIP version when printing the configuration #4872
  • Fixed _CUDA_ARCH__ to __CUDA_ARCH__ for CUDA LDG #4893
  • Fixed an incorrect struct initialization #5028
  • Fix racing condition in HIPParallelLaunch #5008
  • Avoid deprecation warnings with OpenMPExec::validate_partition #4982
  • Make View self-assignment not produce double-free #5024

3.6.00 (2022-02-18)

Full Changelog

Features:

  • Add C++ standard algorithms #4315
  • Implement fill_random for DynRankView #4763
  • Add bhalf_t #4543 #4653
  • Add mathematical constants #4519
  • Allow Kokkos::{create_mirror*,resize,realloc} to be used with WithoutInitializing #4486 #4337
  • Implement KOKKOS_IF_ON_{HOST,DEVICE} macros #4660
  • Allow setting the CMake language for Kokkos #4323

Perf bug fix

  • Desul: Add ScopeCaller #4690
  • Enable Desul atomics by default when using Makefiles #4606
  • Unique token improvement #4741 #4748

Other improvements:

  • Add math function long double overload on the host side #4712

Deprecations:

  • Array reductions with pointer return types #4756
  • Deprecate partition_master, validate_partition #4737
  • Deprecate Kokkos_ENABLE_PTHREAD in favor of Kokkos_ENABLE_THREADS #4619 ** pair with use std::threads **
  • Deprecate log2(unsigned) -> int (removing in next release) #4595
  • Deprecate Kokkos::Impl::is_view #4592
  • Deprecate KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_* macros and the ActiveExecutionMemorySpace alias #4668

Backends and Archs Enhancements:

SYCL:

  • Update required SYCL compiler version #4749
  • Cap vector size to kernel maximum for SYCL #4704
  • Improve check for compatibility of vector size and subgroup size in SYCL #4579
  • Provide chunk_size for SYCL #4635
  • Use host-pinned memory for SYCL kernel memory #4627
  • Use shuffle-based algorithm for scalar reduction #4608
  • Implement pool of USM IndirectKernelMemory #4596
  • Provide valid default team size for SYCL #4481

CUDA:

  • Add checks for shmem usage in parallel_reduce #4548

HIP:

  • Add support for fp16 in the HIP backend #4688
  • Disable multiple kernel instantiations when using HIP (configure with -DKokkos_ENABLE_HIP_MULTIPLE_KERNEL_INSTANTIATIONS=ON to use) #4644
  • Fix HIP scratch use per instance #4439
  • Change allocation header to 256B alignment for AMD VEGA architecture #4753
  • Add generic KOKKOS_ARCH_VEGA macro #4782
  • Require ROCm 4.5 #4689

HPX:

  • Adapt to HPX 1.7.0 which is now required #4241

OpenMP:

  • Fix thread deduction for OpenMP for thread_count==0 #4541

OpenMPTarget:

  • Update memory space size_type to improve performance (size_t -> unsigned) #4779

Other Improvements:

  • Improve NVHPC support #4599
  • Add Kokkos::Experimental::{min,max,minmax,clamp} #4629 #4506
  • Use device type as template argument in Containers and Algorithms #4724 #4675
  • Implement Kokkos::sort with execution space #4490
  • Kokkos::resize always error out for mismatch in runtime rank #4681
  • Print current call stack when calling Kokkos::abort() from the host #4672 #4671
  • Detect mismatch of execution spaces in functors #4655
  • Improve view label access on host #4647
  • Error out for const scalar return type in reduction #4645
  • Don't allow calling UnorderdMap::value_at for a set #4639
  • Add KOKKOS_COMPILER_NVHPC macro, disable quiet_NaN and signaling_NaN #4586
  • Improve performance of local_deep_copy #4511
  • Improve performance when sorting integers #4464
  • Add missing numeric traits (denorm_min, reciprocal_overflow_threshold, {quiet,silent}_NaN}) and make them work on cv-qualified types #4466 #4415 #4473 #4443

Implemented enhancements BuildSystem

  • Manually compute IntelLLVM compiler version for older CMake versions #4760
  • Add Xptxas without = to nvcc_wrapper #4646
  • Use external GoogleTest optionally #4563
  • Silent warnings about multiple optimization flags with nvcc_wrapper #4502
  • Use the same flags in Makefile.kokkos for POWER7/8/9 as for CMake #4483
  • Fix support for A64FX architecture #4745

Incompatibilities:

  • Drop KOKKOS_ARCH_HIP macro when using generated GNU makefiles #4786
  • Remove gcc-toolchain auto add for clang in Makefile.kokkos #4762

Bug Fixes:

  • Lock constant memory in Cuda/HIP kernel launch with a mutex (thread safety) #4525
  • Fix overflow for large requested scratch allocation #4551
  • Fix Windows build in mingw #4564
  • Fix kokkos_launch_compiler: escape $ character #4769 #4703
  • Fix math functions with NVCC and GCC 5 as host compiler #4733
  • Fix shared build with Intel19 #4725
  • Do not install empty desul/src/ directory #4714
  • Fix wrong device_id computation in identifier_from_devid (Profiling Interface) #4694
  • Fix a bug in CUDA scratch memory pool (abnormally high memory consumption) #4673
  • Remove eval of command args in hpcbind #4630
  • SYCL fix to run when no GPU is detected #4623
  • Fix layout_strides::span for rank-0 views #4605
  • Fix SYCL atomics for local memory #4585
  • Hotfix mdrange_large_deep_copy for SYCL #4581
  • Fix bug when sorting integer using the HIP backend #4570
  • Fix compilation error when using HIP with RDC #4553
  • DynamicView: Fix deallocation extent #4533
  • SYCL fix running parallel_reduce with TeamPolicy for large ranges #4532
  • Fix bash syntax error in nvcc_wrapper #4524
  • OpenMPTarget team_policy reduce fixes for init/join reductions #4521
  • Avoid hangs in the Threads backend #4499
  • OpenMPTarget fix reduction bug in parallel_reduce for TeamPolicy #4491
  • HIP fix scratch space per instance #4439
  • OpenMPTarget fix team scratch allocation #4431

3.5.00 (2021-10-19)

Full Changelog

Features:

  • Add support for quad-precision math functions/traits #4098
  • Adding ExecutionSpace partitioning function #4096
  • Improve Python Interop Capabilities #4065
  • Add half_t Kokkos::rand specialization #3922
  • Add math special functions: erf, erfcx, expint1, Bessel functions, Hankel functions #3920
  • Add missing common mathematical functions #4043 #4036 #4034
  • Let the numeric traits be SFINAE-friendly #4038
  • Add Desul atomics - enabling memory-order and memory-scope parameters #3247
  • Add detection idiom from the C++ standard library extension version 2 #3980
  • Fence Profiling Support in all backends #3966 #4304 #4258 #4232
  • Significant SYCL enhancements (see below)

Deprecations:

  • Deprecate CUDA_SAFE_CALL and HIP_SAFE_CALL #4249
  • Deprecate Kokkos::Impl::Timer (Kokkos::Timer has been available for a long time) #4201
  • Deprecate Experimental::MasterLock #4094
  • Deprecate Kokkos_TaskPolicy.hpp (headers got reorganized, doesn't remove functionality) #4011
  • Deprecate backward compatibility features #3978
  • Update and deprecate is_space::host_memory/execution/mirror_space #3973

Backends and Archs Enhancements:

  • Enabling constbitset constructors in kernels #4296
  • Use ZeroMemset in View constructor to improve performance #4226
  • Use memset in deep_copy #3944
  • Add missing fence() calls in resize(View) that effectively do deep_copy(resized, orig) #4212
  • Avoid allocations in resize and realloc #4207
  • StaticCsrGraph: use device type instead of execution space to construct views #3991
  • Consider std::sort when view is accessible from host #3929
  • Fix CPP20 warnings except for volatile #4312

SYCL:

  • Introduce SYCLHostUSMSpace #4268
  • Implement SYCL TeamPolicy for vector_size > 1 #4183
  • Enable 64bit ranges for SYCL #4211
  • Don't print SYCL device info in execution space intialization #4168
  • Improve SYCL MDRangePolicy performance #4161
  • Use sub_groups in SYCL parallel_scan #4147
  • Implement subgroup reduction for SYCL RangePolicy parallel_reduce #3940
  • Use DPC++ broadcast extension in SYCL team_broadcast #4103
  • Only fence in SYCL parallel_reduce for non-device-accessible result_ptr #4089
  • Improve fencing behavior in SYCL backend #4088
  • Fence all registered SYCL queues before deallocating memory #4086
  • Implement SYCL::print_configuration #3992
  • Reuse scratch memory in parallel_scan and TeamPolicy (decreases memory footprint) #3899 #3889

CUDA:

  • Cuda improve heuristic for blocksize #4271
  • Don't use [[deprecated]] for nvcc #4229
  • Improve error message for NVHPC as host compiler #4227
  • Update support for cuda reductions to work with types < 4bytes #4156
  • Fix incompatible team size deduction in rare cases parallel_reduce #4142
  • Remove UVM usage in DynamicView #4129
  • Remove dependency between core and containers #4114
  • Adding opt-in CudaMallocSync support when using CUDA version >= 11.2 #4026 #4233
  • Fix a potential race condition in the CUDA backend #3999

HIP:

  • Implement new blocksize deduction method for HIP Backend #3953
  • Add multiple LaunchMechanism #3820
  • Make HIP backend thread-safe #4170

Serial:

  • Refactor Serial backend and fix thread-safety issue #4053

OpenMPTarget:

  • OpenMPTarget: support array reductions in RangePolicy #4040
  • OpenMPTarget: add MDRange parallel_reduce #4032
  • OpenMPTarget: Fix bug in for the case of a reducer. #4044
  • OpenMPTarget: verify process fix #4041

Implemented enhancements BuildSystem

Important BuildSystem Updates:

  • Use hipcc architecture autodetection when Kokkos_ARCH is not set #3941
  • Introduce Kokkos_ENABLE_DEPRECATION_WARNINGS and remove deprecated code with Kokkos_ENABLE_DEPRECATED_CODE_3 #4106 #3855

Other Improvements:

  • Add allow-unsupported-compiler flag to nvcc-wrapper #4298
  • nvcc_wrapper: fix errors in argument handling #3993
  • Adds support for -time= and -time in nvcc_wrapper #4015
  • nvcc_wrapper: suppress duplicates of GPU architecture and RDC flags #3968
  • Fix TMPDIR support in nvcc_wrapper #3792
  • NVHPC: update PGI compiler arch flags #4133
  • Replace PGI with NVHPC (works for both) #4196
  • Make sure that KOKKOS_CXX_HOST_COMPILER_ID is defined #4235
  • Add options to Makefile builds for deprecated code and warnings #4215
  • Use KOKKOS_CXX_HOST_COMPILER_ID for identifying CPU arch flags #4199
  • Added support for Cray Clang to Makefile.kokkos #4176
  • Add XLClang as compiler #4120
  • Keep quoted compiler flags when passing to Trilinos #3987
  • Add support for AMD Zen3 CPU architecture #3972
  • Rename IntelClang to IntelLLVM #3945
  • Add cppcoreguidelines-pro-type-cstyle-cast to clang-tidy #3522
  • Add sve bit size definition for A64FX #3947 #3946
  • Remove KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES #4150

Other Changes:

Tool Enhancements:

  • Retrieve original value from a point in a MultidimensionalSparseTuningProblem #3977
  • Allow extension of built-in tuners with additional tuning axes #3961
  • Added a categorical tuner #3955

Miscellaneous:

  • hpcbind: Use double quotes around $@ when invoking user command #4284
  • Add file and line to error message #3985
  • Fix compiler warnings when compiling with nvc++ #4198
  • Add OpenMPTarget CI build on AMD GPUs #4055
  • CI: icpx is now part of intel container #4002

Incompatibilities:

  • Remove pre CUDA 9 KOKKOS_IMPL_CUDA_* macros #4138

Bug Fixes:

  • UnorderedMap::clear() should zero the size() #4130
  • Add memory fence for HostSharedPtr::cleanup() #4144
  • SYCL: Fix race conditions in TeamPolicy::parallel_reduce #4418
  • Adding missing memory fence to serial exec space fence. #4292
  • Fix using external SYCL queues in tests #4291
  • Fix digits10 bug #4281
  • Fixes constexpr errors with frounding-math on gcc < 10. #4278
  • Fix compiler flags for PGI/NVHPC #4264
  • Fix Zen2/3 also implying Zen Arch with Makefiles #4260
  • Kokkos_Cuda.hpp: Fix shadow warning with cuda/11.0 #4252
  • Fix issue w/ static initialization of function attributes #4242
  • Disable long double hypot test on Power systems #4221
  • Fix false sharing in random pool #4218
  • Fix a missing memory_fence for debug shared alloc code #4216
  • Fix two xl issues #4179
  • Makefile.kokkos: fix (standard_in) 1: syntax error #4173
  • Fixes for query_device example #4172
  • Fix a bug when using HIP atomic with Kokkos::Complex #4159
  • Fix mistaken logic in pthread creation #4157
  • Define KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION when requesting Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION=ON #4107
  • Fix compilation with latest MSVC version #4102
  • Fix incorrect macro definitions when compiling with Intel compiler on Windows #4087
  • Fixup global buffer overflow in hand rolled string manipulation #4070
  • Fixup heap buffer overflow in cmd line args parsing unit tests #4069
  • Only add quotes in compiler flags for Trilinos if necessary #4067
  • Fixed invocation of tools init callbacks #4061
  • Work around SYCL JIT compiler issues with static variables #4013
  • Fix TestDetectionIdiom.cpp test inclusion for Trilinos/TriBITS #4010
  • Fixup allocation headers with OpenMPTarget backend #4003
  • Add missing specialization for OMPT to Kokkos Random #3967
  • Disable hypot long double test on power arches #3962
  • Use different EBO workaround for MSVC (rebased) #3924
  • Fix SYCL Kokkos::Profiling::(de)allocateData calls #3928

3.4.01 (2021-05-19)

Full Changelog

Bug Fixes:

  • Windows: Remove atomic_compare_exchange_strong overload conflicts with Windows #4024
  • OpenMPTarget: Fixup allocation headers with OpenMPTarget backend #4020
  • OpenMPTarget: Add missing specailization for OMPT to Kokkos Random #4022
  • AMD: Add support for AMD Zen3 CPU architecture #4021
  • SYCL: Implement SYCL::print_configuration #4012
  • Containers: staticcsrgraph: use device type instead of execution space to construct views #3998
  • nvcc_wrapper: fix errors in argument handling, suppress duplicates of GPU architecture and RDC flags #4006
  • CI: Add icpx testing to intel container #4004
  • CMake/TRIBITS: Keep quoted compiler flags when passing to Trilinos #4007
  • CMake: Rename IntelClang to IntelLLVM #3945

3.4.00 (2021-04-25)

Full Changelog

Highlights:

  • SYCL Backend Almost Feature Complete
  • OpenMPTarget Backend Almost Feature Complete
  • Performance Improvements for HIP backend
  • Require CMake 3.16 or newer
  • Tool Callback Interface Enhancements
  • cmath wrapper functions available now in Kokkos::Experimental

Features:

  • Implement parallel_scan with ThreadVectorRange and Reducer #3861
  • Implement SYCL Random #3849
  • OpenMPTarget: Adding Implementation for nested reducers #3845
  • Implement UniqueToken for SYCL #3833
  • OpenMPTarget: UniqueToken::Global implementation #3823
  • DualView sync's on ExecutionSpaces #3822
  • SYCL outer TeamPolicy parallel_reduce #3818
  • SYCL TeamPolicy::team_scan #3815
  • SYCL MDRangePolicy parallel_reduce #3801
  • Enable use of execution space instances in ScatterView #3786
  • SYCL TeamPolicy nested parallel_reduce #3783
  • OpenMPTarget: MDRange with TagType for parallel_for #3781
  • Adding OpenMPTarget parallel_scan #3655
  • SYCL basic TeamPolicy #3654
  • OpenMPTarget: scratch memory implementation #3611

Implemented enhancements Backends and Archs:

  • SYCL choose a specific GPU #3918
  • [HIP] Lock access to scratch memory when using Teams #3916
  • [HIP] fix multithreaded access to get_next_driver #3908
  • Forward declare HIPHostPinnedSpace and SYCLSharedUSMSpace #3902
  • Let SYCL USMObjectMem use SharedAllocationRecord #3898
  • Implement clock_tic for SYCL #3893
  • Don't use a static variable in HIPInternal::scratch_space #3866(kokkos#3866)
  • Reuse memory for SYCL parallel_reduce #3873
  • Update SYCL compiler in CI #3826
  • Introduce HostSharedPtr to manage m_space_instance for Cuda/HIP/SYCL #3824
  • [HIP] Use shuffle for range reduction #3811
  • OpenMPTarget: Changes to the hierarchical parallelism #3808
  • Remove ExtendedReferenceWrapper for SYCL parallel_reduce #3802
  • Eliminate sycl_indirect_launch #3777
  • OpenMPTarget: scratch implementation for parallel_reduce #3776
  • Allow initializing SYCL execution space from sycl::queue and SYCL::impl_static_fence #3767
  • SYCL TeamPolicy scratch memory alternative #3763
  • Alternative implementation for SYCL TeamPolicy #3759
  • Unify handling of synchronous errors in SYCL #3754
  • core/Cuda: Half_t updates for cgsolve #3746
  • Unify HIPParallelLaunch structures #3733
  • Improve performance for SYCL parallel_reduce #3732
  • Use consistent types in Kokkos_OpenMPTarget_Parallel.hpp #3703
  • Implement non-blocking kernel launches for HIP backend #3697
  • Change SYCLInternal::m_queue std::unique_ptr -> std::optional #3677
  • Use alternative SYCL parallel_reduce implementation #3671
  • Use runtime values in KokkosExp_MDRangePolicy.hpp #3626
  • Clean up AnalyzePolicy #3564
  • Changes for indirect launch of SYCL parallel reduce #3511

Implemented enhancements BuildSystem:

  • Also require C++14 when building gtest #3912
  • Fix compiling SYCL with OpenMP #3874
  • Require C++17 for SYCL (at configuration time) #3869
  • Add COMPILE_DEFINITIONS argument to kokkos_create_imported_tpl #3862
  • Do not pass arch flags to the linker with no rdc #3846
  • Try compiling C++14 check with C++14 support and print error message #3843
  • Enable HIP with Cray Clang #3842
  • Add an option to disable header self containment tests #3834
  • CMake check for C++14 #3809
  • Prefer -std=* over --std=* #3779
  • Kokkos launch compiler updates #3778
  • Updated comments and enabled no-op for kokkos_launch_compiler #3774
  • Apple's Clang not correctly recognised #3772
  • kokkos_launch_compiler + CUDA auto-detect arch #3770
  • Add Spack test support for Kokkos #3753
  • Split SYCL tests for aot compilation #3741
  • Use consistent OpenMP flag for IntelClang #3735
  • Add support for -Wno-deprecated-gpu-targets #3722
  • Add configuration to target CUDA compute capability 8.6 #3713
  • Added VERSION and SOVERSION to KOKKOS_INTERNAL_ADD_LIBRARY #3706
  • Add fast-math to known NVCC flags #3699
  • Add MI-100 arch string #3698
  • Require CMake >=3.16 #3679
  • KokkosCI.cmake, KokkosCTest.cmake.in, CTestConfig.cmake.in + CI updates #2844

Implemented enhancements Tools:

  • Improve readability of the callback invocation in profiling #3860
  • V1.1 Tools Interface: incremental, action-based #3812
  • Enable launch latency simulations #3721
  • Added metadata callback to tools interface #3711
  • MDRange Tile Size Tuning #3688
  • Added support for command-line args for kokkos-tools #3627
  • Query max tile sizes for an MDRangePolicy, and set tile sizes on an existing policy #3481

Implemented enhancements Other:

  • Try detecting ndevices in get_gpu #3921
  • Use strcmp to compare names() #3909
  • Add execution space arguments for constructor overloads that might allocate a new underlying View #3904
  • Prefix labels in internal use of kokkos_malloc #3891
  • Prefix labels for internal uses of SharedAllocationRecord #3890
  • Add missing hypot math function #3880
  • Unify algorithm unit tests to avoid code duplication #3851
  • DualView.template view() better matches for Devices in UVMSpace cases #3857
  • More extensive disentangling of Policy Traits #3829
  • Replaced nanosleep and sched_yield with STL routines #3825
  • Constructing Atomic Subviews #3810
  • Metadata Declaration in Core #3729
  • Allow using tagged final functor in parallel_reduce #3714
  • Major duplicate code removal in SharedAllocationRecord specializations #3658

Fixed bugs:

  • Provide forward declarations in Kokkos_ViewLayoutTiled.hpp for XL #3911
  • Fixup absolute value of floating points in Kokkos complex #3882
  • Address intel 17 ICE #3881
  • Add missing pow(Kokkos::complex) overloads #3868
  • Fix bug {pow, log}(Kokkos::complex) #3866(kokkos#3866)
  • Cleanup writing to output streams in Cuda #3859
  • Fixup cache CUDA fallback execution space instance used by DualView::sync #3856
  • Fix cmake warning with pthread #3854
  • Fix typo FOUND_CUDA_{DRIVVER -> DRIVER} #3852
  • Fix bug in SYCL team_reduce #3848
  • Atrocious bug in MDRange tuning #3803
  • Fix compiling SYCL with Kokkos_ENABLE_TUNING=ON #3800
  • Fixed command line parsing bug #3797
  • Workaround race condition in SYCL parallel_reduce #3782
  • Fix Atomic{Min,Max} for Kepler30 #3780
  • Fix SYCL typo #3755
  • Fixed Kokkos_install_additional_files macro #3752
  • Fix a typo for Kokkos_ARCH_A64FX #3751
  • OpenMPTarget: fixes and workarounds to work with "Release" build type #3748
  • Fix parsing bug for number of devices command line argument #3724
  • Avoid more warnings with clang and C++20 #3719
  • Fix gcc-10.1 C++20 warnings #3718
  • Fix cuda cache config not being set correct #3712
  • Fix dualview deepcopy perftools #3701
  • use drand instead of frand in drand #3696

Incompatibilities:

  • Remove unimplemented member functions of SYCLDevice #3919
  • Replace cl::sycl #3896
  • Get rid of SYCL workaround in Kokkos_Complex.hpp #3884
  • Replace most uses of if_c #3883
  • Remove Impl::enable_if_type #3863
  • Remove HostBarrier test #3847
  • Avoid (void) interface #3836
  • Remove VerifyExecutionCanAccessMemorySpace #3813
  • Avoid duplicated code in ScratchMemorySpace #3793
  • Remove superfluous FunctorFinal specialization #3788
  • Rename cl::sycl -> sycl in Kokkos_MathematicalFunctions.hpp #3678
  • Remove integer_sequence backward compatibility implementation #3533

Enabled tests:

  • Fixup re-enable core performance tests #3903
  • Enable more SYCL tests #3900
  • Restrict MDRange Policy tests for Intel GPUs #3853
  • Disable death tests for rawhide #3844
  • OpenMPTarget: Block unit tests that do not pass with the nvidia compiler #3839
  • Enable Bitset container test for SYCL #3830
  • Enable some more SYCL tests #3744
  • Enable SYCL atomic tests #3742
  • Enable more SYCL perf_tests #3692
  • Enable examples for SYCL #3691

3.3.01 (2021-01-06)

Full Changelog

Bug Fixes:

  • Fix severe performance bug in DualView which added memcpys for sync and modify #3693
  • Fix performance bug in CUDA backend, where the cuda Cache config was not set correct.

3.3.00 (2020-12-16)

Full Changelog

Features:

  • Require C++14 as minimum C++ standard. C++17 and C++20 are supported too.
  • HIP backend is nearly feature complete. Kokkos Dynamic Task Graphs are missing.
  • Major update for OpenMPTarget: many capabilities now work. For details contact us.
  • Added DPC++/SYCL backend: primary capabilites are working.
  • Added Kokkos Graph API analogous to CUDA Graphs.
  • Added parallel_scan support with TeamThreadRange #3536
  • Added Logical Memory Spaces #3546
  • Added initial half precision support #3439
  • Experimental feature: control cuda occupancy #3379

Implemented enhancements Backends and Archs:

  • Add a64fx and fujitsu Compiler support #3614
  • Adding support for AMD gfx908 archictecture #3375
  • SYCL parallel_for MDRangePolicy #3583
  • SYCL add parallel_scan #3577
  • SYCL custom reductions #3544
  • SYCL Enable container unit tests #3550
  • SYCL feature level 5 #3480
  • SYCL Feature level 4 (parallel_for) #3474
  • SYCL feature level 3 #3451
  • SYCL feature level 2 #3447
  • OpenMPTarget: Hierarchial reduction for + operator on scalars #3504
  • OpenMPTarget hierarchical #3411
  • HIP Add Impl::atomic_[store,load] #3440
  • HIP enable global lock arrays #3418
  • HIP Implement multiple occupancy paths for various HIP kernel launchers #3366

Implemented enhancements Policies:

  • MDRangePolicy: Let it be semiregular #3494
  • MDRangePolicy: Check narrowing conversion in construction #3527
  • MDRangePolicy: CombinedReducers support #3395
  • Kokkos Graph: Interface and Default Implementation #3362
  • Kokkos Graph: add Cuda Graph implementation #3369
  • TeamPolicy: implemented autotuning of team sizes and vector lengths #3206
  • RangePolicy: Initialize all data members in default constructor #3509

Implemented enhancements BuildSystem:

  • Auto-generate core test files for all backends #3488
  • Avoid rewriting test files when calling cmake #3548
  • RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK system for nvcc_wrapper #3136
  • Adding -include as a known argument to nvcc_wrapper #3434
  • Install hpcbind script #3402
  • cmake/kokkos_tribits.cmake: add parsing for args #3457

Implemented enhancements Tools:

  • Changed namespacing of Kokkos::Tools::Impl::Impl::tune_policy #3455
  • Delegate to an impl allocate/deallocate method to allow specifying a SpaceHandle for MemorySpaces #3530
  • Use the Kokkos Profiling interface rather than the Impl interface #3518
  • Runtime option for tuning #3459
  • Dual View Tool Events #3326

Implemented enhancements Other:

  • Abort on errors instead of just printing #3528
  • Enable C++14 macros unconditionally #3449
  • Make ViewMapping trivially copyable #3436
  • Rename struct ViewMapping to class #3435
  • Replace enums in Kokkos_ViewMapping.hpp (removes -Wextra) #3422
  • Use bool for enums representing bools #3416
  • Fence active instead of default execution space instances #3388
  • Refactor parallel_reduce fence usage #3359
  • Moved Space EBO helpers to Kokkos_EBO #3357
  • Add remove_cvref type trait #3340
  • Adding identity type traits and update definition of identity_t alias #3339
  • Add is_specialization_of type trait #3338
  • Make ScratchMemorySpace semi-regular #3309
  • Optimize min/max atomics with early exit on no-op case #3265
  • Refactor Backend Development #2941

Fixed bugs:

  • Fixup MDRangePolicy construction from Kokkos arrays #3591
  • Add atomic functions for unsigned long long using gcc built-in #3588
  • Fixup silent pointless comparison with zero in checked_narrow_cast (compiler workaround) #3566
  • Fixes for ROCm 3.9 #3565
  • Fix windows build issues which crept in for the CUDA build #3532
  • HIP Fix atomics of large data types and clean up lock arrays #3529
  • Pthreads fix exception resulting from 0 grain size #3510
  • Fixup do not require atomic operation to be default constructible #3503
  • Fix race condition in HIP backend #3467
  • Replace KOKKOS_DEBUG with KOKKOS_ENABLE_DEBUG #3458
  • Fix multi-stream team scratch space definition for HIP #3398
  • HIP fix template deduction #3393
  • Fix compiling with HIP and C++17 #3390
  • Fix sigFPE in HIP blocksize deduction #3378
  • Type alias change: replace CS with CTS to avoid conflicts with NVSHMEM #3348
  • Clang compilation of CUDA backend on Windows #3345
  • Fix HBW support #3343
  • Added missing fences to unique token #3260

Incompatibilities:

  • Remove unused utilities (forward, move, and expand_variadic) from Kokkos::Impl #3535
  • Remove unused traits #3534
  • HIP: Remove old HCC code #3301
  • Prepare for deprecation of ViewAllocateWithoutInitializing #3264
  • Remove ROCm backend #3148

3.2.01 (2020-11-17)

Full Changelog

Fixed bugs:

  • Disallow KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE in shared library builds #3332
  • Do not install libprinter-tool when testing is enabled #3313
  • Fix restrict/alignment following refactor #3373
    • Intel fix: workaround compiler issue with using statement #3383
  • Fix zero-length reductions #\3364
    • Pthread zero-length reduction fix #3452
    • HPX zero-length reduction fix #3470
    • cuda/9.2 zero-length reduction fix #3580
  • Fix multi-stream scratch #\3269
  • Guard KOKKOS_ALL_COMPILE_OPTIONS if Cuda is not enabled #3387
  • Do not include link flags for Fortran linkage #3384
  • Fix NVIDIA GPU arch macro with autodetection #3473
  • Fix libdl/test issues with Trilinos #3543
    • Register Pthread as Tribits option to be enabled with Trilinos #3558

Implemented enhancements:

  • Separate Cuda timing-based tests into their own executable #3407

3.2.00 (2020-08-19)

Full Changelog

Implemented enhancements:

  • HIP:Enable stream in HIP #3163
  • HIP:Add support for shuffle reduction for the HIP backend #3154
  • HIP:Add implementations of missing HIPHostPinnedSpace methods for LAMMPS #3137
  • HIP:Require HIP 3.5.0 or higher #3099
  • HIP:WorkGraphPolicy for HIP #3096
  • OpenMPTarget: Significant update to the new experimental backend. Requires C++17, works on Intel GPUs, reference counting fixes. #3169
  • Windows Cuda support #3018
  • Pass -Wext-lambda-captures-this to NVCC when support for __host__ __device__ lambda is enabled from CUDA 11 #3241
  • Use explicit staging buffer for constant memory kernel launches and cleanup host/device synchronization #3234
  • Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable: #3202 , #3203 , #3196
  • Annotations for DefaultExectutionSpace and DefaultHostExectutionSpace to use in static analysis #3189
  • Add documentation on using Spack to install Kokkos and developing packages that depend on Kokkos #3187
  • Add OpenMPTarget backend flags for NVC++ compiler #3185
  • Move deep_copy/create_mirror_view on Experimental::OffsetView into Kokkos:: namespace #3166
  • Allow for larger block size in HIP #3165
  • View: Added names of Views to the different View initialize/free kernels #3159
  • Cuda: Caching cudaFunctorAttributes and whether L1/Shmem prefer was set #3151
  • BuildSystem: Improved performance in default configuration by defaulting to Release build #3131
  • Cuda: Update CUDA occupancy calculation #3124
  • Vector: Adding data() to Vector #3123
  • BuildSystem: Add CUDA Ampere configuration support #3122
  • General: Apply [[noreturn]] to Kokkos::abort when applicable #3106
  • TeamPolicy: Validate storage level argument passed to TeamPolicy::set_scratch_size() #3098
  • BuildSystem: Make kokkos_has_string() function in Makefile.kokkos case insensitive #3091
  • Modify KOKKOS_FUNCTION macro for clang-tidy analysis #3087
  • Move allocation profiling to allocate/deallocate calls #3084
  • BuildSystem: FATAL_ERROR when attempting in-source build #3082
  • Change enums in ScatterView to types #3076
  • HIP: Changes for new compiler/runtime #3067
  • Extract and use get_gpu #3061 , #3048
  • Add is_allocated to View-like containers #3059
  • Combined reducers for scalar references #3052
  • Add configurable capacity for UniqueToken #3051
  • Add installation testing #3034
  • HIP: Add UniqueToken #3020
  • Autodetect number of devices #3013

Fixed bugs:

  • Check error code from cudaStreamSynchronize in CUDA fences #3255
  • Fix issue with C++ standard flags when using nvcc\_wrapper with PGI #3254
  • Add missing threadfence in lock-based atomics #3208
  • Fix dedup of linker flags for shared lib on CMake <=3.12 #3176
  • Fix memory leak with CUDA streams #3170
  • BuildSystem: Fix OpenMP Target flags for Cray #3161
  • ScatterView: fix for OpenmpTarget remove inheritance from reducers #3162
  • BuildSystem: Set OpenMP flags according to host compiler #3127
  • OpenMP: Fix logic for nested omp in partition_master bug #3101
  • nvcc_wrapper: send --cudart to nvcc instead of host compiler #3092
  • BuildSystem: Fixes for Cuda/11 and c++17 #3085
  • HIP: Fix print_configuration #3080
  • Conditionally define get_gpu #3072
  • Fix bounds for ranges in random number generator #3069
  • Fix Cuda minor arch check #3035
  • BuildSystem: Add -expt-relaxed-constexpr flag to nvcc_wrapper #3021

Incompatibilities:

  • Remove ETI support #3157
  • Remove KOKKOS_INTERNAL_ENABLE_NON_CUDA_BACKEND #3147
  • Remove core/unit_test/config #3146
  • Removed the preprocessor branch for KOKKOS_ENABLE_PROFILING #3115
  • Disable profiling with MSVC #3066

Closed issues:

  • Silent error (Validate storage level arg to set_scratch_size) #3097

  • Remove KOKKOS_ENABLE_PROFILING Option #3095

  • Cuda 11 -> allow C++17 #3083

  • In source build failure not explained #3081

  • Allow naming of Views for initialization kernel #3070

  • DefaultInit tests failing when using CTest resource allocation feature #3040

  • Add installation testing. #3037

  • nvcc_wrapper needs to handle -expt-relaxed-constexpr flag #3017

  • CPU core oversubscription warning on macOS with OpenMP backend #2996

  • Default behavior of KOKKOS_NUM_DEVICES to use all devices available #2975

  • Assert blocksize > 0 #2974

  • Add ability to assign kokkos profile function from executable #2973

  • ScatterView Support for the pre/post increment operator #2967

  • Compiler issue: Cuda build with clang 10 has errors with the atomic unit tests #3237

  • Incompatibility of flags for C++ standard with PGI v20.4 on Power9/NVIDIA V100 system #3252

  • Error configuring as subproject #3140

  • CMake fails with Nvidia compilers when the GPU architecture option is not supplied (Fix configure with OMPT and Cuda) #3207

  • PGI compiler being passed the gcc -fopenmp flag #3125

  • Cuda: Memory leak when using CUDA stream #3167

  • RangePolicy has an implicitly deleted assignment operator #3192

  • MemorySpace::allocate needs to have memory pool counting. #3064

  • Missing write fence for lock based atomics on CUDA #3038

  • CUDA compute capability version check problem #3026

  • Make DynRankView fencing consistent #3014

  • nvcc_wrapper cant handle -Xcompiler -o out.o #2993

  • Reductions of non-trivial types of size 4 fail in CUDA shfl operations #2990

  • complex_double misalignment in reduce, clang+CUDA #2989

  • Span of degenerated (zero-length) subviews is not zero in some special cases #2979

  • Rank 1 custom layouts dont work as expected. #2840

3.1.01 (2020-04-14)

Full Changelog

Fixed bugs:

  • Fix complex_double misalignment in reduce, clang+CUDA #2989
  • Fix compilation fails when profiling disabled and CUDA enabled #3001
  • Fix cuda reduction of non-trivial scalars of size 4 #2990
  • Configure and install version file when building in Trilinos #2957
  • Fix OpenMPTarget build missing include and namespace #3000
  • fix typo in KOKKOS_SET_EXE_PROPERTY() #2959
  • Fix non-zero span subviews of zero sized subviews #2979

3.1.00 (2020-04-14)

Full Changelog

Features:

  • HIP Support for AMD
  • OpenMPTarget Support with clang
  • Windows VS19 (Serial) Support #1533

Implemented enhancements:

  • generate_makefile.bash should allow tests to be disabled #2886
  • clang/7+cuda/9 build -Werror-unused parameter error in nightly test #2884
  • ScatterView memory space is not user settable #2826
  • clang/8+cuda/10.0 build error with c++17 #2809
  • warnings.... #2805
  • Kokkos version in cpp define #2787
  • Remove Defunct QThreads Backend #2751
  • Improve Kokkos::fence behavior with multiple execution spaces #2659
  • polylithic(?) initialization of Kokkos #2658
  • Unnecessary(?) check for host execution space initialization from Cuda initialization #2652
  • Kokkos error reporting failures with CUDA GPUs in exclusive mode #2471
  • atomicMax equivalent (and other atomics) #2401
  • Fix alignment for Kokkos::complex #2255
  • Warnings with Cuda 10.1 #2206
  • dual view with Kokkos::ViewAllocateWithoutInitializing #2188
  • Check error code from cudaOccupancyMaxActiveBlocksPerMultiprocessor #2172
  • Add non-member Kokkos::resize/realloc for DualView #2170
  • Construct DualView without initialization #2046
  • Expose is_assignable to determine if one view can be assigned to another #1936
  • profiling label #1935
  • team_broadcast of bool failed on CUDA backend #1908
  • View static_extent #660
  • Misleading Kokkos::Cuda::initialize ERROR message when compiled for wrong GPU architecture #1944
  • Cryptic Error When Malloc Fails #2164
  • Drop support for intermediate standards in CMake #2336

Fixed bugs:

  • DualView sync_device with length zero creates cuda errors #2946
  • building with nvcc and clang (or clang based XL) as host compiler: "Kokkos::atomic_fetch_min(volatile int *, int)" has already been defined #2903
  • Cuda 9.1,10.1 debug builds failing due to -Werror=unused-parameter #2880
  • clang -Werror: Kokkos_FixedBufferMemoryPool.hpp:140:28: error: unused parameter 'alloc_size' #2869
  • intel/16.0.1, intel/17.0.1 nightly build failures with debugging enabled #2867
  • intel/16.0.1 debug build errors #2863
  • xl/16.1.1 with cpp14, openmp build, nightly test failures #2856
  • Intel nightly test failures: team_vector #2852
  • Kokkos Views with intmax/2<N<intmax can hang during construction #2850
  • workgraph_fib test seg-faults with threads backend and hwloc #2797
  • cuda.view_64bit test hangs on Power8+Kepler37 system - develop and 2.9.00 branches #2771
  • device_type for Kokkos_Random ? #2693
  • "More than one tag given" error in Experimental::require() #2608
  • Segfault on Marvell from our finalization stack #2542

3.0.00 (2020-01-27)

Full Changelog

Implemented enhancements:

  • BuildSystem: Standalone Modern CMake Support #2104
  • StyleFormat: ClangFormat Style #2157
  • Documentation: Document build system and CMake philosophy #2263
  • BuildSystem: Add Alias with Namespace Kokkos:: to Interal Libraries #2530
  • BuildSystem: Universal Kokkos find_package #2099
  • BuildSystem: Dropping support for Kokkos_{DEVICES,OPTIONS,ARCH} in CMake #2329
  • BuildSystem: Set Kokkos_DEVICES and Kokkos_ARCH variables in exported CMake configuration #2193
  • BuildSystem: Drop support for CUDA 7 and CUDA 8 #2489
  • BuildSystem: Drop CMake option SEPARATE_TESTS #2266
  • BuildSystem: Support expt-relaxed-constexpr same as expt-extended-lambda #2411
  • BuildSystem: Add Xnvlink to command line options allowed in nvcc_wrapper #2197
  • BuildSystem: Install Kokkos config files and target files to lib/cmake/Kokkos #2162
  • BuildSystem: nvcc_wrappers and c++ 14 #2035
  • BuildSystem: Kokkos version major/version minor (Feature request) #1930
  • BuildSystem: CMake namespaces (and other modern cmake cleanup) #1924
  • BuildSystem: Remove capability to install Kokkos via GNU Makefiles #2332
  • Documentation: Remove PDF ProgrammingGuide in Kokkos replace with link #2244
  • View: Add Method to Resize View without Initialization #2048
  • Vector: implement “insert” method for Kokkos_Vector (as a serial function on host) #2437

Fixed bugs:

  • ParallelScan: Kokkos::parallel\scan fix race condition seen in inter-block fence #2681
  • OffsetView: Kokkos::OffsetView missing constructor which takes pointer #2247
  • OffsetView: Kokkos::OffsetView: allow offset=0 #2246
  • DeepCopy: Missing DeepCopy instrumentation in Kokkos #2522
  • nvcc_wrapper: --host-only fails with multiple -W* flags #2484
  • nvcc_wrapper: taking first -std option is counterintuitive #2553
  • Subview: Error taking subviews of views with static_extents of min rank #2448
  • TeamPolicy: reducers with valuetypes without += broken on CUDA #2410
  • Libs: Fix inconsistency of Kokkos library names in Kokkos and Trilinos #1902
  • Complex: operator>> for complex<T> uses std::ostream, not std::istream #2313
  • Macros: Restrict not honored for non-intel compilers #1922

2.9.00 (2019-06-24)

Full Changelog

Implemented enhancements:

  • Capability: CUDA Streams #1723
  • Capability: CUDA Stream support for parallel_reduce #2061
  • Capability: Feature Request: TeamVectorRange #713
  • Capability: Adding HPX backend #2080
  • Capability: TaskScheduler to have multiple queues #565
  • Capability: Support for additional reductions in ScatterView #1674
  • Capability: Request: deep_copy within parallel regions #689
  • Capability: Feature Request: create\_mirror\_view\_without\_initializing #1765
  • View: Use SFINAE to restrict possible View type conversions #2127
  • Deprecation: Deprecate ExecutionSpace::fence() as static function and make it non-static #2140
  • Deprecation: Deprecate LayoutTileLeft #2122
  • Macros: KOKKOS_RESTRICT defined for non-Intel compilers #2038

Fixed bugs:

  • Cuda: TeamThreadRange loop count on device is passed by reference to host static constexpr #1733
  • Cuda: Build error with relocatable device code with CUDA 10.1 GCC 7.3 #2134
  • Cuda: cudaFuncSetCacheConfig is setting CachePreferShared too often #2066
  • Cuda: TeamPolicy doesn't throw then created with non-viable vector length and also doesn't backscale to viable one #2020
  • Cuda: cudaMemcpy error for large league sizes on V100 #1991
  • Cuda: illegal warp sync in parallel_reduce by functor on Turing 75 #1958
  • TeamThreadRange: Inconsistent results from TeamThreadRange reduction #1905
  • Atomics: atomic_fetch_oper & atomic_oper_fetch don't build for complex<float> #1964
  • Views: Kokkos randomread Views leak memory #2155
  • ScatterView: LayoutLeft overload currently non-functional #2165
  • KNL: With intel 17.2.174 illegal instruction in random number test #2078
  • Bitset: Enable copy constructor on device #2094
  • Examples: do not compile due to template deduction error (multi_fem) #1928

2.8.00 (2019-02-05)

Full Changelog

Implemented enhancements:

  • Capability, Tests: C++14 support and testing #1914
  • Capability: Add environment variables for all command line arguments #1798
  • Capability: --kokkos-ndevices not working for Slurm #1920
  • View: Undefined behavior when deep copying from and to an empty unmanaged view #1967
  • BuildSystem: nvcc_wrapper should stop immediately if nvcc is not in PATH #1861

Fixed bugs:

  • Cuda: Fix Volta Issues 1 Non-deterministic behavior on Volta, runs fine on Pascal #1949
  • Cuda: Fix Volta Issues 2 CUDA Team Scan gives wrong values on Volta with -G compile flag #1942
  • Cuda: illegal warp sync in parallel_reduce by functor on Turing 75 #1958
  • Threads: Pthreads backend does not handle RangePolicy with offset correctly #1976
  • Atomics: atomic_fetch_oper has no case for Kokkos::complex<double> or other 16-byte types #1951
  • MDRangePolicy: Fix zero-length range #1948
  • TeamThreadRange: TeamThreadRange MaxLoc reduce doesnt compile #1909

2.7.24 (2018-11-04)

Full Changelog

Implemented enhancements:

  • DualView: Add non-templated functions for sync, need_sync, view, modify #1858
  • DualView: Avoid needlessly allocates and initializes modify_host and modify_device flag views #1831
  • DualView: Incorrect deduction of "not device type" #1659
  • BuildSystem: Add KOKKOS_ENABLE_CXX14 and KOKKOS_ENABLE_CXX17 #1602
  • BuildSystem: Installed kokkos_generated_settings.cmake contains build directories instead of install directories #1838
  • BuildSystem: KOKKOS_ARCH: add ticks to printout of improper arch setting #1649
  • BuildSystem: Make core/src/Makefile for Cuda use needed nvcc_wrapper #1296
  • Build: Support PGI as host compiler for NVCC #1828
  • Build: Many Warnings Fixed e.g.#1786
  • Capability: OffsetView with non-zero begin index #567
  • Capability: Reductions into device side view #1788
  • Capability: Add max_size to Kokkos::Array #1760
  • Capability: View Assignment: LayoutStride -> LayoutLeft and LayoutStride -> LayoutRight #1594
  • Capability: Atomic function allow implicit conversion of update argument #1571
  • Capability: Add team_size_max with tagged functors #663
  • Capability: Fix allignment of views from Kokkos_ScratchSpace should use different alignment #1700
  • Capabilitiy: create_mirror_view_and_copy for DynRankView #1651
  • Capability: DeepCopy HBWSpace / HostSpace #548
  • ROCm: support team vector scan #1645
  • ROCm: Merge from rocm-hackathon2 #1636
  • ROCm: Add ParallelScanWithTotal #1611
  • ROCm: Implement MDRange in ROCm #1314
  • ROCm: Implement Reducers for Nested Parallelism Levels #963
  • ROCm: Add asynchronous deep copy #959
  • Tests: Memory pool test seems to allocate 8GB #1830
  • Tests: Add unit_test for team_broadcast #734

Fixed bugs:

  • BuildSystem: Makefile.kokkos gets gcc-toolchain wrong if gcc is cached #1841
  • BuildSystem: kokkos_generated_settings.cmake placement is inconsistent #1771
  • BuildSystem: Invalid escape sequence . in kokkos_functions.cmake #1661
  • BuildSystem: Problem in Kokkos generated cmake file #1770
  • BuildSystem: invalid file names on windows #1671
  • Tests: reducers min/max_loc test fails randomly due to multiple min values and thus multiple valid locations #1681
  • Tests: cuda.scatterview unit test causes "Bus error" when force_uvm and enable_lambda are enabled #1852
  • Tests: cuda.cxx11 unit test fails when force_uvm and enable_lambda are enabled #1850
  • Tests: threads.reduce_device_view_range_policy failing with Cuda/8.0.44 and RDC #1836
  • Build: compile error when compiling Kokkos with hwloc 2.0.1 (on OSX 10.12.6, with g++ 7.2.0) #1506
  • Build: dual_view.view broken with UVM #1834
  • Build: White cuda/9.2 + gcc/7.2 warnings triggering errors #1833
  • Build: warning: enum constant in boolean context #1813
  • Capability: Fix overly conservative max_team_size thingy #1808
  • DynRankView: Ctors taking ViewAllocateWithoutInitializing broken #1783
  • Cuda: Apollo cuda.team_broadcast test fail with clang-6.0 #1762
  • Cuda: Clang spurious test failure in impl_view_accessible #1753
  • Cuda: Kokkos::complex<double> atomic deadlocks with Clang 6 Cuda build with -O0 #1752
  • Cuda: LayoutStride Test fails for UVM as default memory space #1688
  • Cuda: Scan wrong values on Volta #1676
  • Cuda: Kokkos::deep_copy error with CudaUVM and Kokkos::Serial spaces #1652
  • Cuda: cudaErrorInvalidConfiguration with debug build #1647
  • Cuda: parallel_for with TeamPolicy::team_size_recommended with launch bounds not working -- reported by Daniel Holladay #1283
  • Cuda: Using KOKKOS_CLASS_LAMBDA in a class with Kokkos::Random_XorShift64_Pool member data #1696
  • Long Build Times on Darwin #1721
  • Capability: Typo in Kokkos_Sort.hpp - BinOp3D - wrong comparison #1720
  • Buffer overflow in SharedAllocationRecord in Kokkos_HostSpace.cpp #1673
  • Serial unit test failure #1632

2.7.00 (2018-05-24)

Full Changelog

Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.7

Implemented enhancements:

  • Deprecate team_size auto adjusting to maximal value possible #1618
  • DynamicView - remove restrictions to std::is_trivial types and value_type is power of two #1586
  • Kokkos::StaticCrsGraph does not propagate memory traits (e.g., Unmanaged) #1581
  • Adding ETI for DeepCopy / ViewFill etc. #1578
  • Deprecate all the left over KOKKOS_HAVE_ Macros and Kokkos_OldMacros.hpp #1572
  • Error if Kokkos_ARCH set in CMake #1555
  • Deprecate ExecSpace::initialize / ExecSpace::finalize #1532
  • New API for TeamPolicy property setting #1531
  • clang 6.0 + cuda debug out-of-memory test failure #1521
  • Cuda UniqueToken interface not consistent with other backends #1505
  • Move Reducers out of Experimental namespace #1494
  • Provide scope guard for initialize/finalize #1479
  • Check Kokkos::is_initialized in SharedAllocationRecord dtor #1465
  • Remove static list of allocations #1464
  • Makefiles: Support single compile/link line use case #1402
  • ThreadVectorRange with a range #1400
  • Exclusive scan + last value API #1358
  • Install kokkos_generated_settings.cmake #1348
  • Kokkos arrays (not views!) don't do bounds checking in debug mode #1342
  • Expose round-robin GPU assignment outside of initialize(int, char**) #1318
  • DynamicView misses use_count and label function #1298
  • View constructor should check arguments #1286
  • False Positive on Oversubscription Warning #1207
  • Allow (require) execution space for 1st arg of VerifyExecutionCanAccessMemorySpace #1192
  • ROCm: Add ROCmHostPinnedSpace #958
  • power of two functions #656
  • CUDA 8 has 64bit __shfl #361
  • Add TriBITS/CMake configure information about node types #243

Fixed bugs:

  • CUDA atomic_fetch_sub for doubles is hitting CAS instead of intrinsic #1624
  • Bug: use of ballot on Volta #1612
  • Kokkos::deep_copy memory access failures #1583
  • g++ -std option doubly set for cmake project #1548
  • ViewFill for 1D Views of larger 32bit entries fails #1541
  • CUDA Volta another warpsync bug #1520
  • triple_nested_parallelism fails with KOKKOS_DEBUG and CUDA #1513
  • Jenkins errors in Kokkos_SharedAlloc.cpp with debug build #1511
  • Kokkos::Sort out-of-bounds with empty bins #1504
  • Get rid of deprecated functions inside Kokkos #1484
  • get_work_partition casts int64_t to int, causing a seg fault #1481
  • NVCC bug with __device__ on defaulted function #1470
  • CMake example broken with CUDA backend #1468

2.6.00 (2018-03-07)

Full Changelog

Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.6

Implemented enhancements:

  • Support NVIDIA Volta microarchitecture #1466
  • Kokkos - Define empty functions when profiling disabled #1424
  • Don't use __constant__ cache for lock arrays, enable once per run update instead of once per call #1385
  • task dag enhancement. #1354
  • Cuda task team collectives and stack size #1353
  • Replace View operator acceptance of more than rank integers with 'access' function #1333
  • Interoperability: Do not shut down backend execution space runtimes upon calling finalize. #1305
  • shmem_size for LayoutStride #1291
  • Kokkos::resize performs poorly on 1D Views #1270
  • stride() is inconsistent with dimension(), extent(), etc. #1214
  • Kokkos::sort defaults to std::sort on host #1208
  • DynamicView with host size grow #1206
  • Unmanaged View with Anonymous Memory Space #1175
  • Sort subset of Kokkos::DynamicView #1160
  • MDRange policy doesn't support lambda reductions #1054
  • Add ability to set hook on Kokkos::finalize #714
  • Atomics with Serial Backend - Default should be Disable? #549
  • KOKKOS_ENABLE_DEPRECATED_CODE #1359

Fixed bugs:

  • cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269
  • Cuda: level 1 scratch memory bug (reported by Stan Moore) #1434
  • MDRangePolicy Reduction requires value_type typedef in Functor #1379
  • Kokkos DeepCopy between empty views fails #1369
  • Several issues with new CMake build infrastructure (reported by Eric Phipps) #1365
  • deep_copy between rank-1 host/device views of differing layouts without UVM no longer works (reported by Eric Phipps) #1363
  • Profiling can't be disabled in CMake, and a parallel_for is missing for tasks (reported by Kyungjoo Kim) #1349
  • get_work_partition int overflow (reported by berryj5) #1327
  • Kokkos::deep_copy must fence even if the two views are the same #1303
  • CudaUVMSpace::allocate/deallocate must fence #1302
  • ViewResize on CUDA fails in Debug because of too many resources requested #1299
  • Cuda 9 and intrepid2 calls from Panzer. #1183
  • Slowdown due to tracking_enabled() in 2.04.00 (found by Albany app) #1016
  • Bounds checking fails with zero-span Views (reported by Stan Moore) #1411

2.5.00 (2017-12-15)

Full Changelog

Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.5

Implemented enhancements:

  • Provide Makefile.kokkos logic for CMake and TriBITS #878
  • Add Scatter View #825
  • Drop gcc 4.7 and intel 14 from supported compiler list #603
  • Enable construction of unmanaged view using common_view_alloc_prop #1170
  • Unused Function Warning with XL #1267
  • Add memory pool parameter check #1218
  • CUDA9: Fix warning for unsupported long double #1189
  • CUDA9: fix warning on defaulted function marking #1188
  • CUDA9: fix warnings for deprecated warp level functions #1187
  • Add CUDA 9.0 nightly testing #1174
  • {OMPI,MPICH}_CXX hack breaks nvcc_wrapper use case #1166
  • KOKKOS_HAVE_CUDA_LAMBDA became KOKKOS_CUDA_USE_LAMBDA #1274

Fixed bugs:

  • MinMax Reducer with tagged operator doesn't compile #1251
  • Reducers for Tagged operators give wrong answer #1250
  • Kokkos not Compatible with Big Endian Machines? #1235
  • Parallel Scan hangs forever on BG/Q #1234
  • Threads backend doesn't compile with Clang on OS X #1232
  • $(shell date) needs quote #1264
  • Unqualified parallel_for call conflicts with user-defined parallel_for #1219
  • KokkosAlgorithms: CMake issue in unit tests #1212
  • Intel 18 Error: "simd pragma has been deprecated" #1210
  • Memory leak in Kokkos::initialize #1194
  • CUDA9: compiler error with static assert template arguments #1190
  • Kokkos::Serial::is_initialized returns always true #1184
  • Triple nested parallelism still fails on bowman #1093
  • OpenMP openmp.range on Develop Runs Forever on POWER7+ with RHEL7 and GCC4.8.5 #995
  • Rendezvous performance at global scope #985

2.04.11 (2017-10-28)

Full Changelog

Implemented enhancements:

  • Add Subview pattern. #648
  • Add Kokkos "global" is_initialized #1060
  • Add create_mirror_view_and_copy #1161
  • Add KokkosConcepts SpaceAccessibility function #1092
  • Option to Disable Initialize Warnings #1142
  • Mature task-DAG capability #320
  • Promote Work DAG from experimental #1126
  • Implement new WorkGraph push/pop #1108
  • Kokkos_ENABLE_Cuda_Lambda should default ON #1101
  • Add multidimensional parallel for example and improve unit test #1064
  • Fix ROCm: Performance tests not building #1038
  • Make KOKKOS_ALIGN_SIZE a configure-time option #1004
  • Make alignment consistent #809
  • Improve subview construction on Cuda backend #615

Fixed bugs:

  • Kokkos::vector fixes for application #1134
  • DynamicView non-power of two value_type #1177
  • Memory pool bug #1154
  • Cuda launch bounds performance regression bug #1140
  • Significant performance regression in LAMMPS after updating Kokkos #1139
  • CUDA compile error #1128
  • MDRangePolicy neg idx test failure in debug mode #1113
  • subview construction on Cuda backend #615

2.04.04 (2017-09-11)

Full Changelog

Implemented enhancements:

  • OpenMP partition: set number of threads on nested level #1082
  • Add StaticCrsGraph row() method #1071
  • Enhance Kokkos complex operator overloading #1052
  • Tell Trilinos packages about host+device lambda #1019
  • Function markup for defaulted class members #952
  • Add deterministic random number generator #857

Fixed bugs:

  • Fix reduction_identity<T>::max for floating point numbers #1048
  • Fix MD iteration policy ignores lower bound on GPUs #1041
  • (Experimental) HBWSpace Linking issues in KokkosKernels #1094
  • (Experimental) ROCm: algorithms/unit_tests test_sort failing with segfault #1070

2.04.00 (2017-08-16)

Full Changelog

Implemented enhancements:

  • Added ROCm backend to support AMD GPUs
  • Kokkos::complex<T> behaves slightly differently from std::complex<T> #1011
  • Kokkos::Experimental::Crs constructor arguments were in the wrong order #992
  • Work graph construction ease-of-use (one lambda for count and fill) #991
  • when_all returns pointer of futures (improved interface) #990
  • Allow assignment of LayoutLeft to LayoutRight or vice versa for rank-0 Views #594
  • Changed the meaning of Kokkos_ENABLE_CXX11_DISPATCH_LAMBDA #1035

Fixed bugs:

  • memory pool default constructor does not properly set member variables. #1007

2.03.13 (2017-07-27)

Full Changelog

Implemented enhancements:

  • Disallow enabling both OpenMP and Threads in the same executable #406
  • Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630
  • Improve Atomics Performance on KNL/Broadwell where PREFETCHW/RFO is Available #898
  • Kokkos::resize should test whether dimensions have changed before resizing #904
  • Develop performance-regression/acceptance tests #737
  • Make the deep_copy Profiling hook a start/end system #890
  • Add deep_copy Profiling hook #843
  • Append tag name to parallel construct name for Profiling #842
  • Add view label to View bounds error message for CUDA backend #870
  • Disable printing the loaded profiling library #824
  • "Declared but never referenced" warnings #853
  • Warnings about lock_address_cuda_space #852
  • WorkGraph execution policy #771
  • Simplify makefiles by guarding compilation with appropriate KOKKOS_ENABLE_### macros #716
  • Cmake build: wrong include install directory #668
  • Derived View type and allocation #566
  • Fix Compiler warnings when compiling core unit tests for Cuda #214

Fixed bugs:

  • Out-of-bounds read in Kokkos_Layout.hpp #975
  • CudaClang: Fix failing test with Clang 4.0 #941
  • Respawn when memory pool allocation fails (not available memory) #940
  • Memory pool aborts on zero allocation request, returns NULL for < minimum #939
  • Error with TaskScheduler query of underlying memory pool #917
  • Profiling::*Callee static variables declared in header #863
  • calling *Space::name() causes compile error #862
  • bug in Profiling::deallocateData #860
  • task_depend test failing, CUDA 8.0 + Pascal + RDC #829
  • [develop branch] Standalone cmake issues #826
  • Kokkos CUDA failes to compile with OMPI_CXX and MPICH_CXX wrappers #776
  • Task Team reduction on Pascal #767
  • CUDA stack overflow with TaskDAG test #758
  • TeamVector test on Cuda #670
  • Clang 4.0 Cuda Build broken again #560

2.03.05 (2017-05-27)

Full Changelog

Implemented enhancements:

  • Harmonize Custom Reductions over nesting levels #802
  • Prevent users directly including KokkosCore_config.h #815
  • DualView aborts on concurrent host/device modify (in debug mode) #814
  • Abort when running on a NVIDIA CC5.0 or higher architecture with code compiled for CC < 5.0 #813
  • Add "name" function to ExecSpaces #806
  • Allow null Future in task spawn dependences #795
  • Add Unit Tests for Kokkos::complex #785
  • Add pow function for Kokkos::complex #784
  • Square root of a complex #729
  • Command line processing of --threads argument prevents users from having any commandline arguments starting with --threads #760
  • Protected deprecated API with appropriate macro #756
  • Allow task scheduler memory pool to be used by tasks #747
  • View bounds checking on host-side performance: constructing a std::string #723
  • Add check for AppleClang as compiler distinct from check for Clang. #705
  • Uninclude source files for specific configurations to prevent link warning. #701
  • Add --small option to snapshot script #697
  • CMake Standalone Support #674
  • CMake build unit test and install #808
  • CMake: Fix having kokkos as a subdirectory in a pure cmake project #629
  • Tribits macro assumes build directory is in top level source directory #654
  • Use bin/nvcc_wrapper, not config/nvcc_wrapper #562
  • Allow MemoryPool::allocate() to be called from multiple threads per warp. #487
  • Allow MemoryPool::allocate\(\) to be called from multiple threads per warp. #487
  • Move OpenMP 4.5 OpenMPTarget backend into Develop #456
  • Testing on ARM testbed #288

Fixed bugs:

  • Fix label in OpenMP parallel_reduce verify_initialized #834
  • TeamScratch Level 1 on Cuda hangs #820
  • [bug] memory pool. #786
  • Some Reduction Tests fail on Intel 18 with aggressive vectorization on #774
  • Error copying dynamic view on copy of memory pool #773
  • CUDA stack overflow with TaskDAG test #758
  • ThreadVectorRange Customized Reduction Bug #739
  • set_scratch_size overflows #726
  • Get wrong results for compiler checks in Makefile on OS X. #706
  • Fix check if multiple host architectures enabled. #702
  • Threads Backend Does not Pass on Cray Compilers #609
  • Rare bug in memory pool where allocation can finish on superblock in empty state #452
  • LDFLAGS in core/unit_test/Makefile: potential "undefined reference" to pthread lib #148

2.03.00 (2017-04-25)

Full Changelog

Implemented enhancements:

  • UnorderedMap: make it accept Devices or MemorySpaces #711
  • sort to accept DynamicView and [begin,end) indices #691
  • ENABLE Macros should only be used via #ifdef or #if defined #675
  • Remove impl/Kokkos_Synchronic_* #666
  • Turning off IVDEP for Intel 14. #638
  • Using an installed Kokkos in a target application using CMake #633
  • Create Kokkos Bill of Materials #632
  • MDRangePolicy and tagged evaluators #547
  • Add PGI support #289

Fixed bugs:

  • Output from PerTeam fails #733
  • Cuda: architecture flag not added to link line #688
  • Getting large chunks of memory for a thread team in a universal way #664
  • Kokkos RNG normal() function hangs for small seed value #655
  • Kokkos Tests Errors on Shepard/HSW Builds #644

2.02.15 (2017-02-10)

Full Changelog

Implemented enhancements:

  • Containers: Adding block partitioning to StaticCrsGraph #625
  • Kokkos Make System can induce Errors on Cray Volta System #610
  • OpenMP: error out if KOKKOS_HAVE_OPENMP is defined but not _OPENMP #605
  • CMake: fix standalone build with tests #604
  • Change README (that GitHub shows when opening Kokkos project page) to tell users how to submit PRs #597
  • Add correctness testing for all operators of Atomic View #420
  • Allow assignment of Views with compatible memory spaces #290
  • Build only one version of Kokkos library for tests #213
  • Clean out old KOKKOS_HAVE_CXX11 macros clauses #156
  • Harmonize Macro names #150

Fixed bugs:

  • Cray and PGI: Kokkos_Parallel_Reduce #634
  • Kokkos Make System can induce Errors on Cray Volta System #610
  • Normal() function random number generator doesn't give the expected distribution #592

2.02.07 (2016-12-16)

Full Changelog

Implemented enhancements:

  • Add CMake option to enable Cuda Lambda support #589
  • Add CMake option to enable Cuda RDC support #588
  • Add Initial Intel Sky Lake Xeon-HPC Compiler Support to Kokkos Make System #584
  • Building Tutorial Examples #582
  • Internal way for using ThreadVectorRange without TeamHandle #574
  • Testing: Add testing for uvm and rdc #571
  • Profiling: Add Memory Tracing and Region Markers #557
  • nvcc_wrapper not installed with Kokkos built with CUDA through CMake #543
  • Improve DynRankView debug check #541
  • Benchmarks: Add Gather benchmark #536
  • Testing: add spot_check option to test_all_sandia #535
  • Deprecate Kokkos::Impl::VerifyExecutionCanAccessMemorySpace #527
  • Add AtomicAdd support for 64bit float for Pascal #522
  • Add Restrict and Aligned memory trait #517
  • Kokkos Tests are Not Run using Compiler Optimization #501
  • Add support for clang 3.7 w/ openmp backend #393
  • Provide an error throw class #79

Fixed bugs:

  • Cuda UVM Allocation test broken with UVM as default space #586
  • Bug (develop branch only): multiple tests are now failing when forcing uvm usage. #570
  • Error in generate_makefile.sh for Kokkos when Compiler is Empty String/Fails #568
  • XL 13.1.4 incorrect C++11 flag #553
  • Improve DynRankView debug check #541
  • Installing Library on MAC broken due to cp -u #539
  • Intel Nightly Testing with Debug enabled fails #534

2.02.01 (2016-11-01)

Full Changelog

Implemented enhancements:

  • Add Changelog generation to our process. #506

Fixed bugs:

  • Test scratch_request fails in Serial with Debug enabled #520
  • Bug In BoundsCheck for DynRankView #516

2.02.00 (2016-10-30)

Full Changelog

Implemented enhancements:

  • Add PowerPC assembly for grabbing clock register in memory pool #511
  • Add GCC 6.x support #508
  • Test install and build against installed library #498
  • Makefile.kokkos adds expt-extended-lambda to cuda build with clang #490
  • Add top-level makefile option to just test kokkos-core unit-test #485
  • Split and harmonize Object Files of Core UnitTests to increase build parallelism #484
  • LayoutLeft to LayoutLeft subview for 3D and 4D views #473
  • Add official Cuda 8.0 support #468
  • Allow C++1Z Flag for Class Lambda capture #465
  • Add Clang 4.0+ compilation of Cuda code #455
  • Possible Issue with Intel 17.0.098 and GCC 6.1.0 in Develop Branch #445
  • Add name of view to "View bounds error" #432
  • Move Sort Binning Operators into Kokkos namespace #421
  • TaskPolicy - generate error when attempt to use uninitialized #396
  • Import WithoutInitializing and AllowPadding into Kokkos namespace #325
  • TeamThreadRange requires begin, end to be the same type #305
  • CudaUVMSpace should track # allocations, due to CUDA limit on # UVM allocations #300
  • Remove old View and its infrastructure #259

Fixed bugs:

  • Bug in TestCuda_Other.cpp: most likely assembly inserted into Device code #515
  • Cuda Compute Capability check of GPU is outdated #509
  • multi_scratch test with hwloc and pthreads seg-faults. #504
  • generate_makefile.bash: "make install" is broken #503
  • make clean in Out of Source Build/Tests Does Not Work Correctly #502
  • Makefiles for test and examples have issues in Cuda when CXX is not explicitly specified #497
  • Dispatch lambda test directly inside GTEST macro doesn't work with nvcc #491
  • UnitTests with HWLOC enabled fail if run with mpirun bound to a single core #489
  • Failing Reducer Test on Mac with Pthreads #479
  • make test Dumps Error with Clang Not Found #471
  • OpenMP TeamPolicy member broadcast not using correct volatile shared variable #424
  • TaskPolicy - generate error when attempt to use uninitialized #396
  • New task policy implementation is pulling in old experimental code. #372
  • MemoryPool unit test hangs on Power8 with GCC 6.1.0 #298

2.01.10 (2016-09-27)

Full Changelog

Implemented enhancements:

  • Enable Profiling by default in Tribits build #438
  • parallel_reduce(0), parallel_scan(0) unit tests #436
  • data()==NULL after realloc with LayoutStride #351
  • Fix tutorials to track new Kokkos::View #323
  • Rename team policy set_scratch_size. #195

Fixed bugs:

  • Possible Issue with Intel 17.0.098 and GCC 6.1.0 in Develop Branch #445
  • Makefile spits syntax error #435
  • Kokkos::sort fails for view with all the same values #422
  • Generic Reducers: can't accept inline constructed reducer #404
  • data\(\)==NULL after realloc with LayoutStride #351
  • const subview of const view with compile time dimensions on Cuda backend #310
  • Kokkos (in Trilinos) Causes Internal Compiler Error on CUDA 8.0.21-EA on POWER8 #307
  • Core Oversubscription Detection Broken? #159

2.01.06 (2016-09-02)

Full Changelog

Implemented enhancements:

  • Add "standard" reducers for lambda-supportable customized reduce #411
  • TaskPolicy - single thread back-end execution #390
  • Kokkos master clone tag #387
  • Query memory requirements from task policy #378
  • Output order of test_atomic.cpp is confusing #373
  • Missing testing for atomics #341
  • Feature request for Kokkos to provide Kokkos::atomic_fetch_max and atomic_fetch_min #336
  • TaskPolicy<Cuda> performance requires teams mapped to warps #218

Fixed bugs:

  • Reduce with Teams broken for custom initialize #407
  • Failing Kokkos build on Debian #402
  • Failing Tests on NVIDIA Pascal GPUs #398
  • Algorithms: fill_random assumes dimensions fit in unsigned int #389
  • Kokkos::subview with RandomAccess Memory Trait #385
  • Build warning (signed / unsigned comparison) in Cuda implementation #365
  • wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352
  • Hierarchical parallelism - 3 level unit test #344
  • Can I allocate a View w/ both WithoutInitializing & AllowPadding? #324
  • subview View layout determination #309
  • Unit tests with Cuda - Maxwell #196

2.01.00 (2016-07-21)

Full Changelog

Implemented enhancements:

  • Edit ViewMapping so assigning Views with the same custom layout compiles when const casting #327
  • DynRankView: Performance improvement for operator() #321
  • Interoperability between static and dynamic rank views #295
  • subview member function ? #280
  • Inter-operatibility between View and DynRankView. #245
  • (Trilinos) build warning in atomic_assign, with Kokkos::complex #177
  • View<>::shmem_size should runtime check for number of arguments equal to rank #176
  • Custom reduction join via lambda argument #99
  • DynRankView with 0 dimensions passed in at construction #293
  • Inject view_alloc and friends into Kokkos namespace #292
  • Less restrictive TeamPolicy reduction on Cuda #286
  • deep_copy using remap with source execution space #267
  • Suggestion: Enable opt-in L1 caching via nvcc-wrapper #261
  • More flexible create_mirror functions #260
  • Rename View::memory_span to View::required_allocation_size #256
  • Use of subviews and views with compile-time dimensions #237
  • Use of subviews and views with compile-time dimensions #237
  • Kokkos::Timer #234
  • Fence CudaUVMSpace allocations #230
  • View::operator() accept std::is_integral and std::is_enum #227
  • Allocating zero size View #216
  • Thread scalable memory pool #212
  • Add a way to disable memory leak output #194
  • Kokkos exec space init should init Kokkos profiling #192
  • Runtime rank wrapper for View #189
  • Profiling Interface #158
  • Fix View assignment (of managed to unmanaged) #153
  • Add unit test for assignment of managed View to unmanaged View #152
  • Check for oversubscription of threads with MPI in Kokkos::initialize #149
  • Dynamic resizeable 1dimensional view #143
  • Develop TaskPolicy for CUDA #142
  • New View : Test Compilation Downstream #138
  • New View Implementation #135
  • Add variant of subview that lets users add traits #134
  • NVCC-WRAPPER: Add --host-only flag #121
  • Address gtest issue with TriBITS Kokkos build outside of Trilinos #117
  • Make tests pass with -expt-extended-lambda on CUDA #108
  • Dynamic scheduling for parallel_for and parallel_reduce #106
  • Runtime or compile time error when reduce functor's join is not properly specified as const member function or with volatile arguments #105
  • Error out when the number of threads is modified after kokkos is initialized #104
  • Porting to POWER and remove assumption of X86 default #103
  • Dynamic scheduling option for RangePolicy #100
  • SharedMemory Support for Lambdas #81
  • Recommended TeamSize for Lambdas #80
  • Add Aggressive Vectorization Compilation mode #72
  • Dynamic scheduling team execution policy #53
  • UVM allocations in multi-GPU systems #50
  • Synchronic in Kokkos::Impl #44
  • index and dimension types in for loops #28
  • Subview assign of 1D Strided with stride 1 to LayoutLeft/Right #1

Fixed bugs:

  • misspelled variable name in Kokkos_Atomic_Fetch + missing unit tests #340
  • seg fault Kokkos::Impl::CudaInternal::print_configuration #338
  • Clang compiler error with named parallel_reduce, tags, and TeamPolicy. #335
  • Shared Memory Allocation Error at parallel_reduce #311
  • DynRankView: Fix resize and realloc #303
  • Scratch memory and dynamic scheduling #279
  • MemoryPool infinite loop when out of memory #312
  • Kokkos DynRankView changes break Sacado and Panzer #299
  • MemoryPool fails to compile on non-cuda non-x86 #297
  • Random Number Generator Fix #296
  • View template parameter ordering Bug #282
  • Serial task policy broken. #281
  • deep_copy with LayoutStride should not memcpy #262
  • DualView::need_sync should be a const method #248
  • Arbitrary-sized atomics on GPUs broken; loop forever #238
  • boolean reduction value_type changes answer #225
  • Custom init() function for parallel_reduce with array value_type #210
  • unit_test Makefile is Broken - Recursively Calls itself until Machine Apocalypse. #202
  • nvcc_wrapper Does Not Support -Xcompiler <compiler option> #198
  • Kokkos exec space init should init Kokkos profiling #192
  • Kokkos Threads Backend impl_shared_alloc Broken on Intel 16.1 (Shepard Haswell) #186
  • pthread back end hangs if used uninitialized #182
  • parallel_reduce of size 0, not calling init/join #175
  • Bug in Threads with OpenMP enabled #173
  • KokkosExp_SharedAlloc, m_team_work_index inaccessible #166
  • 128-bit CAS without Assembly Broken? #161
  • fatal error: Cuda/Kokkos_Cuda_abort.hpp: No such file or directory #157
  • Power8: Fix OpenMP backend #139
  • Data race in Kokkos OpenMP initialization #131
  • parallel_launch_local_memory and cuda 7.5 #125
  • Resize can fail with Cuda due to asynchronous dispatch #119
  • Qthread taskpolicy initialization bug. #92
  • Windows: sys/mman.h #89
  • Windows: atomic_fetch_sub() #88
  • Windows: snprintf #87
  • Parallel_Reduce with TeamPolicy and league size of 0 returns garbage #85
  • Throw with Cuda when using (2D) team_policy parallel_reduce with less than a warp size #76
  • Scalar views don't work with Kokkos::Atomic memory trait #69
  • Reduce the number of threads per team for Cuda #63
  • Named Kernels fail for reductions with CUDA #60
  • Kokkos View dimension_() for long returning unsigned int #20
  • atomic test hangs with LLVM #6
  • OpenMP Test should set omp_set_num_threads to 1 #4

Closed issues:

  • develop branch broken with CUDA 8 and --expt-extended-lambda #354
  • --arch=KNL with Intel 2016 build failure #349
  • Error building with Cuda when passing -DKOKKOS_CUDA_USE_LAMBDA to generate_makefile.bash #343
  • Can I safely use int indices in a 2-D View with capacity > 2B? #318
  • Kokkos::ViewAllocateWithoutInitializing is not working #317
  • Intel build on Mac OS X #277
  • deleted #271
  • Broken Mira build #268
  • 32-bit build #246
  • parallel_reduce with RDC crashes linker #232
  • build of Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o fails if you use nvcc and have cuda disabled #209
  • Kokkos Serial execution space is not tested with TeamPolicy. #207
  • Unit test failure on Hansen KokkosCore_UnitTest_Cuda_MPI_1 #200
  • nvcc compiler warning: calling a __host__ function from a __host__ __device__ function is not allowed #180
  • Intel 15 build error with defaulted "move" operators #171
  • missing libkokkos.a during Trilinos 12.4.2 build, yet other libkokkos*.a libs are there #165
  • Tie atomic updates to execution space or even to thread team? (speculation) #144
  • New View: Compiletime/size Test #137
  • New View : Performance Test #136
  • Signed/unsigned comparison warning in CUDA parallel #130
  • Kokkos::complex: Need op* w/ std::complex & real #126
  • Use uintptr_t for casting pointers #110
  • Default thread mapping behavior between P and Q threads. #91
  • Windows: Atomic_Fetch_Exchange() return type #90
  • Synchronic unit test is way too long #84
  • nvcc_wrapper -> $(NVCC_WRAPPER) #42
  • Check compiler version and print helpful message #39
  • Kokkos shared memory on Cuda uses a lot of registers #31
  • Can not pass unit test cuda.space without a GT 720 #25
  • Makefile.kokkos lacks bounds checking option that CMake has #24
  • Kokkos can not complete unit tests with CUDA UVM enabled #23
  • Simplify teams + shared memory histogram example to remove vectorization #21
  • Kokkos needs to rever to ${PROJECT_NAME}_ENABLE_CXX11 not Trilinos_ENABLE_CXX11 #17
  • Kokkos Base Makefile adds AVX to KNC Build #16
  • MS Visual Studio 2013 Build Errors #9
  • subview(X, ALL(), j) for 2-D LayoutRight View X: should it view a column? #5

End_C++98 (2015-04-15)

* This Change Log was automatically generated by github_changelog_generator