ECP Milestone 1.7: Kokkos contribution

Introduction

In this report we develop and port two QMCPACK kernels of the miniqmc miniapp (spline and Jastrow evaluations) to the Kokkos C++ programming model, and at least one CPU and one GPU based architecture. We make an initial assessment of:

Capability (expressiveness, ability to achieve required levels of parallelization)
Performance (achieved time to solution on multiple architectures)
Portability (extent of modification required for performance on multiple architectures)
Suitability (deviation from desired application abstractions required to port or to obtain performance, complexity, difficulty of learning curve)
Support (compiler, tools, documentation)

Desired improvements and extensions to improve the objective (1-3) and subjective (4,5) measures will be identified and communicated with the Kokkos team.

In this initial assessment we concentrate on objective (1,3,4) with only a partial performance evaluation. Objective 5 is at this stage not a specific issue, since only production level compilers and tools are required for using Kokkos. A particular goal of this first porting approach to Kokkos is to leave the code structure of QMCPACK intact, and provide a minimally invasive implementation. As a consequence some possible optimization strategies, such as kernels which work on multiple walkers at the same time were not attempted. If this allows ultimately for satisfying performance, the approach would minimize the effort necessary to adopt Kokkos in QMCPACK. The work corresponds to the second stage porting effort outlined in the Sandia Technical Report INSERT REPORT LINK HERE for an initial portable version.

Implementation

Build System

Integrating Kokkos into the existing CMake build system is straight forward. Kokkos supports inclusion of its own CMake files into existing projects.

In the top level CMakeLists.txt we add:

######################################################
#KOKKOS Interaction
######################################################
set(CMAKE_CXX_EXTENSIONS OFF)

SET( KOKKOS_PATH "/home/crtrott/Kokkos/kokkos" )
add_subdirectory( ${KOKKOS_PATH} ${qmcpack_BINARY_DIR}/kokkos )
include_directories(${Kokkos_INCLUDE_DIRS_RET})
LINK_LIBRARIES(kokkos)

We also need to remove QMCPACKs internal settings for architecture flags, which

didn't work on all platforms in the first place
are now provided through Kokkos' CMake logic.

The net effect is a reduction of complexity of the build system. To provide architecture setting simply use the Kokkos options such as:

-DKOKKOS_HOST_ARCH=Power8
-DKOKKOS_GPU_ARCH=Pascal60

Currently for CUDA we require UVM as the memory space, which allows us to leave memory staging to the GPU memory to the driver runtime system. To configure on the Summit like testbeds use:

cmake -DKOKKOS_ENABLE_CUDA=true -DKOKKOS_ENABLE_OPENMP=false -DKOKKOS_ENABLE_CUDA_UVM=true -DCMAKE_CXX_COMPILER=${KOKKOS_PATH}/bin/nvcc_wrapper -DKOKKOS_GPU_ARCH=Pascal60 -DKOKKOS_HOST_ARCH=Power8 ../

On KNL use:

cmake -DKOKKOS_ENABLE_OPENMP=true -DCMAKE_CXX_COMPILER=icpc -DKOKKOS_HOST_ARCH=KNL ../

Walker Parallelization Strategy

For the walker parallelization strategy a new capability in Kokkos called partitioning was utilized. It allows splitting of resources owned by the process over multiple master threads. For the Kokkos OpenMP backend this capability is implemented using OpenMP nested parallel execution. The outer level is utilized for walkers, while the inner levels are doing the parallelization for kernels acting on the data of each individual walker. Currently this approach is only available for the OpenMP backend, we expect it to be ready for CUDA in mid 2018. The fallback for CUDA compilation is to use a single walker per process, but launching multiple processes per GPU using the NVIDIA MPS server. The drawback of this approach is that otherwise shared data such as the spline arrays are duplicated in each process, similar to MPI only parallelization in the current QMCPACK production application.

The code change to achieve the partitioning is to take the openMP parallel section in main() and change it into a C++11 lambda which is provided to the partition_master function of Kokkos:

// Old code
int main () {
  ...
  #pragma omp parallel 
  {
     // Walker Code Body 
  }
  ...
}

// New code
int main () {
  ...
  auto main_function = [&] (int partition_id, int num_partitions) {
     // Walker Code Body 
  }

  Kokkos::partition_master(main_function,num_threads/ncrews,ncrews);
}

Due to the above mentioned limitations in Kokkos the current partition_master call is implemented as:

#if defined(KOKKOS_ENABLE_OPENMP) && !defined(KOKKOS_ENABLE_CUDA)
  int num_threads = Kokkos::OpenMP::thread_pool_size();
  printf("Partitioning\n");
  Kokkos::OpenMP::partition_master(main_function,num_threads/ncrews,ncrews);
#else
  main_function(0,1);
#endif

Spline kernel

Each Monte Carlo step requires the evaluation of single particle orbitals (SPOs). The miniqmc miniapp uses a 3D tricubic B-spline based orbital representation for the evaluation of the SPOs. The 3D tricubic B-spline basis is a highly efficient representation for SPOs requiring only 64 elements at any given point in space. The Kokkos implementation of the spline kernel requires modifications to the data structures and expression of parallelism in the mini-app.

Data structures

The spline kernel evaluates the wavefunction, its gradient (a 3D vector containing the first derivative with respect to particle positions) and its hessian (a 3D tensor containing the second derivative with respect to particle positions). We convert several data structures related to these into Views which is the Kokkos abstraction for multidimensional arrays with template parameters controlling in which memory space (host or device) the data resides and which data layout is imposed. Generally all aligned_vector objects which are accesses inside of parallel_for regions need to be replaced. The same is true for the VectorSoaContainer which is replaced with a 2D View. Note that the Layout of the 2D View is fixed to correspond with the legacy data layout for interoperability purposes during incremental Kokkos adoption.

For example in einspline_spo.hpp we replace:

  using vContainer_type = aligned_vector<T>;
  using gContainer_type = VectorSoaContainer<T, 3>;
  using hContainer_type = VectorSoaContainer<T, 6>;

with

  using vContainer_type = Kokkos::View<T*>;
  using gContainer_type = Kokkos::View<T*[3],Kokkos::LayoutLeft>;
  using hContainer_type = Kokkos::View<T*[6],Kokkos::LayoutLeft>;

Furthermore the spline class utilizes vectors of vectors for the psi grad and hess members, which are now replaced with Views of Views.

// Old
  aligned_vector<vContainer_type *> psi;
  aligned_vector<gContainer_type *> grad;
  aligned_vector<hContainer_type *> hess;

// New
  Kokkos::View<vContainer_type*> psi;
  Kokkos::View<gContainer_type*> grad;
  Kokkos::View<hContainer_type*> hess;

This necessitates a particular way of allocation though using a placement new strategy:

// Old
 psi.resize(nBlocks);
 grad.resize(nBlocks);
 hess.resize(nBlocks);
 #pragma omp parallel for
 for (int i = 0; i < nBlocks; ++i)
 {
   psi[i]  = new vContainer_type(nSplinesPerBlock);
   grad[i] = new gContainer_type(nSplinesPerBlock);
   hess[i] = new hContainer_type(nSplinesPerBlock);
 }
// New
 psi = Kokkos::View<vContainer_type*>("Psi",nBlocks) ;
 grad = Kokkos::View<gContainer_type*>("Grad",nBlocks) ;
 hess = Kokkos::View<hContainer_type*>("Hess",nBlocks);
 for(int i=0; i<psi.extent(0); i++) {
   new (&psi(i)) vContainer_type("Psi_i",nSplinesPerBlock);
   new (&grad(i)) gContainer_type("Grad_i",nSplinesPerBlock);
   new (&hess(i)) hContainer_type("Hess_i",nSplinesPerBlock);
 }

For the actual bspline struct we add an additional View member, but keep the old data handles available for interoperability purposes during an incremental Kokkos adoption. This again necessitates fixing the layout to the legacy layout, which in this case is not an issue since it is already optimal for the chosen parallelization strategy, which explicitly exploits vectorization on all architectures.

// Old
typedef struct
{
  ...
  double *restrict coefs;
  ...
} multi_UBspline_3d_d;


// New
typedef struct
{
  ...
  typedef Kokkos::View<double****, Kokkos::LayoutRight> coefs_view_t;
  coefs_view_t coefs_view;
  double *restrict coefs;
  ...
} multi_UBspline_3d_d;

To allocate the splines we allocate the View first and then extract the raw data pointer:

  spline->coefs_view = multi_UBspline_3d_d::coefs_view_t("Multi_UBspline_3d_d",Nx,Ny,Nz,N);
  spline->coefs = spline->coefs_view.data();

Deallocation generally relies on Kokkos' reference counting, and thus simply involves assigning a default constructure View:

// Old
    class Allocator
    {
      template <typename SplineType> void destroy(SplineType *spline)
      {
        einspline_free(spline->coefs);
        free(spline);
      }
    }
// New
    class Allocator
    {
      template <typename SplineType> void destroy(SplineType *spline)
      {
        // Delete View by setting it to an empty view and rely on reference counting of views.
        spline->coefs_view = typename SplineType::coefs_view_t();
        free(spline);
      }
    }

Replace inline with KOKKOS_INLINE_FUNCTION and add KOKKOS_INLINE_FUNCTION to constructors and destructors in src/Lattice/CrystalLattice.h, src/OhmmsPETE/Tensor.h, src/OhmmsPETE/TensorOps.h, src/OhmmsPETE/TinyVector.h, src/OhmmsPETE/TinyVectorOps.h.

In src/spline2/MultiBspline.hpp:

  //Master header file to define MultiBspline
  //
  #ifndef QMCPLUSPLUS_MULTIEINSPLINE_COMMON_HPP
  #define QMCPLUSPLUS_MULTIEINSPLINE_COMMON_HPP
  //
  #include <iostream>
  #include <spline2/MultiBsplineData.hpp>
  #include <stdlib.h>
  // 
  //
  //
  //
  //
  //
  template <typename T> struct MultiBspline
  {
    // define the einsplie object type
    using spliner_type = typename bspline_traits<T, 3>::SplineType;
    //
    MultiBspline() {}
    //MultiBspline(const MultiBspline &in) = delete;
    MultiBspline &operator=(const MultiBspline &in) = delete;
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
    //
      void evaluate_v(const spliner_type *restrict spline_m, T x, T y, T z, 
                      T *restrict vals, size_t num_splines) const;
    //
    //
      void evaluate_vgl(const spliner_type *restrict spline_m, T x, T y, T z, 
                        T *restrict vals, T *restrict grads, T *restrict lapl, 
                        size_t num_splines) const;
    //  
    //
    //
      void evaluate_vgh(const spliner_type *restrict spline_m, T x, T y, T z, 
                        T *restrict vals, T *restrict grads, T *restrict hess, 
                        size_t num_splines) const;
  };
  //
  template <typename T>
  //
  //
  inline void MultiBspline<T>::evaluate_v(const spliner_type *restrict spline_m,
                                          T x, T y, T z, T *restrict vals,
                                          size_t num_splines) const;
                                          //
  {
    x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    T tx, ty, tz;
    int ix, iy, iz;
    //
    //
    SplineBound<T>::get(x * spline_m->x_grid.delta_inv, tx, ix,
                        spline_m->x_grid.num - 1);
    SplineBound<T>::get(y * spline_m->y_grid.delta_inv, ty, iy,
                        spline_m->y_grid.num - 1);
    SplineBound<T>::get(z * spline_m->z_grid.delta_inv, tz, iz,
                        spline_m->z_grid.num - 1);
    T a[4], b[4], c[4];
    //
    MultiBsplineData<T>::compute_prefactors(a, tx);
    MultiBsplineData<T>::compute_prefactors(b, ty);
    MultiBsplineData<T>::compute_prefactors(c, tz);
    //
    const intptr_t xs = spline_m->x_stride;
    const intptr_t ys = spline_m->y_stride;
    const intptr_t zs = spline_m->z_stride;
    //
    CONSTEXPR T zero(0);
    ASSUME_ALIGNED(vals);
    std::fill(vals, vals + num_splines, zero);
  //
  //
  //  
    for (size_t i = 0; i < 4; i++)
      for (size_t j = 0; j < 4; j++)
      {
        const T pre00 = a[i] * b[j];
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        //#pragma omp simd
        for (size_t n = 0; n < num_splines; n++)
          vals[n] +=
              pre00 * (c[0] * coefs[n] + c[1] * coefs[n + zs] +
                       c[2] * coefs[n + 2 * zs] + c[3] * coefs[n + 3 * zs]);
        //
     }
  }
  template <typename T>
  inline void
  MultiBspline<T>::evaluate_vgl(const spliner_type *restrict spline_m,
                                T x, T y, T z, T *restrict vals,
                                T *restrict grads, T *restrict lapl,
                                size_t num_splines) const
  {
    x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    T tx, ty, tz;
    int ix, iy, iz;
    SplineBound<T>::get(x * spline_m->x_grid.delta_inv, tx, ix,
                        spline_m->x_grid.num - 1);
    SplineBound<T>::get(y * spline_m->y_grid.delta_inv, ty, iy,
                        spline_m->y_grid.num - 1);
    SplineBound<T>::get(z * spline_m->z_grid.delta_inv, tz, iz,
                        spline_m->z_grid.num - 1);
    //
    T a[4], b[4], c[4], da[4], db[4], dc[4], d2a[4], d2b[4], d2c[4];
    //
    MultiBsplineData<T>::compute_prefactors(a, da, d2a, tx);
    MultiBsplineData<T>::compute_prefactors(b, db, d2b, ty);
    MultiBsplineData<T>::compute_prefactors(c, dc, d2c, tz);
    //
    const intptr_t xs = spline_m->x_stride;
    const intptr_t ys = spline_m->y_stride;
    const intptr_t zs = spline_m->z_stride;
    //
    const size_t out_offset = spline_m->num_splines;
    //
    ASSUME_ALIGNED(vals);
    T *restrict gx = grads;
    ASSUME_ALIGNED(gx);
    T *restrict gy = grads + out_offset;
    ASSUME_ALIGNED(gy);
    T *restrict gz = grads + 2 * out_offset;
    ASSUME_ALIGNED(gz);
    T *restrict lx = lapl;
    ASSUME_ALIGNED(lx);
    T *restrict ly = lapl + out_offset;
    ASSUME_ALIGNED(ly);
    T *restrict lz = lapl + 2 * out_offset;
    ASSUME_ALIGNED(lz);
    //
    std::fill(vals, vals + num_splines, T());
    std::fill(gx, gx + num_splines, T());
    std::fill(gy, gy + num_splines, T());
    std::fill(gz, gz + num_splines, T());
    std::fill(lx, lx + num_splines, T());
    std::fill(ly, ly + num_splines, T());
    std::fill(lz, lz + num_splines, T());
    //
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
      {
        //
        const T pre20 = d2a[i] * b[j];
        const T pre10 = da[i] * b[j];
        const T pre00 = a[i] * b[j];
        const T pre11 = da[i] * db[j];
        const T pre01 = a[i] * db[j];
        const T pre02 = a[i] * d2b[j];
        //
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        const T *restrict coefszs = coefs + zs;
        ASSUME_ALIGNED(coefszs);
        const T *restrict coefs2zs = coefs + 2 * zs;
        ASSUME_ALIGNED(coefs2zs);
        const T *restrict coefs3zs = coefs + 3 * zs;
        ASSUME_ALIGNED(coefs3zs);
        //
  #pragma noprefetch
  #pragma omp simd
        for (int n = 0; n < num_splines; n++)
        {
          const T coefsv    = coefs[n];
          const T coefsvzs  = coefszs[n];
          const T coefsv2zs = coefs2zs[n];
          const T coefsv3zs = coefs3zs[n];
          //
          T sum0 = c[0] * coefsv + c[1] * coefsvzs + c[2] * coefsv2zs +
                   c[3] * coefsv3zs;
          T sum1 = dc[0] * coefsv + dc[1] * coefsvzs + dc[2] * coefsv2zs +
                   dc[3] * coefsv3zs;
          T sum2 = d2c[0] * coefsv + d2c[1] * coefsvzs + d2c[2] * coefsv2zs +
                   d2c[3] * coefsv3zs;
          gx[n] += pre10 * sum0;
          gy[n] += pre01 * sum0;
          gz[n] += pre00 * sum1;
          lx[n] += pre20 * sum0;
          ly[n] += pre02 * sum0;
          lz[n] += pre00 * sum2;
          vals[n] += pre00 * sum0;
        }
      }
    //
    const T dxInv = spline_m->x_grid.delta_inv;
    const T dyInv = spline_m->y_grid.delta_inv;
    const T dzInv = spline_m->z_grid.delta_inv;
    //
    const T dxInv2 = dxInv * dxInv;
    const T dyInv2 = dyInv * dyInv;
    const T dzInv2 = dzInv * dzInv;
    //
  #pragma omp simd
    for (int n = 0; n < num_splines; n++)
    {
      gx[n] *= dxInv;
       gy[n] *= dyInv;
       gz[n] *= dzInv;
       lx[n] = lx[n] * dxInv2 + ly[n] * dyInv2 + lz[n] * dzInv2;
    }
  }
  //
  template <typename T>
  inline void
  MultiBspline<T>::evaluate_vgh(const spliner_type *restrict spline_m,
                                T x, T y, T z, T *restrict vals,
                                T *restrict grads, T *restrict hess,
                                size_t num_splines) const
  {
    int ix, iy, iz;
    T tx, ty, tz;
    T a[4], b[4], c[4], da[4], db[4], dc[4], d2a[4], d2b[4], d2c[4];
  
    x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    SplineBound<T>::get(x * spline_m->x_grid.delta_inv, tx, ix,
                        spline_m->x_grid.num - 1);
    SplineBound<T>::get(y * spline_m->y_grid.delta_inv, ty, iy,
                        spline_m->y_grid.num - 1);
    SplineBound<T>::get(z * spline_m->z_grid.delta_inv, tz, iz,
                        spline_m->z_grid.num - 1);
    //
    MultiBsplineData<T>::compute_prefactors(a, da, d2a, tx);
    MultiBsplineData<T>::compute_prefactors(b, db, d2b, ty);
    MultiBsplineData<T>::compute_prefactors(c, dc, d2c, tz);
    //
    //
    std::fill(vals, vals + num_splines, T());
    std::fill(gx, gx + num_splines, T());
    std::fill(gy, gy + num_splines, T());
    std::fill(gz, gz + num_splines, T());
    std::fill(hxx, hxx + num_splines, T());
    std::fill(hxy, hxy + num_splines, T());
    std::fill(hxz, hxz + num_splines, T());
    std::fill(hyy, hyy + num_splines, T());
    std::fill(hyz, hyz + num_splines, T());
    std::fill(hzz, hzz + num_splines, T());
    //
    //
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
      {
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        const T *restrict coefszs = coefs + zs;
        ASSUME_ALIGNED(coefszs);
        const T *restrict coefs2zs = coefs + 2 * zs;
        ASSUME_ALIGNED(coefs2zs);
        const T *restrict coefs3zs = coefs + 3 * zs;
        ASSUME_ALIGNED(coefs3zs);
    //
        const T pre20 = d2a[i] * b[j];
        const T pre10 = da[i] * b[j];
        const T pre00 = a[i] * b[j];
        const T pre11 = da[i] * db[j];
        const T pre01 = a[i] * db[j];
        const T pre02 = a[i] * d2b[j];
      //
        const int iSplitPoint = num_splines;
        #pragma omp simd
        for (int n = 0; n < iSplitPoint; n++)
        {
          T coefsv    = coefs[n];
          T coefsvzs  = coefszs[n];
          T coefsv2zs = coefs2zs[n];
          T coefsv3zs = coefs3zs[n];
          T sum0 = c[0] * coefsv + c[1] * coefsvzs + c[2] * coefsv2zs +
                   c[3] * coefsv3zs;
          T sum1 = dc[0] * coefsv + dc[1] * coefsvzs + dc[2] * coefsv2zs +
                   dc[3] * coefsv3zs;
          T sum2 = d2c[0] * coefsv + d2c[1] * coefsvzs + d2c[2] * coefsv2zs +
                   d2c[3] * coefsv3zs;
          hxx[n] += pre20 * sum0;
          hxy[n] += pre11 * sum0;
          hxz[n] += pre10 * sum1;
          hyy[n] += pre02 * sum0;
          hyz[n] += pre01 * sum1;
          hzz[n] += pre00 * sum2;
          gx[n] += pre10 * sum0;
          gy[n] += pre01 * sum0;
          gz[n] += pre00 * sum1;
          vals[n] += pre00 * sum0;
        }
      }
  \\
    #pragma omp simd
    for (int n = 0; n < num_splines; n++)
    {
      gx[n] *= dxInv;
      gy[n] *= dyInv;
      gz[n] *= dzInv;
      hxx[n] *= dxx;
      hyy[n] *= dyy;
      hzz[n] *= dzz;
      hxy[n] *= dxy;
      hxz[n] *= dxz;
      hyz[n] *= dyz;
    }
  }

  //Master header file to define MultiBspline
  //
  #ifndef QMCPLUSPLUS_MULTIEINSPLINE_COMMON_HPP
  #define QMCPLUSPLUS_MULTIEINSPLINE_COMMON_HPP
  //
  #include <iostream>
  #include <spline2/MultiBsplineData.hpp>
  #include <stdlib.h>
  //
  #ifdef __CUDA_ARCH__
  #undef ASSUME_ALIGNED
  #define ASSUME_ALIGNED(x)
  #endif
  //
  template <typename T> struct MultiBspline
  {
    // define the einsplie object type
    using spliner_type = typename bspline_traits<T, 3>::SplineType;
    //
    MultiBspline() {}
    //MultiBspline(const MultiBspline &in) = delete;
    MultiBspline &operator=(const MultiBspline &in) = delete;
    //
    T A44[16];
    T dA44[16];
    T d2A44[16];
    //T d3A44[16];
    //
    void copy_A44() {
      for(int i=0; i<16; i++) {
        A44[i] = MultiBsplineData<T>::A44[i];
        dA44[i] = MultiBsplineData<T>::dA44[i];
        d2A44[i] = MultiBsplineData<T>::d2A44[i];
        //d3A44[i] = MultiBsplineData<T>::d3A44[i];
      }
    }
  //
    KOKKOS_INLINE_FUNCTION void compute_prefactors(T a[4], T tx) const
    {
      a[0] = ((A44[0] * tx + A44[1]) * tx + A44[2]) * tx + A44[3];
      a[1] = ((A44[4] * tx + A44[5]) * tx + A44[6]) * tx + A44[7];
      a[2] = ((A44[8] * tx + A44[9]) * tx + A44[10]) * tx + A44[11];
      a[3] = ((A44[12] * tx + A44[13]) * tx + A44[14]) * tx + A44[15];
    }
  //
    KOKKOS_INLINE_FUNCTION void compute_prefactors(T a[4], T da[4], T d2a[4], T tx) const
    {
      a[0]   = ((A44[0] * tx + A44[1]) * tx + A44[2]) * tx + A44[3];
      a[1]   = ((A44[4] * tx + A44[5]) * tx + A44[6]) * tx + A44[7];
      a[2]   = ((A44[8] * tx + A44[9]) * tx + A44[10]) * tx + A44[11];
      a[3]   = ((A44[12] * tx + A44[13]) * tx + A44[14]) * tx + A44[15];
      da[0]  = ((dA44[0] * tx + dA44[1]) * tx + dA44[2]) * tx + dA44[3];
      da[1]  = ((dA44[4] * tx + dA44[5]) * tx + dA44[6]) * tx + dA44[7];
      da[2]  = ((dA44[8] * tx + dA44[9]) * tx + dA44[10]) * tx + dA44[11];
      da[3]  = ((dA44[12] * tx + dA44[13]) * tx + dA44[14]) * tx + dA44[15];
      d2a[0] = ((d2A44[0] * tx + d2A44[1]) * tx + d2A44[2]) * tx + d2A44[3];
      d2a[1] = ((d2A44[4] * tx + d2A44[5]) * tx + d2A44[6]) * tx + d2A44[7];
      d2a[2] = ((d2A44[8] * tx + d2A44[9]) * tx + d2A44[10]) * tx + d2A44[11];
      d2a[3] = ((d2A44[12] * tx + d2A44[13]) * tx + d2A44[14]) * tx + d2A44[15];
    }
  //
  #define MYMAX(a,b) (a<b?b:a)
  #define MYMIN(a,b) (a>b?b:a)
   KOKKOS_INLINE_FUNCTION void get(T x, T &dx, int &ind, int ng) const
   {
     T ipart;
     dx  = std::modf(x, &ipart);
     ind = MYMIN(MYMAX(int(0), static_cast<int>(ipart)), ng);
   }
  #undef MYMAX
  #undef MYMIN
    //
    template<class TeamType>
    KOKKOS_INLINE_FUNCTION
      void evaluate_v(const TeamType& team, const spliner_type *restrict spline_m, T x, T y, T z, 
                      T *restrict vals, size_t num_splines) const;
    //
    KOKKOS_INLINE_FUNCTION
      void evaluate_vgl(const spliner_type *restrict spline_m, T x, T y, T z, 
                        T *restrict vals, T *restrict grads, T *restrict lapl, 
                        size_t num_splines) const;
    //
    template<class TeamType>
    KOKKOS_INLINE_FUNCTION
      void evaluate_vgh(const TeamType& team, const spliner_type *restrict spline_m, T x, T y, T z, 
                        T *restrict vals, T *restrict grads, T *restrict hess, 
                        size_t num_splines) const;
  };
  //
  template <typename T>
  template<class TeamType>
  KOKKOS_INLINE_FUNCTION
  void MultiBspline<T>::evaluate_v(const TeamType& team, 
                                   const spliner_type *restrict spline_m,
                                   T x, T y, T z, T *restrict vals,
                                   size_t num_splines) const;
  {
    x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    T tx, ty, tz;
    int ix, iy, iz;
    // Make SplineBound and MultiBsplineData actual members, 
    // or copy functionality over into MultiBspline
    get(x * spline_m->x_grid.delta_inv, tx, ix,
                      spline_m->x_grid.num - 1);
    get(y * spline_m->y_grid.delta_inv, ty, iy,
                      spline_m->y_grid.num - 1);
    get(z * spline_m->z_grid.delta_inv, tz, iz,
                      spline_m->z_grid.num - 1);
    T a[4], b[4], c[4];
    // 
    compute_prefactors(a, tx);
    compute_prefactors(b, ty);
    compute_prefactors(c, tz);
    //
    const intptr_t xs = spline_m->x_stride;
    const intptr_t ys = spline_m->y_stride;
    const intptr_t zs = spline_m->z_stride;
    //
    CONSTEXPR T zero(0);
    ASSUME_ALIGNED(vals);
    Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,num_splines), [&](const int& i) {
      vals[i] = T();
    });
  //  
    for (size_t i = 0; i < 4; i++)
      for (size_t j = 0; j < 4; j++)
      {
        const T pre00 = a[i] * b[j];
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        //#pragma omp simd
        Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,num_splines), [&](const int& n) {
          vals[n] +=
              pre00 * (c[0] * coefs[n] + c[1] * coefs[n + zs] +
                       c[2] * coefs[n + 2 * zs] + c[3] * coefs[n + 3 * zs]);
        });
      }
  }
  template <typename T>
  KOKKOS_INLINE_FUNCTION
  void MultiBspline<T>::evaluate_vgl(const spliner_type *restrict spline_m,
                                     T x, T y, T z, T *restrict vals,
                                     T *restrict grads, T *restrict lapl,
                                     size_t num_splines) const
  {
    /*x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    T tx, ty, tz;
    int ix, iy, iz;
    get(x * spline_m->x_grid.delta_inv, tx, ix,
                      spline_m->x_grid.num - 1);
    get(y * spline_m->y_grid.delta_inv, ty, iy,
                      spline_m->y_grid.num - 1);
    get(z * spline_m->z_grid.delta_inv, tz, iz,
                      spline_m->z_grid.num - 1);
    // 
    T a[4], b[4], c[4], da[4], db[4], dc[4], d2a[4], d2b[4], d2c[4];
    //
    MultiBsplineData<T>::compute_prefactors(a, da, d2a, tx);
    MultiBsplineData<T>::compute_prefactors(b, db, d2b, ty);
    MultiBsplineData<T>::compute_prefactors(c, dc, d2c, tz);
    //
    const intptr_t xs = spline_m->x_stride;
    const intptr_t ys = spline_m->y_stride;
    const intptr_t zs = spline_m->z_stride;
    //
    const size_t out_offset = spline_m->num_splines;
    //
    ASSUME_ALIGNED(vals);
    T *restrict gx = grads;
    ASSUME_ALIGNED(gx);
    T *restrict gy = grads + out_offset;
    ASSUME_ALIGNED(gy);
    T *restrict gz = grads + 2 * out_offset;
    ASSUME_ALIGNED(gz);
    T *restrict lx = lapl;
    ASSUME_ALIGNED(lx);
    T *restrict ly = lapl + out_offset;
    ASSUME_ALIGNED(ly);
    T *restrict lz = lapl + 2 * out_offset;
    ASSUME_ALIGNED(lz);
    //
    std::fill(vals, vals + num_splines, T());
    std::fill(gx, gx + num_splines, T());
    std::fill(gy, gy + num_splines, T());
    std::fill(gz, gz + num_splines, T());
    std::fill(lx, lx + num_splines, T());
    std::fill(ly, ly + num_splines, T());
    std::fill(lz, lz + num_splines, T());
    //
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
      {
        //
        const T pre20 = d2a[i] * b[j];
        const T pre10 = da[i] * b[j];
        const T pre00 = a[i] * b[j];
        const T pre11 = da[i] * db[j];
        const T pre01 = a[i] * db[j];
        const T pre02 = a[i] * d2b[j];
        //
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        const T *restrict coefszs = coefs + zs;
        ASSUME_ALIGNED(coefszs);
        const T *restrict coefs2zs = coefs + 2 * zs;
        ASSUME_ALIGNED(coefs2zs);
        const T *restrict coefs3zs = coefs + 3 * zs;
        ASSUME_ALIGNED(coefs3zs);
        //
  #pragma noprefetch
  #pragma omp simd
        for (int n = 0; n < num_splines; n++)
        {
          const T coefsv    = coefs[n];
          const T coefsvzs  = coefszs[n];
          const T coefsv2zs = coefs2zs[n];
          const T coefsv3zs = coefs3zs[n];
          //
          T sum0 = c[0] * coefsv + c[1] * coefsvzs + c[2] * coefsv2zs +
                   c[3] * coefsv3zs;
          T sum1 = dc[0] * coefsv + dc[1] * coefsvzs + dc[2] * coefsv2zs +
                   dc[3] * coefsv3zs;
          T sum2 = d2c[0] * coefsv + d2c[1] * coefsvzs + d2c[2] * coefsv2zs +
                   d2c[3] * coefsv3zs;
          gx[n] += pre10 * sum0;
          gy[n] += pre01 * sum0;
          gz[n] += pre00 * sum1;
          lx[n] += pre20 * sum0;
          ly[n] += pre02 * sum0;
          lz[n] += pre00 * sum2;
          vals[n] += pre00 * sum0;
        }
      }
    //
    const T dxInv = spline_m->x_grid.delta_inv;
    const T dyInv = spline_m->y_grid.delta_inv;
    const T dzInv = spline_m->z_grid.delta_inv;
    //
    const T dxInv2 = dxInv * dxInv;
    const T dyInv2 = dyInv * dyInv;
    const T dzInv2 = dzInv * dzInv;
    //
  #pragma omp simd
    for (int n = 0; n < num_splines; n++)
    {
      gx[n] *= dxInv;
       gy[n] *= dyInv;
       gz[n] *= dzInv;
       lx[n] = lx[n] * dxInv2 + ly[n] * dyInv2 + lz[n] * dzInv2;
    }*/
  }
  template<class TeamType>
  template <typename T>
  KOKKOS_INLINE_FUNCTION
  void MultiBspline<T>::evaluate_vgh(const TeamType& team, const spliner_type *restrict spline_m,
                                T x, T y, T z, T *restrict vals,
                                T *restrict grads, T *restrict hess,
                                size_t num_splines) const
  {
    int ix, iy, iz;
    T tx, ty, tz;
    T a[4], b[4], c[4], da[4], db[4], dc[4], d2a[4], d2b[4], d2c[4];
  
    x -= spline_m->x_grid.start;
    y -= spline_m->y_grid.start;
    z -= spline_m->z_grid.start;
    get(x * spline_m->x_grid.delta_inv, tx, ix,
                      spline_m->x_grid.num - 1);
    get(y * spline_m->y_grid.delta_inv, ty, iy,
                      spline_m->y_grid.num - 1);
    get(z * spline_m->z_grid.delta_inv, tz, iz,
                      spline_m->z_grid.num - 1);
    // 
    compute_prefactors(a, da, d2a, tx);
    compute_prefactors(b, db, d2b, ty);
    compute_prefactors(c, dc, d2c, tz);
    // 
    //
  Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,num_splines), [&](const int& i) {
      vals[i] = T();
      gx[i] = T();
      gy[i] = T();
      gz[i] = T();
      hxx[i] = T();
      hxy[i] = T();
      hxz[i] = T();
      hyy[i] = T();
      hyz[i] = T();
      hzz[i] = T();
    });
    for (int i = 0; i < 4; i++)
      for (int j = 0; j < 4; j++)
      {
        const T *restrict coefs =
            spline_m->coefs + ((ix + i) * xs + (iy + j) * ys + iz * zs);
        ASSUME_ALIGNED(coefs);
        const T *restrict coefszs = coefs + zs;
        ASSUME_ALIGNED(coefszs);
        const T *restrict coefs2zs = coefs + 2 * zs;
        ASSUME_ALIGNED(coefs2zs);
        const T *restrict coefs3zs = coefs + 3 * zs;
        ASSUME_ALIGNED(coefs3zs);
    //
        const T pre20 = d2a[i] * b[j];
        const T pre10 = da[i] * b[j];
        const T pre00 = a[i] * b[j];
        const T pre11 = da[i] * db[j];
        const T pre01 = a[i] * db[j];
        const T pre02 = a[i] * d2b[j];
      //
      //
      //
        Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,num_splines), [&](const int& n) {
      //
          T coefsv    = coefs[n];
          T coefsvzs  = coefszs[n];
          T coefsv2zs = coefs2zs[n];
          T coefsv3zs = coefs3zs[n];
          T sum0 = c[0] * coefsv + c[1] * coefsvzs + c[2] * coefsv2zs +
                   c[3] * coefsv3zs;
          T sum1 = dc[0] * coefsv + dc[1] * coefsvzs + dc[2] * coefsv2zs +
                   dc[3] * coefsv3zs;
          T sum2 = d2c[0] * coefsv + d2c[1] * coefsvzs + d2c[2] * coefsv2zs +
                   d2c[3] * coefsv3zs;
          hxx[n] += pre20 * sum0;
          hxy[n] += pre11 * sum0;
          hxz[n] += pre10 * sum1;
          hyy[n] += pre02 * sum0;
          hyz[n] += pre01 * sum1;
          hzz[n] += pre00 * sum2;
          gx[n] += pre10 * sum0;
          gy[n] += pre01 * sum0;
          gz[n] += pre00 * sum1;
          vals[n] += pre00 * sum0;
        });
      }
  \\
  \\
    Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,num_splines), [&](const int& n) {
      \\
      gx[n] *= dxInv;
      gy[n] *= dyInv;
      gz[n] *= dzInv;
      hxx[n] *= dxx;
      hyy[n] *= dyy;
      hzz[n] *= dzz;
      hxy[n] *= dxy;
      hxz[n] *= dxz;
      hyz[n] *= dyz;
    });
  }

In src/einspline/multi_bspline_structs.h:

Add Kokkos header file
```
//
#include <Kokkos_Core.hpp>
//
```

Create Kokkos views

typedef struct
{
  //
  //
  spline_code spcode;
  type_code tcode;
  double *restrict coefs;
  intptr_t x_stride, y_stride, z_stride;
  Ugrid x_grid, y_grid, z_grid;
  BCtype_d xBC, yBC, zBC;
  int num_splines;
  size_t coefs_size;
} multi_UBspline_3d_d;

typedef struct
{
  typedef Kokkos::View<double****, Kokkos::LayoutRight> coefs_view_t;
  spline_code spcode;
  type_code tcode;
  double *restrict coefs;
  coefs_view_t coefs_view;
  intptr_t x_stride, y_stride, z_stride;
  Ugrid x_grid, y_grid, z_grid;
  BCtype_d xBC, yBC, zBC;
  int num_splines;
  size_t coefs_size;
} multi_UBspline_3d_d;

In src/spline2/bspline_allocator.cpp:

Remove functions.

extern "C" {    
//
void set_multi_UBspline_3d_d(multi_UBspline_3d_d *spline, int spline_num, double *data);
void set_multi_UBspline_3d_s(multi_UBspline_3d_s *spline, int spline_num, float *data);
//
multi_UBspline_3d_s *einspline_create_multi_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                          BCtype_s xBC, BCtype_s yBC, BCtype_s zBC, 
                                                          int num_splines);
//
UBspline_3d_s *einspline_create_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                              BCtype_s xBC, BCtype_s yBC, BCtype_s zBC, 
                                              float *data);
//
multi_UBspline_3d_d *einspline_create_multi_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                          BCtype_d xBC, BCtype_d yBC, BCtype_d zBC, 
                                                          int num_splines);
//
UBspline_3d_d *einspline_create_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                              BCtype_d xBC, BCtype_d yBC, BCtype_d zBC, 
                                              double *data);
}

extern "C" {
//    
void set_multi_UBspline_3d_d(multi_UBspline_3d_d *spline, int spline_num, double *data);
void set_multi_UBspline_3d_s(multi_UBspline_3d_s *spline, int spline_num, float *data);
//  
//
//
//
//
//
//
//
//
//  
//
//  
//  
//  
//  
//  
}

Changes in deallocation.

void Allocator::set(double *indata, multi_UBspline_3d_s *spline, int i)
{
  //
  einspline_free(singleS->coefs);
  //
}

void Allocator::set(double *indata, multi_UBspline_3d_s *spline, int i)
{
  //
  einspline_free((void*)singleS->coefs);
  //
}

In src/spline2/bspline_allocator.hpp:

Delete Kokkos view by setting it to an empty view and relying on reference counting of views.

class Allocator
{
  //
  template <typename SplineType> void destroy(SplineType *spline)
  {
    //
    einspline_free(spline->coefs);
    free(spline);
  }
  //
}

class Allocator
{
  //
  template <typename SplineType> void destroy(SplineType *spline)
  {
    // Delete View by setting it to an empty view and rely on reference counting of views.
    spline->coefs_view = typename SplineType::coefs_view_t();
    free(spline);
  }
  //
}

In src/spline2/einspline_allocator.cpp (Renamed src/spline2/einspline_allocator.c):

Add Kokkos header file
```
#include <Kokkos_Core.hpp>
```
Remove inline specifier for all einspline_alloc and einspline_free.

Modify multi_Bspline creation:

multi_UBspline_3d_s *
einspline_create_multi_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid,
                                     BCtype_s xBC, BCtype_s yBC, BCtype_s zBC,
                                      int num_splines)
{
   // Create new spline
   multi_UBspline_3d_s *restrict spline = malloc(sizeof(multi_UBspline_3d_s));
   //
}
//
//
multi_UBspline_3d_d *
einspline_create_multi_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid,
                                     BCtype_d xBC, BCtype_d yBC, BCtype_d zBC,
                                     int num_splines)
{
  // Create new spline
  multi_UBspline_3d_d *restrict spline = malloc(sizeof(multi_UBspline_3d_d));
  //
}
//

multi_UBspline_3d_s *
einspline_create_multi_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid,
                                     BCtype_s xBC, BCtype_s yBC, BCtype_s zBC,
                                      int num_splines)
{
  // Create new spline
  multi_UBspline_3d_s *restrict spline = (multi_UBspline_3d_s *)malloc(sizeof(multi_UBspline_3d_s));
  //
}
//
//
multi_UBspline_3d_d *
einspline_create_multi_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid,
                                     BCtype_d xBC, BCtype_d yBC, BCtype_d zBC,
                                     int num_splines)
{
  // Create new spline
  multi_UBspline_3d_d *restrict spline = new multi_UBspline_3d_d;
  //
}
//

Create Kokkos view

//
//
spline->coefs_size = (size_t)Nx * spline->x_stride;
//
//
spline->coefs = (double *)einspline_alloc(sizeof(double) * spline->coefs_size, QMC_CLINE);    
//
//
//
//
//
//
//
//
//
//
//

// Create KokkosView
spline->coefs_view = multi_UBspline_3d_d::coefs_view_t("Multi_UBspline_3d_d",Nx,Ny,Nz,N);
// Check data layout is as expected
int strides[4];
spline->coefs_view.stride(strides);
if(spline->x_stride!=strides[0] ||
  spline->y_stride!=strides[1] ||
  spline->z_stride!=strides[2] ||
  1!=strides[3])
  fprintf(stderr, "Kokkos View has non-compatible strides %i %i | %i %i | %i %i\n",
      spline->x_stride,strides[0],
      spline->y_stride,strides[1],
      spline->z_stride,strides[2]
      );
spline->coefs = spline->coefs_view.data();

Modify Bspline creation

UBspline_3d_d *einspline_create_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid,
                                              Ugrid z_grid, BCtype_d xBC,
                                              BCtype_d yBC, BCtype_d zBC,
                                              double *data)
{
  // Create new spline
  UBspline_3d_d *restrict spline = malloc(sizeof(UBspline_3d_d));
  //
}
//
//
UBspline_3d_s *einspline_create_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid,
                                              Ugrid z_grid, BCtype_s xBC,
                                              BCtype_s yBC, BCtype_s zBC,
                                              float *data)
{
  // Create new spline
  UBspline_3d_s *spline = malloc(sizeof(UBspline_3d_s));
  //
}
//

UBspline_3d_d *einspline_create_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid,
                                              Ugrid z_grid, BCtype_d xBC,
                                              BCtype_d yBC, BCtype_d zBC,
                                              double *data)
{
  // Create new spline
  UBspline_3d_d *restrict spline = (UBspline_3d_d *)malloc(sizeof(UBspline_3d_d));
  //
}
//
//
UBspline_3d_s *einspline_create_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid,
                                              Ugrid z_grid, BCtype_s xBC,
                                              BCtype_s yBC, BCtype_s zBC,
                                              float *data)
{
  // Create new spline
  UBspline_3d_s *spline = (UBspline_3d_s *)malloc(sizeof(UBspline_3d_s));
  //
}
//

In src/spline2/einspline_allocator.h:

Add multi_Bspline and Bspline functions

extern "C" {
  #endif
  //
  void *einspline_alloc(size_t size, size_t alignment);
  //
  void einspline_free(void *ptr); 
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  #ifdef __cplusplus
}

extern "C" {
  //   
  //
  void *einspline_alloc(size_t size, size_t alignment);
  //
  void einspline_free(void *ptr);
  //
  multi_UBspline_3d_s *einspline_create_multi_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                            BCtype_s xBC, BCtype_s yBC, BCtype_s zBC, 
                                                            int num_splines);
  //
  multi_UBspline_3d_d *einspline_create_multi_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                            BCtype_d xBC, BCtype_d yBC, BCtype_d zBC, 
                                                            int num_splines);
  //
  UBspline_3d_s *einspline_create_UBspline_3d_s(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                BCtype_s xBC, BCtype_s yBC, BCtype_s zBC, 
                                                float *data);
  //
  UBspline_3d_d *einspline_create_UBspline_3d_d(Ugrid x_grid, Ugrid y_grid, Ugrid z_grid, 
                                                BCtype_d xBC, BCtype_d yBC, BCtype_d zBC, 
                                                double *data);
  //
}

In src/spline2/einspline_spo.hpp:

Modify data structures

struct einspline_spo
{
  struct EvaluateVGHTag {};
  struct EvaluateVTag {};
  typedef Kokkos::TeamPolicy<EvaluateVGHTag> policy_vgh_t;
  typedef Kokkos::TeamPolicy<EvaluateVTag> policy_v_t;
  //
  typedef typename Kokkos::TeamPolicy<>::member_type team_t;
  //
  // using spline_type=MultiBspline<T>;
  // define the einspline data object type
  using spline_type = typename bspline_traits<T, 3>::SplineType;
  using pos_type        = TinyVector<T, 3>;
  using vContainer_type = aligned_vector<T>;
  using gContainer_type = VectorSoaContainer<T, 3>;
  using hContainer_type = VectorSoaContainer<T, 6>;
  //
  // mark instance as copy
  bool is_copy;
  // 
  aligned_vector<spline_type *> einsplines;
  aligned_vector<vContainer_type *> psi;
  aligned_vector<gContainer_type *> grad;
  aligned_vector<hContainer_type *> hess;
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  // using spline_type=MultiBspline<T>;
  using pos_type        = TinyVector<T, 3>;
  using vContainer_type = aligned_vector<T>;
  //
  //
  lattice_type Lattice;
  //
  //
  //
  //
  //
  //
  //
  aligned_vector<spline_type *> einsplines;
  //
  // default constructor
  einspline_spo()
      : nBlocks(0), nSplines(0), firstBlock(0), lastBlock(0), Owner(false)
      //
  {
  }
  // disable copy constructor
  einspline_spo(const einspline_spo &in) = delete;
  // disable copy operator
  einspline_spo &operator=(const einspline_spo &in) = delete;
  //
  einspline_spo(einspline_spo &in, int ncrews, int crewID)
      : Owner(false), Lattice(in.Lattice)
  {
    //
    //
    nSplines         = in.nSplines;
    nSplinesPerBlock = in.nSplinesPerBlock;
    nBlocks          = (in.nBlocks + ncrews - 1) / ncrews;
    firstBlock       = nBlocks * crewID;
    lastBlock        = std::min(in.nBlocks, nBlocks * (crewID + 1));
    nBlocks          = lastBlock - firstBlock;
    einsplines.resize(nBlocks);
    for (int i = 0, t = firstBlock; i < nBlocks; ++i, ++t)
      einsplines[i] = in.einsplines[t];
    resize();
  }
  //
  einspline_spo(einspline_spo_ref<T,compute_engine_type> &in, int ncrews, int crewID)
       : Owner(false), Lattice(in.Lattice), is_copy(false)
  {
    //
    //
    nSplines         = in.nSplines;
    nSplinesPerBlock = in.nSplinesPerBlock;
    nBlocks          = (in.nBlocks + ncrews - 1) / ncrews;
    firstBlock       = nBlocks * crewID;
    lastBlock        = std::min(in.nBlocks, nBlocks * (crewID + 1));
    nBlocks          = lastBlock - firstBlock;
    einsplines.resize(nBlocks);
      for (int i = 0, t = firstBlock; i < nBlocks; ++i, ++t)
    einsplines[i] = in.einsplines[t];
    resize();
  }
  //
  // destructors
  ~einspline_spo()
  {
  //  
    if (psi.size()) clean();
    if (Owner)
      for (int i = 0; i < nBlocks; ++i)
        delete einsplines[i];
  }
  void clean()
  {
    const int n = psi.size();
    for (int i = 0; i < n; ++i)
    {
      delete psi[i];
      delete grad[i];
      delete hess[i];
    }
    //
    //
    //
    //
    //
  }
  // resize the containers
  void resize()
  {
    if (nBlocks > psi.size())
    {
      clean();
      psi.resize(nBlocks);
      grad.resize(nBlocks);
      hess.resize(nBlocks);
#pragma omp parallel for
      for (int i = 0; i < nBlocks; ++i)
      {
        psi[i]  = new vContainer_type(nSplinesPerBlock);
        grad[i] = new gContainer_type(nSplinesPerBlock);
        hess[i] = new hContainer_type(nSplinesPerBlock);
      }
    }
  }
  // fix for general num_splines
  void set(int nx, int ny, int nz, int num_splines, int nblocks,
           bool init_random = true)
  {
    nSplines         = num_splines;
    nBlocks          = nblocks;
    nSplinesPerBlock = num_splines / nblocks;
    firstBlock       = 0;
    lastBlock        = nBlocks;
    if (einsplines.empty())
    {
      Owner = true;
      TinyVector<int, 3> ng(nx, ny, nz);
      pos_type start(0);
      pos_type end(1);
      einsplines.resize(nBlocks);
      RandomGenerator<T> myrandom(11);
      Array<T, 3> data(nx, ny, nz);
      std::fill(data.begin(), data.end(), T());
      myrandom.generate_uniform(data.data(), data.size());
        for (int i = 0; i < nBlocks; ++i)
        {
          einsplines[i] = myAllocator.createMultiBspline(T(0), start, end, ng, PERIODIC, nSplinesPerBlock);
          if (init_random)
            for (int j = 0; j < nSplinesPerBlock; ++j)
              myAllocator.set(data.data(), einsplines[i], j);
        }
      }
      resize();
    }
   //
  }
}

struct einspline_spo
{
  struct EvaluateVGHTag {};
  struct EvaluateVTag {};
  typedef Kokkos::TeamPolicy<EvaluateVGHTag> policy_vgh_t;
  typedef Kokkos::TeamPolicy<Kokkos::Serial,EvaluateVTag> policy_v_t;
  //
  typedef typename policy_v_t::member_type team_v_t;
  //
  // using spline_type=MultiBspline<T>;
  // define the einspline data object type
  using spline_type = typename bspline_traits<T, 3>::SplineType;
  using pos_type        = TinyVector<T, 3>;
  using vContainer_type = Kokkos::View<T*>;
  using gContainer_type = Kokkos::View<T*[3],Kokkos::LayoutLeft>;
  using hContainer_type = Kokkos::View<T*[6],Kokkos::LayoutLeft>;
  //
  // mark instance as copy
  bool is_copy;
  // 
  Kokkos::View<spline_type *> einsplines;
  Kokkos::View<vContainer_type*> psi;
  Kokkos::View<gContainer_type*> grad;
  Kokkos::View<hContainer_type*> hess;
  //
  struct EvaluateVTag {};
  struct EvaluateVGHTag {};
  typedef Kokkos::TeamPolicy<Kokkos::Serial,EvaluateVGHTag> policy_vgh_serial_t;
  typedef Kokkos::TeamPolicy<EvaluateVGHTag> policy_vgh_parallel_t;
  typedef Kokkos::TeamPolicy<Kokkos::Serial,EvaluateVTag> policy_v_serial_t;
  typedef Kokkos::TeamPolicy<EvaluateVTag> policy_v_parallel_t;
  //
  typedef typename policy_vgh_serial_t::member_type team_vgh_serial_t;
  typedef typename policy_vgh_parallel_t::member_type team_vgh_parallel_t;
  typedef typename policy_v_serial_t::member_type team_v_serial_t;
  typedef typename policy_v_parallel_t::member_type team_v_parallel_t;
  // 
  // Whether to use Serial evaluation or not
  int nSplinesSerialThreshold_V;
  int nSplinesSerialThreshold_VGH;    
  //
  // using spline_type=MultiBspline<T>;
  using pos_type        = TinyVector<T, 3>;
  using vContainer_type = aligned_vector<T>;
  //
  //
  lattice_type Lattice;
  //
  // Needed to be stored in class since we use it as the functor
  pos_type pos;
  //
  // mark instance as copy
  bool is_copy;
  //
  aligned_vector<spline_type *> einsplines;
  //
  // default constructor
  einspline_spo()
      : nSplinesSerialThreshold_V(512),nSplinesSerialThreshold_VGH(128),nBlocks(0), nSplines(0), 
        firstBlock(0), lastBlock(0), Owner(false), is_copy(false)
  {
  }
  // disable copy constructor
  // einspline_spo(const einspline_spo &in) = delete;
  // disable copy operator
  einspline_spo &operator=(const einspline_spo &in) = delete;
  //
  einspline_spo(einspline_spo &in, int ncrews, int crewID)
      : Owner(false), Lattice(in.Lattice)
  {
    nSplinesSerialThreshold_V = in.nSplinesSerialThreshold_V;
    nSplinesSerialThreshold_VGH = in.nSplinesSerialThreshold_VGH;
    nSplines         = in.nSplines;
    nSplinesPerBlock = in.nSplinesPerBlock;
    nBlocks          = (in.nBlocks + ncrews - 1) / ncrews;
    firstBlock       = nBlocks * crewID;
    lastBlock        = std::min(in.nBlocks, nBlocks * (crewID + 1));
    nBlocks          = lastBlock - firstBlock;
    einsplines       = Kokkos::View<spline_type*>("einsplines",nBlocks);
    for (int i = 0, t = firstBlock; i < nBlocks; ++i, ++t)
      einsplines[i] = in.einsplines[t];
    resize();
  }
  //
  einspline_spo(einspline_spo_ref<T,compute_engine_type> &in, int ncrews, int crewID)
       : Owner(false), Lattice(in.Lattice), is_copy(false)
  {
    nSplinesSerialThreshold_V = 512;
    nSplinesSerialThreshold_VGH = 128;
    nSplines         = in.nSplines;
    nSplinesPerBlock = in.nSplinesPerBlock;
    nBlocks          = (in.nBlocks + ncrews - 1) / ncrews;
    firstBlock       = nBlocks * crewID;
    lastBlock        = std::min(in.nBlocks, nBlocks * (crewID + 1));
    nBlocks          = lastBlock - firstBlock;
    einsplines       = Kokkos::View<spline_type*>("einsplines",nBlocks);
      for (int i = 0, t = firstBlock; i < nBlocks; ++i, ++t)
    einsplines[i] = in.einsplines[t];
    resize();
  }
  //
  // destructors
  ~einspline_spo()
  {
    if(! is_copy) {
    if (psi.extent(0)) clean();
    einsplines = Kokkos::View<spline_type *>() ;
    //
    }
  }
  void clean()
  //
  //
  //
  {
    for(int i=0; i<psi.extent(0); i++) {
      psi(i) = vContainer_type();
      grad(i) = gContainer_type();
      hess(i) = hContainer_type();
    }
    psi = Kokkos::View<vContainer_type*>()  ;
    grad = Kokkos::View<gContainer_type*>()  ;
    hess = Kokkos::View<hContainer_type*>() ;      
    //
  }
  // resize the containers
  void resize()
  {
    if (nBlocks > psi.extent(0))
    {
      clean();
      psi = Kokkos::View<vContainer_type*>("Psi",nBlocks) ;
      grad = Kokkos::View<gContainer_type*>("Grad",nBlocks) ;
      hess = Kokkos::View<hContainer_type*>("Hess",nBlocks);
      for(int i=0; i<psi.extent(0); i++) {
        new (&psi(i)) vContainer_type("Psi_i",nSplinesPerBlock);
        new (&grad(i)) gContainer_type("Grad_i",nSplinesPerBlock);
        new (&hess(i)) hContainer_type("Hess_i",nSplinesPerBlock);
      }
    }
    //
    //      
  }
  // fix for general num_splines
  void set(int nx, int ny, int nz, int num_splines, int nblocks,
           bool init_random = true)
  {
    nSplines         = num_splines;
    nBlocks          = nblocks;
    nSplinesPerBlock = num_splines / nblocks;
    firstBlock       = 0;
    lastBlock        = nBlocks;
    if (einsplines.extent(0)==0)
    {
      Owner = true;
      TinyVector<int, 3> ng(nx, ny, nz);
      pos_type start(0);
      pos_type end(1);
      einsplines       = Kokkos::View<spline_type*>("einsplines",nBlocks);
      RandomGenerator<T> myrandom(11);
      Array<T, 3> data(nx, ny, nz);
      std::fill(data.begin(), data.end(), T());
      myrandom.generate_uniform(data.data(), data.size());
      for (int i = 0; i < nBlocks; ++i)
      {
        einsplines[i] = *myAllocator.createMultiBspline(T(0), start, end, ng, PERIODIC, nSplinesPerBlock);
        if (init_random)
          for (int j = 0; j < nSplinesPerBlock; ++j)
            myAllocator.set(data.data(), &einsplines[i], j);
      }
    }
    resize();
  }
  //
}

Parallel execution

In src/QMCWaveFunctions/einspline_spo.hpp:

Add Kokkos::parallel_for to evaluate_v, evaluate_vgl and tidy up data structures in evaluate_v, evaluate_vgh, evaluate_vgl

struct einspline_spo
{
  //
  /** evaluate psi */
  inline void evaluate_v(const pos_type &p)
  {
    //
    auto u = Lattice.toUnit(p);
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_v(einsplines[i], u[0], u[1], u[2], psi[i]->data(), psi[i]->size());
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
  }
  /** evaluate psi */
  inline void evaluate_v_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_v(einsplines[i], u[0], u[1], u[2], psi[i]->data(), psi[i]->size());
  }
  /** evaluate psi, grad and lap */
  inline void evaluate_vgl(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgl(einsplines[i], u[0], u[1], u[2],
                                  psi[i]->data(), grad[i]->data(), hess[i]->data(),
                                  psi[i]->size());
  }
  /** evaluate psi, grad and lap */
  inline void evaluate_vgl_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgl(einsplines[i], u[0], u[1], u[2],
                                  psi[i]->data(), grad[i]->data(), hess[i]->data(),
                                  psi[i]->size());
  }    
  //
  /** evaluate psi, grad and hess */
  inline void evaluate_vgh(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    //
    //
    for (int i = 0; i < nBlocks; ++i)
      einsplines[i]->evaluate_vgh(u, *psi[i], *grad[i], *hess[i]);
      //     
      //
      //
      //      
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
      //
  }
  /** evaluate psi, grad and hess */
  inline void evaluate_vgh_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgh(einsplines[i], u[0], u[1], u[2],
                                  psi[i]->data(), grad[i]->data(), hess[i]->data(),
                                  psi[i]->size());
  }
  //
}

struct einspline_spo
{
  //
  /** evaluate psi */
  inline void evaluate_v(const pos_type &p)
  {
    pos = p;
    is_copy = true;
    compute_engine.copy_A44();
    if(nSplines > nSplinesSerialThreshold_V)
      Kokkos::parallel_for("EinsplineSPO::evalute_v_parallel",policy_v_parallel_t(nBlocks,1,32),*this);
    else
      Kokkos::parallel_for("EinsplineSPO::evalute_v_serial",policy_v_serial_t(nBlocks,1,32),*this);
    is_copy = false;
  }
  KOKKOS_INLINE_FUNCTION
  void operator() (const EvaluateVTag&, const team_v_serial_t& team ) const {
    int block = team.league_rank();
    // Need KokkosInlineFunction on Tensor and TinyVector ....
    auto u = Lattice.toUnit(pos);
    //
    compute_engine.evaluate_v(team,&einsplines[block], u[0], u[1], u[2],
                              psi(block).data(), psi(block).extent(0));
  }
  //
  KOKKOS_INLINE_FUNCTION
  void operator() (const EvaluateVTag&, const team_v_parallel_t& team ) const {
    int block = team.league_rank();
    // Need KokkosInlineFunction on Tensor and TinyVector ....
    auto u = Lattice.toUnit(pos);
  //
   compute_engine.evaluate_v(team,&einsplines[block], u[0], u[1], u[2],
                             psi(block).data(), psi(block).extent(0));
  }
  /** evaluate psi */
  inline void evaluate_v_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_v(&einsplines[i], u[0], u[1], u[2], &psi(i,0), psi.extent(0));
  }
  /** evaluate psi, grad and lap */
  inline void evaluate_vgl(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgl(&einsplines[i], u[0], u[1], u[2],
                                  psi(i).data(), grad(i).data(), hess(i).data(),
                                  psi(i).extent(0));
  }
  /** evaluate psi, grad and lap */
  inline void evaluate_vgl_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgl(&einsplines[i], u[0], u[1], u[2],
                                  psi(i).data(), grad(i).data(), hess(i).data(),
                                  psi(i).extent(0));
  }    
  //    
  /** evaluate psi, grad and hess */
  inline void evaluate_vgh(const pos_type &p)
  {
    pos = p;
    is_copy = true;
    compute_engine.copy_A44();
    if(nSplines > nSplinesSerialThreshold_VGH)
      Kokkos::parallel_for("EinsplineSPO::evalute_vgh",policy_vgh_parallel_t(nBlocks,1,32),*this);
    else
      Kokkos::parallel_for("EinsplineSPO::evalute_vgh",policy_vgh_serial_t(nBlocks,1,32),*this);
    is_copy = false;
  }
  //
  KOKKOS_INLINE_FUNCTION
  void operator() (const EvaluateVGHTag&, const team_vgh_parallel_t& team ) const {
    int block = team.league_rank();
    auto u = Lattice.toUnit(pos);
    compute_engine.evaluate_vgh(team,&einsplines[block], u[0], u[1], u[2],
                                      psi(block).data(), grad(block).data(), hess(block).data(),
                                      psi(block).extent(0));
  }
  //
  KOKKOS_INLINE_FUNCTION
  void operator() (const EvaluateVGHTag&, const team_vgh_serial_t& team ) const {
    int block = team.league_rank();
    auto u = Lattice.toUnit(pos);
    compute_engine.evaluate_vgh(team,&einsplines[block], u[0], u[1], u[2],
                                     psi(block).data(), grad(block).data(), hess(block).data(),
                                     psi(block).extent(0));
  }    
  /** evaluate psi, grad and hess */
  inline void evaluate_vgh_pfor(const pos_type &p)
  {
    auto u = Lattice.toUnit(p);
    #pragma omp for nowait
    for (int i = 0; i < nBlocks; ++i)
      compute_engine.evaluate_vgh(&einsplines[i], u[0], u[1], u[2],
                                  psi(i).data(), grad(i).data(), hess(i).data(),
                                  psi(i).extent(0));
  }
  //
}

Jastrow kernel

Data structures

In src/Numerics/LinearFit.h: Add new function LinearFit for Kokkos::vector.

//
//
#include "Numerics/DeterminantOperators.h"
//
//
namespace qmcplusplus
{
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //
  //  
  //  
  //  
  //  
  //  
  //  
  //  
  //  
  //    
  //  
  //  
  //  
  //    
  //  
  //  
  //  
  //  
  //  
  //
}
//

//
//
#include "Numerics/DeterminantOperators.h"
#include <Kokkos_Vector.hpp>
//
namespace qmcplusplus
{
  template <typename T>
  inline void LinearFit(std::vector<T> &y, Matrix<T> &A, Kokkos::vector<T> &coefs)
  {
    int N = A.size(0);
    int M = A.size(1);
    if (y.size() != N)
    {
      app_error() << "Differernt number of rows in basis functions that in data "
                  << "points in LinearFit.  Exitting.\n";
      abort();
    }
    // Construct alpha matrix
    Matrix<T> alpha(M, M);
    alpha = 0.0;
    for (int j = 0; j < M; j++)
      for (int k = 0; k < M; k++)
        for (int i = 0; i < N; i++)
          alpha(k, j) += A(i, j) * A(i, k);
    // Next, construct beta vector
    std::vector<T> beta(M, 0.0);
    for (int k = 0; k < M; k++)
      for (int i = 0; i < N; i++)
        beta[k] += y[i] * A(i, k);
    invert_matrix(alpha, false);
    coefs.resize(M, 0.0);
    for (int i = 0; i < M; i++)
      for (int j = 0; j < M; j++)
        coefs[i] += alpha(i, j) * beta[j];
  }
}
//

In src/QMCWaveFunctions/Jastrow/BsplineFunctor.h:

#include <cstdio>
//
//  
 /*!
  * @file BsplineFunctor.h
 */
//
namespace qmcplusplus
{
//
template <class T> struct BsplineFunctor : public OptimizableFunctorBase
{
  typedef real_type value_type;
  int NumParams;
  int Dummy;
  const TinyVector<real_type, 16> A, dA, d2A, d3A;
  aligned_vector<real_type> SplineCoefs;
  //
  // static const real_type A[16], dA[16], d2A[16];
  real_type DeltaR, DeltaRInv;
  real_type CuspValue;
  real_type Y, dY, d2Y;
  // Stores the derivatives w.r.t. SplineCoefs
  // of the u, du/dr, and d2u/dr2
  std::vector<TinyVector<real_type, 3>> SplineDerivs;
  std::vector<real_type> Parameters;
  std::vector<std::string> ParameterNames;
  std::string elementType, pairType;
  std::string fileName;
  //
  int ResetCount;
  bool notOpt;
  bool periodic;
  //
  //
  //
  void evaluateVGL(const int iStart, const int iEnd, 
                   const T* _distArray,  T* restrict _valArray,
                   T* restrict _gradArray, T* restrict _laplArray, 
                   T* restrict distArrayCompressed, int* restrict distIndices ) const;
};
//
//
}

#include <cstdio>
#include <Kokkos_Vector.hpp>
//  
 /*!
  * @file BsplineFunctor.h
 */
//
namespace qmcplusplus
{
//
template <class T> struct BsplineFunctor : public OptimizableFunctorBase
{
  typedef real_type value_type;
  int NumParams;
  int Dummy;
  TinyVector<real_type, 16> A, dA, d2A, d3A;
  Kokkos::vector<real_type> SplineCoefs;
  //
  // static const real_type A[16], dA[16], d2A[16];
  real_type DeltaR, DeltaRInv;
  real_type CuspValue;
  real_type Y, dY, d2Y;
  // Stores the derivatives w.r.t. SplineCoefs
  // of the u, du/dr, and d2u/dr2
  Kokkos::vector<TinyVector<real_type, 3>> SplineDerivs;
  Kokkos::vector<real_type> Parameters;
  Kokkos::vector<std::string> ParameterNames;
  std::string elementType, pairType;
  std::string fileName;
  //
  int ResetCount;
  bool notOpt;
  bool periodic;
  //
  //
  template <typename TeamType>
  void evaluateVGL(const TeamType& team, const int iStart, const int iEnd,
                   const T* _distArray,  T* restrict _valArray,
                   T* restrict _gradArray, T* restrict _laplArray, 
                   T* restrict distArrayCompressed, int* restrict distIndices ) const;
};
//
//
}

Parallel execution

Assessment

We assess our Kokkos implementation of the spline and Jastrow kernels on the following architectures.

Many-core architecture: Intel Xeon Phi (KNL)
GPU: Nvidia K40

Spline kernel

Jastrow kernel

Conclusions and Future Work

Acknowledgements

This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations—the Office of Science and the National Nuclear Security Administration responsible for the planning and preparation of a capable exascale ecosystem including software, applications, hardware, advanced system engineering, and early testbed platforms to support the nation’s exascale computing imperative.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECP Milestone 1.7: Kokkos contribution

Introduction

Implementation

Build System

Walker Parallelization Strategy

Spline kernel

Data structures

Parallel execution

Jastrow kernel

Data structures

Parallel execution

Assessment

Spline kernel

Jastrow kernel

Conclusions and Future Work

Acknowledgements

Clone this wiki locally