Skip to content

One Code Branch

Peter Doak edited this page May 2, 2019 · 2 revisions

Original goal was for CPU / CUDA / KOKKOS CPU / KOKKOS CUDA to work. Currently only the following combinations work:

CPU + CUDA

on x86 broadwell

Environment

  1) env/cades-cnms                   5) hdf5-1.10.4-gcc-6.5.0-4gmsnjn    9) gcc-6.5.0-gcc-8.2.0-egooyqw
  2) mpich-3.3-gcc-6.5.0-6zgajlw      6) cmake-3.13.4-gcc-6.5.0-q76ndqk  10) cuda/9.2
  3) ninja-1.6.0-gcc-4.8.5-gzwd46m    7) git-2.12.1-gcc-5.3.0-kibjjo6
  4) emacs-25.3-gcc-5.3.0-qp7x25b     8) fftw-3.3.8-gcc-6.5.0-kpdartc

Build and run

your-prompt$ mkdir build_cpu_cuda
your-prompt$ cd build_cpu_cuda
your-prompt$ rm -rf *; export CUDA_DIR=/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/cuda/9.2/centos7.2_binary;  LDFLAGS="-L${CUDA_DIR}/lib64 -Wl,-rpath,${CUDA_DIR}/lib64" cmake -DCMAKE_CXX_COMPILER=g++ -DCUDA_TOOLKIT_ROOT_DIR=${CUDA_DIR} -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DQMC_USE_CUDA=1 -DCMAKE_BUILD_TYPE=Release -GNinja -DCUDA_NVCC_FLAGS="-std=c++14;-arch=sm_60;-Drestrict=__restrict__;-DNO_CUDA_MAIN;-O3;--default-stream=per-thread;-Xptxas;-v" -DENABLE_TIMERS=1 ..
your-prompt$ ninja
your-prompt$ bin/miniqmc -h
usage:
  miniqmc   [-bhjvV] [-g "n0 n1 n2"] [-m meshfactor]
            [-n steps] [-N substeps] [-x rmax]
            [-r AcceptanceRatio] [-s seed] [-w walkers]
            [-a tile_size] [-t timer_level]
options:
  -a  splines per spline block       default: num of orbs
  -b  use reference implementations  default: off
  -g  set the 3D tiling.             default: 1 1 1
  -h  print help and exit
  -j  enable three body Jastrow      default: off
  -m  meshfactor                     default: 1.0
  -M  Crowd implementation          default: off
  -n  number of MC steps             default: 5
  -N  number of MC substeps          default: 1
  -p  pack size for batching         default: 1
  -r  set the acceptance ratio.      default: 0.5
  -s  set the random seed.           default: 11
  -t  timer level: coarse or fine    default: fine
  -w  number of walker(movers)       default: 1
  -v  verbose output
  -V  print version information and exit
  -x  set the Rmax.                  default: 1.7
  -z  number of crews for walker partitioning.   default: 1
  -d  device implementation.         default: CPU
      Available devices:
                         0.  CPU
                         1.  CUDA
your-prompt$ bin/miniqmc -d0 -p2 -w16 -M -g '2 1 1' -a 128
... so much trace/debug, apologies
========== Throughput ============

Total throughput ( N_walkers * N_elec^3 / Total time ) = 1.86866e+09
Diffusion throughput ( N_walkers * N_elec^3 / Diffusion time ) = 2.23139e+09
Pseudopotential throughput ( N_walkers * N_elec^2 / Pseudopotential time ) = 4.75486e+06

Stack timer profile in seconds
Timer                             Inclusive_time  Exclusive_time  Calls       Time_per_call
Total                                7.7572     0.6206              1       7.757153988
  Diffusion                          6.4962     0.0077              5       1.299234009
    Current Gradient                 0.0006     0.0006           3840       0.000000148
    Kinetic Energy                   0.0002     0.0001             10       0.000017881
      Determinant                    0.0000     0.0000             10       0.000000095
      OneBodyJastrow                 0.0001     0.0001             10       0.000006056
      TwoBodyJastrow                 0.0000     0.0000             10       0.000002980
    Make move                        0.2208     0.2208           7680       0.000028749
    New Gradient                     0.1545     0.0034           3840       0.000040241
      Determinant                    0.0336     0.0336           7680       0.000004376
      OneBodyJastrow                 0.0063     0.0063           7680       0.000000823
      TwoBodyJastrow                 0.1112     0.1112           7680       0.000014482
    Pseudopotential                  3.9695     0.0000              5       0.793898249
      Value                          3.9695     0.0572              5       0.793898058
        Determinant                  0.3329     0.3329          80052       0.000004158
        Make move                    2.2903     2.2903          80052       0.000028610
        OneBodyJastrow               0.0344     0.0344          80052       0.000000429
        Single-Particle Orbitals     0.9400     0.0096          80052       0.000011743
          Eval V                     0.9304     0.9304          80052       0.000011622
        TwoBodyJastrow               0.3148     0.3148          80052       0.000003932
    Set active                       0.2263     0.2263           7680       0.000029465
    Spline Hessian Evaluation        0.6379     0.0018           3840       0.000166115
      Single-Particle Orbitals       0.6361     0.6361           7680       0.000082825
    Update                           1.2787     0.0040           7680       0.000166502
      Accept move                    0.0042     0.0042           3837       0.000001095
      Determinant                    1.2045     1.2045           3837       0.000313920
      OneBodyJastrow                 0.0005     0.0005           3837       0.000000131
      TwoBodyJastrow                 0.0656     0.0656           3837       0.000017087
  Initialization                     0.6404     0.6404              1       0.640361071
your-prompt$ bin/miniqmc -d1 -p2 -w16 -M -g '2 1 1' -a 128
========== Throughput ============

Total throughput ( N_walkers * N_elec^3 / Total time ) = 2.55167e+09
Diffusion throughput ( N_walkers * N_elec^3 / Diffusion time ) = 4.55671e+09
Pseudopotential throughput ( N_walkers * N_elec^2 / Pseudopotential time ) = 2.75715e+07

Stack timer profile in seconds
Timer                          Inclusive_time  Exclusive_time  Calls       Time_per_call
Total                             5.6808     0.6387              1       5.680799961
  Diffusion                       3.1811     0.0076              5       0.636226749
    Current Gradient              0.0005     0.0005           3840       0.000000137
    Kinetic Energy                0.0002     0.0001             10       0.000015736
      Determinant                 0.0000     0.0000             10       0.000000119
      OneBodyJastrow              0.0000     0.0000             10       0.000004649
      TwoBodyJastrow              0.0000     0.0000             10       0.000004220
    Make move                     0.2204     0.2204           7680       0.000028701
    New Gradient                  0.1568     0.0034           3840       0.000040824
      Determinant                 0.0338     0.0338           7680       0.000004407
      OneBodyJastrow              0.0065     0.0065           7680       0.000000843
      TwoBodyJastrow              0.1130     0.1130           7680       0.000014720
    Pseudopotential               0.6846     0.0000              5       0.136911964
      Value                       0.6846     0.0599              5       0.136911011
        Determinant               0.3159     0.3159          80052       0.000003946
        OneBodyJastrow            0.0201     0.0201          80052       0.000000251
        TwoBodyJastrow            0.2886     0.2886          80052       0.000003605
    Set active                    0.2265     0.2265           7680       0.000029486
    Spline Hessian Evaluation     0.6064     0.6064           3840       0.000157919
    Update                        1.2783     0.0049           7680       0.000166440
      Accept move                 0.0045     0.0045           3837       0.000001165
      Determinant                 1.2021     1.2021           3837       0.000313279
      FinishUpdate                0.0005     0.0005           7680       0.000000067
      OneBodyJastrow              0.0005     0.0005           3837       0.000000121
      TwoBodyJastrow              0.0659     0.0659           3837       0.000017177
  Initialization                  1.8610     1.8610              1       1.860973835

CPU + KOKKOS(openmp) Next up