-
Notifications
You must be signed in to change notification settings - Fork 35
One Code Branch
Peter Doak edited this page May 2, 2019
·
2 revisions
Original goal was for CPU / CUDA / KOKKOS CPU / KOKKOS CUDA to work. Currently only the following combinations work:
on x86 broadwell
1) env/cades-cnms 5) hdf5-1.10.4-gcc-6.5.0-4gmsnjn 9) gcc-6.5.0-gcc-8.2.0-egooyqw
2) mpich-3.3-gcc-6.5.0-6zgajlw 6) cmake-3.13.4-gcc-6.5.0-q76ndqk 10) cuda/9.2
3) ninja-1.6.0-gcc-4.8.5-gzwd46m 7) git-2.12.1-gcc-5.3.0-kibjjo6
4) emacs-25.3-gcc-5.3.0-qp7x25b 8) fftw-3.3.8-gcc-6.5.0-kpdartc
your-prompt$ mkdir build_cpu_cuda
your-prompt$ cd build_cpu_cuda
your-prompt$ rm -rf *; export CUDA_DIR=/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/cuda/9.2/centos7.2_binary; LDFLAGS="-L${CUDA_DIR}/lib64 -Wl,-rpath,${CUDA_DIR}/lib64" cmake -DCMAKE_CXX_COMPILER=g++ -DCUDA_TOOLKIT_ROOT_DIR=${CUDA_DIR} -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DQMC_USE_CUDA=1 -DCMAKE_BUILD_TYPE=Release -GNinja -DCUDA_NVCC_FLAGS="-std=c++14;-arch=sm_60;-Drestrict=__restrict__;-DNO_CUDA_MAIN;-O3;--default-stream=per-thread;-Xptxas;-v" -DENABLE_TIMERS=1 ..
your-prompt$ ninja
your-prompt$ bin/miniqmc -h
usage:
miniqmc [-bhjvV] [-g "n0 n1 n2"] [-m meshfactor]
[-n steps] [-N substeps] [-x rmax]
[-r AcceptanceRatio] [-s seed] [-w walkers]
[-a tile_size] [-t timer_level]
options:
-a splines per spline block default: num of orbs
-b use reference implementations default: off
-g set the 3D tiling. default: 1 1 1
-h print help and exit
-j enable three body Jastrow default: off
-m meshfactor default: 1.0
-M Crowd implementation default: off
-n number of MC steps default: 5
-N number of MC substeps default: 1
-p pack size for batching default: 1
-r set the acceptance ratio. default: 0.5
-s set the random seed. default: 11
-t timer level: coarse or fine default: fine
-w number of walker(movers) default: 1
-v verbose output
-V print version information and exit
-x set the Rmax. default: 1.7
-z number of crews for walker partitioning. default: 1
-d device implementation. default: CPU
Available devices:
0. CPU
1. CUDA
your-prompt$ bin/miniqmc -d0 -p2 -w16 -M -g '2 1 1' -a 128
... so much trace/debug, apologies
========== Throughput ============
Total throughput ( N_walkers * N_elec^3 / Total time ) = 1.86866e+09
Diffusion throughput ( N_walkers * N_elec^3 / Diffusion time ) = 2.23139e+09
Pseudopotential throughput ( N_walkers * N_elec^2 / Pseudopotential time ) = 4.75486e+06
Stack timer profile in seconds
Timer Inclusive_time Exclusive_time Calls Time_per_call
Total 7.7572 0.6206 1 7.757153988
Diffusion 6.4962 0.0077 5 1.299234009
Current Gradient 0.0006 0.0006 3840 0.000000148
Kinetic Energy 0.0002 0.0001 10 0.000017881
Determinant 0.0000 0.0000 10 0.000000095
OneBodyJastrow 0.0001 0.0001 10 0.000006056
TwoBodyJastrow 0.0000 0.0000 10 0.000002980
Make move 0.2208 0.2208 7680 0.000028749
New Gradient 0.1545 0.0034 3840 0.000040241
Determinant 0.0336 0.0336 7680 0.000004376
OneBodyJastrow 0.0063 0.0063 7680 0.000000823
TwoBodyJastrow 0.1112 0.1112 7680 0.000014482
Pseudopotential 3.9695 0.0000 5 0.793898249
Value 3.9695 0.0572 5 0.793898058
Determinant 0.3329 0.3329 80052 0.000004158
Make move 2.2903 2.2903 80052 0.000028610
OneBodyJastrow 0.0344 0.0344 80052 0.000000429
Single-Particle Orbitals 0.9400 0.0096 80052 0.000011743
Eval V 0.9304 0.9304 80052 0.000011622
TwoBodyJastrow 0.3148 0.3148 80052 0.000003932
Set active 0.2263 0.2263 7680 0.000029465
Spline Hessian Evaluation 0.6379 0.0018 3840 0.000166115
Single-Particle Orbitals 0.6361 0.6361 7680 0.000082825
Update 1.2787 0.0040 7680 0.000166502
Accept move 0.0042 0.0042 3837 0.000001095
Determinant 1.2045 1.2045 3837 0.000313920
OneBodyJastrow 0.0005 0.0005 3837 0.000000131
TwoBodyJastrow 0.0656 0.0656 3837 0.000017087
Initialization 0.6404 0.6404 1 0.640361071
your-prompt$ bin/miniqmc -d1 -p2 -w16 -M -g '2 1 1' -a 128
========== Throughput ============
Total throughput ( N_walkers * N_elec^3 / Total time ) = 2.55167e+09
Diffusion throughput ( N_walkers * N_elec^3 / Diffusion time ) = 4.55671e+09
Pseudopotential throughput ( N_walkers * N_elec^2 / Pseudopotential time ) = 2.75715e+07
Stack timer profile in seconds
Timer Inclusive_time Exclusive_time Calls Time_per_call
Total 5.6808 0.6387 1 5.680799961
Diffusion 3.1811 0.0076 5 0.636226749
Current Gradient 0.0005 0.0005 3840 0.000000137
Kinetic Energy 0.0002 0.0001 10 0.000015736
Determinant 0.0000 0.0000 10 0.000000119
OneBodyJastrow 0.0000 0.0000 10 0.000004649
TwoBodyJastrow 0.0000 0.0000 10 0.000004220
Make move 0.2204 0.2204 7680 0.000028701
New Gradient 0.1568 0.0034 3840 0.000040824
Determinant 0.0338 0.0338 7680 0.000004407
OneBodyJastrow 0.0065 0.0065 7680 0.000000843
TwoBodyJastrow 0.1130 0.1130 7680 0.000014720
Pseudopotential 0.6846 0.0000 5 0.136911964
Value 0.6846 0.0599 5 0.136911011
Determinant 0.3159 0.3159 80052 0.000003946
OneBodyJastrow 0.0201 0.0201 80052 0.000000251
TwoBodyJastrow 0.2886 0.2886 80052 0.000003605
Set active 0.2265 0.2265 7680 0.000029486
Spline Hessian Evaluation 0.6064 0.6064 3840 0.000157919
Update 1.2783 0.0049 7680 0.000166440
Accept move 0.0045 0.0045 3837 0.000001165
Determinant 1.2021 1.2021 3837 0.000313279
FinishUpdate 0.0005 0.0005 7680 0.000000067
OneBodyJastrow 0.0005 0.0005 3837 0.000000121
TwoBodyJastrow 0.0659 0.0659 3837 0.000017177
Initialization 1.8610 1.8610 1 1.860973835
CPU + KOKKOS(openmp) Next up