- Jastrow 2-body kernel, distance tables kernel(s), 1D bspline.
- 3D bspline kernel.
- easy to use self-checking capability (compile-time switch okay, runtime is nicer)
- timings for the individual kernels (not necessarily nested, not switchable - always on)
- shrink the codebase as much as possible - ONLY code necessary to run the kernels of interest
- inverse update kernel
- Jastrow 1-body, 3-body(e-e-I) kernels
- high-level (physics) description for each kernel (we can probably recruit help for this)
- doxygen comments in the code
- Evaluate programming models for performance portability
- Explore alternative algorithms and data structures for critical kernels
- Collaborate with non-QMCPACK developers, e.g. vendors and ECP-ST projects
- Solve reintegration issues within miniapp before bringing new developments back to QMCPACK
- Make an initial release/handoff to OpenMP and Kokkos assessors quickly enough such that the ECP milestones will be delivered by end September 2017.
- Evolve the capabilities iteratively over a series of releases
- Continue to explore new language features, libraries, and high-risk ideas eficiently
- Benchmark new hardware/software platforms.
- Serve as a new level of tests above the unit tests.
- Collaborate with non-QMCPACK developers
- Easier training for new project members and students
-
As simple to use as possible to ensure maximum productivity
- Simple build system based on flat Makefiles
- Reasonable default options and easy to use command-line interface - no input files for easier scriptability
- Self timing with breakdown by kernel
- Output in easily parseable text format, e.g. gnuplot columns or csv
- No large data or auxiliary files
-
As simple to develop as possible to ensure maximum productivity
- Commented source code, doxygen headers on functions.
- Completely internal self-checking for most efficient development workflow
- Can be switched on/off at runtime
- When switched off, no residual effects on timings or memory usage
- Minimal/No library dependency
- To keep building/maintenance as simple as possible
- BLAS is okay
- Avoid HDF5, FFTW, BOOST, libXML etc.
- No MPI parallelization
- As small as possible - only has the code necessary to run the kernels of interest
- Minimal functionality, e.g. only value and combined val-grad-lap for wavefunction components.
- Kernel orthogonality
- Each is hackable/tunable without getting into details of the others
- This is required for novel hardware and software technology assessments
- Keep significant implementation out of the coupling/shared code
- Driver and shared data structures rarely/never need changes
- Easy for newcomers to work on isolated parts of the code
- C++ productivity without over-engineering
- Abstractions hide data and implementation for safety and convenience
- Loosely coupled instead of deep hierarchies for hackability
-
Communicates our intentions and needs from C++, libraries, etc.
- Uses C++ features - write the code how we would like to see it
- Best possible algorithm for computational and memory footprint scaling
but easy to understand implementation without optimization
- Main entry point for new collaborators
- Starting point for exploring optimizations
- No pre-optimization or lowered abstractions
- Make the empirical data tell us where to compromise
-
Generally represents QMCPACK OOP design and QMC algorithms
- We have an accessible enough app that little/no supervision is required to use it
- Must ensure maximum flexibility for new algorithms and data structures
- Have necessary physics abstractions for functional flexibility
- Wavefunction components (det, J1/2/3).
- Boundary conditions (Open, PPP, PPN, PNN). Initially fully periodic only (PPP).
- Numerical functor (Bspline, Polynomial). Only spline J1J2 and polynomial J3 in first version, no functor.
- Start from QMCPACK API, but simplify as much as possible
- Can break consistency if necessary (minimize if possible)
- During miniapp development, no extra effort necessary due to reintegration concerns
- Use the same call sequence as real simulations
- Enables interprocedural algorithmic exploration
-
Flexible to change benchmark system and system size
- Only ECP-NiO, CORAL-graphite
- From ~100 to ~10k electrons sizes
- Via command line
-
Minimal MPI - no MPI-based parallelization
- Only minimum MPI present for tools/job wrappers to interface (e.g. MPI_Init()/Finalize())
- No introduction of additional build dependency on MPI
-
Documentation
- Need a high-level, non-code-specific explanation of kernels/algorithms
- Governing equations, operations, etc.
- README included with source code
- How to run, check, time, etc.
- How to scale inputs
- Small: single process
- Medium: single gpu, node, etc.
- Big: rack
- Huge: challenge problem
- List of compilers and platforms we’ve checked
- What is in/out of scope for optimization
- No optimization into corners, magic numbers, etc.
- Specifics about physical realities, restrictions, etc.
- Contact info: support, patches, changes, etc.
- Need a high-level, non-code-specific explanation of kernels/algorithms