Skip to content

miniQMC computational overview

Ye Luo edited this page Oct 22, 2018 · 6 revisions

The most important real-space approach is the diffusion Monte Carlo algorithm (DMC), where an ensemble of walkers (population) is propagated stochastically from generation to generation; each walker is represented by a set of 3D particle positions, R. In each propagation step, the walkers are moved through position space by a drift-diffusion process. At the end of each step, each walker may reproduce itself, be killed, or continue unchanged (branching process), depending on the value of E_L, the local energy, at that point in space. The walkers in an ensemble are tightly coupled through a trial energy E_T and a feedback mechanism to keep the average population at the target 10^5-10^6 walkers. This leads to a fluctuating population and necessitates shuffling of walkers among processing units to maintain load balance at an appropriate interval.

Most of the computational time is spent on evaluations of ratios (L7), derivatives (L8), energies (L13), and state updates when Monte Carlo moves are accepted (L10). Profiling the QMCPACK code shows the dominant kernels are the evaluation of the particle-based Jastrow functions and 3D B-Splines called from L7,8,13 of the pseudo code, and the inverse update from L10. The latter kernel is in fact using most of the computational resources for both the CPU and GPU versions of the code. For DMC and the real space approach, the selected mini-apps will naturally focus on the inverse matrix update, the splines, the Jastrow functions and the distance tables.

The pseudocode for each step of diffusion Monte Carlo (DMC) at a high-level goes as:

1.  for MC generation = 1...M do
2.    for walker = 1...Nw do
3.      let R = {r1...rN}
4.      for particle i = 1...N do
5.        set r'_i = r_i + del
6.        let R' = {r_1...r'_i...r_N}
7.        ratio p = Psi_T(R')/ Psi_T(R) (Jastrow factors, 3D B-Spline, Determinants)
8.        derivatives \nabla_i log(Psi_T), \nalba^2_i log(Psi_T) (Jastrow factors, 3D B-Spline, Determinants)
9.        if r --> r' is accepted then
10.         update state of a walker (Inverse Update)
11.       end if
12.     end for {particle}
13.     local energy E_L = H Psi_T(R)/Psi_T(R) (Jastrow factors, 3D B-Spline)
14.     reweight and branch walkers
15.   end for {walker}
16.   update E_T and load balance
17. endfor {MCgeneration}

miniQMC currently implements only a single walker to focus on single-node performance explorations, and so discards the two outer-most loops shown above. miniQMC mimics QMCPACk in the abstraction of the wavefunction into its components, which are the 1,2,3-body Jastrow functions and the determinant matrix; each of these pieces derive from a common Wavefunction base class and have their own implementation of functions such as ratio computation and gradient evaluation.

Important kernels

Determinant update (inverse matrix update)

As electrons are moved in the Monte Carlo, DMC uses the Sherman-Morrison algorithm to perform column-by-column updates of the Slater determinant (matrix) component of the wave- functions. Updating the Slater matrix is the source of the cubic scaling of our calculations. This algorithm relies on BLAS2 functions. In current calculations it is 30-50% of computation time and is therefore a major optimization target. In addition to searching for improved performance portable implementations, we will investigate using a delayed update method which reduces this bottleneck.

There are various algorithms and implementations that compute the updated determinant matrix generally based on the Sherman-Morrison formula. The one currently used by the main path of QMCPACK is as given in Eqn (26) of Fahy et al. 1990. An alternate variation is described in K. P. Esler, J. Kim, D. M. Ceperley, and L. Shulenburger: Accelerating Quantum Monte Carlo Simulations of Real Materials on GPU Clusters. Computing in Science & Engineering, Vol. 14, No. 1, Jan/Feb 2012, pp. 40-51 (section "Inverses and Updates"). The reference version implemented by miniQMC in src/QMCWaveFunctions/DeterminantRef.h uses the Fahy variant.

Given the following notation:

  • D: the incoming determinant matrix
  • Dinv: the inverse of 'D'
  • Dtil: \tilde{D}, the transpose of 'Dinv'
  • p: the index of the row (particle) in Dinv to be udpated
  • b: the new vector to update Dinv

we can most basically describe the task of the determinant update kernel as

  1. replacing the p'th row of D with the vector b
  2. inverting the matrix D to get Dinv

Roughly, the general pseudocode steps for computing the inverse update using the Fahy formula variant are:

  1. ratio = Dtil[p,:].b (vector dot product)
  2. ratio_inv = 1.0 / ratio (scalar arithmetic)
  3. vtemp = ratio_inv * Dtil[p,:] * b (DGEMV)
  4. vtemp[p] = 1.0 - ratio_inv (scalar update)
  5. rcopy = Dtil[p,:] (vector update)
  6. Dtil = Dtil - rcopy X v_temp (DGER outer product)

For Sherman-Morrison, the steps go as:

  1. delta_b = b - D[p,:] (vector subtraction)
  2. delDinv = delta_b * Dinv (DGEMV)
  3. ratio_inv = -1 / (1.0 + delDinv[p]) (scalar arithmetic)
  4. temp = Dinv[:,p] (vector update)
  5. temp = Dinv[:,p] * delta_Dinv (this and following two lines done with DGER)
  6. temp = temp * -1 / (1 + delta_Dinv[p])
  7. result = temp + Dinv

Be aware, this algorithm returns the updated inverse which is different than the inverse transpose that Fahy operates on/returns.

The determinant matrix is filled with the current values of the single particle orbitals. Specifically

In general, there is one determinant matrix for each spin species, up and down:


Splines

For every proposed electron move the value, gradient and laplacian of N orbitals are evaluated at a random particle position, where N is the number of electrons. Additional evaluations are performed for the energy. For this reason, the choice of the orbital representation is crucial for efficiency. This mini-app evaluates 3D tri-cubic B-splines which have been shown to be among the most efficient choices in condensed phases. Because many splines are evaluated simultaneously, data layout and memory hierarchy considerations are critical for high performance.

The large memory footprint of the spline data also motivates use of alternative representations and may be an important area for collaboration with mathematical efforts. A more compact representations may fit in faster/higher parts of the memory hierarchy, and therefore be preferred even if the evaluation cost is greater due to overall higher performance. Physics-based alternative representations are being explored in current BES-funded work, e.g. optimizing the grid spacing based on the nature of the orbitals.

B-splines form a basis that is used to represent the single particle orbitals and also terms in the Jastrow factors discussed next. A single (or cardinal) cubic B-spline is given by



A basis of splines in one dimension is comprised of a set of shifted copies of the cardinal B-spline:

A three dimensional B-spline basis is formed by direct product:

with . The full basis size is with .

Each orbital entering the determinant parts of the wavefunction is expressed in the 3D B-spline basis as:

where is a matrix with row vectors related to the simulation cell lattice vectors by .
At each point in the simulation cell, only 64 B-spline basis functions are non-zero and so obtaining the value and derivatives of a given orbital at a given point in space requires loading in a set of 64 coefficients/parameters (from ) along with evaluations of shifted copies of the cardinal spline.

Jastrow factors (1-,2-, and 3-body)

Jastrow factors are simple functions of inter-particle distances, such as the distances between electrons and between electrons and atoms. Although their purpose is very different, they are very similar to the terms that occur in classical molecular dynamics force fields and related particle methods. For exascale, the challenge is to evaluate these functions efficiently for the expected problem size and to expose additional parallelism. For example, the one, two, and three body terms in the Jastrow can be coscheduled, either by a tasking or stream programming approach, or by a runtime. The amount of work in each term varies with physical system and chosen parameterization, so a fixed decomposition will not be optimal.

Jastrow factors represent electronic correlation, or in other words, the way electrons move around atoms/ions or other electrons in a solid. The correlations are decomposed into 1-body (electron-ion), 2-body (electron-electron), and 3-body (electron-electron-ion) terms.

The one- and two-body terms implemented in the miniapp are similar in form:


Here represents the electron spin (up or down), refers to the ionic species (e.g. Ni or O for nickel oxide), is the number of electrons having a given spin, and is the number of atoms/ions of species . Electron positions are denoted and atom/ion positions are denoted . Terms with spin labels exchanged are treated identically (i.e. ).

Each one dimensional function (, ) is represented with a one dimensional basis of cubic B-splines

Here is a cutoff radius and .

The three-body Jastrow is more complex in form, as it depends on both the electron-ion and electron-electron distances explicitly.

In both QMCPACK and miniQMC, the three-body correlation function is represented as a general polynomial in these distances:

Here is a set of parameters and is a cutoff radius.

Distance tables

As a particle-based method, many operations in DMC require knowledge of the distance between electrons or between electrons and atoms. This is quadratically scaling work, and as the system size increases this becomes time consuming. Under periodic boundary conditions, minimum image conventions must be applied to every distance computation. In this mini-app, we explore the required technique to perform this task efficiently and portably: the nature of the algorithms leads to a strong sensitivity to data layout. Currently the threaded CPU implementation maintains explicit lists of particle distances, while the GPU implementation computes distances on the fly. As noted for the Jastrow factor evaluation, many techniques have been developed for classical molecular dynamics codes, and we will evaluate their algorithms where pertinent. However, we note that our particle counts are typically much smaller and minimum image convention is more important.

Distance tables are basically matrices composed of all pairs of particle distances. The distances are split into two tables, one for electron-electron distances and one for electron-ion distances, as follows:


Here are electron positions and is an atom/ion position. The distance tables are updated following each successful single electron Monte Carlo move.

As mentioned above, these distances are defined in the minimum image convention. If are the simulation cell lattice vectors, then each distance is defined as the shortest of 27 possible "images" (arising from all neighboring cells in periodic boundary conditions):

Here denotes the usual 2-norm (Euclidean norm). Distances defined in this way (left side above) are used for all pairwise distances in the miniapp, including all Jastrow factors.