LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

jngrad · 2024-01-12T21:20:49Z

TL;DR: the seed argument of system.thermostat.set_lb() doesn't actually set the random number generator (RNG) seed of the LB particle coupling code. Instead, it sets the lag time in the RNG sequence. Thus two independent simulations with a different seed will exhibit correlated dynamics. The LB fluid seed is not affected by this issue.

Background

We use Philox counters to generate reproducible and cross-platform random number sequences. These are used to calculate the friction noise during thermalization. The RNG sequence is fully determined by 4 input arguments:

the RNG salt, which is a unique id for each kind of thermostat (LB particle coupling, Langevin, Brownian, Stokesian, etc.)
the RNG seed, which the user has control over
two keys, which are typically particle ids (for LB particle coupling, the second key is always zero)

Running the same simulation twice with the same RNG seed produces the exact same trajectory, up to machine precision. Running the same simulation twice with a different RNG seed produces a different trajectory. For the latter, the magnitude of the difference between the two seeds is irrelevant; the RNG seed "decorrelates" RNG sequences, just like the RNG salt does. This is achieved by internally concatenating the salt and seed into a single 64bit integer.

The Philox RNG has an internal counter which is incremented at each time step. This counter essentially controls the current position in the RNG sequence. We do not provide a convenient way for the user to initialize this value. This counter value is saved during checkpointing.

Problem statement

For historical reasons, thermostats in ESPResSo were all implemented differently, and didn't share a common base class. Yet they shared a unified Python interface, which could give users the false impression that they were related. This lead to bad user experience in the 4.1 line of ESPResSo where one thermostat was using the Mersenne Twister RNG instead of the Philox RNG.

The LB particle coupling thermostat in particular was implemented with a hardcoded seed=0, regardless of which seed the user gives as argument to system.thermostat.set_lb(). Yet the particle forces are different when giving a different seed, because the value of the seed argument is stored as the internal counter of the Philox RNG. Thus the LB particle coupling RNG sequence is always the same for all simulations, and the seed argument actually controls the initial time lag in that RNG sequence. This issue is specific to the LB particle coupling code, and does not affect the LB fluid code.

As a result, two independent simulations that differ only in the LB particle coupling seed with not truly be independent: the particle forces from simulation A will be correlated to the particle forces from simulation B with a time lag. This correlation is weak, since it's "diluted" by the LB fluid noise and all other interactions in the system.

It would be useful to know how weak this correlation truly is, and whether it can affect some observables more than other, for example Green-Kubo relations. To that end, I adapted the polymers tutorial to investigate how the correlation between independent simulations that differ only by the LB particle coupling seed affects simulation results. I measured the diffusion coefficient of a polymer with 4 beads connected by FENE bonds, repelled by a WCA potential, and coupled to a thermalized fluid in a 8x8x8 grid in units of the WCA sigma. The simulation was repeated 48 times for two scenarios: one where the Philox seed is 0 and the Philox counter initial value is controlled by the user ("current behavior"), and one where the Philox seed is controlled by the user and the Philox counter initial value is always 0 ("proposed behavior").

Figure 1 plots the mean and standard error of the mean of the diffusion coefficient as a function of the simulation time for both scenarios. We can see in the proposed behavior scenario that the error band tapers for longer simulation times, as expected. In the current behavior scenario, the error band is extremely narrow right at the beginning, and tapers more slowly. Figure 2 plots the standard error of the mean as a function of time to help better visualize how the correlation between independent measurements affects the magnitude of the standard error of the mean.

In my opinion, this is a design flaw. Data production is affected in two ways:

error bars in plots are artificially shrunk by a factor 2.5, because the underlying assumption of independent samples in the formula of the standard of the mean is not satisfied
users might decide to simulate for shorter times, because these independent simulations converge to the same value much faster than they should

Point 1 doesn't affect the samples mean, thus the calculated data point (or the shape of the curve for time-dependent observables) converges to the real value. Point 2 can affect the sample mean due to insufficient sampling time.

In the light of the data produced by this experiment, I'm not convinced we can simply simulate for longer to get the LB fluid RNG to introduce enough entropy to fully decorrelate independent simulations. The current behavior scenario was run again with a hardcoded seed of 1 (data not shown) to make sure my observations were not due to a statistical fluke with the seed of 0, and the same trend emerged.

Figure 1: Convergence analysis of the polymer diffusion coefficient.

Figure 2: Comparison of the standard error of the mean in both scenarios.

Proposed solution

Make the seed parameter set the Philox seed, instead of the Philox counter. This was implemented by commit 9550aa1 in #4845. This fix could be backported to the 4.2 line of ESPResSo with an extra MPI callback and 3 lines of Cython code.

Supporting information

Simulation script

import sys
import numpy as np
import scipy.optimize

import espressomd
import espressomd.analyze
import espressomd.accumulators
import espressomd.observables
import espressomd.polymer
import espressomd.lb

espressomd.assert_features(['WCA', 'WALBERLA'])

def build_polymer(system, n_monomers, polymer_params, fene):
    positions = espressomd.polymer.linear_polymer_positions(
        beads_per_chain=n_monomers, **polymer_params)
    p_previous = None
    for i, pos in enumerate(positions[0]):
        p = system.part.add(pos=pos)
        if p_previous is not None:
            p.add_bond((fene, p_previous))
        p_previous = p

def correlator_gk(pids_monomers, tau_max):
    com_vel = espressomd.observables.ComVelocity(ids=pids_monomers)
    com_vel_cor = espressomd.accumulators.Correlator(
        obs1=com_vel, tau_lin=16, tau_max=tau_max, delta_N=10,
        corr_operation="scalar_product", compress1="discard1")
    return com_vel_cor

def gk_integration(N, tau, acf):
    x = np.arange(2, 28)
    y = [1 / 3 * np.trapz(acf[:j], tau[:j]) for j in x]
    diffusion_coef = np.mean(y[10:])
    return diffusion_coef, tau[x], y

# Setup constants
BOX_L = 8.0
TIME_STEP = 0.01
LOOPS = 100
STEPS = 100
KT = 1.0
GAMMA = 5.0
POLYMER_PARAMS = {'n_polymers': 1, 'bond_length': 1, 'seed': 42, 'min_distance': 0.9}

# System setup
system = espressomd.System(box_l=3 * [BOX_L])
system.cell_system.skin = 0.4

# Lennard-Jones interaction
system.non_bonded_inter[0, 0].wca.set_params(epsilon=1.0, sigma=1.0)

# Fene interaction
fene = espressomd.interactions.FeneBond(k=7, r_0=1, d_r_max=2)
system.bonded_inter.add(fene)

com_vel_tau_results = []
com_vel_acf_results = []

N = 4
seed = int(sys.argv[-2])
counter = int(sys.argv[-1])
build_polymer(system, N, POLYMER_PARAMS, fene)
# energy minimization
system.time_step = 0.002
system.integrator.set_steepest_descent(
    f_max=1.0,
    gamma=10,
    max_displacement=0.01)
system.integrator.run(2000)
# thermalization with Langevin
system.integrator.set_vv()
system.time_step = TIME_STEP
system.thermostat.set_langevin(kT=1.0, gamma=50, seed=42)
system.integrator.run(2000)
system.thermostat.turn_off()
# thermalization with LB
lbf = espressomd.lb.LBFluidWalberla(kT=KT, seed=42, agrid=1, density=1,
                                    kinematic_viscosity=5, tau=system.time_step,
                                    single_precision=True)
system.lb = lbf
system.thermostat.set_lb(LB_fluid=lbf, gamma=GAMMA, seed=seed, philox_counter=counter)
system.integrator.run(1000)

# configure Green-Kubo correlator
com_vel_cors = []
n_loops = [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 256, 384, 512]
n_steps = []
for i in n_loops:
    com_vel_cor = correlator_gk(np.arange(N), LOOPS * STEPS)
    system.auto_update_accumulators.add(com_vel_cor)
    com_vel_cors.append(com_vel_cor)

def save_data(n_steps, com_vel_tau_results, com_vel_acf_results):
    com_vel_tau_results = np.array(com_vel_tau_results)
    com_vel_acf_results = np.reshape(com_vel_acf_results, [len(n_steps), -1])

    diffusion_gk = []
    diffusion_time = []
    for length, tau, acf in zip(n_steps, com_vel_tau_results, com_vel_acf_results):
        diffusion_coef, xdata, ydata = gk_integration(N, tau, acf)
        diffusion_gk.append(diffusion_coef)
        diffusion_time.append(length * TIME_STEP)

    np.save(f"lb_seed_{seed}_counter_{counter}.npy", [diffusion_gk, diffusion_time])

last_sim_steps = 0
for i, n in enumerate(n_loops):
    com_vel_cor = com_vel_cors[i]
    sim_total = n * LOOPS * STEPS
    system.integrator.run(sim_total - last_sim_steps)
    system.auto_update_accumulators.remove(com_vel_cor)
    last_sim_steps = sim_total
    com_vel_cor.finalize()
    com_vel_tau_results.append(com_vel_cor.lag_times())
    com_vel_acf_results.append(com_vel_cor.result())
    n_steps.append(sim_total)
    save_data(n_steps, com_vel_tau_results, com_vel_acf_results)

Launch script

for counter in {1..16}; do ./pypresso mwe.py 0 $counter & done
for seed in {1..16}; do ./pypresso mwe.py $seed 0 & done

Data analysis

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 16})

def parse(n, pat, burn=0):
    root = "./"
    matrix = []
    for i in range(n):
        diffusion_gk, diffusion_time = np.load(f"{root}/{pat.format(i+1)}")
        matrix.append(diffusion_gk)
    n = min(len(x) for x in matrix)
    return diffusion_time[burn:n], np.array([x[burn:n] for x in matrix])

df_old_time, df_old_coef = parse(48, "lb_seed_0_counter_{}.npy", 2)
df_new_time, df_new_coef = parse(48, "lb_seed_{}_counter_0.npy", 2)

df_old_mean = np.mean(df_old_coef, axis=0)
df_old_stderr = np.std(df_old_coef, axis=0) / np.sqrt(df_old_coef.shape[1])
df_new_mean = np.mean(df_new_coef, axis=0)
df_new_stderr = np.std(df_new_coef, axis=0) / np.sqrt(df_new_coef.shape[1])

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(11, 4))
ax2.fill_between([], [], [])
ax1.fill_between(df_old_time, df_old_mean - df_old_stderr, df_old_mean + df_old_stderr, alpha=0.5)
ax2.fill_between(df_new_time, df_new_mean - df_new_stderr, df_new_mean + df_new_stderr, alpha=0.5)
ax1.plot(df_old_time, df_old_mean, label="Current behavior")
ax2.plot([], [])
ax2.plot(df_new_time, df_new_mean, label="Proposed behavior")
ax1.set_xlabel("Number of time steps")
ax2.set_xlabel("Number of time steps")
ax1.set_ylabel(r"Diffusion coefficient [$\sigma^2/t$]")
ax1.legend(loc="lower right")
ax2.legend(loc="lower right")
plt.tight_layout()
plt.show()

fig = plt.figure(figsize=(8, 4))
plt.plot(df_old_time, df_old_stderr, label="Current behavior")
plt.plot(df_new_time, df_new_stderr, label="Proposed behavior")
#plt.yscale('log')
#plt.xscale('log')
plt.legend()
plt.xlabel("Number of time steps")
plt.ylabel("Standard error of\n" r"the mean [$\sigma^2/t$]")
plt.tight_layout()
plt.show()

The text was updated successfully, but these errors were encountered:

jngrad · 2024-01-12T21:37:16Z

This topic was actually already brought up for all other thermostats in #3585. In fact, the solution implemented in #4845 is exactly the same as the solution we implemented back in 2020. I don't remember why we chose to leave the LB particle coupling code out of the bugfix PR back then.

jngrad · 2024-01-19T13:06:33Z

Offline discussion with @RudolfWeeber: this is indeed a bug, albeit not a serious one, since it only affects ensemble runs.

I looked into the possibility of backporting the bugfix to 4.2, but the LB CPU and LB GPU use different Philox kernels while sharing the same RNG struct, thus significant alterations to the LB GPU code would be required. We will not backport it to 4.2.

jngrad mentioned this issue Jan 12, 2024

Thermostats refactoring #4845

Merged

jngrad added Bug Core labels Jan 19, 2024

jngrad added this to the ESPResSo 4.3.0 milestone Jan 19, 2024

kodiakhq bot closed this as completed in #4845 Jan 19, 2024

kodiakhq bot closed this as completed in da47129 Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

jngrad commented Jan 12, 2024

jngrad commented Jan 12, 2024

jngrad commented Jan 19, 2024

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

Comments

jngrad commented Jan 12, 2024

Background

Problem statement

Proposed solution

Supporting information

jngrad commented Jan 12, 2024

jngrad commented Jan 19, 2024