Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LB particle coupling thermostat seed doesn't fully decorrelate independent simulations #4848

Closed
jngrad opened this issue Jan 12, 2024 · 2 comments · Fixed by #4845
Closed

Comments

@jngrad
Copy link
Member

jngrad commented Jan 12, 2024

TL;DR: the seed argument of system.thermostat.set_lb() doesn't actually set the random number generator (RNG) seed of the LB particle coupling code. Instead, it sets the lag time in the RNG sequence. Thus two independent simulations with a different seed will exhibit correlated dynamics. The LB fluid seed is not affected by this issue.

Background

We use Philox counters to generate reproducible and cross-platform random number sequences. These are used to calculate the friction noise during thermalization. The RNG sequence is fully determined by 4 input arguments:

  • the RNG salt, which is a unique id for each kind of thermostat (LB particle coupling, Langevin, Brownian, Stokesian, etc.)
  • the RNG seed, which the user has control over
  • two keys, which are typically particle ids (for LB particle coupling, the second key is always zero)

Running the same simulation twice with the same RNG seed produces the exact same trajectory, up to machine precision. Running the same simulation twice with a different RNG seed produces a different trajectory. For the latter, the magnitude of the difference between the two seeds is irrelevant; the RNG seed "decorrelates" RNG sequences, just like the RNG salt does. This is achieved by internally concatenating the salt and seed into a single 64bit integer.

The Philox RNG has an internal counter which is incremented at each time step. This counter essentially controls the current position in the RNG sequence. We do not provide a convenient way for the user to initialize this value. This counter value is saved during checkpointing.

Problem statement

For historical reasons, thermostats in ESPResSo were all implemented differently, and didn't share a common base class. Yet they shared a unified Python interface, which could give users the false impression that they were related. This lead to bad user experience in the 4.1 line of ESPResSo where one thermostat was using the Mersenne Twister RNG instead of the Philox RNG.

The LB particle coupling thermostat in particular was implemented with a hardcoded seed=0, regardless of which seed the user gives as argument to system.thermostat.set_lb(). Yet the particle forces are different when giving a different seed, because the value of the seed argument is stored as the internal counter of the Philox RNG. Thus the LB particle coupling RNG sequence is always the same for all simulations, and the seed argument actually controls the initial time lag in that RNG sequence. This issue is specific to the LB particle coupling code, and does not affect the LB fluid code.

As a result, two independent simulations that differ only in the LB particle coupling seed with not truly be independent: the particle forces from simulation A will be correlated to the particle forces from simulation B with a time lag. This correlation is weak, since it's "diluted" by the LB fluid noise and all other interactions in the system.

It would be useful to know how weak this correlation truly is, and whether it can affect some observables more than other, for example Green-Kubo relations. To that end, I adapted the polymers tutorial to investigate how the correlation between independent simulations that differ only by the LB particle coupling seed affects simulation results. I measured the diffusion coefficient of a polymer with 4 beads connected by FENE bonds, repelled by a WCA potential, and coupled to a thermalized fluid in a 8x8x8 grid in units of the WCA sigma. The simulation was repeated 48 times for two scenarios: one where the Philox seed is 0 and the Philox counter initial value is controlled by the user ("current behavior"), and one where the Philox seed is controlled by the user and the Philox counter initial value is always 0 ("proposed behavior").

Figure 1 plots the mean and standard error of the mean of the diffusion coefficient as a function of the simulation time for both scenarios. We can see in the proposed behavior scenario that the error band tapers for longer simulation times, as expected. In the current behavior scenario, the error band is extremely narrow right at the beginning, and tapers more slowly. Figure 2 plots the standard error of the mean as a function of time to help better visualize how the correlation between independent measurements affects the magnitude of the standard error of the mean.

In my opinion, this is a design flaw. Data production is affected in two ways:

  1. error bars in plots are artificially shrunk by a factor 2.5, because the underlying assumption of independent samples in the formula of the standard of the mean is not satisfied
  2. users might decide to simulate for shorter times, because these independent simulations converge to the same value much faster than they should

Point 1 doesn't affect the samples mean, thus the calculated data point (or the shape of the curve for time-dependent observables) converges to the real value. Point 2 can affect the sample mean due to insufficient sampling time.

In the light of the data produced by this experiment, I'm not convinced we can simply simulate for longer to get the LB fluid RNG to introduce enough entropy to fully decorrelate independent simulations. The current behavior scenario was run again with a hardcoded seed of 1 (data not shown) to make sure my observations were not due to a statistical fluke with the seed of 0, and the same trend emerged.


Plot of the diffusion coefficient as a function of the simulation time, for both the current behavior of ESPResSo, and the new proposed behavior. In the current behavior, the standard error of the mean is extremely narrow and much smaller in magnitude than the magnitude of the fluctuations of the mean between two at short sampling times. In the proposed behavior, the standard error of the marge is more reasonable and contains the natural fluctuation of the mean at short simulation times, and tapers progressively towards longer simulation times.

Figure 1: Convergence analysis of the polymer diffusion coefficient.


Plot of the standard error of the mean of the diffusion coefficient as a function of the simulation time. The proposed behavior is initially 3 times larger in magnitude compared to the current behavior, and 2.2 times larger towards large simulation times.

Figure 2: Comparison of the standard error of the mean in both scenarios.

Proposed solution

Make the seed parameter set the Philox seed, instead of the Philox counter. This was implemented by commit 9550aa1 in #4845. This fix could be backported to the 4.2 line of ESPResSo with an extra MPI callback and 3 lines of Cython code.

Supporting information

Simulation script
import sys
import numpy as np
import scipy.optimize

import espressomd
import espressomd.analyze
import espressomd.accumulators
import espressomd.observables
import espressomd.polymer
import espressomd.lb

espressomd.assert_features(['WCA', 'WALBERLA'])

def build_polymer(system, n_monomers, polymer_params, fene):
    positions = espressomd.polymer.linear_polymer_positions(
        beads_per_chain=n_monomers, **polymer_params)
    p_previous = None
    for i, pos in enumerate(positions[0]):
        p = system.part.add(pos=pos)
        if p_previous is not None:
            p.add_bond((fene, p_previous))
        p_previous = p

def correlator_gk(pids_monomers, tau_max):
    com_vel = espressomd.observables.ComVelocity(ids=pids_monomers)
    com_vel_cor = espressomd.accumulators.Correlator(
        obs1=com_vel, tau_lin=16, tau_max=tau_max, delta_N=10,
        corr_operation="scalar_product", compress1="discard1")
    return com_vel_cor

def gk_integration(N, tau, acf):
    x = np.arange(2, 28)
    y = [1 / 3 * np.trapz(acf[:j], tau[:j]) for j in x]
    diffusion_coef = np.mean(y[10:])
    return diffusion_coef, tau[x], y

# Setup constants
BOX_L = 8.0
TIME_STEP = 0.01
LOOPS = 100
STEPS = 100
KT = 1.0
GAMMA = 5.0
POLYMER_PARAMS = {'n_polymers': 1, 'bond_length': 1, 'seed': 42, 'min_distance': 0.9}

# System setup
system = espressomd.System(box_l=3 * [BOX_L])
system.cell_system.skin = 0.4

# Lennard-Jones interaction
system.non_bonded_inter[0, 0].wca.set_params(epsilon=1.0, sigma=1.0)

# Fene interaction
fene = espressomd.interactions.FeneBond(k=7, r_0=1, d_r_max=2)
system.bonded_inter.add(fene)

com_vel_tau_results = []
com_vel_acf_results = []

N = 4
seed = int(sys.argv[-2])
counter = int(sys.argv[-1])
build_polymer(system, N, POLYMER_PARAMS, fene)
# energy minimization
system.time_step = 0.002
system.integrator.set_steepest_descent(
    f_max=1.0,
    gamma=10,
    max_displacement=0.01)
system.integrator.run(2000)
# thermalization with Langevin
system.integrator.set_vv()
system.time_step = TIME_STEP
system.thermostat.set_langevin(kT=1.0, gamma=50, seed=42)
system.integrator.run(2000)
system.thermostat.turn_off()
# thermalization with LB
lbf = espressomd.lb.LBFluidWalberla(kT=KT, seed=42, agrid=1, density=1,
                                    kinematic_viscosity=5, tau=system.time_step,
                                    single_precision=True)
system.lb = lbf
system.thermostat.set_lb(LB_fluid=lbf, gamma=GAMMA, seed=seed, philox_counter=counter)
system.integrator.run(1000)

# configure Green-Kubo correlator
com_vel_cors = []
n_loops = [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 256, 384, 512]
n_steps = []
for i in n_loops:
    com_vel_cor = correlator_gk(np.arange(N), LOOPS * STEPS)
    system.auto_update_accumulators.add(com_vel_cor)
    com_vel_cors.append(com_vel_cor)

def save_data(n_steps, com_vel_tau_results, com_vel_acf_results):
    com_vel_tau_results = np.array(com_vel_tau_results)
    com_vel_acf_results = np.reshape(com_vel_acf_results, [len(n_steps), -1])

    diffusion_gk = []
    diffusion_time = []
    for length, tau, acf in zip(n_steps, com_vel_tau_results, com_vel_acf_results):
        diffusion_coef, xdata, ydata = gk_integration(N, tau, acf)
        diffusion_gk.append(diffusion_coef)
        diffusion_time.append(length * TIME_STEP)

    np.save(f"lb_seed_{seed}_counter_{counter}.npy", [diffusion_gk, diffusion_time])

last_sim_steps = 0
for i, n in enumerate(n_loops):
    com_vel_cor = com_vel_cors[i]
    sim_total = n * LOOPS * STEPS
    system.integrator.run(sim_total - last_sim_steps)
    system.auto_update_accumulators.remove(com_vel_cor)
    last_sim_steps = sim_total
    com_vel_cor.finalize()
    com_vel_tau_results.append(com_vel_cor.lag_times())
    com_vel_acf_results.append(com_vel_cor.result())
    n_steps.append(sim_total)
    save_data(n_steps, com_vel_tau_results, com_vel_acf_results)
Launch script
for counter in {1..16}; do ./pypresso mwe.py 0 $counter & done
for seed in {1..16}; do ./pypresso mwe.py $seed 0 & done
Data analysis
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 16})

def parse(n, pat, burn=0):
    root = "./"
    matrix = []
    for i in range(n):
        diffusion_gk, diffusion_time = np.load(f"{root}/{pat.format(i+1)}")
        matrix.append(diffusion_gk)
    n = min(len(x) for x in matrix)
    return diffusion_time[burn:n], np.array([x[burn:n] for x in matrix])

df_old_time, df_old_coef = parse(48, "lb_seed_0_counter_{}.npy", 2)
df_new_time, df_new_coef = parse(48, "lb_seed_{}_counter_0.npy", 2)

df_old_mean = np.mean(df_old_coef, axis=0)
df_old_stderr = np.std(df_old_coef, axis=0) / np.sqrt(df_old_coef.shape[1])
df_new_mean = np.mean(df_new_coef, axis=0)
df_new_stderr = np.std(df_new_coef, axis=0) / np.sqrt(df_new_coef.shape[1])

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(11, 4))
ax2.fill_between([], [], [])
ax1.fill_between(df_old_time, df_old_mean - df_old_stderr, df_old_mean + df_old_stderr, alpha=0.5)
ax2.fill_between(df_new_time, df_new_mean - df_new_stderr, df_new_mean + df_new_stderr, alpha=0.5)
ax1.plot(df_old_time, df_old_mean, label="Current behavior")
ax2.plot([], [])
ax2.plot(df_new_time, df_new_mean, label="Proposed behavior")
ax1.set_xlabel("Number of time steps")
ax2.set_xlabel("Number of time steps")
ax1.set_ylabel(r"Diffusion coefficient [$\sigma^2/t$]")
ax1.legend(loc="lower right")
ax2.legend(loc="lower right")
plt.tight_layout()
plt.show()

fig = plt.figure(figsize=(8, 4))
plt.plot(df_old_time, df_old_stderr, label="Current behavior")
plt.plot(df_new_time, df_new_stderr, label="Proposed behavior")
#plt.yscale('log')
#plt.xscale('log')
plt.legend()
plt.xlabel("Number of time steps")
plt.ylabel("Standard error of\n" r"the mean [$\sigma^2/t$]")
plt.tight_layout()
plt.show()
@jngrad
Copy link
Member Author

jngrad commented Jan 12, 2024

This topic was actually already brought up for all other thermostats in #3585. In fact, the solution implemented in #4845 is exactly the same as the solution we implemented back in 2020. I don't remember why we chose to leave the LB particle coupling code out of the bugfix PR back then.

@jngrad
Copy link
Member Author

jngrad commented Jan 19, 2024

Offline discussion with @RudolfWeeber: this is indeed a bug, albeit not a serious one, since it only affects ensemble runs.

I looked into the possibility of backporting the bugfix to 4.2, but the LB CPU and LB GPU use different Philox kernels while sharing the same RNG struct, thus significant alterations to the LB GPU code would be required. We will not backport it to 4.2.

@jngrad jngrad added this to the ESPResSo 4.3.0 milestone Jan 19, 2024
@kodiakhq kodiakhq bot closed this as completed in #4845 Jan 19, 2024
@kodiakhq kodiakhq bot closed this as completed in da47129 Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant