Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix race condition between NATS sync and reload
This fixes a race condition between the monit file check that is used to reload the NATS server configuration, and the BOSH NATS sync process that rewrites that configuration. The BOSH NATS sync process is used to rewrite `/var/vcap/data/nats/auth.json` with the list of clients allowed to connect to NATS. This adds clients for each BOSH deployed VM agent. It also adds entries for the BOSH Director and Health Monitor processes, in cases where they don't use the standard "Org=Cloud Foundry" certificate subjects. monit watches `auth.json` and sends a reload signal to the NATS server when the checksum changes, allowing the NATS server to trust the updated list of clients. However, it's possible for the NATS sync process to start writing `auth.json` before monit begins its file checks. When this happens, the checksum never changes from monit's perspective and it never sends the reload signal to the NATS server. When using certificates that do not have a subject with Org "Cloud Foundry", this means that Health Monitor will never be able to connect to NATS and thus never reach a "running" state with monit. This will cause the `bosh create-env` to fail. Now, the bosh_nats_sync monit job is marked as depending on the nats_auth_conf file check. This should ensure that the file check is started by monit before NATS sync is started, guaranteeing that monit will detect the initial change to `auth.json` and send the reload signal. As an additional measure, the nats_auth_conf check has been moved immediately after the nats job, as monit also _seems_ to start checks in the order they are declared.
- Loading branch information