Skip to content

Commit

Permalink
Fix race condition between NATS sync and reload
Browse files Browse the repository at this point in the history
This fixes a race condition between the monit file check that is used to
reload the NATS server configuration, and the BOSH NATS sync process
that rewrites that configuration.

The BOSH NATS sync process is used to rewrite
`/var/vcap/data/nats/auth.json` with the list of clients allowed to
connect to NATS. This adds clients for each BOSH deployed VM agent.
It also adds entries for the BOSH Director and Health Monitor processes,
in cases where they don't use the standard "Org=Cloud Foundry"
certificate subjects. monit watches `auth.json` and sends a reload
signal to the NATS server when the checksum changes, allowing the NATS
server to trust the updated list of clients.

However, it's possible for the NATS sync process to start writing
`auth.json` before monit begins its file checks. When this happens, the
checksum never changes from monit's perspective and it never sends the
reload signal to the NATS server. When using certificates that do not
have a subject with Org "Cloud Foundry", this means that Health Monitor
will never be able to connect to NATS and thus never reach a "running"
state with monit. This will cause the `bosh create-env` to fail.

Now, the bosh_nats_sync monit job is marked as depending on the
nats_auth_conf file check. This should ensure that the file check is
started by monit before NATS sync is started, guaranteeing that monit
will detect the initial change to `auth.json` and send the reload
signal. As an additional measure, the nats_auth_conf check has been
moved immediately after the nats job, as monit also _seems_ to start
checks in the order they are declared.
  • Loading branch information
ystros committed Sep 9, 2022
1 parent fdfba25 commit 59a3a13
Showing 1 changed file with 5 additions and 4 deletions.
9 changes: 5 additions & 4 deletions jobs/nats/monit
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@ check process nats
stop program "/var/vcap/jobs/bpm/bin/bpm stop nats"
group vcap

check file nats_auth_conf
with path /var/vcap/data/nats/auth.json
if changed checksum then exec "/var/vcap/packages/nats/bin/nats-server --signal reload=/var/vcap/sys/run/bpm/nats/nats.pid"

check process bosh_nats_sync
with pidfile /var/vcap/sys/run/bpm/nats/bosh_nats_sync.pid
start program "/var/vcap/jobs/bpm/bin/bpm start nats -p bosh_nats_sync"
stop program "/var/vcap/jobs/bpm/bin/bpm stop nats -p bosh_nats_sync"
group vcap

check file nats_auth_conf
with path /var/vcap/data/nats/auth.json
if changed checksum then exec "/var/vcap/packages/nats/bin/nats-server --signal reload=/var/vcap/sys/run/bpm/nats/nats.pid"
depends on nats_auth_conf

0 comments on commit 59a3a13

Please sign in to comment.