Production - Server Restart and Restore #192
Replies: 10 comments 2 replies
-
Update from the restart after ~12 hours of downtime on March 13th, 2024 . All of the above worked great, with the exception that now (as documented) the |
Beta Was this translation helpful? Give feedback.
-
From a fresh state, this is correct data-management logic. DO NOT just blindly run this; check (with
|
Beta Was this translation helpful? Give feedback.
-
Update!
Again, make sure |
Beta Was this translation helpful? Give feedback.
-
Unfortunately zfs does not seem to support overlayfs mounts (at least on Linux 5.4, I believe modern Linux+zfs does). So, we make it be ext4 with a simple bind mount:
|
Beta Was this translation helpful? Give feedback.
-
Our first server reboot of the new Fall 2024 semester! These commands brought us back:
|
Beta Was this translation helpful? Give feedback.
-
In order to encourage the server not to sleep:
|
Beta Was this translation helpful? Give feedback.
-
On October 11th, 2024 new users (and/or users that hadn't been active for a while) were getting "Docker Failed" errors. For some reason new mounts inside "/data/homes/mounts" were not possible. While stracing the autofs process while trying to access a new mount in there, I saw:
Maybe this is part of the root cause, maybe it's not; unsure. ChatGPT and Google didn't reveal any quick fixes/discussion surrounding this after searching around for 5 minutes. So, the simplest solution was to restart. I opted for a soft manual restart of services, rather than multiple full machine restart. What I didFirst I brought down the site on the main node to make sure that nobody could cause some weird state:
On each node, I agressively stopped all workspaces:
This didn't actually work successfully for me, so I had to get even more agressive:
Back on the main node I wanted to flush all the mounts (because new users were no longer able to mount their homes, and so were getting "Docker failed" errors):
I then brought the site back up:
Going ForwardIt seems like we aren't getting homes automatically unmounting in our "/data/homes/mounts" autofs, and even worse, after ~6200 homes, things just broke. Unmounting something, and then mounting a new thing worked, so it seems to be count based (but not clear what that count is; certainly doesn't seem to be a power of 2). Fortunately, we are migrating away from our current homes setup, and so the current issues likely won't be very relevant in the future version. |
Beta Was this translation helpful? Give feedback.
-
Once again, we hit the ResolutionOn main node:
On each workspace node:
On main node:
On each workspace node:
On main node:
|
Beta Was this translation helpful? Give feedback.
-
Node 192.168.42.2 went down this morning. I suspect that the root cause is homefs concurrency issues, but the real problem set in during restart of the outer docker. The d-in-d daemon kept trying to access the homefs plugin, because it had volumes that were using it, but the homefs docker container had not started yet, because this was during docker initialization. This led to a deadlock. To resolve this, I moved the following directories out of
I'm not sure all of them were necessary, but I am sure that volumes and containers was. After this, a restart of the outer docker cleared everything up. |
Beta Was this translation helpful? Give feedback.
-
We had a power outage that killed all nodes, fortunately they came back (after several hours). The database automatically came back on the database node. BTRFS came back on the workspace nodes, because they had /etc/fstab to do so, but did not on the main node. This was resolved by running The main node was still missing its ext4 override for docker data. We previously changed the workspace nodes to use ext4 for docker data via a `mkdir As a note, we have already configured the main node/workspace nodes, this is just for a restart (the correct data is already in config.env). On the main node: mount /dev/disk/by-uuid/829fdf9a-dd00-4081-9307-8c306e971806 /mnt/docker-data
mount --bind /mnt/docker-data /data/docker On each workspace node: mount --bind /mnt/docker-data /data/docker
docker rm dojo
modprobe br_netfilter
git pull && docker build -t pwncollege/dojo . && docker run --privileged -d -v $PWD:/opt/pwn.college -v /data:/data --name dojo pwncollege/dojo Unfortunately homefs docker plugin messes everything up on the bad restart (need to fix this...): docker exec -it dojo bash
rm -r /data/docker/containers/*; rm /data/docker/volumes/metadata.db; kill -9 $(pgrep dockerd)
dojo compose up -d Back on the main node: docker rm dojo
modprobe br_netfilter
git pull && docker build -t pwncollege/dojo . && docker run --privileged -d -v $PWD:/opt/pwn.college -v /data:/data -p 22:22 -p 80:80 -p 443:443 -p 51820:51820/udp --name dojo pwncollege/dojo |
Beta Was this translation helpful? Give feedback.
-
The pwn.college server went down on July 19, 2023 and came back up on July 20, 2023, for ~24 of downtime.
Here were the minimal steps taken to restore all service.
ZFS
The user homes are currently stored in ZFS.
zpool import
showed that ourdata
pool still existed in its expected mirror state.zpool import data
brought our data back into/data
.zpool status
revealed no detected issues with our ZFS data.Currently we bind mount this into the dojo data.
mount --bind /data/homes /opt/dojo/data/homes/data
correctly restored home data.Starting
docker run --privileged -d -v /opt/dojo:/opt/pwn.college -p 22:22 -p 80:80 -p 443:443 pwncollege/dojo
brought the infrastructure back up.SENSAI
We were receiving errors in the
sensai-sense-tty
container that looked likeERROR: Unknown struct/union: 'struct tty_struct'
. This was probably due to the container no longer having correct linux header files, probably because the reboot caused a kernel upgrade.cd sensai; docker compose build sense-tty --no-cache
rebuilt the tty sensor container.cd sensai; docker compose up -d --force-recreate sense-tty
restarted the tty sensor container.Beta Was this translation helpful? Give feedback.
All reactions