Production - Server Restart and Restore #192

ConnorNelson · 2023-07-20T18:24:10Z

ConnorNelson
Jul 20, 2023
Maintainer

The pwn.college server went down on July 19, 2023 and came back up on July 20, 2023, for ~24 of downtime.

Here were the minimal steps taken to restore all service.

ZFS

The user homes are currently stored in ZFS.

zpool import showed that our data pool still existed in its expected mirror state.
zpool import data brought our data back into /data.
zpool status revealed no detected issues with our ZFS data.

Currently we bind mount this into the dojo data.

mount --bind /data/homes /opt/dojo/data/homes/data correctly restored home data.

Starting

docker run --privileged -d -v /opt/dojo:/opt/pwn.college -p 22:22 -p 80:80 -p 443:443 pwncollege/dojo brought the infrastructure back up.

SENSAI

We were receiving errors in the sensai-sense-tty container that looked like ERROR: Unknown struct/union: 'struct tty_struct'. This was probably due to the container no longer having correct linux header files, probably because the reboot caused a kernel upgrade.

cd sensai; docker compose build sense-tty --no-cache rebuilt the tty sensor container.
cd sensai; docker compose up -d --force-recreate sense-tty restarted the tty sensor container.

ConnorNelson · 2024-03-13T17:52:59Z

ConnorNelson
Mar 13, 2024
Maintainer Author

Update from the restart after ~12 hours of downtime on March 13th, 2024 .

All of the above worked great, with the exception that now (as documented) the docker run command requires :shared, so docker run --privileged -d -v "/opt/dojo:/opt/pwn.college:shared" -p 22:22 -p 80:80 -p 443:443 --name dojo pwncollege/dojo.

0 replies

ConnorNelson · 2024-07-12T16:29:56Z

ConnorNelson
Jul 12, 2024
Maintainer Author

From a fresh state, this is correct data-management logic. DO NOT just blindly run this; check (with ls, findmnt PATH, mount) what the current mount states are.

mount /dev/disk/by-id/ata-Samsung_SSD_870_QVO_8TB_S5VUNJ0W910166X /mnt/docker-data
mount --bind /mnt/docker-data /tank/dojo/data/docker
mount --bind /tank/homes /tank/dojo/data/homes/data

mount --rbind /tank/dojo/data /opt/dojo/data

cd /opt/dojo
docker run --privileged -d -v $PWD:/opt/pwn.college:private -v $PWD/data:/opt/pwn.college/data:shared -p 22:22 -p 80:80 -p 443:443 --name dojo pwncollege/dojo

1 reply

ConnorNelson Jul 12, 2024
Maintainer Author

If mounts are missing (e.g. /tank/homes is empty), you can restore confidence that it still exists with zfs list, and tell zfs to re-mount it at its mount point with zfs mount tank/homes.

ConnorNelson · 2024-07-25T22:42:19Z

ConnorNelson
Jul 25, 2024
Maintainer Author

Update!

docker run --privileged -d -v $PWD:/opt/pwn.college -v /tank/dojo/data:/data:shared -p 22:22 -p 80:80 -p 443:443 --name dojo pwncollege/dojo

Again, make sure /tank/dojo/data, /tank/dojo/data/home/data, and /tank/dojo/data/docker are all looking good.

0 replies

ConnorNelson · 2024-08-21T00:20:59Z

ConnorNelson
Aug 21, 2024
Maintainer Author

Unfortunately zfs does not seem to support overlayfs mounts (at least on Linux 5.4, I believe modern Linux+zfs does). So, we make it be ext4 with a simple bind mount:

mkdir -p /mnt/homes-overlay /tank/dojo/data/homes/overlay
mount --bind /mnt/homes-overlay /tank/dojo/data/homes/overlay

0 replies

ConnorNelson · 2024-08-23T21:55:41Z

ConnorNelson
Aug 23, 2024
Maintainer Author

Our first server reboot of the new Fall 2024 semester!

These commands brought us back:

mount /dev/disk/by-id/ata-Samsung_SSD_870_QVO_8TB_S5VUNJ0W910166X /mnt/docker-data
mount --bind /mnt/docker-data /tank/dojo/data/docker
mount --bind /tank/homes /tank/dojo/data/homes/data
mkdir -p /mnt/homes-overlay /tank/dojo/data/homes/overlay
mount --bind /mnt/homes-overlay /tank/dojo/data/homes/overlay

docker rm dojo
cd /opt/dojo
docker run --privileged -d -v $PWD:/opt/pwn.college -v /tank/dojo/data:/data:shared -p 22:22 -p 80:80 -p 443:443 --name dojo pwncollege/dojo

0 replies

ConnorNelson · 2024-08-23T22:58:18Z

ConnorNelson
Aug 23, 2024
Maintainer Author

In order to encourage the server not to sleep:

systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

0 replies

ConnorNelson · 2024-10-11T22:52:00Z

ConnorNelson
Oct 11, 2024
Maintainer Author

On October 11th, 2024 new users (and/or users that hadn't been active for a while) were getting "Docker Failed" errors. For some reason new mounts inside "/data/homes/mounts" were not possible.

While stracing the autofs process while trying to access a new mount in there, I saw:

[pid 1448928] mount("/dev/loop6225", "/data/homes/mounts/test2", "ext4", MS_NOSUID, NULL) = -1 ENOSPC (No space left on device)

Maybe this is part of the root cause, maybe it's not; unsure. ChatGPT and Google didn't reveal any quick fixes/discussion surrounding this after searching around for 5 minutes.

So, the simplest solution was to restart. I opted for a soft manual restart of services, rather than multiple full machine restart.

What I did

First I brought down the site on the main node to make sure that nobody could cause some weird state:

dojo compose down ctfd

On each node, I agressively stopped all workspaces:

docker kill $(docker ps | grep user_ | awk '{print $1}')

This didn't actually work successfully for me, so I had to get even more agressive:

# go away nfs mounts
umount -l /data/workspace/homes/mounts
service autofs stop

# go away containers
service docker stop
rm -r /data/docker/containers/*

# come back!
service docker start
service autofs start

Back on the main node I wanted to flush all the mounts (because new users were no longer able to mount their homes, and so were getting "Docker failed" errors):

service autofs stop
service autofs start
exportfs -ra

I then brought the site back up:

dojo compose up -d --no-deps ctfd

Going Forward

It seems like we aren't getting homes automatically unmounting in our "/data/homes/mounts" autofs, and even worse, after ~6200 homes, things just broke. Unmounting something, and then mounting a new thing worked, so it seems to be count based (but not clear what that count is; certainly doesn't seem to be a power of 2).

Fortunately, we are migrating away from our current homes setup, and so the current issues likely won't be very relevant in the future version.

0 replies

ConnorNelson · 2024-10-22T16:50:23Z

ConnorNelson
Oct 22, 2024
Maintainer Author

Once again, we hit the ENOSPC error. I investigated a bit further and discovered that no mount could be made on the host, including bind mounts.

Resolution

On main node:

dojo compose down ctfd sshd

On each workspace node:

docker kill $(docker ps | grep user_ | awk '{print $1}')
service autofs stop

On main node:

service autofs stop
umount -l /data/homes/mounts
service nfs-server stop
service nfs-server start
service autofs start

On each workspace node:

service autofs start

On main node:

dojo compose up -d --no-deps ctfd sshd

0 replies

zardus · 2024-11-03T23:27:21Z

zardus
Nov 3, 2024
Maintainer

Node 192.168.42.2 went down this morning. I suspect that the root cause is homefs concurrency issues, but the real problem set in during restart of the outer docker. The d-in-d daemon kept trying to access the homefs plugin, because it had volumes that were using it, but the homefs docker container had not started yet, because this was during docker initialization. This led to a deadlock.

To resolve this, I moved the following directories out of /data/docker on the bare host:

root@ubuntu-server:/data/docker/bak# ls
buildkit  containers  network  volumes

I'm not sure all of them were necessary, but I am sure that volumes and containers was. After this, a restart of the outer docker cleared everything up.

1 reply

zardus Nov 18, 2024
Maintainer

Verified locally that buildkit and network don't need to be blown away

ConnorNelson · 2024-12-13T03:15:24Z

ConnorNelson
Dec 13, 2024
Maintainer Author

We had a power outage that killed all nodes, fortunately they came back (after several hours).

The database automatically came back on the database node.

BTRFS came back on the workspace nodes, because they had /etc/fstab to do so, but did not on the main node. This was resolved by running btrfs filesystem show and then using the OUTPUT_UUID for echo 'UUID=701230ee-a1fc-4f42-ba63-cf89ed1f2909 /data btrfs defaults 0 0' >> /etc/fstab, and finally mount -a. This brought back BTRFS data on the main node, and it should now come back automatically in the future.

The main node was still missing its ext4 override for docker data.

We previously changed the workspace nodes to use ext4 for docker data via a `mkdir

As a note, we have already configured the main node/workspace nodes, this is just for a restart (the correct data is already in config.env).

On the main node:

mount /dev/disk/by-uuid/829fdf9a-dd00-4081-9307-8c306e971806 /mnt/docker-data
mount --bind /mnt/docker-data /data/docker

On each workspace node:

mount --bind /mnt/docker-data /data/docker

docker rm dojo
modprobe br_netfilter
git pull && docker build -t pwncollege/dojo . && docker run --privileged -d -v $PWD:/opt/pwn.college -v /data:/data --name dojo pwncollege/dojo

Unfortunately homefs docker plugin messes everything up on the bad restart (need to fix this...):

docker exec -it dojo bash
rm -r /data/docker/containers/*; rm /data/docker/volumes/metadata.db; kill -9 $(pgrep dockerd)

dojo compose up -d

Back on the main node:

docker rm dojo
modprobe br_netfilter
git pull && docker build -t pwncollege/dojo . && docker run --privileged -d -v $PWD:/opt/pwn.college -v /data:/data -p 22:22 -p 80:80 -p 443:443 -p 51820:51820/udp --name dojo pwncollege/dojo

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production - Server Restart and Restore #192

{{title}}

Replies: 10 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Production - Server Restart and Restore #192

ConnorNelson Jul 20, 2023 Maintainer

ZFS

Starting

SENSAI

Replies: 10 comments · 2 replies

ConnorNelson Mar 13, 2024 Maintainer Author

ConnorNelson Jul 12, 2024 Maintainer Author

ConnorNelson Jul 12, 2024 Maintainer Author

ConnorNelson Jul 25, 2024 Maintainer Author

ConnorNelson Aug 21, 2024 Maintainer Author

ConnorNelson Aug 23, 2024 Maintainer Author

ConnorNelson Aug 23, 2024 Maintainer Author

ConnorNelson Oct 11, 2024 Maintainer Author

What I did

Going Forward

ConnorNelson Oct 22, 2024 Maintainer Author

Resolution

zardus Nov 3, 2024 Maintainer

zardus Nov 18, 2024 Maintainer

ConnorNelson Dec 13, 2024 Maintainer Author

ConnorNelson
Jul 20, 2023
Maintainer

Replies: 10 comments 2 replies

ConnorNelson
Mar 13, 2024
Maintainer Author

ConnorNelson
Jul 12, 2024
Maintainer Author

ConnorNelson Jul 12, 2024
Maintainer Author

ConnorNelson
Jul 25, 2024
Maintainer Author

ConnorNelson
Aug 21, 2024
Maintainer Author

ConnorNelson
Aug 23, 2024
Maintainer Author

ConnorNelson
Aug 23, 2024
Maintainer Author

ConnorNelson
Oct 11, 2024
Maintainer Author

ConnorNelson
Oct 22, 2024
Maintainer Author

zardus
Nov 3, 2024
Maintainer

zardus Nov 18, 2024
Maintainer

ConnorNelson
Dec 13, 2024
Maintainer Author