Skip to content

Latest commit



345 lines (259 loc) · 11.5 KB


File metadata and controls

345 lines (259 loc) · 11.5 KB

Setting up ZFS for PlanB

This is not a full-blown ZFS setup guide. But it will provide some tips to get a zpool up and running for PlanB.

Here, the most common setup using raidz2 is described with a concise explanation of the parameters. For more information, tips and tweaks, and why you should not skimp on non-ECC memory, the author refers you to The Internet™.

This how-to assumes you're using ZFS 0.8.x on Linux, but it will likely work on other versions with slight adaptations.

  1. Selecting/preparing disks
  2. Using native ZFS encryption
  3. Setting up the zpool
  4. Explanation of zpool attributes
  5. Replacing faulty disks

Selecting/preparing disks

So, start with a bunch of disks. Let's say 34 10TB disks:

# cat /proc/partitions  | grep ' sd[a-z]*$' | sort -Vk4
   8        0 9766436864 sda
  65      160 9766436864 sdaa
  65      176 9766436864 sdab
  65      128 9766436864 sdy
  65      144 9766436864 sdz

The disks don't need to have the same size, but it helps (and hot spares will need to match the largest, for obvious reasons). For the common setup, you'll use the entire disk and not a partition. (ZFS will do its own partitioning, but you don't need to worry about that.)

You will want to triple check which disks you're using. You don't want to overwrite your operating system (OS) or some other important data. (In my case, the OS is on separate nvme drives, so I can safely use all sdX drives.)

These 34 disks will go into three ZFS vdevs (see vdevs and raidz2 below):

  1. 10 disks
  2. 10 disks
  3. 10 disks
  4. 4 hot spares

A pro tip here is to use the device identifiers instead of the kernel generated names. I don't think ZFS will have a problem finding the right device if the kernel renames sda to sdb, but when you're swapping defective disks, you'll be happy when you can match scsi-SSEAGATE_ST10000NM0226_0123 to the identifier printed on the physical disk.

So, step 1, find the drives:

# ls -go /dev/disk/by-id/ | grep '/sda$'
lrwxrwxrwx 1 10 Jun 24 08:29 scsi-35000c500af2fd4df -> ../../sda
lrwxrwxrwx 1 10 Jun 24 08:29 scsi-SSEAGATE_ST10000NM0226_0123 -> ../../sda
lrwxrwxrwx 1 10 Jun 24 08:29 wwn-0x5000c500af2fd4df -> ../../sda

# ls -go /dev/disk/by-id/ | grep 'scsi-[^ ]*_.*/sd[a-z]*$'
lrwxrwxrwx 1  9 Jun 24 08:29 scsi-SSEAGATE_ST10000NM0226_0101 -> ../../sdac
lrwxrwxrwx 1 10 Jun 24 08:29 scsi-SSEAGATE_ST10000NM0226_0123 -> ../../sda
lrwxrwxrwx 1 10 Jun 24 08:29 scsi-SSEAGATE_ST10000NM0226_0226 -> ../../sde

# ls -go /dev/disk/by-id/ | grep 'scsi-[^ ]*_.*/sd[a-z]*$' | wc -l

Drop them in a file somewhere:

# ls -go /dev/disk/by-id/ | grep 'scsi-[^ ]*_.*/sd[a-z]*$' |
    awk '{print $7}'

# ls -go /dev/disk/by-id/ | grep 'scsi-[^ ]*_.*/sd[a-z]*$' |
    awk '{print $7}' >disks

However, now they are sorted by serial number. I don't know if the serials are generated incrementally, but if they are, those with similar numbers may be part of a bad batch. We don't want all bad disks to end up on the same vdev. If a vdev fails, all data is lost.

So, to counter that, a simple shuf (shuffle) of the data is sufficient to ease my paranoia.

# ls -go /dev/disk/by-id/ | grep 'scsi-[^ ]*_.*/sd[a-z]*$' |
    awk '{print $7}' | shuf >disks

Okay, now that the disks are shuffled. Open an editor on the created disks file and prepend numbers. 0 (and a space) before the 10 first disks, 1 before the next 10, then 2 and lastly S for the spares. Your file now looks like this:

0 scsi-SSEAGATE_ST10000NM0226_6351
0 scsi-SSEAGATE_ST10000NM0226_0226
0 scsi-SSEAGATE_ST10000NM0226_8412
1 scsi-SSEAGATE_ST10000NM0226_0123
S scsi-SSEAGATE_ST10000NM0226_8412

That's nice, because now we can quickly get the chosen disks from that file. For example, find ``S `` to get the 4 spares:

# awk '/^S /{print "disk/by-id/" $2}' disks

Using native ZFS encryption

If you're using ZFS on Linux 0.8.x or higher, you can use native encryption. You should enable this on the pool directly. Now all child datasets will use encryption.

Don't worry about the key just yet. You can always change it, as it is a wrapping key only; that is, the key is used to decrypt the real key which never changes.

For now, start out with a passphrase key:

# pwgen -s 512 1

Setting up the zpool

If you prepared which disks you'll be using according to the method described above, you now have a disks file with a destination "number" and a disk identifier.

Setting up three vdevs and a set of spares is then as easy as this:

# zpool create -o ashift=12 \
    -O canmount=off -O xattr=sa \
    -O compression=lz4 -O encryption=aes-256-gcm \
    -O keylocation=prompt -O keyformat=passphrase \
    tank raidz2 \
    $(awk '/^0 /{print "disk/by-id/" $2}' disks)

# zpool add tank raidz2 $(awk '/^1 /{print "disk/by-id/" $2}' disks)

# zpool add tank raidz2 $(awk '/^2 /{print "disk/by-id/" $2}' disks)

# zpool add tank spare $(awk '/^S /{print "disk/by-id/" $2}' disks)

Check the zpool status:

# zpool status
  pool: tank
 state: ONLINE
  scan: none requested

  NAME                                  STATE
  tank                                  ONLINE
    raidz2-0                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_6351  ONLINE
      scsi-SSEAGATE_ST10000NM0226_0226  ONLINE
      scsi-SSEAGATE_ST10000NM0226_8412  ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    raidz2-1                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_0123  ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    raidz2-2                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    scsi-SSEAGATE_ST10000NM0226_9866    AVAIL
    scsi-SSEAGATE_ST10000NM0226_5992    AVAIL
    scsi-SSEAGATE_ST10000NM0226_5900    AVAIL
    scsi-SSEAGATE_ST10000NM0226_8412    AVAIL

Nice and shiny!


  • readable device IDs instead of kernel-generated sdX numbers;
  • shuffled disks to reduce the chance of a batch of bad disks ending up on the same vdev.

Explanation of zpool attributes

vdevs and raidz2

A ZFS pool consists of one or more vdevs (and optionally spares, cache and log disks).

Every vdev itself must be redundant. If one of the vdevs fails, your entire pool fails. A minimal safe setup would be:

  • one vdev with two mirrored disks.

By adding more vdevs, ZFS will do a form of striping on those (i.e. more read/write speed). It makes sense to make all vdevs equal in size, but it is not mandatory.

Our setup uses:

  • three vdevs with ten raidz2 disks per vdev and four spare disks.

raidz2 is the ZFS equivalent of raid6; in our case 8 data disks and 2 parity disks. Two disks in the same vdev are allowed to fail. And upon failure, a spare is automatically activated.

zpool create options

In the create commands above, we use ashift=12, canmount=off, xattr=sa, compression=lz4 and encryption=aes-256-gcm:

  • ashift=12: Because most newer disks emulate having 512byte sectors (the default ashift=9) but in reality have 4K sectors (ashift=12), you'll want this option for performance.
  • canmount=off: Because we don't want to write in the root dataset.
  • xattr=sa: Lets us add extended attributes in inodes. We don't use them for now, but they can be nice to have later.
  • compression=lz4: LZ4 is a relatively fast compression scheme that gives you better performance, and improves the security of the encryption (because of the increased entropy). (Note that we'll consider CRIME-based attacks (using partial compression to attack encryption) irrelevant on the local system.)
  • encryption=aes-256-gcm: Yes. We want the best native encryption we can get now.

Replacing faulty disks

When you're dealing with a DEGRADED array, you'll want to use the zpool replace command. Mark the failing disk offline using zpool offline and replace that disk with the new one.

# zpool offline tank scsi-SSEAGATE_ST10000NM0226_0123
# ledctl locate=/dev/disk/by-id/scsi-SSEAGATE_ST10000NM0226_0123

Swap the disks, and replace:

# zpool replace tank scsi-SSEAGATE_ST10000NM0226_0123 \
# ledctl locate_off=/dev/disk/by-id/scsi-NEW_DISK

If you have trouble with the finding the original disk, use zdb to list the disks. You can then use the guid as old disk when replacing.

hot spares

When the hot spares work as intended, a failing disk will have been substituted by a spare already. But you'll still need to manually swap them out to make the array completely online.

# zpool status
  NAME                                  STATE
  tank                                  DEGRADED
    raidz2-0                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_6351  ONLINE
      scsi-SSEAGATE_ST10000NM0226_0226  ONLINE
      scsi-SSEAGATE_ST10000NM0226_8412  ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    raidz2-1                            DEGRADED
      spare-1                           DEGRADED
        scsi-SSEAGATE_ST10000NM0226_0123  UNAVAIL
        scsi-SSEAGATE_ST10000NM0226_9866  ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    raidz2-2                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE
    scsi-SSEAGATE_ST10000NM0226_9866    INUSE
    scsi-SSEAGATE_ST10000NM0226_5992    AVAIL
    scsi-SSEAGATE_ST10000NM0226_5900    AVAIL
    scsi-SSEAGATE_ST10000NM0226_8412    AVAIL

This requires some manual action:

# zpool detach tank scsi-SSEAGATE_ST10000NM0226_0123

Now the array should be ONLINE again, and scsi-SSEAGATE_ST10000NM0226_9866 will be gone from the spares list.

Use ledctl to find the broken disk and physically replace with a new one, and lastly zpool add tank spare NEW_DISK.