Skip to content
Tester edited this page Nov 22, 2015 · 1 revision

enl.global filesystem

System physical drives

  • one OCZ-VERTEX2 3.5 SSD 110 G : /dev/sdb
  • one WDC WD20NPVX-00EA4T0 external HD 568 G : /dev/sdd

Partition scheme

  • a btrfs HD partition for miscellaneous data
  • a btrfs HD partition for DB, with one layer:
    1. a backing device. This partition on a HDD will be cached by a cached device
  • a SSD Btrfs partition for the OS
  • a SSD bacaches partition for caching

Note:

  • some articles recommend building btrfs filesystem on top of LVM partitions. This is a non sense and shall be avoid.
  • the use of bcache on Btrfs has been reported to lead to file corruption. This setting is considered hazardous, thus a tight backup policy is implemented.
  • By default, btrfs performs CoW (Copy on Write)for all files, at all times.This can negatively affect performance with large files and it is recommended to disable CoW for database files.

Initial SSD drive set up

SSD memory cell clearing

It will reset SSD's cells to the same virgin state they were manufactured.

Step 1 - Check if drive security is frozen

# hdparm -I /dev/sdb
.....................
Security: 
		supported
	not	enabled
	not	locked
		frozen

If the command output shows "frozen" as above, one cannot continue. In this case, we must suspend the disk first. # systemctl suspend On resume, the hdparmcommand will show the device is marked as not frozen.

Step 2 - Enable security

Choose any password as it is only temporary. The password will at least be set back to NULL.

# hdparm --user-master u --security-set-pass Trollolo /dev/sdb When issuing again the # hdparm -I /dev/sdb command, output shall now display enabled.

Security: 
		supported
		enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
        Security level high
	400min for SECURITY ERASE UNIT. 400min for ENHANCED SECURITY ERASE UNIT. 

Step 3 - Issue the ATA Secure Erase command

# hdparm --user-master u --security-erase Trollolo /dev/sdb

The drive shall now be erased. The output of the # hdparm -I /dev/sdbcommand will look like this:

Security: 
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	400min for SECURITY ERASE UNIT. 400min for ENHANCED SECURITY ERASE UNIT. 

SSD partitioning

Partition alignement

In past, proper alignment required manual calculation and intervention when partitioning. We will use one partition for our root filesystem.

Modern method - Using GPT

Partitions are GUID Partition Tables. Using GPT, the utility for editing the partition table is called gdisk.

# gdisk /dev/sdb
............................................................
Disk /dev/sdb: 224674128 sectors, 107.1 GiB
   1            2048       167774207   80.0 GiB    8300  poppy-root
   2       167774208       224674094   27.1 GiB    8300  poppy-cache

Ensure our partition is aligned:

# blockdev --getalignoff /dev/sdb
0

Note:

  • poppy-root hosts the OS
  • poppy-cache is the caching partition

HD partitioning

# gdisk /dev/sdd
......................................................................
Disk /dev/sdd: 3907029168 sectors, 1.8 TiB
   3      2081282048      2853033983   368.0 GiB   8300  poppy-storage
   4      2853033984      3274092543   200.8 GiB   8300  poppy-encrypt

Make filesystem

Warning: two layers, Luks and Bcache, will be used, with one filesystem, btrfs. This setup can quickly become a mess and the order things are to be done is important. Best practice is to let the encrypted containers reside below stacked block devices (in our case Btrfs).

1- Bcache

Bcache is a Linux kernel block layer cache. It allows one fast disk drive such as flash-based solid state drives (SSDs) to act as a caching device for one or more slower hard disk drives, the backing devices. Bcache migrates data from the caching device to the backing device to persist data and make room for new data in the cache

bcache scheme

Note:

  • the caching device shall be a whole device, when backing devices can be whatever partitions on any HD.
  • in case of any error with a need to run again make fs, best is to first run # wipefs -a first on the partitions, otherwise the kernel may accidentally misdetect filesystems that are no longer there.
  • the backing device has to be formatted as a bcache block device. In case of an existing formatted partition, one must try blocks to-bcache.
 make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C /dev/sdb2 --discard

Note: The above command set

  • a bucket size of 2Mb. The bucket size should match the erase block size of the caching device with the intent of reducing write amplification. 2 Mb is a safe value for SSD
  • discard since most modern SSDs work better with it
  • the block size should match the backing devices sector size which will usually be either 512 or 4k
  • all associated disks are partitioned at once so bcache will automatically attach "-B backing stores" to the "-C ssd cache"

Register the caching device against the backing device.

# echo 93d1664f-8a3a-4a0d-9616-8cf422073438 > /sys/block/bcache1/bcache/attach

Enable write-back mode to gain maximum performance

# echo writeback > /sys/block/bcache1/bcache/cache_mode

Management

To see the state of the bcache device:

$ ls /sys/block/sdb/sdb2/bcache
block_size  btree_written  bucket_size  cache_replacement_policy  clear_stats  discard  io_errors  metadata_written  nbuckets  priority_stats  set@  written

Then

# cat /sys/block/sdb/sdb2/bcache/...

2 - Dmcrypt + LUKS

NOTE: unfortunately, nspawn container does not see a mapping encrypted device. It is thus impossible to mount and use an encrypted Data Base.The below settings are kept for the record.

As of linux kernel 3.2, it is now considered safe to have btrfs on top of dmcrypt. One must encrypt the partition before running mkfs.btrfs.

Configure LUKS partition

This is done with the cryptsetup command. It will manage plain dm-crypt and LUKS encrypted volumes. LUKS encrypted partition are usually preferred as offering less user errors.

  • fill the device with random data The below command is much faster than dd if=/dev/random of=/dev/mapper/containerbut provides lower quality random data.
# badblocks -c 10240 -s -w -t random /dev/sdd4
...............................................
Testing with random pattern: done                                                 
Reading and comparing: done  
  • create a keyfile A keyfile is any file in which the data contained within it is used as the passphrase to unlock an encrypted volume. Best practice is to put the keyfile elswhere the computer (remote, USB device...).
# mkdir /etc/keys
# dd if=/dev/urandom of=/etc/keys/hda6.luks bs=4k count=1
4+0 records in
4+0 records out
2048 bytes (2.0 kB) copied, 0.000511517 s, 4.0 MB/s
  • format partition with the keyfile Here we are using the default AES cipher in XTS mode with an effective 256-bit encryption
# cryptsetup -s 512 luksFormat /dev/bcache0 /etc/keys/poppy.luks
  • check the partition is encrypted
# cryptsetup luksDump /dev/bcache0
 LUKS header information for /dev/bcache0

Version:       	1
Cipher name:   	aes
Cipher mode:   	xts-plain64

Unlock/Mapp LUKS partition

To access the device's decrypted contents, a mapping must be established using the kernel device-mapper. We will use sdd4_crypt for mapper name.

# cryptsetup --key-file /etc/keys/poppy.luks luksOpen /dev/sdd4 sdd4_crypt
$ ls /dev/mapper
.................
sdd4_crypt@
------------------------------------------------
$ lsblk -o NAME,KNAME,MAJ:MIN,FSTYPE,LABEL
...............................................
└─sdd4                                       sdd4    8:52  crypto_LUKS 
  └─sdd4_crypt dm-8  253:8   

Remove the mapping:

# cryptsetup --key-file /etc/keys/poppy.luks luksClose sdd4_crypt

Remove key

# cryptsetup --key-file /etc/keys/poppy.luks luksRemoveKey /dev/bcache0 

3- Make Btrfs on Luks encrypted sdd4

# mkfs.btrfs -L poppy-encrypt /dev/mapper/sdd4_crypt
$ lsblk -o NAME,KNAME,MAJ:MIN,FSTYPE,LABEL
...............................................
└─sdd4                   sdd4      8:52  bcache      
  └─bcache0              bcache0 254:0   crypto_LUKS 
    └─sdd4_crypt         dm-7    253:7   btrfs       poppy-encrypt

Make Btrfs on SSD partition sdb1

# mkfs.btrfs -f -L poppy-root /dev/sdb1

Create subvolume

# mkdir /mnt/btrfs
# mount -t btrfs /dev/sdb1 /mnt/btrfs 
# cd /mnt/btrfs

# btrfs subvolume create var
# btrfs subvolume create etc
# btrfs subvolume create rootvol

Check everything is correct

# btrfs subvolume list .
ID 266 gen 39 top level 5 path rootvol
ID 268 gen 41 top level 5 path var
ID 269 gen 42 top level 5 path etc

Create subvolume for DB

# mount /dev/dm-7 /mnt/btrfs
# btrfs subvolume create db

Create subvolume for storage

mount /dev/sdd3 /mnt/btrfs 
btrfs subvolume create storage

Verify the overall Btrfs filesystem:

# btrfs filesystem show
Label: 'poppy-root'  uuid: ef1b44cd-e7b0-4166-b933-e7d4d20a1171
	Total devices 1 FS bytes used 915.97MiB
	devid    1 size 80.00GiB used 3.01GiB path /dev/sdb1

Label: 'poppy-snapshots'  uuid: 89979a3d-d8eb-4464-a6c6-5f514d766643
	Total devices 1 FS bytes used 112.00KiB
	devid    1 size 162.74GiB used 2.04GiB path /dev/sda2

Label: 'poppy-storage'  uuid: a1e643cf-5bb3-42db-b977-8b9a105c72f7
	Total devices 1 FS bytes used 400.00KiB
	devid    1 size 368.00GiB used 2.04GiB path /dev/sdd3

Label: 'poppy-db'  uuid: de093502-cbc7-4f88-a564-eaf845953742
	Total devices 1 FS bytes used 400.00KiB
	devid    1 size 200.78GiB used 2.04GiB path /dev/bcache0

Note:

  • even though subvolumes looks like an ordinary subdirectory (as returned by the ls command), the filesystem treats it as if it were on a separate physical device.
  • contrary to what can be written here and there, it is a bad practice to create nested subvolumes (i.e subvol/myPartition).

Mount option

SSD

TRIM

In order to sustain long term performance, it is needed to run the TRIM command. There is no need to add the discard mount flag as we cron fstrim periodically with systemd service.

To verify SSD suuport TRIM, run the following command:

# hdparm -I /dev/sda | grep TRIM
        *    Data Set Management TRIM supported (limit 1 block)
        *    Deterministic read data after TRIM

Use of tmpfs

  • In order to speed and save read/write cycles, one can relocate some highly used directory to a tmpfs filesystem. This is achieved with anything-sync-daemon.
  • Under systemd, /tmp is automatically mounted as a tmpfs even though you have no entry for that in your /etc/fstab.

Disable CoW for Data Base

By default, btrfs performs CoW for all files. CoW comes with some advantages, but can negatively affect performance with large files that have small random writes because it will fragment them (even if no "copy" is ever performed!). It is recommended to disable CoW for database files.

Use the nodatacow option

FSTAB

Btrfs

  • in /etc/fstab, the mount option subvol="subvolume-name" has to be specified, and the fsck setting in the last field has to be 0.
  • btrfs file systems can make use of zlib (default) and lzo compression which means that compressible files will be stored in compressed form on the hard drive which saves space. Using compression, especially lzo compression, can improve the throughput performance

SSD

  • Using the noatime option fully disables writing file access times to the drive every time you read a file.
  • autodefrag : will detect random writes into existing files and kick off background defragging.
  • ssd : turn on some of the SSD optimized behaviour within btrfs

NOTE: discard (Enables discard/TRIM on freed blocks) is not used as trim is run periodically

/etc/fstab
---------------------------------------------------------
LABEL=poppy-root                                 /var/lib/machines/poppy                btrfs           rw,noatime,autodefrag,compress=lzo,ssd,subvol=rootvol   0       0
LABEL=poppy-root                                 /var/lib/machines/poppy/etc            btrfs           rw,noatime,autodefrag,compress=lzo,ssd,subvol=etc       0       0
LABEL=poppy-root                                 /var/lib/machines/poppy/var            btrfs           rw,noatime,autodefrag,compress=lzo,ssd,subvol=var       0       0
LABEL=poppy-storage                              /var/lib/machines/poppy/storage        btrfs           rw,noatime,autodefrag,compress=lzo,nodatacow,subvol=storage       0       0
LABEL=poppy-db                                   /var/lib/machines/poppy/db             btrfs           rw,noatime,autodefrag,compress=lzo,subvol=db            0       0

Crypttab

The /etc/crypttab (or, encrypted device table) file contains a list of encrypted devices that are to be unlocked when the system boots, similar to fstab. It is read before fstab, so that dm-crypt containers can be unlocked before the filesystem inside is mounted.

/etc/crypttab
sdd4_crypt     UUID=c5514aef-28cc-4b1a-aefe-25f8ac1d128b    /etc/keys/poppy.luks

Resources

btrfs howtoforge

Marc Merlin btrfs blog

btrfs on LWM

btrfs wiki

btrfs tips on Archwiki

Archlinux on Btrfs - bitloom blog

Bcache Archwiki

Bcache official documentation

Bcache documentation at Kernel.org

pommi nethuis blog

blocks to-bache github