-
Notifications
You must be signed in to change notification settings - Fork 0
Filesystem
- one
OCZ-VERTEX2 3.5
SSD 110 G :/dev/sdb
- one
WDC WD20NPVX-00EA4T0
external HD 568 G :/dev/sdd
- a btrfs HD partition for miscellaneous data
- a btrfs HD partition for DB, with one layer:
- a backing device. This partition on a HDD will be cached by a cached device
- a SSD Btrfs partition for the OS
- a SSD bacaches partition for caching
Note:
- some articles recommend building btrfs filesystem on top of LVM partitions. This is a non sense and shall be avoid.
- the use of bcache on Btrfs has been reported to lead to file corruption. This setting is considered hazardous, thus a tight backup policy is implemented.
- By default, btrfs performs CoW (Copy on Write)for all files, at all times.This can negatively affect performance with large files and it is recommended to disable CoW for database files.
It will reset SSD's cells to the same virgin state they were manufactured.
# hdparm -I /dev/sdb
.....................
Security:
supported
not enabled
not locked
frozen
If the command output shows "frozen" as above, one cannot continue. In this case, we must suspend the disk first.
# systemctl suspend
On resume, the hdparm
command will show the device is marked as not frozen
.
Choose any password as it is only temporary. The password will at least be set back to NULL.
# hdparm --user-master u --security-set-pass Trollolo /dev/sdb
When issuing again the # hdparm -I /dev/sdb
command, output shall now display enabled
.
Security:
supported
enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
Security level high
400min for SECURITY ERASE UNIT. 400min for ENHANCED SECURITY ERASE UNIT.
# hdparm --user-master u --security-erase Trollolo /dev/sdb
The drive shall now be erased. The output of the # hdparm -I /dev/sdb
command will look like this:
Security:
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
400min for SECURITY ERASE UNIT. 400min for ENHANCED SECURITY ERASE UNIT.
In past, proper alignment required manual calculation and intervention when partitioning. We will use one partition for our root filesystem.
Partitions are GUID Partition Tables. Using GPT, the utility for editing the partition table is called gdisk.
# gdisk /dev/sdb
............................................................
Disk /dev/sdb: 224674128 sectors, 107.1 GiB
1 2048 167774207 80.0 GiB 8300 poppy-root
2 167774208 224674094 27.1 GiB 8300 poppy-cache
Ensure our partition is aligned:
# blockdev --getalignoff /dev/sdb
0
Note:
- poppy-root hosts the OS
- poppy-cache is the caching partition
# gdisk /dev/sdd
......................................................................
Disk /dev/sdd: 3907029168 sectors, 1.8 TiB
3 2081282048 2853033983 368.0 GiB 8300 poppy-storage
4 2853033984 3274092543 200.8 GiB 8300 poppy-encrypt
Warning: two layers, Luks and Bcache, will be used, with one filesystem, btrfs. This setup can quickly become a mess and the order things are to be done is important. Best practice is to let the encrypted containers reside below stacked block devices (in our case Btrfs).
Bcache is a Linux kernel block layer cache. It allows one fast disk drive such as flash-based solid state drives (SSDs) to act as a caching device for one or more slower hard disk drives, the backing devices. Bcache migrates data from the caching device to the backing device to persist data and make room for new data in the cache
Note:
- the caching device shall be a whole device, when backing devices can be whatever partitions on any HD.
- in case of any error with a need to run again make fs, best is to first run
# wipefs -a
first on the partitions, otherwise the kernel may accidentally misdetect filesystems that are no longer there. - the backing device has to be formatted as a bcache block device. In case of an existing formatted partition, one must try blocks to-bcache.
make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C /dev/sdb2 --discard
Note: The above command set
- a bucket size of 2Mb. The bucket size should match the erase block size of the caching device with the intent of reducing write amplification. 2 Mb is a safe value for SSD
- discard since most modern SSDs work better with it
- the block size should match the backing devices sector size which will usually be either 512 or 4k
- all associated disks are partitioned at once so bcache will automatically attach "-B backing stores" to the "-C ssd cache"
Register the caching device against the backing device.
# echo 93d1664f-8a3a-4a0d-9616-8cf422073438 > /sys/block/bcache1/bcache/attach
Enable write-back mode to gain maximum performance
# echo writeback > /sys/block/bcache1/bcache/cache_mode
To see the state of the bcache device:
$ ls /sys/block/sdb/sdb2/bcache
block_size btree_written bucket_size cache_replacement_policy clear_stats discard io_errors metadata_written nbuckets priority_stats set@ written
Then
# cat /sys/block/sdb/sdb2/bcache/...
NOTE: unfortunately, nspawn container does not see a mapping encrypted device. It is thus impossible to mount and use an encrypted Data Base.The below settings are kept for the record.
As of linux kernel 3.2, it is now considered safe to have btrfs on top of dmcrypt. One must encrypt the partition before running mkfs.btrfs
.
This is done with the cryptsetup command. It will manage plain dm-crypt and LUKS encrypted volumes. LUKS encrypted partition are usually preferred as offering less user errors.
- fill the device with random data
The below command is much faster than
dd if=/dev/random of=/dev/mapper/container
but provides lower quality random data.
# badblocks -c 10240 -s -w -t random /dev/sdd4
...............................................
Testing with random pattern: done
Reading and comparing: done
- create a keyfile A keyfile is any file in which the data contained within it is used as the passphrase to unlock an encrypted volume. Best practice is to put the keyfile elswhere the computer (remote, USB device...).
# mkdir /etc/keys
# dd if=/dev/urandom of=/etc/keys/hda6.luks bs=4k count=1
4+0 records in
4+0 records out
2048 bytes (2.0 kB) copied, 0.000511517 s, 4.0 MB/s
- format partition with the keyfile Here we are using the default AES cipher in XTS mode with an effective 256-bit encryption
# cryptsetup -s 512 luksFormat /dev/bcache0 /etc/keys/poppy.luks
- check the partition is encrypted
# cryptsetup luksDump /dev/bcache0
LUKS header information for /dev/bcache0
Version: 1
Cipher name: aes
Cipher mode: xts-plain64
To access the device's decrypted contents, a mapping must be established using the kernel device-mapper. We will use sdd4_crypt for mapper name.
# cryptsetup --key-file /etc/keys/poppy.luks luksOpen /dev/sdd4 sdd4_crypt
$ ls /dev/mapper
.................
sdd4_crypt@
------------------------------------------------
$ lsblk -o NAME,KNAME,MAJ:MIN,FSTYPE,LABEL
...............................................
└─sdd4 sdd4 8:52 crypto_LUKS
└─sdd4_crypt dm-8 253:8
Remove the mapping:
# cryptsetup --key-file /etc/keys/poppy.luks luksClose sdd4_crypt
Remove key
# cryptsetup --key-file /etc/keys/poppy.luks luksRemoveKey /dev/bcache0
# mkfs.btrfs -L poppy-encrypt /dev/mapper/sdd4_crypt
$ lsblk -o NAME,KNAME,MAJ:MIN,FSTYPE,LABEL
...............................................
└─sdd4 sdd4 8:52 bcache
└─bcache0 bcache0 254:0 crypto_LUKS
└─sdd4_crypt dm-7 253:7 btrfs poppy-encrypt
# mkfs.btrfs -f -L poppy-root /dev/sdb1
# mkdir /mnt/btrfs
# mount -t btrfs /dev/sdb1 /mnt/btrfs
# cd /mnt/btrfs
# btrfs subvolume create var
# btrfs subvolume create etc
# btrfs subvolume create rootvol
Check everything is correct
# btrfs subvolume list .
ID 266 gen 39 top level 5 path rootvol
ID 268 gen 41 top level 5 path var
ID 269 gen 42 top level 5 path etc
Create subvolume for DB
# mount /dev/dm-7 /mnt/btrfs
# btrfs subvolume create db
Create subvolume for storage
mount /dev/sdd3 /mnt/btrfs
btrfs subvolume create storage
Verify the overall Btrfs filesystem:
# btrfs filesystem show
Label: 'poppy-root' uuid: ef1b44cd-e7b0-4166-b933-e7d4d20a1171
Total devices 1 FS bytes used 915.97MiB
devid 1 size 80.00GiB used 3.01GiB path /dev/sdb1
Label: 'poppy-snapshots' uuid: 89979a3d-d8eb-4464-a6c6-5f514d766643
Total devices 1 FS bytes used 112.00KiB
devid 1 size 162.74GiB used 2.04GiB path /dev/sda2
Label: 'poppy-storage' uuid: a1e643cf-5bb3-42db-b977-8b9a105c72f7
Total devices 1 FS bytes used 400.00KiB
devid 1 size 368.00GiB used 2.04GiB path /dev/sdd3
Label: 'poppy-db' uuid: de093502-cbc7-4f88-a564-eaf845953742
Total devices 1 FS bytes used 400.00KiB
devid 1 size 200.78GiB used 2.04GiB path /dev/bcache0
Note:
- even though subvolumes looks like an ordinary subdirectory (as returned by the ls command), the filesystem treats it as if it were on a separate physical device.
- contrary to what can be written here and there, it is a bad practice to create nested subvolumes (i.e subvol/myPartition).
In order to sustain long term performance, it is needed to run the TRIM command. There is no need to add the discard
mount flag as we cron fstrim periodically with systemd service.
To verify SSD suuport TRIM, run the following command:
# hdparm -I /dev/sda | grep TRIM
* Data Set Management TRIM supported (limit 1 block)
* Deterministic read data after TRIM
- In order to speed and save read/write cycles, one can relocate some highly used directory to a tmpfs filesystem. This is achieved with anything-sync-daemon.
- Under systemd,
/tmp
is automatically mounted as a tmpfs even though you have no entry for that in your/etc/fstab
.
By default, btrfs performs CoW for all files. CoW comes with some advantages, but can negatively affect performance with large files that have small random writes because it will fragment them (even if no "copy" is ever performed!). It is recommended to disable CoW for database files.
Use the nodatacow
option
- in /etc/fstab, the mount option subvol="subvolume-name" has to be specified, and the fsck setting in the last field has to be 0.
- btrfs file systems can make use of zlib (default) and lzo compression which means that compressible files will be stored in compressed form on the hard drive which saves space. Using compression, especially lzo compression, can improve the throughput performance
- Using the noatime option fully disables writing file access times to the drive every time you read a file.
- autodefrag : will detect random writes into existing files and kick off background defragging.
- ssd : turn on some of the SSD optimized behaviour within btrfs
NOTE: discard (Enables discard/TRIM on freed blocks) is not used as trim is run periodically
/etc/fstab
---------------------------------------------------------
LABEL=poppy-root /var/lib/machines/poppy btrfs rw,noatime,autodefrag,compress=lzo,ssd,subvol=rootvol 0 0
LABEL=poppy-root /var/lib/machines/poppy/etc btrfs rw,noatime,autodefrag,compress=lzo,ssd,subvol=etc 0 0
LABEL=poppy-root /var/lib/machines/poppy/var btrfs rw,noatime,autodefrag,compress=lzo,ssd,subvol=var 0 0
LABEL=poppy-storage /var/lib/machines/poppy/storage btrfs rw,noatime,autodefrag,compress=lzo,nodatacow,subvol=storage 0 0
LABEL=poppy-db /var/lib/machines/poppy/db btrfs rw,noatime,autodefrag,compress=lzo,subvol=db 0 0
The /etc/crypttab
(or, encrypted device table) file contains a list of encrypted devices that are to be unlocked when the system boots, similar to fstab. It is read before fstab, so that dm-crypt containers can be unlocked before the filesystem inside is mounted.
/etc/crypttab
sdd4_crypt UUID=c5514aef-28cc-4b1a-aefe-25f8ac1d128b /etc/keys/poppy.luks
Archlinux on Btrfs - bitloom blog