blk-switch is a redesign of the Linux kernel storage stack that achieves μs-scale latencies while saturating 100Gbps links, even when tens of applications are colocated on a server. The key insight in blk-switch is that the multi-queue storage and network I/O architecture makes storage stacks conceptually similar to network switches. The main design contributions of blk-switch are:
- Multiple egress queues: enable isolating request types (prioritization)
- Request steering: avoids transient congestion (load valancing)
- Application steering: avoids persistent congestion (switch scheduling)
- block/ includes blk-switch core implementation.
- drivers/nvme/ includes remote storage stack kernel modules such as i10 and NVMe-over-TCP (nvme-tcp).
- include/linux includes small changes in some headers for blk-switch and i10.
- osdi21_artifact/ includes scripts for OSDI21 artifact evaluation.
- scripts/ includes scripts for getting started instructions.
For simplicity, we assume that users have two physical servers (Host and Target) connected with each other over networks. Target server has actual storage devices (e.g., RAM block device, NVMe SSD, etc.), and Host server accesses the Target-side storage devices via the remote storage stack (e.g., i10, nvme-tcp) over the networks. Then Host server runs latency-sensitive applications (L-apps) and throughput-bound applications (T-apps) using standard I/O APIs (e.g., Linux AIO). Meanwhile, blk-switch plays a role for providing μs-latency and high throughput at the kernel block device layer.
Through the following three sections, we provide getting started instructions to install blk-switch and to run experiments.
- Build blk-switch Kernel (10 human-mins + 30 compute-mins + 5 reboot-mins):
blk-switch is currently implemented in the core part of Linux kernel storage stack (blk-mq at block device layer), so it requires kernel compilation and system reboot into the blk-switch kernel. This section covers how to build the blk-switch kernel and i10/nvme-tcp kernel modules. - Setup Remote Storage Devices (5 human-mins):
This section covers how to setup remote storage devices using i10/nvme-tcp kernel modules. - Run Toy-experiments (5-10 compute-mins):
This section covers how to run experiments with the blk-switch kernel.
The detailed instructions to reproduce all individual results presented in our OSDI21 paper is provided in the "osdi21_artifact" directory.
blk-switch has been successfully tested on Ubuntu 16.04 LTS with kernel 5.4.43. Building the blk-switch kernel should be done on both Host and Target servers.
(Don't forget to be root)
-
Download Linux kernel source tree:
sudo -s cd ~ wget https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.43.tar.gz tar xzvf linux-5.4.43.tar.gz
-
Download blk-switch source code and copy to the kernel source tree:
git clone https://github.com/resource-disaggregation/blk-switch.git cd blk-switch cp -rf block drivers include ~/linux-5.4.43/ cd ~/linux-5.4.43/
-
Update kernel configuration:
cp /boot/config-x.x.x .config make olddefconfig
"x.x.x" is a kernel version. It can be your current kernel version or latest version your system has. Type "uname -r" to see your current kernel version.
Edit ".config" file to include your name in the kernel version.
vi .config (in the file) ... CONFIG_LOCALVERSION="-jaehyun" ...
Save the .config file and exit.
-
Make sure i10 and nvme-tcp modules are included in the kernel configuration:
make menuconfig - Device Drivers ---> NVME Support ---> <M> NVM Express over Fabrics TCP host driver - Device Drivers ---> NVME Support ---> <M> NVMe over Fabrics TCP target support - Device Drivers ---> NVME Support ---> <M> i10: A New Remote Storage I/O Stack (host) - Device Drivers ---> NVME Support ---> <M> i10: A New Remote Storage I/O Stack (target)
Press "Save" and "Exit"
-
Compile and install:
(See NOTE below for '-j24') make -j24 bzImage make -j24 modules make modules_install make install
NOTE: The number 24 means the number of threads created for compilation. Set it to be the total number of cores of your system to reduce the compilation time. Type "lscpu | grep 'CPU(s)'" to see the total number of cores:
CPU(s): 24 On-line CPU(s) list: 0-23
-
Edit "/etc/default/grub" to boot with your new kernel by default. For example:
... #GRUB_DEFAULT=0 GRUB_DEFAULT="1>Ubuntu, with Linux 5.4.43-jaehyun" ...
-
Update the grub configuration and reboot into the new kernel.
update-grub && reboot
-
Do the same steps 1--7 for both Host and Target servers.
-
When systems are rebooted, check the kernel version: Type "uname -r". It should be "5.4.43-(your name)".
We will compare two systems in the toy-experiment section, "blk-switch" and "Linux". We implemented a part of blk-switch (multi-egress support of i10) in the nvme-tcp kernel module. Therefore, we use
- nvme-tcp module for "blk-switch"
- i10 module for "Linux"
We now configure RAM null-blk device as a remote storage device at Target server.
(In step 4, we provide a script that covers steps 1--3, so please go to step 4 if you want to skip steps 1--3)
-
Create null-block devices (10GB):
sudo -s modprobe null-blk gb=10 bs=4096 irqmode=1 hw_queue_depth=1024 submit-queues=24
Use the number of cores of your system for "submit-queues".
-
Load i10/nvme-tcp target kernel modules:
modprobe nvmet nvmet-tcp i10-target
-
Create an nvme-of subsystem:
mkdir /sys/kernel/config/nvmet/subsystems/(subsystem name) cd /sys/kernel/config/nvmet/subsystems/(subsystem name) echo 1 > attr_allow_any_host mkdir namespaces/10 cd namespaces/10 echo -n (device name) > device_path echo 1 > enable mkdir /sys/kernel/config/nvmet/ports/1 cd /sys/kernel/config/nvmet/ports/1 echo xxx.xxx.xxx.xxx > addr_traddr echo (protocol name) > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam ln -s /sys/kernel/config/nvmet/subsystems/(subsystem name) /sys/kernel/config/nvmet/ports/1/subsystems/(subsystem name)
- device name: "/dev/nullb0" for null-blk, "/dev/nvme0n1" for NVMe SSD
- xxx.xxx.xxx.xxx: Target IP address
- protocol name: "tcp", "i10", etc.
-
Or, you can use our script for a quick setup (for RAM null-blk devices):
sudo -s cd ~/blk-switch/scripts/ (see NOTE below) ./target_null.sh
NOTE: please edit
~/blk-switch/scripts/system_env.sh
to specify Target IP address, Network Interface name associated with the Target IP address, and number of cores before runningtarget_null.sh
.
You can type "lscpu | grep 'CPU(s)'" to get the number of cores of your system.
(Go to step 4 if you want to skip steps 2--3. You still need to run step 1.)
-
Install NVMe utility (nvme-cli):
(Ifnvme list
command works, you can skip this step.)sudo -s cd ~ git clone https://github.com/linux-nvme/nvme-cli.git cd nvme-cli make make install
-
Load i10/nvme-tcp host kernel modules:
modprobe nvme-tcp i10-host
-
Connect to the target subsystem:
nvme connect -t (protocol name) -n (subsystem name) -a (target IP address) -s 4420 -q nvme_tcp_host -W (num of cores)
-
Or, you can use our script for a quick setup:
cd ~/blk-switch/scripts/ (see NOTE below) ./host_tcp_null.sh
NOTE: please edit
~/blk-switch/scripts/system_env.sh
to specify Target IP address, Network Interface name associated with the Target IP address, and number of cores before runninghost_tcp_null.sh
.
You can type "lscpu | grep 'CPU(s)'" to get the number of cores of your system. -
Check the remote storage device name you just created (e.g.,
/dev/nvme0n1
):nvme list
-
At Both Target and Host:
Please make sure that you editedsystem_env.sh
correctly during the host configuration.sudo -s cd ~/blk-switch/scripts/ ./system_setup.sh
NOTE:
system_setup.sh
enables aRFS on Mellanox ConnextX-5 NICs. For different NICs, you may need to follow a different procedure to enable aRFS (please refer to the NIC documentation). If the NIC does not support aRFS, the results that you observe could be significantly different. (We have not experimented with setups where aRFS is disabled).The below error messages from
system_setup.sh
is normal. Please ignore them.Cannot get device udp-fragmentation-offload settings: Operation not supported Cannot get device udp-fragmentation-offload settings: Operation not supported
At Host: we run FIO to test blk-switch using the remote null-blk device (/dev/nvme0n1
).
-
Install FIO
(Type `fio -v'. If FIO is already installed, you can skip this step.)sudo -s apt-get install fio
Or refer to https://github.com/axboe/fio to install the latest version.
-
Run one L-app and one T-app on a core:
cd ~/blk-switch/scripts/ (see NOTE below) ./toy_example_blk-switch.sh
NOTE: Edit
toy_example_blk-switch.sh
if your remote null-blk device created above for blk-switch is not/dev/nvme0n1
. -
Compare with Linux (pure i10 without blk-switch):
cd ~/blk-switch/scripts/ ./host_i10_null.sh (see NOTE below) ./toy_example_linux.sh
NOTE: Check the remote storage device name newly added after executing
host_i10_null.sh
. We assume it is/dev/nvme1n1
. Edittoy_example_linux.sh
if not. -
Validate results (see output files on the same directory):
If system has multiple cores per socket,
- L-app is isolated by blk-switch achieving lower latency than Linux. Check the unit of 99.00th latency (us or ns) with the last
grep
command below.cd ~/blk-switch/scripts/ grep 'clat (' output_linux_lapp output_blk-switch_lapp grep '99.00th' output_linux_lapp output_blk-switch_lapp grep 'clat percentiles' output_linux_lapp output_blk-switch_lapp
- While achieving low latency, blk-switch also achieves comparable throughput to Linux.
cd ~/blk-switch/scripts/ grep IOPS output_linux_tapp output_blk-switch_tapp
- L-app is isolated by blk-switch achieving lower latency than Linux. Check the unit of 99.00th latency (us or ns) with the last
To continue the artifact evaluation: please go to "osdi21_artifact/".