Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udev block device rule #26

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

udev block device rule #26

wants to merge 1 commit into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Feb 8, 2022

add udev rule to ensure block device symlinks exist for modern nvme EBS mappings

for a while now I've noticed intermittent startup failures on smaller elasticsearch machines, I was never able to put my finger on the issue but I suspected a race between kernel tasks and cloud-init.

today I made some progress, mainly because upgrading to ubuntu focal seems to make it fail more consistently.

what I think is going on is that there is either a race between cloud-init and udev... or that udev isn't working properly due to python2.7 not being available on modern Ubu distros.

but backing up a second, what's the issue?

well, with some more modern AWS machines you request an EBS block device mapping of something like /dev/sdb but when you boot it's actually available as /dev/nvme2n1 or something similar 🤷‍♂️

it's kind of odd, but I believe that this is due to the 'Nitro' system using an NVME driver for EBS volumes.

so to get around this blaring issue AWS encodes some 'vendor info' in the NVME mapping binary header which contains information about the mapping you actually requested.

there is then a udev rule (the last line below) which is responsible for detecting all this and creating a symlink:

cat /etc/udev/rules.d/10-aws.rules
KERNEL=="xvd*", PROGRAM="/sbin/ec2udev-vbd %k", SYMLINK+="%c"
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/sbin/ebsnvme-id -u /dev/%k", SYMLINK+="%c"

when this symlink isn't created or isn't created YET, then things break:

mke2fs 1.45.5 (07-Jan-2020)
The file /dev/sdb does not exist and no size was specified.
waiting for elasticsearch service to come up
..............................

Elasticsearch did not come up, check configuration

the udev rule installed by default seems to be broken on modern Ubu because it runs /sbin/ebsnvme-id which doesn't work because it requires python2.7 which isn't installed 😢

I tried installing python2.7 and trigger the rules and it still doesn't work so 🤷‍♂️

that's when I found this article https://opensource.creativecommons.org/blog/entries/2020-04-03-nvmee-on-debian-on-aws/ pointing me to https://github.com/oogali/ebs-automatic-nvme-mapping

@missinglink missinglink force-pushed the nvme-udev-rule branch 2 times, most recently from 5018484 to fc048fd Compare February 8, 2022 15:26
@orangejulius
Copy link
Member

Nice, yeah, the current system here is a bit brittle and complicated.

In some Geocode Earth infra we use a simpler method, which is basically to look at the symlinks in /dev/disk/by-id:

$ ls -lh /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS222D6E08AD542C2D4 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-Amazon_Elastic_Block_Store_vol0d00dbaf264800849 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 15:28 nvme-Amazon_Elastic_Block_Store_vol0d00dbaf264800849-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-nvme.1d0f-4157533232324436453038414435343243324434-416d617a6f6e20454332204e564d6520496e7374616e63652053746f72616765-00000001 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-nvme.1d0f-766f6c3064303064626166323634383030383439-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 15:28 nvme-nvme.1d0f-766f6c3064303064626166323634383030383439-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1 -> ../../nvme0n1p1

That doesn't require any tools, seems to be populated instantly, and makes it quite clear which volumes are EBS, which are NVMe, etc. Maybe we simplify and use that?

@missinglink missinglink force-pushed the nvme-udev-rule branch 2 times, most recently from b5e4115 to 916bd51 Compare February 8, 2022 16:00
@missinglink
Copy link
Member Author

missinglink commented Feb 8, 2022

I had a look at that and unfortunately it doesn't seem possible, the AWS docs say there's no guarantee that the ordinal numbers correspond to the order they were defined or anything really, the only consistent way seems to be to check the block device binary header where the requested mapping path is encoded.

This is what that command looks like on a t4g.xlarge ARM instance:

ls -lh /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 16 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8-part15 -> ../../nvme0n1p15
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0eeb851c145fc2b4d -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0fd2cdbaccc423f58 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 16 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part15 -> ../../nvme0n1p15
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3065656238353163313435666332623464-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3066643263646261636363343233663538-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme1n1

the scripts in this repo create these symlinks, which doesn't seem to be possible from the information above:

lrwxrwxrwx  1 root root           7 Feb  8 16:15 sda1 -> nvme0n1
lrwxrwxrwx  1 root root           7 Feb  8 16:15 sdb -> nvme2n1
lrwxrwxrwx  1 root root           7 Feb  8 16:15 sdc -> nvme1n1

@missinglink
Copy link
Member Author

The script we have which selects the first available disk matching a pattern would be susceptible to error since there are multiple and there's no guarantee the correct device is selected using head -n1

Screenshot 2022-02-08 at 17 21 40

@missinglink
Copy link
Member Author

I've tested this on a t4g.xlarge running an AMI tagged dev-es7.16-arm and after a few iterations it's working great 🎉

Before we consider merging this we should change the cURL commands I'm using to get the scripts from github to actual files committed to this repo, for security reasons.

@missinglink
Copy link
Member Author

for reference, this is what this binary encoded header looks like (note: sdb encoded in the first bytes)

sudo nvme id-ctrl --vendor-specific /dev/nvme2n1
NVME Identify Controller:
vid     : 0x1d0f
ssvid   : 0x1d0f
sn      : vol0eeb851c145fc2b4d
mn      : Amazon Elastic Block Store
fr      : 1.0
rab     : 32
ieee    : a002dc
cmic    : 0
mdts    : 6
cntlid  : 0
ver     : 10000
rtd3r   : 0
rtd3e   : 0
oaes    : 0x100
ctratt  : 0
oacs    : 0
acl     : 4
aerl    : 0
frmw    : 0x3
lpa     : 0
elpe    : 63
npss    : 0
avscc   : 0x1
apsta   : 0
wctemp  : 343
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
edstt   : 0
dsto    : 0
fwug    : 0
kas     : 0
hctma   : 0
mntmt   : 0
mxtmt   : 0
sanicap : 0
hmminds : 0
hmmaxd  : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0
fuses   : 0
fna     : 0
vwc     : 0
awun    : 0
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
subnqn  :
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:0.01W operational enlat:1000000 exlat:1000000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
vs[]:
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 73 64 62 20 20 20 20 20 20 20 20 20 20 20 20 20 "sdb............."
0010: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants