invalid GPT signature when booting with AWS instance store #1581

crawford · 2024-11-13T23:34:27Z

Description

We've been running into trouble while trying to boot Flatcar on AWS instances which have an instance store. I've seen ignition-disks.service fail and I've seen GRUB itself fail to select a partition to boot. We've been using the same version of Flatcar for months and only started seeing these failures after switching to instances with an instance store. I'm filing this here mainly for visibility. We're switching back to EBS-only instances since the performance of these stores is, uh, not great (p99.9 latency of 17 s).

Environment and steps to reproduce

Boot Flatcar 3941.1.0 on an m6gd.medium or c5d.large AWS instance in us-west-2

Expected behavior

It boots successfully.

Actual behavior

There are two different manifestations. Sometimes we see:

error: file `/flatcar/grub/arm64-efi/all_video.mod' not found.
error: no such device: OEM.

error: invalid GPT signature.
Reading or updating the GPT failed!
Please file a bug with any messages above to Flatcar:

 https://issues.flatcar.org/

Aborted. Press enter to exit GRUB.

And if GRUB finishes, we sometimes get stuck in the initrd with the following failure:

# systemctl status --failed --no-pager -l
× ignition-disks.service - Ignition (disks)
     Loaded: loaded (/usr/lib/systemd/system/ignition-disks.service; static)
     Active: failed (Result: signal) since Wed 2024-11-13 19:49:53 UTC; 8min ago
       Docs: https://github.com/coreos/ignition
    Process: 1195 ExecStart=/usr/bin/ignition --root=/sysroot --platform=${PLATFORM_ID} --stage=disks (code=killed, signal=TERM)
   Main PID: 1195 (code=killed, signal=TERM)

Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: created device alias for "/dev/nvme1n1": "/run/ignition/dev_aliases/dev/nvme1n1" -> "/dev/nvme1n1"
Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: op(2): [started]  partitioning "/run/ignition/dev_aliases/dev/nvme1n1"
Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: op(2): wiping partition table requested on "/run/ignition/dev_aliases/dev/nvme1n1"
Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: op(2): running sgdisk with options: [--zap-all /run/ignition/dev_aliases/dev/nvme1n1]
Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: op(2): op(3): [started]  deleting 0 partitions and creating 0 partitions on "/run/ignition/dev_aliases/dev/nvme1n1"
Nov 13 19:49:53 localhost ignition[1195]: disks: createPartitions: op(2): op(3): executing: "sgdisk" "--zap-all" "/run/ignition/dev_aliases/dev/nvme1n1"
Nov 13 19:49:53 localhost systemd[1]: ignition-disks.service: Main process exited, code=killed, status=15/TERM
Nov 13 19:49:53 localhost systemd[1]: ignition-disks.service: Failed with result 'signal'.
Nov 13 19:49:53 localhost systemd[1]: Stopped ignition-disks.service - Ignition (disks).
Nov 13 19:49:53 localhost systemd[1]: ignition-disks.service: Triggering OnFailure= dependencies.

(/dev/nvme1n1 is the instance store)

I'd estimate that we see one of these two failures 10% of the time.

The text was updated successfully, but these errors were encountered:

tormath1 · 2024-11-14T12:55:21Z

Thanks @crawford for the report. Do you have by any chance a Terraform snippet for repro? I'd be interested to see if we can reproduce this with Fedora CoreOS too as I guess there are two issues one with Ignition and the other one with the boot itself.

crawford · 2024-11-15T20:10:46Z

@tormath1 I don't, unfortunately. I think my days of Terraform are behind me. Here's the relevant portion of the Ignition config that we've been using though (sorry for forgetting to include it):

{
  "ignition": {
    "version": "3.4.0"
  },
  "storage": {
    "disks": [{
      "device": "/dev/nvme1n1",
      "partitions": [{
        "label": "SWAP",
        "sizeMiB": 10240
      }, {
        "label": "DOCKER"
      }],
      "wipeTable": true
    }],
    "filesystems": [{
      "device": "/dev/disk/by-partlabel/SWAP",
      "format": "swap",
      "label": "SWAP",
      "wipeFilesystem": true
    }, {
      "device": "/dev/disk/by-partlabel/DOCKER",
      "format": "btrfs",
      "label": "DOCKER",
      "wipeFilesystem": true
    }]
    "units": [{
      "contents": "\n[Mount]\nWhat=/dev/disk/by-label/DOCKER\nWhere=/var/lib/docker\n\n[Install]\nWantedBy=local-fs.target\n",
      "enabled": true,
      "name": "var-lib-docker.mount"
    }, {
      "mask": true,
      "name": "update-engine.service"
    }, {
      "mask": true,
      "name": "locksmithd.service"
    }, {
      "mask": true,
      "name": "sshkeys.service"
    }, {
      "mask": true,
      "name": "amazon-ssm-agent.service"
    }]
  }
}

crawford added the kind/bug Something isn't working label Nov 13, 2024

github-project-automation bot added this to Flatcar tactical, release planning, and roadmap Nov 13, 2024

github-project-automation bot moved this to 📝 Needs Triage in Flatcar tactical, release planning, and roadmap Nov 13, 2024

tormath1 added the platform/AWS label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid GPT signature when booting with AWS instance store #1581

invalid GPT signature when booting with AWS instance store #1581

crawford commented Nov 13, 2024

tormath1 commented Nov 14, 2024

crawford commented Nov 15, 2024

invalid GPT signature when booting with AWS instance store #1581

invalid GPT signature when booting with AWS instance store #1581

Comments

crawford commented Nov 13, 2024

Description

Environment and steps to reproduce

Expected behavior

Actual behavior

tormath1 commented Nov 14, 2024

crawford commented Nov 15, 2024