Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Kernel Bug" error on host #235

Open
glitch003 opened this issue Sep 19, 2024 · 2 comments
Open

"Kernel Bug" error on host #235

glitch003 opened this issue Sep 19, 2024 · 2 comments

Comments

@glitch003
Copy link

I see this error in the host dmesg logs, and my VM failed to boot. This seems to be intermittent and sometimes the VM works and sometimes it doesn't.

We are using the latest repo state as of friday 07/12: ovmf 4b6ee06a09, kernel-guest a38297e3fb01, kernel-host 05b10142ac6a, qemu fb924a5139 using the AmdSevX64 VM BIOS

When this happens, we also see this error in the guest: #221

[ 1039.036906] kvm: Invalid SPTE change: cannot replace a present leaf
               SPTE with another present leaf SPTE mapping a
               different PFN!
               as_id: 0 gfn: 24805 old_spte: 68b805e67 new_spte: 1a9069e27 level: 1
[ 1039.036937] ------------[ cut here ]------------
[ 1039.036991] kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:475!
[ 1039.037012] invalid opcode: 0000 [#1] SMP NOPTI
[ 1039.037024] CPU: 19 PID: 3244 Comm: qemu-system-x86 Not tainted 6.9.0-rc7-snp-host-05b10142ac6a #2
[ 1039.037043] Hardware name: Dell Inc. PowerEdge R6515/068NXX, BIOS 2.15.3 05/15/2024
[ 1039.037058] RIP: 0010:handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.037119] Code: e9 8d fc ff ff 48 8b 15 33 6d 98 d0 e9 35 fc ff ff 8b 74 24 30 41 89 d9 4d 89 d8 48 89 e9 48 c7 c7 58 1b f5 c0 e8 75 80 4a cf <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[ 1039.037142] RSP: 0018:ffffb6bcc8737a00 EFLAGS: 00010246
[ 1039.037153] RAX: 00000000000000b9 RBX: 0000000000000001 RCX: 0000000000000000
[ 1039.037165] RDX: 0000000000000000 RSI: ffff8907befa1700 RDI: ffff8907befa1700
[ 1039.037176] RBP: 000000068b805e67 R08: 0000000000000000 R09: 0000000000000003
[ 1039.037187] R10: ffffb6bcc87378a0 R11: ffff8907ff0797a8 R12: 0000000000000001
[ 1039.037198] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000800
[ 1039.037210] FS:  00007fb699a006c0(0000) GS:ffff8907bef80000(0000) knlGS:0000000000000000
[ 1039.037223] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1039.037233] CR2: 0000000000000000 CR3: 000000010bf16001 CR4: 0000000000770ef0
[ 1039.037245] PKRU: 55555554
[ 1039.037619] Call Trace:
[ 1039.037946]  <TASK>
[ 1039.038263]  ? die+0x32/0x80
[ 1039.038562]  ? do_trap+0xd9/0x100
[ 1039.038844]  ? handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.039156]  ? do_error_trap+0x6a/0x90
[ 1039.039431]  ? handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.039733]  ? exc_invalid_op+0x4c/0x60
[ 1039.040002]  ? handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.040297]  ? asm_exc_invalid_op+0x16/0x20
[ 1039.040577]  ? handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.040865]  ? __entry_text_end+0x1025ca/0x1025cd
[ 1039.041123]  kvm_tdp_mmu_map+0x352/0x4f0 [kvm]
[ 1039.041406]  kvm_tdp_page_fault+0x12d/0x150 [kvm]
[ 1039.041689]  kvm_mmu_do_page_fault+0x1c8/0x270 [kvm]
[ 1039.041967]  kvm_mmu_page_fault+0x8e/0x680 [kvm]
[ 1039.042240]  ? fire_user_return_notifiers+0x2c/0x50
[ 1039.042482]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.042719]  ? syscall_exit_to_user_mode+0x7a/0x210
[ 1039.042956]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.043188]  ? do_syscall_64+0x8c/0x190
[ 1039.043416]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.043642]  ? tomoyo_init_request_info+0x95/0xc0
[ 1039.043868]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.044089]  ? tomoyo_path_number_perm+0x88/0x200
[ 1039.044315]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.044551]  ? kvm_release_page_clean+0x83/0xb0 [kvm]
[ 1039.044807]  npf_interception+0x8b/0x120 [kvm_amd]
[ 1039.045038]  kvm_arch_vcpu_ioctl_run+0x692/0x1590 [kvm]
[ 1039.045301]  kvm_vcpu_ioctl+0x285/0x6d0 [kvm]
[ 1039.045603]  __x64_sys_ioctl+0x93/0xd0
[ 1039.045854]  do_syscall_64+0x80/0x190
[ 1039.046097]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.046321]  ? syscall_exit_to_user_mode+0x7a/0x210
[ 1039.046534]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.046744]  ? do_syscall_64+0x8c/0x190
[ 1039.046948]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.047143]  ? kvm_on_user_return+0x60/0x90 [kvm]
[ 1039.047369]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.047558]  ? fire_user_return_notifiers+0x2c/0x50
[ 1039.047748]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.047935]  ? syscall_exit_to_user_mode+0x7a/0x210
[ 1039.048121]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.048312]  ? do_syscall_64+0x8c/0x190
[ 1039.048504]  ? do_syscall_64+0x8c/0x190
[ 1039.048681]  ? do_syscall_64+0x8c/0x190
[ 1039.048853]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1039.049024]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1039.049199] RIP: 0033:0x7fb6a2469c5b
[ 1039.049373] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1039.049753] RSP: 002b:00007fb6999ff6e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1039.049953] RAX: ffffffffffffffda RBX: 000055ffa23a3610 RCX: 00007fb6a2469c5b
[ 1039.050153] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001a
[ 1039.050354] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[ 1039.050553] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1039.050752] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 1039.050983]  </TASK>
[ 1039.051182] Modules linked in: veth nf_conntrack_netlink xfrm_user xfrm_algo tun cpuid nbd overlay bridge stp llc nft_chain_nat amd_atl intel_rapl_msr intel_rapl_common xt_nat amd64_edac edac_mce_amd xt_MASQUERADE kvm_amd nf_nat xt_addrtype xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables binfmt_misc nfnetlink nls_ascii nls_cp437 vfat fat ipmi_ssif kvm ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel snd_pcm crypto_simd snd_timer rfkill cryptd video dell_smbios snd dcdbas soundcore rapl dell_wmi_descriptor wmi_bmof acpi_cpufreq pcspkr mgag200 drm_shmem_helper i2c_algo_bit evdev ccp sp5100_tco acpi_ipmi k10temp watchdog ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter button fuse loop dm_mod efi_pstore configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod bochs drm_vram_helper
[ 1039.051276]  drm_kms_helper drm_ttm_helper ttm drm nvme ahci xhci_pci mpt3sas nvme_core libahci xhci_hcd raid_class crc32_pclmul t10_pi libata scsi_transport_sas crc32c_intel crc64_rocksoft usbcore tg3 crc64 scsi_mod crc_t10dif bnxt_en crct10dif_generic crct10dif_pclmul crct10dif_common usb_common i2c_piix4 scsi_common wmi
[ 1039.054044] ---[ end trace 0000000000000000 ]---
[ 1039.136766] RIP: 0010:handle_changed_spte+0x9fb/0xa00 [kvm]
[ 1039.137183] Code: e9 8d fc ff ff 48 8b 15 33 6d 98 d0 e9 35 fc ff ff 8b 74 24 30 41 89 d9 4d 89 d8 48 89 e9 48 c7 c7 58 1b f5 c0 e8 75 80 4a cf <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f
[ 1039.137739] RSP: 0018:ffffb6bcc8737a00 EFLAGS: 00010246
[ 1039.138021] RAX: 00000000000000b9 RBX: 0000000000000001 RCX: 0000000000000000
[ 1039.138304] RDX: 0000000000000000 RSI: ffff8907befa1700 RDI: ffff8907befa1700
[ 1039.138586] RBP: 000000068b805e67 R08: 0000000000000000 R09: 0000000000000003
[ 1039.138870] R10: ffffb6bcc87378a0 R11: ffff8907ff0797a8 R12: 0000000000000001
[ 1039.139155] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000800
[ 1039.139443] FS:  00007fb699a006c0(0000) GS:ffff8907bef80000(0000) knlGS:0000000000000000
[ 1039.139733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1039.140024] CR2: 0000000000000000 CR3: 000000010bf16001 CR4: 0000000000770ef0
[ 1039.140327] PKRU: 55555554
@mdroth
Copy link
Collaborator

mdroth commented Sep 19, 2024

Thanks for the report. This helps explain why #221 causes host freeze up since KVM hits a fatal BUG() and will no longer allow itself to run at that point. This seems a lot like a generic race condition in KVM MMU code rather than SNP-specific paths, so I think it would be a good idea to try to reproduce with the latest snp-latest build, which now builds a 6.11 kernel which contains significant changes/general refactoring of KVM MMU code.

@glitch003
Copy link
Author

thx, will try that and update here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants