Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

roy-hopkins · 2024-07-19T11:32:41Z

This version of Linux works as both a host and guest kernel for COCONUT-SVSM. It is based on https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next with the addition of these latest patch series:

SEV-SNP host support (already in kvm/next)
SEV-SNP guest requests
VMPL2 guest support patches
SVSM patches

The SVSM patches required some modifications due to significant changes since previous rebase. In particular, the method for determining an SVSM is via init-flags is not supported anymore. Therefore an additional patch has been applied that allows direct setting of the VMSA for each vCPU removing the need for KVM to track that an SVSM is present.

You need the corresponding QEMU that supports IGVM and direct setting of the VMSA to work with this kernel. This can be found in this PR: coconut-svsm/qemu#15

Matthieu Baerts says: ==================== selftests: mptcp: mark unstable subtests as flaky Some subtests can be unstable, failing once every X runs. Fixing them can take time: there could be an issue in the kernel or in the subtest, and it is then important to do a proper analysis, not to hide real bugs. To avoid creating noises on the different CIs where tests are more unstable than on our side, some subtests have been marked as flaky. As a result, errors with these subtests (if any) are ignored. Note that the MPTCP CI will continue to track these flaky subtests. All these unstable subtests are also tracked by our bug tracker. These are fixes for the -net tree, because the instabilities are visible there. The first patch introducing the flake support has no 'Fixes' tags, mainly because it requires recent and important refactoring done in all MPTCP selftests. Backporting that to old versions where the flaky tests have been introduced would be too difficult, and probably not worth it. The other patches, adding MPTCP_LIB_SUBTEST_FLAKY=1, have a Fixes tag, simply to ease the backport of the future fixes removing them along with the proper fix. ==================== Link: https://lore.kernel.org/r/20240524-upstream-net-20240524-selftests-mptcp-flaky-v1-0-a352362f3f8e@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>

s/of/off/ Signed-off-by: Thorsten Blum <[email protected]> Fixes: e110ba6 ("docs: netdev: add note about Changes Requested and revising commit messages") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

Polling is initially attempted with timeout_base_ms enabled for preemption, and if it exceeds this timeframe, another attempt is made without preemption, allowing an additional 50 ms before timing out. v2 - Rebase v3 - Move warnings to separate patch (Lucas) Cc: Lucas De Marchi <[email protected]> Cc: Rodrigo Vivi <[email protected]> Signed-off-by: Himal Prasad Ghimiray <[email protected]> Fixes: 7dc9b92 ("drm/xe: Remove i915_utils dependency from xe_pcode.") Reviewed-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]> (cherry picked from commit c81858e) Signed-off-by: Thomas Hellström <[email protected]>

The GuC context scheduling queue is 2 entires deep, thus it is possible for a migration job to be stuck behind a fault if migration exec queue shares engines with user jobs. This can deadlock as the migrate exec queue is required to service page faults. Avoid deadlock by only using reserved BCS instances for usm migrate exec queue. Fixes: a043fba ("drm/xe/pvc: Use fast copy engines as migrate engine on PVC") Cc: Matt Roper <[email protected]> Cc: Niranjana Vishwanathapura <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Reviewed-by: Brian Welty <[email protected]> (cherry picked from commit 04f4a70) Signed-off-by: Thomas Hellström <[email protected]>

Release the submission_state lock if alloc_guc_id() fails. v2: Add Fixes tag and CC stable kernel Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: <[email protected]> # v6.8+ Signed-off-by: Niranjana Vishwanathapura <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Signed-off-by: José Roberto de Souza <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 40672b7) Signed-off-by: Thomas Hellström <[email protected]>

The TPM SPI transfer mechanism uses MAX_SPI_FRAMESIZE for computing the maximum transfer length and the size of the transfer buffer. As such, it does not account for the 4 bytes of header that prepends the SPI data frame. This can result in out-of-bounds accesses and was confirmed with KASAN. Introduce SPI_HDRSIZE to account for the header and use to allocate the transfer buffer. Fixes: a86a42a ("tpm_tis_spi: Add hardware wait polling") Signed-off-by: Matthew R. Ochs <[email protected]> Tested-by: Carol Soto <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Jarkko Sakkinen <[email protected]>

With only single call site, this makes no sense (slipped out of the radar during the review). Open code and document the action directly to the site, to make it more readable. Fixes: 1b6d7f9 ("tpm: add session encryption protection to tpm2_get_random()") Signed-off-by: Jarkko Sakkinen <[email protected]>

sk_psock_get will return NULL if the refcount of psock has gone to 0, which will happen when the last call of sk_psock_put is done. However, sk_psock_drop may not have finished yet, so the close callback will still point to sock_map_close despite psock being NULL. This can be reproduced with a thread deleting an element from the sock map, while the second one creates a socket, adds it to the map and closes it. That will trigger the WARN_ON_ONCE: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Modules linked in: CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02 RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293 RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000 RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0 RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3 R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840 R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870 FS: 000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0 Call Trace: <TASK> unix_release+0x87/0xc0 net/unix/af_unix.c:1048 __sock_release net/socket.c:659 [inline] sock_close+0xbe/0x240 net/socket.c:1421 __fput+0x42b/0x8a0 fs/file_table.c:422 __do_sys_close fs/open.c:1556 [inline] __se_sys_close fs/open.c:1541 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1541 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb37d618070 Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070 RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004 RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Use sk_psock, which will only check that the pointer is not been set to NULL yet, which should only happen after the callbacks are restored. If, then, a reference can still be gotten, we may call sk_psock_stop and cancel psock->work. As suggested by Paolo Abeni, reorder the condition so the control flow is less convoluted. After that change, the reproducer does not trigger the WARN_ON_ONCE anymore. Suggested-by: Paolo Abeni <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355 Fixes: aadb2bb ("sock_map: Fix a potential use-after-free in sock_map_close()") Fixes: 5b4a79b ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself") Cc: [email protected] Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>

Rename and document TPM2_OA_TMPL, as originally requested in the patch set review, but left unaddressed without any appropriate reasoning. The new name is TPM2_OA_NULL_KEY, has a documentation and is local only to tpm2-sessions.c. Link: https://lore.kernel.org/linux-integrity/ddbeb8111f48a8ddb0b8fca248dff6cc9d7079b2.camel@HansenPartnership.com/ Link: https://lore.kernel.org/linux-integrity/CZCKTWU6ZCC9.2UTEQPEVICYHL@suppilovahvero/ Signed-off-by: Jarkko Sakkinen <[email protected]>

Given the not fully root caused performance issues on non-x86 platforms, enable the feature by default only for x86-64. That is the platform it brings the most value and has gone most of the QA. Can be reconsidered later and can be obviously opt-in enabled too on any arch. Link: https://lore.kernel.org/linux-integrity/[email protected]/#t Signed-off-by: Jarkko Sakkinen <[email protected]>

When a UMP packet is converted between MIDI1 and MIDI2 protocols, the bank selection may be lost. The conversion from MIDI1 to MIDI2 needs the encoding of the bank into UMP_MSG_STATUS_PROGRAM bits, while the conversion from MIDI2 to MIDI1 needs the extraction from that instead. This patch implements the missing bank selection mechanism in those conversions. Fixes: e9e0281 ("ALSA: seq: Automatic conversion of UMP events") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>

The current code to convert from a legacy sequencer event to UMP MIDI2 clears the bank selection at each time the program change is submitted. This is confusing and may lead to incorrect bank values tranmitted to the destination in the end. Drop the line to clear the bank info and keep the provided values. Fixes: e9e0281 ("ALSA: seq: Automatic conversion of UMP events") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>

The information on PCI class/subclass was interesting in the Skylake timeframe, since the DSP was only enabled on a limited number of platforms. Now most Intel platforms do enable the DSP, so the information is less interesting to log. When a DSP driver is used, the common helper may be called multiple times due to deferred probes, but there's no reason to print the same information multiple times. Using dev_info_once() covers all the existing usages for internal cards with DSPs. External cards don't rely on DSPs so far. Signed-off-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Bard Liao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>

…put_matches is called In this patch, a software bug has been fixed. rtq2208_ldo_match is no longer a local variable. It prevents invalid memory access when devm_of_regulator_put_matches is called. Signed-off-by: Alina Yu <[email protected]> Link: https://msgid.link/r/4ce8c4f16f1cf3aa4e5f36c0694dd3c5ccf3cd1c.1716870419.git.alina_yu@richtek.com Signed-off-by: Mark Brown <[email protected]>

When changing the maximum number of open zones, print that number instead of the total number of zones. Fixes: dc4d137 ("null_blk: add support for max open/active zone limit for zoned devices") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Niklas Cassel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

sd can set a max_sectors value that is lower than the max_hw_sectors limit based on the block limits VPD page. While this is rather unusual, it used to work until the max_user_sectors field was split out to cleanly deal with conflicting hardware and user limits when the hardware limit changes. Also set max_user_sectors to ensure the limit can properly be stacked. Fixes: 4f563a6 ("block: add a max_user_discard_sectors queue limit") Reported-by: Mike Snitzer <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Mike Snitzer <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

The max_user_sectors is one of the three factors determining the actual max_sectors limit for READ/WRITE requests. Because of that it needs to be stacked at least for the device mapper multi-path case where requests are directly inserted on the lower device. For SCSI disks this is important because the sd driver actually sets it's own advisory limit that is lower than max_hw_sectors based on the block limits VPD page. While this is a bit odd an unusual, the same effect can happen if a user or udev script tweaks the value manually. Fixes: 4f563a6 ("block: add a max_user_discard_sectors queue limit") Reported-by: Mike Snitzer <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Mike Snitzer <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

The logical block size need to be smaller than the max_hw_sector setting, otherwise we can't even transfer a single LBA. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <[email protected]> Signed-off-by: Jens Axboe <[email protected]>

Currently, if the gc is running, when the allocator found free_inc is empty, allocator has to wait the gc finish. Before that, the IO is blocked. But actually, there would be some buckets is reclaimable before gc, and gc will never mark this kind of bucket to be unreclaimable. So we can put these buckets into free_inc in gc running to avoid IO being blocked. Signed-off-by: Dongsheng Yang <[email protected]> Signed-off-by: Mingzhe Zou <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

If there are extreme heavy write I/O continuously hit on relative small cache device (512GB in my testing), it is possible to make counter c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD. If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write requests will bypass the cache device because check_should_bypass() returns 'true'. Because all writes bypass the cache device, counter c->sectors_to_gc has no chance to be negative value, and garbage collection thread won't be waken up even the whole cache becomes clean after writeback accomplished. The aftermath is that all write I/Os go directly into backing device even the cache device is clean. To avoid the above situation, this patch uses a quite conservative way to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes up garbage collection thread when the whole cache device is clean. Before the fix, the writes-always-bypass situation happens after 10+ hours write I/O pressure on 512GB Intel optane memory which acts as cache device. After this fix, such situation doesn't happen after 36+ hours testing. Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

In __bch_bucket_alloc_set() the lines after lable 'err:' indeed do nothing useful after multiple cache devices are removed from bcache code. This cleanup patch drops the useless code to save a bit CPU cycles. Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

The start counter for FT1 filter is wrongly set to 0 in the driver. FT1 is used for source address violation (SAV) check and source address starts at Byte 6 not Byte 0. Fix this by changing start counter to ETH_ALEN in icssg_ft1_set_mac_addr(). Fixes: e9b4ece ("net: ti: icssg-prueth: Add Firmware config and classification APIs.") Signed-off-by: MD Danish Anwar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>

It can be useful to know the exact byte offset within a btree node where an error occured. Signed-off-by: Kent Overstreet <[email protected]>

This function is used for finding the hash seed (which is the same in all versions of an inode in different snapshots): ff an inode has been deleted in a child snapshot we need to iterate until we find a live version. Signed-off-by: Kent Overstreet <[email protected]>

We now track whether a transaction is locked, and verify that we don't have nodes locked when the transaction isn't locked; reorder relocks to not pop the new assert. Signed-off-by: Kent Overstreet <[email protected]>

Consolidate per-key work into delete_dead_snapshots_process_key(), so we now walk all keys once, not twice. Signed-off-by: Kent Overstreet <[email protected]>

delete_dead_snapshots now runs before the main fsck.c passes which check for keys for invalid snapshots; thus, it needs those checks as well. Signed-off-by: Kent Overstreet <[email protected]>

bch2_check_version_downgrade() was setting c->sb.version, which bch2_sb_set_downgrade() expects to be at the previous version; and it shouldn't even have been set directly because c->sb.version is updated by write_super(). Signed-off-by: Kent Overstreet <[email protected]>

Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>

Compatibility fix - we no longer have a separate table for which order gc walks btrees in, and special case the stripes btree directly. Signed-off-by: Kent Overstreet <[email protected]>

Function kvm_reset_dirty_gfn may be called with parameters cur_slot / cur_offset / mask are all zero, it does not represent real dirty page. It is not necessary to clear dirty page in this condition. Also return value of macro __fls() is undefined if mask is zero which is called in funciton kvm_reset_dirty_gfn(). Here just return. Signed-off-by: Bibo Mao <[email protected]> Message-ID: <[email protected]> [Move the conditional inside kvm_reset_dirty_gfn; suggested by Sean Christopherson. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>

kvm_gmem_populate() is a potentially lengthy operation that can involve multiple calls to the firmware. Interrupt it if a signal arrives. Fixes: 1f6c06b ("KVM: guest_memfd: Add interface for populating gmem pages with user data") Cc: Isaku Yamahata <[email protected]> Cc: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace the SPTE value and returns -EBUSY on failure. The caller must check the return value and retry. Add __must_check to it, as well as to two more functions that forward the return value of __tdp_mmu_set_spte_atomic to their caller. Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com> Acked-by: Kai Huang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Version 2 of GHCB specification added support for the SNP Guest Request Message NAE event. The event allows for an SEV-SNP guest to make requests to the SEV-SNP firmware through hypervisor using the SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification. This is used by guests primarily to request attestation reports from firmware. There are other request types are available as well, but the specifics of what guest requests are being made are opaque to the hypervisor, which only serves as a proxy for the guest requests and firmware responses. Implement handling for these events. Co-developed-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: ensure FW command failures are indicated to guest, drop extended request handling to be re-written as separate patch, massage commit] Signed-off-by: Michael Roth <[email protected]>

These commands can be used to create a transaction such that commands that update the reported TCB, such as SNP_SET_CONFIG/SNP_COMMIT, and updates to userspace-supplied certificates, can be handled atomically relative to any extended guest requests issued by any SNP guests while the updates are taking place. Without this interface, there is a risk that a guest will be given certificate information that does not correspond to the VCEK/VLEK used to sign a particular attestation report unless all the running guests are paused in advance, which would cause disruption to all guests in the system even if no attestation requests are being made. Even then, care is needed to ensure that KVM does not pass along certificate information that was fetched from userspace in advance of the guest being paused. This interface also provides some versatility with how similar firmware maintenance activity can be handled in the future without passing unnecessary management complexity on to userspace. Signed-off-by: Michael Roth <[email protected]>

Version 2 of GHCB specification added support for the SNP Extended Guest Request Message NAE event. This event serves a nearly identical purpose to the previously-added SNP_GUEST_REQUEST event, but allows for additional certificate data to be supplied via an additional guest-supplied buffer to be used mainly for verifying the signature of an attestation report as returned by firmware. This certificate data is supplied by userspace, so unlike with SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first forwarded to userspace via a KVM_EXIT_VMGEXIT exit type, and then the firmware request is made only afterward. Implement handling for these events. Since there is a potential for race conditions where the userspace-supplied certificate data may be out-of-sync relative to the reported TCB that firmware will use when signing attestation reports, make use of the transaction/synchronization mechanisms added by the SNP_SET_CONFIG_{START,END} SEV device ioctls such that the guest will be told to retry the request when an update to reported TCB or userspace-supplied certificates may have occurred or is in progress while an extended guest request is being processed. Signed-off-by: Michael Roth <[email protected]>

Print additional information, in the form of the old and new versions of the SEV firmware, so that it can be seen what the base firmware was before the upgrade. Signed-off-by: Tom Lendacky <[email protected]>

There are situations where SEV-ES and/or SEV-SNP guests cannot be run but the feature flags are still advertised. Add additional checks to the current SME and SEV checks to clear these CPU feature flags under these conditions: - CPUID 0x8000001f/edx reports the minimum SEV ASID, which is used to determine the number of SEV-ES/SEV-SNP ASIDs that are available. If the value reported is <= 1, then SEV-ES and SEV-SNP guests can't be run and the SEV-ES and SEV-SNP feature flags should be cleared. - SEV-SNP support relies on BIOS allocation of the RMP table. If the table hasn't been allocated or allocated properly, the SEV-SNP feature flag should be cleared. Signed-off-by: Tom Lendacky <[email protected]>

Update AP creation to support ADD/DESTROY of VMSAs at levels other than VMPL0 in order to run with under an SVSM at VMPL1 or lower. To maintain backwards compatibility, the VMPL is specified in bits 16 to 19 of the AP Creation request in SW_EXITINFO1 of the GHCB. In order to track the VMSAs at different levels, create arrays for the VMSAs, GHCBs, registered GHCBs and others. When switching VMPL levels, these entries will be used to set the VMSA and GHCB physical addresses in the VMCB for the VMPL level. In order ensure that the proper responses are returned in the proper GHCB, the GHCB must be unmapped at the current level and saved for restoration later when switching back to that VMPL level. Additional checks are applied to prevent a non-VMPL0 vCPU from being able to perform an AP creation request. Additionally, a vCPU cannot replace its own VMSA. Signed-off-by: Tom Lendacky <[email protected]>

Implement the GET_APIC_IDS NAE event to gather and return the list of APIC IDs for all vCPUs in the guest. Signed-off-by: Tom Lendacky <[email protected]>

Implement the RUN_VMPL NAE event and MSR protocol to allow a guest to request a different VMPL level VMSA be run for the vCPU. This allows the guest to "call" an SVSM to process an SVSM request. Signed-off-by: Tom Lendacky <[email protected]>

Update the hypervisor supported features to indicate that the SVSM feature is supported. The SVSM feature consists of: - APIC ID retrieval support - Multi-VMPL support - AP Creation at specified VMPL - Run VMPL level support Signed-off-by: Tom Lendacky <[email protected]>

Add helper function has_snp_feature() to determine if a SNP guest has a given feature. This will be particularly positive for when new checks and/or SNP features are added in the future. Signed-off-by: Carlos Bilbao <[email protected]>

Make struct kvm_sev_info maintain separate SEV features per VMPL, allowing distinct SEV features depending on VMs privilege level. Signed-off-by: Carlos Bilbao <[email protected]>

Prevent injection of exceptions/interrupts when restricted injection is active. This is not full support for restricted injection, but the SVSM is not expecting any injections at all. Signed-off-by: Tom Lendacky <[email protected]>

Allow an Restricted Injection to be set in SEV_FEATURES. When set, attempts to inject any interrupts other than #HV will make VMRUN fail. This is done to further reduce the security exposure within the SVSM. Signed-off-by: Carlos Bilbao <[email protected]>

During early boot phases, check for the presence of an SVSM when running as an SEV-SNP guest. An SVSM is present if the 64-bit value at offset 0x148 into the secrets page is non-zero. If an SVSM is present, save the SVSM Calling Area address (CAA), located at offset 0x150 into the secrets page, and set the VMPL level of the guest, which should be non-zero, to indicate the presence of an SVSM. Signed-off-by: Tom Lendacky <[email protected]>

The SVSM Calling Area (CA) is used to communicate between Linux and the SVSM. Since the firmware supplied CA for the BSP is likely to be in reserved memory, switch off that CA to a kernel provided CA so that access and use of the CA is available during boot. The CA switch is done using the SVSM core protocol SVSM_CORE_REMAP_CAA call. An SVSM call is executed by filling out the SVSM CA and setting the proper register state as documented by the SVSM protocol. The SVSM is invoked by by requesting the hypervisor to run VMPL0. Once it is safe to allocate/reserve memory, allocate a CA for each CPU. After allocating the new CAs, the BSP will switch from the boot CA to the per-CPU CA. The CA for an AP is identified to the SVSM when creating the VMSA in preparation for booting the AP. Signed-off-by: Tom Lendacky <[email protected]>

The PVALIDATE instruction can only be performed at VMPL0. An SVSM will be present when running at VMPL1 or a lower privilege level. When an SVSM is present, use the SVSM_CORE_PVALIDATE call to perform memory validation instead of issuing the PVALIDATE instruction directly. The validation of a single 4K page is now explicitly identified as such in the function name, pvalidate_4k_page(). The pvalidate_pages() function is used for validating 1 or more pages at either 4K or 2M in size. Each function, however, determines whether it can issue the PVALIDATE directly or whether the SVSM needs to be invoked. Signed-off-by: Tom Lendacky <[email protected]>

Using the RMPADJUST instruction, the VSMA attribute can only be changed at VMPL0. An SVSM will be present when running at VMPL1 or a lower privilege level. When an SVSM is present, use the SVSM_CORE_CREATE_VCPU call or the SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the VMPL level supplied by the SVSM within the VMSA and when starting the AP. Signed-off-by: Tom Lendacky <[email protected]>

The SVSM specification documents an alternative method of discovery for the SVSM using a reserved CPUID bit and a reserved MSR. For the CPUID support, the #VC handler of an SEV-SNP guest should modify the returned value in the EAX register for the 0x8000001f CPUID function by setting bit 28 when an SVSM is present. For the MSR support, new reserved MSR 0xc001f000 has been defined. A #VC should be generated when accessing this MSR. The #VC handler is expected to ignore writes to this MSR and return the physical calling area address (CAA) on reads of this MSR. Signed-off-by: Tom Lendacky <[email protected]>

Requesting an attestation report from userspace involves providing the VMPL level for the report. Currently any value from 0-3 is valid because Linux enforces running at VMPL0. When an SVSM is present, though, Linux will not be running at VMPL0 and only VMPL values starting at the VMPL level Linux is running at to 3 are valid. In order to allow userspace to determine the minimum VMPL value that can be supplied to an attestation report, create a sysfs entry that can be used to retrieve the current VMPL level of Linux. Signed-off-by: Tom Lendacky <[email protected]>

Currently, the sev-guest driver uses the vmpck-0 key by default. When an SVSM is present the kernel is running at a VMPL other than 0 and the vmpck-0 key is no longer available. So choose the vmpck key based on the active VMPL level. Signed-off-by: Tom Lendacky <[email protected]>

When an SVSM is present, the guest can also request attestation reports from the SVSM. These SVSM attestation reports can be used to attest the SVSM and any services running within the SVSM. Extend the config-fs attestation support to allow for an SVSM attestation report. This involves creating four (4) new config-fs attributes: - 'svsm' (input) This attribute is used to determine whether the attestation request should be sent to the SVSM or to the SEV firmware. - 'service_guid' (input) Used for requesting the attestation of a single service within the SVSM. A null GUID implies that the SVSM_ATTEST_SERVICES call should be used to request the attestation report. A non-null GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used. - 'service_version' (input) Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version represents a specific service manifest version be used for the attestation report. - 'manifestblob' (output) Used to return the service manifest associated with the attestation report. Signed-off-by: Tom Lendacky <[email protected]>

To allow execution at a level other than VMPL0, an SVSM must be present. Allow the SEV-SNP guest to continue booting if an SVSM is detected and the hypervisor supports the SVSM feature as indicated in the GHCB hypervisor features bitmap. Signed-off-by: Tom Lendacky <[email protected]>

The VMSA containing the initial CPU state for an SEV-SNP guest is measured as part of the launch process. Currently, KVM does this automatically during the call to KVM_SEV_SNP_LAUNCH_FINISH where the CPU state is synchronised to the VMSA for every vCPU, measured then the guest is launched. This poses a problem for guests that want to have full control over the number and contents of VMSAs, such as when using an SVSM module or paravisor. In which case, for example, you may only want the BSP VMSA to be provided, or have full control over non-synced registers. As soon as the VMSA is measured it is encrypted by hardware so KVM immediately loses sight and control over the contents. With this in mind, there is no need to keep the VMSA in sync with KVMs view of register state. Therefore it makes sense to bypass the sync completely and provide a way for the VMSA to be directly specified from userspace if required. This commit extends the KVM_SEV_SNP_LAUNCH_UPDATE ioctl to allow VMSA pages to be updated. When encountered, this modifies the behaviour of KVM_SEV_SNP_LAUNCH_FINISH to prevent the sync and measurement of CPU state. This allows for both legacy functionaly and new functionality to co-exist. Signed-off-by: Roy Hopkins <[email protected]>

kuba-moo and others added 30 commits May 27, 2024 17:13

bcachefs: Plumb bkey into __btree_err()

1292bc2

It can be useful to know the exact byte offset within a btree node where an error occured. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix locking assert

218e5e0

We now track whether a transaction is locked, and verify that we don't have nodes locked when the transaction isn't locked; reorder relocks to not pop the new assert. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Refactor delete_dead_snapshots()

82af5ce

Consolidate per-key work into delete_dead_snapshots_process_key(), so we now walk all keys once, not twice. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Run check_key_has_snapshot in snapshot_delete_keys()

08f5000

delete_dead_snapshots now runs before the main fsck.c passes which check for keys for invalid snapshots; thus, it needs those checks as well. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: add missing MODULE_DESCRIPTION()

b413107

Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: btree_gc can now handle unknown btrees

088d0de

Compatibility fix - we no longer have a separate table for which order gc walks btrees in, and special case the stripes btree directly. Signed-off-by: Kent Overstreet <[email protected]>

bibo-mao and others added 27 commits June 20, 2024 17:20

Merge branch 'kvm-6.10-fixes' into HEAD

02b0d3b

crypto: ccp: Add additional information about an SEV firmware upgrade

9752c2a

Print additional information, in the form of the old and new versions of the SEV firmware, so that it can be seen what the base firmware was before the upgrade. Signed-off-by: Tom Lendacky <[email protected]>

KVM: SVM: Implement GET_AP_APIC_IDS NAE event

ec292a8

Implement the GET_APIC_IDS NAE event to gather and return the list of APIC IDs for all vCPUs in the guest. Signed-off-by: Tom Lendacky <[email protected]>

KVM: SVM: Add auxiliary function has_snp_feature()

73156ab

Add helper function has_snp_feature() to determine if a SNP guest has a given feature. This will be particularly positive for when new checks and/or SNP features are added in the future. Signed-off-by: Carlos Bilbao <[email protected]>

KVM: SVM: Maintain per-VMPL SEV features in kvm_sev_info

1587851

Make struct kvm_sev_info maintain separate SEV features per VMPL, allowing distinct SEV features depending on VMs privilege level. Signed-off-by: Carlos Bilbao <[email protected]>

roy-hopkins mentioned this pull request Jul 19, 2024

Prepare for update to COCONUT linux host and QEMU 9.0 coconut-svsm/svsm#415

Open

osteffenrh mentioned this pull request Jul 22, 2024

Update to QEMU 9.0 including IGVM v4 patch series + direct VMSA coconut-svsm/qemu#15

Open

Freax13 mentioned this pull request Sep 5, 2024

State of SVSM with Linux 6.11? coconut-svsm/svsm#449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

roy-hopkins commented Jul 19, 2024

Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

Are you sure you want to change the base?

Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

Conversation

roy-hopkins commented Jul 19, 2024