Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to linux-6.10-rc2 from kvm/next + SVSM host/guest support + direct VMSA #6

Open
wants to merge 10,000 commits into
base: master
Choose a base branch
from

Conversation

roy-hopkins
Copy link

This version of Linux works as both a host and guest kernel for COCONUT-SVSM. It is based on https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next with the addition of these latest patch series:

  • SEV-SNP host support (already in kvm/next)
  • SEV-SNP guest requests
  • VMPL2 guest support patches
  • SVSM patches

The SVSM patches required some modifications due to significant changes since previous rebase. In particular, the method for determining an SVSM is via init-flags is not supported anymore. Therefore an additional patch has been applied that allows direct setting of the VMSA for each vCPU removing the need for KVM to track that an SVSM is present.

You need the corresponding QEMU that supports IGVM and direct setting of the VMSA to work with this kernel. This can be found in this PR: coconut-svsm/qemu#15

kuba-moo and others added 30 commits May 27, 2024 17:13
Matthieu Baerts says:

====================
selftests: mptcp: mark unstable subtests as flaky

Some subtests can be unstable, failing once every X runs. Fixing them
can take time: there could be an issue in the kernel or in the subtest,
and it is then important to do a proper analysis, not to hide real bugs.

To avoid creating noises on the different CIs where tests are more
unstable than on our side, some subtests have been marked as flaky. As a
result, errors with these subtests (if any) are ignored.

Note that the MPTCP CI will continue to track these flaky subtests. All
these unstable subtests are also tracked by our bug tracker.

These are fixes for the -net tree, because the instabilities are visible
there. The first patch introducing the flake support has no 'Fixes'
tags, mainly because it requires recent and important refactoring done
in all MPTCP selftests. Backporting that to old versions where the flaky
tests have been introduced would be too difficult, and probably not
worth it. The other patches, adding MPTCP_LIB_SUBTEST_FLAKY=1, have a
Fixes tag, simply to ease the backport of the future fixes removing them
along with the proper fix.
====================

Link: https://lore.kernel.org/r/20240524-upstream-net-20240524-selftests-mptcp-flaky-v1-0-a352362f3f8e@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
s/of/off/

Signed-off-by: Thorsten Blum <[email protected]>
Fixes: e110ba6 ("docs: netdev: add note about Changes Requested and revising commit messages")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Polling is initially attempted with timeout_base_ms enabled for
preemption, and if it exceeds this timeframe, another attempt is made
without preemption, allowing an additional 50 ms before timing out.

v2
- Rebase

v3
- Move warnings to separate patch (Lucas)

Cc: Lucas De Marchi <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Signed-off-by: Himal Prasad Ghimiray <[email protected]>
Fixes: 7dc9b92 ("drm/xe: Remove i915_utils dependency from xe_pcode.")
Reviewed-by: Lucas De Marchi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit c81858e)
Signed-off-by: Thomas Hellström <[email protected]>
The GuC context scheduling queue is 2 entires deep, thus it is possible
for a migration job to be stuck behind a fault if migration exec queue
shares engines with user jobs. This can deadlock as the migrate exec
queue is required to service page faults. Avoid deadlock by only using
reserved BCS instances for usm migrate exec queue.

Fixes: a043fba ("drm/xe/pvc: Use fast copy engines as migrate engine on PVC")
Cc: Matt Roper <[email protected]>
Cc: Niranjana Vishwanathapura <[email protected]>
Signed-off-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Brian Welty <[email protected]>
(cherry picked from commit 04f4a70)
Signed-off-by: Thomas Hellström <[email protected]>
Release the submission_state lock if alloc_guc_id() fails.

v2: Add Fixes tag and CC stable kernel

Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: <[email protected]> # v6.8+
Signed-off-by: Niranjana Vishwanathapura <[email protected]>
Reviewed-by: Nirmoy Das <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 40672b7)
Signed-off-by: Thomas Hellström <[email protected]>
The TPM SPI transfer mechanism uses MAX_SPI_FRAMESIZE for computing the
maximum transfer length and the size of the transfer buffer. As such, it
does not account for the 4 bytes of header that prepends the SPI data
frame. This can result in out-of-bounds accesses and was confirmed with
KASAN.

Introduce SPI_HDRSIZE to account for the header and use to allocate the
transfer buffer.

Fixes: a86a42a ("tpm_tis_spi: Add hardware wait polling")
Signed-off-by: Matthew R. Ochs <[email protected]>
Tested-by: Carol Soto <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
With only single call site, this makes no sense (slipped out of the
radar during the review). Open code and document the action directly
to the site, to make it more readable.

Fixes: 1b6d7f9 ("tpm: add session encryption protection to tpm2_get_random()")
Signed-off-by: Jarkko Sakkinen <[email protected]>
sk_psock_get will return NULL if the refcount of psock has gone to 0, which
will happen when the last call of sk_psock_put is done. However,
sk_psock_drop may not have finished yet, so the close callback will still
point to sock_map_close despite psock being NULL.

This can be reproduced with a thread deleting an element from the sock map,
while the second one creates a socket, adds it to the map and closes it.

That will trigger the WARN_ON_ONCE:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Modules linked in:
CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02
RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293
RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000
RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0
RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3
R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840
R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870
FS:  000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 unix_release+0x87/0xc0 net/unix/af_unix.c:1048
 __sock_release net/socket.c:659 [inline]
 sock_close+0xbe/0x240 net/socket.c:1421
 __fput+0x42b/0x8a0 fs/file_table.c:422
 __do_sys_close fs/open.c:1556 [inline]
 __se_sys_close fs/open.c:1541 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1541
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb37d618070
Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070
RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Use sk_psock, which will only check that the pointer is not been set to
NULL yet, which should only happen after the callbacks are restored. If,
then, a reference can still be gotten, we may call sk_psock_stop and cancel
psock->work.

As suggested by Paolo Abeni, reorder the condition so the control flow is
less convoluted.

After that change, the reproducer does not trigger the WARN_ON_ONCE
anymore.

Suggested-by: Paolo Abeni <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355
Fixes: aadb2bb ("sock_map: Fix a potential use-after-free in sock_map_close()")
Fixes: 5b4a79b ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself")
Cc: [email protected]
Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]>
Acked-by: Jakub Sitnicki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Rename and document TPM2_OA_TMPL, as originally requested in the patch
set review, but left unaddressed without any appropriate reasoning. The
new name is TPM2_OA_NULL_KEY, has a documentation and is local only to
tpm2-sessions.c.

Link: https://lore.kernel.org/linux-integrity/ddbeb8111f48a8ddb0b8fca248dff6cc9d7079b2.camel@HansenPartnership.com/
Link: https://lore.kernel.org/linux-integrity/CZCKTWU6ZCC9.2UTEQPEVICYHL@suppilovahvero/
Signed-off-by: Jarkko Sakkinen <[email protected]>
Given the not fully root caused performance issues on non-x86 platforms,
enable the feature by default only for x86-64. That is the platform it
brings the most value and has gone most of the QA. Can be reconsidered
later and can be obviously opt-in enabled too on any arch.

Link: https://lore.kernel.org/linux-integrity/[email protected]/#t
Signed-off-by: Jarkko Sakkinen <[email protected]>
When a UMP packet is converted between MIDI1 and MIDI2 protocols, the
bank selection may be lost.  The conversion from MIDI1 to MIDI2 needs
the encoding of the bank into UMP_MSG_STATUS_PROGRAM bits, while the
conversion from MIDI2 to MIDI1 needs the extraction from that
instead.

This patch implements the missing bank selection mechanism in those
conversions.

Fixes: e9e0281 ("ALSA: seq: Automatic conversion of UMP events")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
The current code to convert from a legacy sequencer event to UMP MIDI2
clears the bank selection at each time the program change is
submitted.  This is confusing and may lead to incorrect bank values
tranmitted to the destination in the end.

Drop the line to clear the bank info and keep the provided values.

Fixes: e9e0281 ("ALSA: seq: Automatic conversion of UMP events")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
The information on PCI class/subclass was interesting in the Skylake
timeframe, since the DSP was only enabled on a limited number of
platforms. Now most Intel platforms do enable the DSP, so the
information is less interesting to log.

When a DSP driver is used, the common helper may be called multiple
times due to deferred probes, but there's no reason to print the same
information multiple times. Using dev_info_once() covers all the
existing usages for internal cards with DSPs. External cards don't
rely on DSPs so far.

Signed-off-by: Pierre-Louis Bossart <[email protected]>
Reviewed-by: Bard Liao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
…put_matches is called

In this patch, a software bug has been fixed.
rtq2208_ldo_match is no longer a local variable.
It prevents invalid memory access when devm_of_regulator_put_matches
 is called.

Signed-off-by: Alina Yu <[email protected]>
Link: https://msgid.link/r/4ce8c4f16f1cf3aa4e5f36c0694dd3c5ccf3cd1c.1716870419.git.alina_yu@richtek.com
Signed-off-by: Mark Brown <[email protected]>
When changing the maximum number of open zones, print that number
instead of the total number of zones.

Fixes: dc4d137 ("null_blk: add support for max open/active zone limit for zoned devices")
Cc: [email protected]
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
sd can set a max_sectors value that is lower than the max_hw_sectors
limit based on the block limits VPD page.   While this is rather unusual,
it used to work until the max_user_sectors field was split out to cleanly
deal with conflicting hardware and user limits when the hardware limit
changes.  Also set max_user_sectors to ensure the limit can properly be
stacked.

Fixes: 4f563a6 ("block: add a max_user_discard_sectors queue limit")
Reported-by: Mike Snitzer <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
The max_user_sectors is one of the three factors determining the actual
max_sectors limit for READ/WRITE requests.  Because of that it needs to
be stacked at least for the device mapper multi-path case where requests
are directly inserted on the lower device.  For SCSI disks this is
important because the sd driver actually sets it's own advisory limit
that is lower than max_hw_sectors based on the block limits VPD page.
While this is a bit odd an unusual, the same effect can happen if a
user or udev script tweaks the value manually.

Fixes: 4f563a6 ("block: add a max_user_discard_sectors queue limit")
Reported-by: Mike Snitzer <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
The logical block size need to be smaller than the max_hw_sector
setting, otherwise we can't even transfer a single LBA.

Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Currently, if the gc is running, when the allocator found free_inc
is empty, allocator has to wait the gc finish. Before that, the
IO is blocked.

But actually, there would be some buckets is reclaimable before gc,
and gc will never mark this kind of bucket to be unreclaimable.

So we can put these buckets into free_inc in gc running to avoid
IO being blocked.

Signed-off-by: Dongsheng Yang <[email protected]>
Signed-off-by: Mingzhe Zou <[email protected]>
Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
If there are extreme heavy write I/O continuously hit on relative small
cache device (512GB in my testing), it is possible to make counter
c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD.

If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write
requests will bypass the cache device because check_should_bypass()
returns 'true'. Because all writes bypass the cache device, counter
c->sectors_to_gc has no chance to be negative value, and garbage
collection thread won't be waken up even the whole cache becomes clean
after writeback accomplished. The aftermath is that all write I/Os go
directly into backing device even the cache device is clean.

To avoid the above situation, this patch uses a quite conservative way
to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes
up garbage collection thread when the whole cache device is clean.

Before the fix, the writes-always-bypass situation happens after 10+
hours write I/O pressure on 512GB Intel optane memory which acts as
cache device. After this fix, such situation doesn't happen after 36+
hours testing.

Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
In __bch_bucket_alloc_set() the lines after lable 'err:' indeed do
nothing useful after multiple cache devices are removed from bcache
code. This cleanup patch drops the useless code to save a bit CPU
cycles.

Signed-off-by: Coly Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
The start counter for FT1 filter is wrongly set to 0 in the driver.
FT1 is used for source address violation (SAV) check and source address
starts at Byte 6 not Byte 0. Fix this by changing start counter to
ETH_ALEN in icssg_ft1_set_mac_addr().

Fixes: e9b4ece ("net: ti: icssg-prueth: Add Firmware config and classification APIs.")
Signed-off-by: MD Danish Anwar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
It can be useful to know the exact byte offset within a btree node where
an error occured.

Signed-off-by: Kent Overstreet <[email protected]>
This function is used for finding the hash seed (which is the same in
all versions of an inode in different snapshots): ff an inode has been
deleted in a child snapshot we need to iterate until we find a live
version.

Signed-off-by: Kent Overstreet <[email protected]>
We now track whether a transaction is locked, and verify that we don't
have nodes locked when the transaction isn't locked; reorder relocks to
not pop the new assert.

Signed-off-by: Kent Overstreet <[email protected]>
Consolidate per-key work into delete_dead_snapshots_process_key(), so we
now walk all keys once, not twice.

Signed-off-by: Kent Overstreet <[email protected]>
delete_dead_snapshots now runs before the main fsck.c passes which check
for keys for invalid snapshots; thus, it needs those checks as well.

Signed-off-by: Kent Overstreet <[email protected]>
bch2_check_version_downgrade() was setting c->sb.version, which
bch2_sb_set_downgrade() expects to be at the previous version; and it
shouldn't even have been set directly because c->sb.version is updated
by write_super().

Signed-off-by: Kent Overstreet <[email protected]>
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o

Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.

Signed-off-by: Kent Overstreet <[email protected]>
bibo-mao and others added 27 commits June 20, 2024 17:20
Function kvm_reset_dirty_gfn may be called with parameters cur_slot /
cur_offset / mask are all zero, it does not represent real dirty page.
It is not necessary to clear dirty page in this condition. Also return
value of macro __fls() is undefined if mask is zero which is called in
funciton kvm_reset_dirty_gfn(). Here just return.

Signed-off-by: Bibo Mao <[email protected]>
Message-ID: <[email protected]>
[Move the conditional inside kvm_reset_dirty_gfn; suggested by
 Sean Christopherson. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
kvm_gmem_populate() is a potentially lengthy operation that can involve
multiple calls to the firmware.  Interrupt it if a signal arrives.

Fixes: 1f6c06b ("KVM: guest_memfd: Add interface for populating gmem pages with user data")
Cc: Isaku Yamahata <[email protected]>
Cc: Michael Roth <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace
the SPTE value and returns -EBUSY on failure.  The caller must check the
return value and retry.  Add __must_check to it, as well as to two more
functions that forward the return value of __tdp_mmu_set_spte_atomic to
their caller.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Acked-by: Kai Huang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Version 2 of GHCB specification added support for the SNP Guest Request
Message NAE event. The event allows for an SEV-SNP guest to make
requests to the SEV-SNP firmware through hypervisor using the
SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification.

This is used by guests primarily to request attestation reports from
firmware. There are other request types are available as well, but the
specifics of what guest requests are being made are opaque to the
hypervisor, which only serves as a proxy for the guest requests and
firmware responses.

Implement handling for these events.

Co-developed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: ensure FW command failures are indicated to guest, drop extended
 request handling to be re-written as separate patch, massage commit]
Signed-off-by: Michael Roth <[email protected]>
These commands can be used to create a transaction such that commands
that update the reported TCB, such as SNP_SET_CONFIG/SNP_COMMIT, and
updates to userspace-supplied certificates, can be handled atomically
relative to any extended guest requests issued by any SNP guests while
the updates are taking place.

Without this interface, there is a risk that a guest will be given
certificate information that does not correspond to the VCEK/VLEK used
to sign a particular attestation report unless all the running guests
are paused in advance, which would cause disruption to all guests in the
system even if no attestation requests are being made. Even then, care
is needed to ensure that KVM does not pass along certificate information
that was fetched from userspace in advance of the guest being paused.

This interface also provides some versatility with how similar firmware
maintenance activity can be handled in the future without passing
unnecessary management complexity on to userspace.

Signed-off-by: Michael Roth <[email protected]>
Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but allows for
additional certificate data to be supplied via an additional
guest-supplied buffer to be used mainly for verifying the signature of
an attestation report as returned by firmware.

This certificate data is supplied by userspace, so unlike with
SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
forwarded to userspace via a KVM_EXIT_VMGEXIT exit type, and then the
firmware request is made only afterward.

Implement handling for these events.

Since there is a potential for race conditions where the
userspace-supplied certificate data may be out-of-sync relative to the
reported TCB that firmware will use when signing attestation reports,
make use of the transaction/synchronization mechanisms added by the
SNP_SET_CONFIG_{START,END} SEV device ioctls such that the guest will be
told to retry the request when an update to reported TCB or
userspace-supplied certificates may have occurred or is in progress
while an extended guest request is being processed.

Signed-off-by: Michael Roth <[email protected]>
Print additional information, in the form of the old and new versions of
the SEV firmware, so that it can be seen what the base firmware was before
the upgrade.

Signed-off-by: Tom Lendacky <[email protected]>
There are situations where SEV-ES and/or SEV-SNP guests cannot be run
but the feature flags are still advertised. Add additional checks to the
current SME and SEV checks to clear these CPU feature flags under these
conditions:

  - CPUID 0x8000001f/edx reports the minimum SEV ASID, which is used to
    determine the number of SEV-ES/SEV-SNP ASIDs that are available. If
    the value reported is <= 1, then SEV-ES and SEV-SNP guests can't be
    run and the SEV-ES and SEV-SNP feature flags should be cleared.

  - SEV-SNP support relies on BIOS allocation of the RMP table. If the
    table hasn't been allocated or allocated properly, the SEV-SNP feature
    flag should be cleared.

Signed-off-by: Tom Lendacky <[email protected]>
Update AP creation to support ADD/DESTROY of VMSAs at levels other than
VMPL0 in order to run with under an SVSM at VMPL1 or lower. To maintain
backwards compatibility, the VMPL is specified in bits 16 to 19 of the
AP Creation request in SW_EXITINFO1 of the GHCB.

In order to track the VMSAs at different levels, create arrays for the
VMSAs, GHCBs, registered GHCBs and others. When switching VMPL levels,
these entries will be used to set the VMSA and GHCB physical addresses
in the VMCB for the VMPL level.

In order ensure that the proper responses are returned in the proper GHCB,
the GHCB must be unmapped at the current level and saved for restoration
later when switching back to that VMPL level.

Additional checks are applied to prevent a non-VMPL0 vCPU from being able
to perform an AP creation request. Additionally, a vCPU cannot replace its
own VMSA.

Signed-off-by: Tom Lendacky <[email protected]>
Implement the GET_APIC_IDS NAE event to gather and return the list of
APIC IDs for all vCPUs in the guest.

Signed-off-by: Tom Lendacky <[email protected]>
Implement the RUN_VMPL NAE event and MSR protocol to allow a guest to
request a different VMPL level VMSA be run for the vCPU. This allows
the guest to "call" an SVSM to process an SVSM request.

Signed-off-by: Tom Lendacky <[email protected]>
Update the hypervisor supported features to indicate that the SVSM feature
is supported. The SVSM feature consists of:
	- APIC ID retrieval support
	- Multi-VMPL support
		- AP Creation at specified VMPL
		- Run VMPL level support

Signed-off-by: Tom Lendacky <[email protected]>
Add helper function has_snp_feature() to determine if a SNP guest has a
given feature. This will be particularly positive for when new checks
and/or SNP features are added in the future.

Signed-off-by: Carlos Bilbao <[email protected]>
Make struct kvm_sev_info maintain separate SEV features per VMPL, allowing
distinct SEV features depending on VMs privilege level.

Signed-off-by: Carlos Bilbao <[email protected]>
Prevent injection of exceptions/interrupts when restricted injection is
active. This is not full support for restricted injection, but the SVSM
is not expecting any injections at all.

Signed-off-by: Tom Lendacky <[email protected]>
Allow an Restricted Injection to be set in SEV_FEATURES. When set,
attempts to inject any interrupts other than #HV will make VMRUN fail.
This is done to further reduce the security exposure within the SVSM.

Signed-off-by: Carlos Bilbao <[email protected]>
During early boot phases, check for the presence of an SVSM when running
as an SEV-SNP guest.

An SVSM is present if the 64-bit value at offset 0x148 into the secrets
page is non-zero. If an SVSM is present, save the SVSM Calling Area
address (CAA), located at offset 0x150 into the secrets page, and set
the VMPL level of the guest, which should be non-zero, to indicate the
presence of an SVSM.

Signed-off-by: Tom Lendacky <[email protected]>
The SVSM Calling Area (CA) is used to communicate between Linux and the
SVSM. Since the firmware supplied CA for the BSP is likely to be in
reserved memory, switch off that CA to a kernel provided CA so that access
and use of the CA is available during boot. The CA switch is done using
the SVSM core protocol SVSM_CORE_REMAP_CAA call.

An SVSM call is executed by filling out the SVSM CA and setting the proper
register state as documented by the SVSM protocol. The SVSM is invoked by
by requesting the hypervisor to run VMPL0.

Once it is safe to allocate/reserve memory, allocate a CA for each CPU.
After allocating the new CAs, the BSP will switch from the boot CA to the
per-CPU CA. The CA for an AP is identified to the SVSM when creating the
VMSA in preparation for booting the AP.

Signed-off-by: Tom Lendacky <[email protected]>
The PVALIDATE instruction can only be performed at VMPL0. An SVSM will
be present when running at VMPL1 or a lower privilege level.

When an SVSM is present, use the SVSM_CORE_PVALIDATE call to perform
memory validation instead of issuing the PVALIDATE instruction directly.

The validation of a single 4K page is now explicitly identified as such
in the function name, pvalidate_4k_page(). The pvalidate_pages() function
is used for validating 1 or more pages at either 4K or 2M in size. Each
function, however, determines whether it can issue the PVALIDATE directly
or whether the SVSM needs to be invoked.

Signed-off-by: Tom Lendacky <[email protected]>
Using the RMPADJUST instruction, the VSMA attribute can only be changed
at VMPL0. An SVSM will be present when running at VMPL1 or a lower
privilege level.

When an SVSM is present, use the SVSM_CORE_CREATE_VCPU call or the
SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the
VMPL level supplied by the SVSM within the VMSA and when starting the
AP.

Signed-off-by: Tom Lendacky <[email protected]>
The SVSM specification documents an alternative method of discovery for
the SVSM using a reserved CPUID bit and a reserved MSR.

For the CPUID support, the #VC handler of an SEV-SNP guest should modify
the returned value in the EAX register for the 0x8000001f CPUID function
by setting bit 28 when an SVSM is present.

For the MSR support, new reserved MSR 0xc001f000 has been defined. A #VC
should be generated when accessing this MSR. The #VC handler is expected
to ignore writes to this MSR and return the physical calling area address
(CAA) on reads of this MSR.

Signed-off-by: Tom Lendacky <[email protected]>
Requesting an attestation report from userspace involves providing the
VMPL level for the report. Currently any value from 0-3 is valid because
Linux enforces running at VMPL0.

When an SVSM is present, though, Linux will not be running at VMPL0 and
only VMPL values starting at the VMPL level Linux is running at to 3 are
valid. In order to allow userspace to determine the minimum VMPL value
that can be supplied to an attestation report, create a sysfs entry that
can be used to retrieve the current VMPL level of Linux.

Signed-off-by: Tom Lendacky <[email protected]>
Currently, the sev-guest driver uses the vmpck-0 key by default. When an
SVSM is present the kernel is running at a VMPL other than 0 and the
vmpck-0 key is no longer available. So choose the vmpck key based on the
active VMPL level.

Signed-off-by: Tom Lendacky <[email protected]>
When an SVSM is present, the guest can also request attestation reports
from the SVSM. These SVSM attestation reports can be used to attest the
SVSM and any services running within the SVSM.

Extend the config-fs attestation support to allow for an SVSM attestation
report. This involves creating four (4) new config-fs attributes:

  - 'svsm' (input)
    This attribute is used to determine whether the attestation request
    should be sent to the SVSM or to the SEV firmware.

  - 'service_guid' (input)
    Used for requesting the attestation of a single service within the
    SVSM. A null GUID implies that the SVSM_ATTEST_SERVICES call should
    be used to request the attestation report. A non-null GUID implies
    that the SVSM_ATTEST_SINGLE_SERVICE call should be used.

  - 'service_version' (input)
    Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version
    represents a specific service manifest version be used for the
    attestation report.

  - 'manifestblob' (output)
    Used to return the service manifest associated with the attestation
    report.

Signed-off-by: Tom Lendacky <[email protected]>
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.

Signed-off-by: Tom Lendacky <[email protected]>
The VMSA containing the initial CPU state for an SEV-SNP guest is
measured as part of the launch process. Currently, KVM does this
automatically during the call to KVM_SEV_SNP_LAUNCH_FINISH where the CPU
state is synchronised to the VMSA for every vCPU, measured then the
guest is launched.

This poses a problem for guests that want to have full control over the
number and contents of VMSAs, such as when using an SVSM module or
paravisor. In which case, for example, you may only want the BSP VMSA to
be provided, or have full control over non-synced registers.

As soon as the VMSA is measured it is encrypted by hardware so KVM
immediately loses sight and control over the contents. With this in
mind, there is no need to keep the VMSA in sync with KVMs view of
register state. Therefore it makes sense to bypass the sync completely
and provide a way for the VMSA to be directly specified from userspace
if required.

This commit extends the KVM_SEV_SNP_LAUNCH_UPDATE ioctl to allow VMSA
pages to be updated. When encountered, this modifies the behaviour of
KVM_SEV_SNP_LAUNCH_FINISH to prevent the sync and measurement of CPU
state. This allows for both legacy functionaly and new functionality to
co-exist.

Signed-off-by: Roy Hopkins <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet