Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA: SAUCE: iommu/arm-smmu-v3: Allow default substream bypass with… #22

Open
wants to merge 128 commits into
base: 24.04_linux-nvidia
Choose a base branch
from

Conversation

jamieNguyenNVIDIA
Copy link

… a pasid support

When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would skip the allocation of a CD table and set the CONFIG field of the STE to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have one substream, i.e. PASID disabled.

However, there could be a use case, for a pasid capable device, that allows bypassing the translation at the default substream while still enabling the pasid feature, which means the driver should not skip the allocation of a CD table nor simply bypass the CONFIG field. Instead, the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING.

Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the finalise() to support this use case.

Also, according to "13.5 Summary of attribute/permission configuration fields" in the reference manual, the SHCFG field value is irrelevant. So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING for simplification.

Reviewed-by: Pritesh Raithatha [email protected]

Tested-by: Matt Ochs [email protected]

ianmay81 and others added 30 commits July 17, 2024 11:15
…dversion"

This reverts commit 47d27f2.

We need to revert this to avoid regressing any modules used in Jammy.

Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Paolo Pisati <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2055128
Properties: no-test-build
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
jacobmartin0 and others added 23 commits July 17, 2024 11:15
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2068306
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071967
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2069777

Use u8, u32 and u64 instead of respective uintN_t types.
Remove unnecessary newlines for function argument lists.

Signed-off-by: Shravan Kumar Ramani <[email protected]>
Link: https://lore.kernel.org/r/39be055af3506ce6f843d11e45d71620f2a96e26.1707808180.git.shravankr@nvidia.com
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
(cherry picked from commit fd23023)
Signed-off-by: David Thompson <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2069777

Use unsigned integer types for register values and array indices.
Use %u instead of %d accordingly.

Signed-off-by: Shravan Kumar Ramani <[email protected]>
Link: https://lore.kernel.org/r/d8548c70339a29258a906b2b518e5c48f669795c.1707808180.git.shravankr@nvidia.com
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
(cherry picked from commit 1ae9ffd)
Signed-off-by: David Thompson <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
…ptional

BugLink: https://bugs.launchpad.net/bugs/2069777

The mlxbf_pmc_event_list() function returns a pointer to an array of
supported events and the array size. The array size is returned via
a pointer passed as an argument, which is mandatory.

However, we want to be able to use mlxbf_pmc_event_list() just to check
if a block name is implemented/supported. For this usage passing the size
argument is not necessary so let's make it optional.

Signed-off-by: Luiz Capitulino <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
Link: https://lore.kernel.org/r/182de8ec6b9c33152f2ba6b248c35b0311abf5e4.1708635408.git.luizcap@redhat.com
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
(cherry picked from commit 0d46439)
Signed-off-by: David Thompson <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2069777

Currently, the driver has two behaviors to deal with new & unsupported
performance blocks reported by the firmware:

 1. For register and unknown block types, the driver will fail to load
    with the following error message:

    [ 4510.956369] mlxbf-pmc: probe of MLNXBFD2:00 failed with error -22

 2. For counter and crspace blocks, the driver will load and sysfs files
    will be created but getting the contents of event_list or trying to
    setup the counter will fail

Instead, let's ignore and log unsupported blocks. This means the driver
will always load and unsupported blocks will never show up in sysfs.

Signed-off-by: Luiz Capitulino <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
Link: https://lore.kernel.org/r/f8e2e6210b43e825b69824b420c801cd513d401d.1708635408.git.luizcap@redhat.com
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
(cherry picked from commit c0459ee)
Signed-off-by: David Thompson <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2069777

These need to be signed for the error handling to work.  The
mlxbf_pmc_get_event_num() function returns int so int type is correct.

Fixes: 1ae9ffd ("platform/mellanox: mlxbf-pmc: Cleanup signed/unsigned mix-up")
Signed-off-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
(cherry picked from commit 7c8772f)
Signed-off-by: David Thompson <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071654

We enumerate devices by attempting config reads to the Vendor ID of each
possible device.  On conventional PCI, if no device responds, the read
terminates with a Master Abort (PCI r3.0, sec 6.1).  On PCIe, the config
read is terminated as an Unsupported Request (PCIe r6.0, sec 2.3.2,
7.5.1.3.7).  In either case, if the read addressed a device below a bridge,
it is logged by setting "Received Master Abort" in the bridge Secondary
Status register.

Clear any errors logged in the Secondary Status register after enumeration.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vidya Sagar <[email protected]>
[bhelgaas: simplify commit log]
Signed-off-by: Bjorn Helgaas <[email protected]>
(cherry picked from commit 7bf9d2a)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071655

Start switching iomap_copy routines over to use #define and arch provided
inline/macro functions instead of weak symbols.

Inline functions allow more compiler optimization and this is often a
driver hot path.

x86 has the only weak implementation for __iowrite32_copy(), so replace it
with a static inline containing the same single instruction inline
assembly. The compiler will generate the "mov edx,ecx" in a more optimal
way.

Remove iomap_copy_64.S

Link: https://lore.kernel.org/r/[email protected]
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 20516d6)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071655

It is trivial to implement an inline to do this, so provide it in the s390
headers. Like the 64 bit version it should just invoke zpci_memcpy_toio()
with the correct size.

Link: https://lore.kernel.org/r/[email protected]
Acked-by: Niklas Schnelle <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 6ae798c)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071655

Complete switching the __iowriteXX_copy() routines over to use #define and
arch provided inline/macro functions instead of weak symbols.

S390 has an implementation that simply calls another memcpy
function. Inline this so the callers don't have to do two jumps.

Link: https://lore.kernel.org/r/[email protected]
Acked-by: Niklas Schnelle <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit e7bc47b)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071655

The kernel provides driver support for using write combining IO memory
through the __iowriteXX_copy() API which is commonly used as an optional
optimization to generate 16/32/64 byte MemWr TLPs in a PCIe environment.

iomap_copy.c provides a generic implementation as a simple 4/8 byte at a
time copy loop that has worked well with past ARM64 CPUs, giving a high
frequency of large TLPs being successfully formed.

However modern ARM64 CPUs are quite sensitive to how the write combining
CPU HW is operated and a compiler generated loop with intermixed
load/store is not sufficient to frequently generate a large TLP. The CPUs
would like to see the entire TLP generated by consecutive store
instructions from registers. Compilers like gcc tend to intermix loads and
stores and have poor code generation, in part, due to the ARM64 situation
that writeq() does not codegen anything other than "[xN]". However even
with that resolved compilers like clang still do not have good code
generation.

This means on modern ARM64 CPUs the rate at which __iowriteXX_copy()
successfully generates large TLPs is very small (less than 1 in 10,000)
tries), to the point that the use of WC is pointless.

Implement __iowrite32/64_copy() specifically for ARM64 and use inline
assembly to build consecutive blocks of STR instructions. Provide direct
support for 64/32/16 large TLP generation in this manner. Optimize for
common constant lengths so that the compiler can directly inline the store
blocks.

This brings the frequency of large TLP generation up to a high level that
is comparable with older CPU generations.

As the __iowriteXX_copy() family of APIs is intended for use with WC
incorporate the DGH hint directly into the function.

Link: https://lore.kernel.org/r/[email protected]
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit ead7911)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2071655

Now that the ARM64 arch implementation does the DGH as part of
__iowrite64_copy() there is no reason to open code this in drivers.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jijie Shao<[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
(cherry picked from commit 2b7a5e1)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2072186
Properties: no-test-build
Signed-off-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2073811

PCIe ACS settings control the level of isolation and the possible P2P paths
between devices. With greater isolation the kernel will create smaller
iommu_groups and with less isolation there is more HW that can achieve P2P
transfers. From a virtualization perspective all devices in the same
iommu_group must be assigned to the same VM as they lack security
isolation.

There is no way for the kernel to automatically know the correct ACS
settings for any given system and workload. Existing command line options
(e.g., disable_acs_redir) allow only for large scale change, disabling all
isolation, but this is not sufficient for more complex cases.

Add a kernel command-line option 'config_acs' to directly control all the
ACS bits for specific devices, which allows the operator to setup the right
level of isolation to achieve the desired P2P configuration.  The
definition is future proof; when new ACS bits are added to the spec the
open syntax can be extended.

ACS needs to be setup early in the kernel boot as the ACS settings affect
how iommu_groups are formed. iommu_group formation is a one time event
during initial device discovery, so changing ACS bits after kernel boot can
result in an inaccurate view of the iommu_groups compared to the current
isolation configuration.

ACS applies to PCIe Downstream Ports and multi-function devices.  The
default ACS settings are strict and deny any direct traffic between two
functions. This results in the smallest iommu_group the HW can support.
Frequently these values result in slow or non-working P2PDMA.

ACS offers a range of security choices controlling how traffic is
allowed to go directly between two devices. Some popular choices:

  - Full prevention

  - Translated requests can be direct, with various options

  - Asymmetric direct traffic, A can reach B but not the reverse

  - All traffic can be direct

Along with some other less common ones for special topologies.

The intention is that this option would be used with expert knowledge of
the HW capability and workload to achieve the desired configuration.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Vidya Sagar <[email protected]>
[bhelgaas: add example, tidy printk formats]
Signed-off-by: Bjorn Helgaas <[email protected]>
(cherry picked from commit 47c8846)
Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Noah Wager <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2075396

Commit 3bd786f ("mm: convert do_set_pte() to set_pte_range()")
replaced do_set_pte() with set_pte_range() and that introduced a
regression in the following faulting path of non-anonymous vmas which
caused the PTE for the faulting address to be marked as old instead of
young.

handle_pte_fault()
  do_pte_missing()
    do_fault()
      do_read_fault() || do_cow_fault() || do_shared_fault()
        finish_fault()
          set_pte_range()

The polarity of prefault calculation is incorrect.  This leads to prefault
being incorrectly set for the faulting address.  The following check will
incorrectly mark the PTE old rather than young.  On some architectures
this will cause a double fault to mark it young when the access is
retried.

    if (prefault && arch_wants_old_prefaulted_pte())
        entry = pte_mkold(entry);

On a subsequent fault on the same address, the faulting path will see a
non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
will then be correctly marked young in handle_pte_fault() itself.

Due to this bug, performance degradation in the fault handling path will
be observed due to unnecessary double faulting.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3bd786f ("mm: convert do_set_pte() to set_pte_range()")
Signed-off-by: Ram Tummala <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(backported from commit 4cd7ba1)
[context changes]
Signed-off-by: Jamie Nguyen <[email protected]>
Tested-by: Carol Soto <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Noah Wager <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
… a pasid support

When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would
skip the allocation of a CD table and set the CONFIG field of the STE
to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have
one substream, i.e. PASID disabled.

However, there could be a use case, for a pasid capable device, that
allows bypassing the translation at the default substream while still
enabling the pasid feature, which means the driver should not skip the
allocation of a CD table nor simply bypass the CONFIG field. Instead,
the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the
SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING.

Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the
finalise() to support this use case.

Also, according to "13.5 Summary of attribute/permission configuration
fields" in the reference manual, the SHCFG field value is irrelevant.
So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING
for simplification.

Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Pritesh Raithatha <[email protected]>
Signed-off-by: Jamie Nguyen <[email protected]>
Tested-by: Matt Ochs <[email protected]>
@nvmochs
Copy link
Collaborator

nvmochs commented Aug 7, 2024

Without SAUCE patch, CUDA apps fail with iommu.passthrough=y:

root@ubuntu:~# uname -a
Linux ubuntu 6.8.0-1010-nvidia-64k #10-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 20:12:13 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
root@ubuntu:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-6.8.0-1010-nvidia-64k root=/dev/mapper/ubuntu--vg-ubuntu--lv ro iommu.passthrough=y
root@ubuntu:~# nvidia-smi
Tue Aug  6 20:53:39 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.10                 Driver Version: 565.10         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             Off |   00000009:01:00.0 Off |                    0 |
| N/A   62C    P0            239W /  700W |       2MiB /  97871MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/gflops/gflops
Running GFLOPs test...
&&&& gflops test FAILED

root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/uvmConformance/uvmConformance -t texture_simple
(thread 251711430748352 [t0]) At /dvs/p4/build/sw/gpgpu/cuda/apps/common/dfontaine/devicefilter.cpp:529:
    Value of : cudaGetDeviceCount(&deviceCount)
    Which is : 1 (cudaErrorInvalidValue)
    Should be: cudaSuccess
    Which is : 0 (cudaSuccess)
&&&& uvmConformance test FAILED

root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host 
(thread 260805548927168 [t0]) At /dvs/p4/build/sw/gpgpu/cuda/apps/common/dfontaine/devicefilter.cpp:529:
    Value of : cudaGetDeviceCount(&deviceCount)
    Which is : 1 (cudaErrorInvalidValue)
    Should be: cudaSuccess
    Which is : 0 (cudaSuccess)
&&&& uvmConformance test FAILED


With SAUCE patch, CUDA apps pass with iommu.passthrough=y:

root@ubuntu:~# uname -a
Linux ubuntu 6.8.0-2011-nvidia-64k #11~pasid SMP PREEMPT_DYNAMIC Mon Aug  5 19:46:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
root@ubuntu:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-6.8.0-2011-nvidia-64k root=/dev/mapper/ubuntu--vg-ubuntu--lv ro iommu.passthrough=y
root@ubuntu:~# nvidia-smi
Tue Aug  6 21:18:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.10                 Driver Version: 565.10         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             Off |   00000009:01:00.0 Off |                    0 |
| N/A   59C    P0            237W /  700W |       2MiB /  97871MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+                                                             

root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/gflops/gflops
Running GFLOPs test...
&&&& PERF GFLOPs 0
&&&& gflops test PASSED

root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/uvmConformance/uvmConformance -t texture_simple
Device 0: NVIDIA GH200 120GB
Driver version: 12070
Runtime version: 12070
Dispatcher pid: 2249
Running test texture_simple (pid: 2282)
^^^^ PASS: texture_simple (566.5ms)
Total time: 567ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED

root@ubuntu:~/mochs/r565/internal_r565_cuda12.7_with_nvbug_4318299_fix# tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host 
Device 0: NVIDIA GH200 120GB
Driver version: 12070
Runtime version: 12070
Dispatcher pid: 2286
Running test ats_malloc_host (pid: 2318)
^^^^ PASS: ats_malloc_host (557.6ms)
Total time: 558ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED

nvidia-bfigg pushed a commit that referenced this pull request Aug 17, 2024
BugLink: https://bugs.launchpad.net/bugs/2073603

[ Upstream commit 769e6a1 ]

ui_browser__show() is capturing the input title that is stack allocated
memory in hist_browser__run().

Avoid a use after return by strdup-ing the string.

Committer notes:

Further explanation from Ian Rogers:

My command line using tui is:
$ sudo bash -c 'rm /tmp/asan.log*; export
ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a
sleep 1; /tmp/perf/perf mem report'
I then go to the perf annotate view and quit. This triggers the asan
error (from the log file):
```
==1254591==ERROR: AddressSanitizer: stack-use-after-return on address
0x7f2813331920 at pc 0x7f28180
65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10
READ of size 80 at 0x7f2813331920 thread T0
    #0 0x7f2818065990 in __interceptor_strlen
../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461
    #1 0x7f2817698251 in SLsmg_write_wrapped_string
(/lib/x86_64-linux-gnu/libslang.so.2+0x98251)
    #2 0x7f28176984b9 in SLsmg_write_nstring
(/lib/x86_64-linux-gnu/libslang.so.2+0x984b9)
    #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60
    #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266
    #5 0x55c94045c776 in ui_browser__show ui/browser.c:288
    #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206
    #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458
    #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412
    #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527
    #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613
    #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661
    #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671
    #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141
    #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805
    #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374
    #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516
    #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350
    #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403
    #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447
    #20 0x55c9400e53ad in main tools/perf/perf.c:561
    #21 0x7f28170456c9 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
    #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360
    #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId:
84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93)

Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame
    #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746

  This frame has 1 object(s):
    [32, 192) 'title' (line 747) <== Memory access at offset 32 is
inside this variable
HINT: this may be a false positive if your program uses some custom
stack unwind mechanism, swapcontext or vfork
```
hist_browser__run isn't on the stack so the asan error looks legit.
There's no clean init/exit on struct ui_browser so I may be trading a
use-after-return for a memory leak, but that seems look a good trade
anyway.

Fixes: 05e8b08 ("perf ui browser: Stop using 'self'")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Portia Stephens <[email protected]>
Signed-off-by: Roxana Nicolescu <[email protected]>
nvidia-bfigg pushed a commit that referenced this pull request Aug 17, 2024
…uddy pages

BugLink: https://bugs.launchpad.net/bugs/2073788

commit 8cf360b upstream.

When I did memory failure tests recently, below panic occurs:

page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:1009!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__del_page_from_free_list+0x151/0x180
RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __rmqueue_pcplist+0x23b/0x520
 get_page_from_freelist+0x26b/0xe40
 __alloc_pages_noprof+0x113/0x1120
 __folio_alloc_noprof+0x11/0xb0
 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
 __alloc_fresh_hugetlb_folio+0xe7/0x140
 alloc_pool_huge_folio+0x68/0x100
 set_max_huge_pages+0x13d/0x340
 hugetlb_sysctl_handler_common+0xe8/0x110
 proc_sys_call_handler+0x194/0x280
 vfs_write+0x387/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xc2/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff916114887
RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
 </TASK>
Modules linked in: mce_inject hwpoison_inject
---[ end trace 0000000000000000 ]---

And before the panic, there had an warning about bad page state:

BUG: Bad page state in process page-types  pfn:8cee00
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
page_type: 0xffffff7f(buddy)
raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
Call Trace:
 <TASK>
 dump_stack_lvl+0x83/0xa0
 bad_page+0x63/0xf0
 free_unref_page+0x36e/0x5c0
 unpoison_memory+0x50b/0x630
 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
 debugfs_attr_write+0x42/0x60
 full_proxy_write+0x5b/0x80
 vfs_write+0xcd/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xc2/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f189a514887
RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
 </TASK>

The root cause should be the below race:

 memory_failure
  try_memory_failure_hugetlb
   me_huge_page
    __page_handle_poison
     dissolve_free_hugetlb_folio
     drain_all_pages -- Buddy page can be isolated e.g. for compaction.
     take_page_off_buddy -- Failed as page is not in the buddy list.
	     -- Page can be putback into buddy after compaction.
    page_ref_inc -- Leads to buddy page with refcnt = 1.

Then unpoison_memory() can unpoison the page and send the buddy page back
into buddy list again leading to the above bad page state warning.  And
bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
allocate this page.

Fix this issue by only treating __page_handle_poison() as successful when
it returns 1.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ceaf8fb ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Portia Stephens <[email protected]>
Signed-off-by: Roxana Nicolescu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.