Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE #11

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

jamieNguyenNVIDIA
Copy link
Contributor

The current calculation of min_free_kbytes only uses ZONE_DMA and ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will also divide part of min_free_kbytes.This will cause the min watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.

__GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in __setup_per_zone_wmarks().

On my testing machine with 16GB of memory (transparent hugepage is turned off by default, and movablecore=12G is configured) The following is a comparative test data of watermark_min

	no patch	add patch

ZONE_DMA 1 8
ZONE_DMA32 151 709
ZONE_NORMAL 233 1113
ZONE_MOVABLE 1434 128
min_free_kbytes 7288 7326

Link: https://lkml.kernel.org/r/[email protected]

Reviewed-by: "Huang, Ying" [email protected]

(cherry picked from commit 416ef04 linux)

ianmay81 and others added 30 commits December 4, 2023 11:14
…dversion"

This reverts commit 47d27f2.

We need to revert this to avoid regressing any modules used in Jammy.

Signed-off-by: Ian May <[email protected]>
This reverts commit b2638e9.

This feature is not targeted for Jammy.

Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
… a pasid support

BugLink: https://bugs.launchpad.net/bugs/2031320

When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would
skip the allocation of a CD table and set the CONFIG field of the STE
to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have
one substream, i.e. PASID disabled.

However, there could be a use case, for a pasid capable device, that
allows bypassing the translation at the default substream while still
enabling the pasid feature, which means the driver should not skip the
allocation of a CD table nor simply bypass the CONFIG field. Instead,
the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the
SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING.

Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the
finalise() to support this use case.

Also, according to "13.5 Summary of attribute/permission configuration
fields" in the reference manual, the SHCFG field value is irrelevant.
So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING
for simplification.

Signed-off-by: Nicolin Chen <[email protected]>
Reviewed-by: Pritesh Raithatha <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Nicolin Chen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
The current version of nvidia-fs(2.17.4-0) is failing to compile
with the 6.5 kernel.  Drop inclusion until an updated version
is available.

Signed-off-by: Ian May <[email protected]>
…rnel

BugLink: https://bugs.launchpad.net/bugs/2043132

With this change, the NVMe and NVMeOF driver would be enabled to support GPUDirectStorage(GDS).
The change is around nvme/nvme rdma map_data()
and unmap_data(), where the IO request is
first intercepted to check for GDS pages and
if it is a GDS page then the request is served
by GDS driver component called nvidia-fs,
else the request would be served by the standard NVMe driver code

Signed-off-by: Sourab Gupta <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2043132

With this change, the NFS driver would be enabled to support GPUDirectStorage(GDS).
The change is around frwr_map and frwr_unmap in the NFS driver, where the IO request
is first intercepted to check for GDS pages and if it is a GDS page then the
request is served by GDS driver component called nvidia-fs,
else the request would be served by the standard NFS driver code.

Signed-off-by: Sourab Gupta <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
This reverts commit b2638e9.

This feature is not targeted for Jammy.

Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Ian May <[email protected]>
…ng packages that do not apply to this kernel.

Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Ian May <[email protected]>
Add support for exposing rprovides data for standalone modules
too. Switch to exposing provides as a shared debian/substvar file
and use that in the templates.

Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Ian May <[email protected]>
ianmay81 and others added 16 commits January 11, 2024 11:31
Ignore: yes
Signed-off-by: Ian May <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

Add support of "Thermal fast Sampling Period (_TFP)" for passive
cooling.

As per the ACPI specification (ACPI 6.5, Section 11.4.17 "_TFP (Thermal
fast Sampling Period)", _TFP overrides _TSP ("Thermal Sampling Period"
if both are present in a Thermal zone.

Signed-off-by: Jeff Brasen <[email protected]>
Co-developed-by: Sumit Gupta <[email protected]>
Signed-off-by: Sumit Gupta <[email protected]>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Sumit Gupta <[email protected]>
(backported from commit a2ee758 linux-next)
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

Current implementation of processor_thermal performs software throttling
in fixed steps of "20%" which can be too coarse for some platforms.
We observed some performance gain after reducing the throttle percentage.
Change the CPUFREQ thermal reduction percentage and maximum thermal steps
to be configurable. Also, update the default values of both for Nvidia
Tegra241 (Grace) SoC. The thermal reduction percentage is reduced to "5%"
and accordingly the maximum number of thermal steps are increased as they
are derived from the reduction percentage.

Signed-off-by: Srikar Srimath Tirumala <[email protected]>
Co-developed-by: Sumit Gupta <[email protected]>
Signed-off-by: Sumit Gupta <[email protected]>
Acked-by: Sudeep Holla <[email protected]>
Acked-by: Hanjun Guo <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Sumit Gupta <[email protected]>
(cherry picked from commit 310293a linux-next)
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
…arlier

BugLink: https://bugs.launchpad.net/bugs/2049537

Allocate and device assign 'struct etmv4_drvdata' earlier during the driver
probe, ensuring that it can be retrieved in power management based runtime
callbacks if required. This will also help in dropping iomem base address
argument from the function etm4_probe() later.

Cc: Mathieu Poirier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 3095e90 linux)
Signed-off-by: Tushar Dave <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

'struct etm4_drvdata' itself can carry the base address before etm4_probe()
gets called. Just drop that redundant argument from etm4_probe().

Cc: Mathieu Poirier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: James Clark <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 4e3b9a6 linux)
Signed-off-by: Tushar Dave <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

Coresight device pid can be retrieved from its iomem base address, which is
stored in 'struct etm4x_drvdata'. This drops pid argument from etm4_probe()
and 'struct etm4_init_arg'. Instead etm4_check_arch_features() derives the
coresight device pid with a new helper coresight_get_pid(), right before it
is consumed in etm4_hisi_match_pid().

Cc: Mathieu Poirier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 5a1c709 linux)
Signed-off-by: Tushar Dave <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

Add support for handling MMIO based devices via platform driver. We need to
make sure that :

1) The APB clock, if present is enabled at probe and via runtime_pm ops
2) Use the ETM4x architecture or CoreSight architecture registers to
   identify a device as CoreSight ETM4x, instead of relying a white list of
   "Peripheral IDs"

The driver doesn't get to handle the devices yet, until we wire the ACPI
changes to move the devices to be handled via platform driver than the
etm4_amba driver.

Cc: Mathieu Poirier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Sudeep Holla <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 73d779a linux)
Signed-off-by: Tushar Dave <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

Drop ETM4X ACPI ID from the AMBA ACPI device list, and instead just move it
inside the new ACPI devices list detected and used via platform driver.

Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mathieu Poirier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Sudeep Holla <[email protected]> (for ACPI specific changes)
Acked-by: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Tested-by: Tanmay Jagdale <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 134124a linux)
Signed-off-by: Tushar Dave <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2049537

The peer_memory_client scheme allows a driver to register with the ib_umem
system that it has the ability to understand user virtual address ranges
that are not compatible with get_user_pages(). For instance VMAs created
with io_remap_pfn_range(), or other driver special VMA.

For ranges the interface understands it can provide a DMA mapped sg_table
for use by the ib_umem, allowing user virtual ranges that cannot be
supported by get_user_pages() to be used as umems for RDMA.

This is designed to preserve the kABI, no functions or structures are
changed, only new symbols are added:

 ib_register_peer_memory_client
 ib_unregister_peer_memory_client
 ib_umem_activate_invalidation_notifier
 ib_umem_get_peer

And a bitfield in struct ib_umem uses more bits.

This interface is compatible with the two out of tree GPU drivers:
 https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/amdkfd/kfd_peerdirect.c
 https://github.com/Mellanox/nv_peer_memory/blob/master/nv_peer_mem.c

Signed-off-by: Yishai Hadas <[email protected]>
Signed-off-by: Feras Daoud <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Acked-by: Ian May <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
The current calculation of min_free_kbytes only uses ZONE_DMA and
ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will
also divide part of min_free_kbytes.This will cause the min watermark of
ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.

__GFP_HIGH and PF_MEMALLOC allocations usually don't need movable zone
pages, so just like ZONE_HIGHMEM, cap pages_min to a small value in
__setup_per_zone_wmarks().

On my testing machine with 16GB of memory (transparent hugepage is turned
off by default, and movablecore=12G is configured) The following is a
comparative test data of watermark_min

		no patch	add patch
ZONE_DMA	1		8
ZONE_DMA32	151		709
ZONE_NORMAL	233		1113
ZONE_MOVABLE	1434		128
min_free_kbytes	7288		7326

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: liuq <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 416ef04 linux)
Signed-off-by: Jamie Nguyen <[email protected]>
Signed-off-by: Jan Vesely <[email protected]>
@jamieNguyenNVIDIA
Copy link
Contributor Author

jamieNguyenNVIDIA commented Mar 1, 2024

Internal bug: 4452840

Test case:

  • Download and extract tests-Linux-aarch64.tar from //builds/linuxbuilds/release/display/aarch64/<GPU_DRIVER_VER>/tests-Linux-aarch64.tar
  • Load UVM with tests enabled:
sudo nvidia-smi -pm 0
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm uvm_enable_builtin_tests=1 uvm_perf_access_counter_mimc_migration_enable=1
sudo nvidia-smi -pm 1
  • Run ./tests-Linux-aarch64/uvm/uvm_test/uvm_test -t oversubscription_multiprocess

It will fail with the following output without this change:

$ ./uvm_test -t oversubscription_multiprocess
^^^^ RUNNING oversubscription_multiprocess
pid: 200601; started at: Fri Jan 12 02:44:53 UTC 2024
repro cmd: oversubscription --printKmsg -n -t multiprocess --seed 2863276720
(thread 281465001818176 [t0]) At oversubscription.cu:671:
   Value of : cudaMalloc(&dummy, mapGpuEvictAllocSize[gpu.cudaId])
   Which is : 2 (cudaErrorMemoryAllocation)
   Should be: cudaSuccess
   Which is : 0 (cudaSuccess) free_mem = 9557 MB
Test state is:
 Thread 281465001818176 [t0]:
   oversubscription.cu:668: alloc_size = 5768 MB
   oversubscription.cu:657: gpu = { GH100GL-B  GPU-77083a13-fb9f-2097-eac0-81ee8034773d CUDA ID  0 }
   oversubscription.cu:588: child = 0
Waiting 100ms for CUDA API calls in other threads to finish.
Failed to run cleanup action: wait for CUDA API calls.
Forcing termination due to failed cleanup actions.
(thread 281465001818176 [t0]) At utils.cpp:1310:
   expected !spawnedChildDied spawned child 200602 died unexpectedly

It should pass with this patch applied.

@jamieNguyenNVIDIA
Copy link
Contributor Author

This was broken in 6.3+ kernels that contain the following patch:

[1ebbb21811b76c3b932959787f37985af36f62fa] mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves

nvidia-bfigg pushed a commit that referenced this pull request Apr 24, 2024
BugLink: https://bugs.launchpad.net/bugs/2059068

commit f546c42 upstream.

[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:

- "unable to find chunk map" errors
  BTRFS info (device vdb): scrub: started on devid 1
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056

  This would lead to unrepariable errors.

- Use-after-free KASAN reports:
  ==================================================================
  BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
  Read of size 8 at addr ffff8881013c9040 by task btrfs/909
  CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x43/0x60
   print_report+0xcf/0x640
   kasan_report+0xa6/0xd0
   __blk_rq_map_sg+0x18f/0x7c0
   virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   blk_mq_flush_plug_list.part.0+0x780/0x860
   __blk_flush_plug+0x1ba/0x220
   blk_finish_plug+0x3b/0x60
   submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   __x64_sys_ioctl+0xbd/0x100
   do_syscall_64+0x5d/0xe0
   entry_SYSCALL_64_after_hwframe+0x63/0x6b
  RIP: 0033:0x7f47e5e0952b

- Crash, mostly due to above use-after-free

[CAUSE]
The converted fs has the following data chunk layout:

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
        length 86016 owner 2 stripe_len 65536 type DATA|single

For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).

This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.

However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.

[FIX]
- Make sure scrub_read_endio() only updates bits in its range
  Since we may read less than 64K at the end of the chunk, we should not
  touch the bits beyond chunk boundary.

- Make sure scrub_submit_initial_read() only to read the chunk range
  This is done by calculating the real number of sectors we need to
  read, and add sector-by-sector to the bio.

Thankfully the scrub read repair path won't need extra fixes:

- scrub_stripe_submit_repair_read()
  With above fixes, we won't update error bit for range beyond chunk,
  thus scrub_stripe_submit_repair_read() should never submit any read
  beyond the chunk.

Reported-by: Rongrong <[email protected]>
Fixes: e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
[ Use min_t() to fix a compiling error due to difference types ]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Portia Stephens <[email protected]>
Signed-off-by: Stefan Bader <[email protected]>
nvidia-bfigg pushed a commit that referenced this pull request May 25, 2024
BugLink: https://bugs.launchpad.net/bugs/2059068

commit f546c42 upstream.

[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:

- "unable to find chunk map" errors
  BTRFS info (device vdb): scrub: started on devid 1
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056

  This would lead to unrepariable errors.

- Use-after-free KASAN reports:
  ==================================================================
  BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
  Read of size 8 at addr ffff8881013c9040 by task btrfs/909
  CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x43/0x60
   print_report+0xcf/0x640
   kasan_report+0xa6/0xd0
   __blk_rq_map_sg+0x18f/0x7c0
   virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   blk_mq_flush_plug_list.part.0+0x780/0x860
   __blk_flush_plug+0x1ba/0x220
   blk_finish_plug+0x3b/0x60
   submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   __x64_sys_ioctl+0xbd/0x100
   do_syscall_64+0x5d/0xe0
   entry_SYSCALL_64_after_hwframe+0x63/0x6b
  RIP: 0033:0x7f47e5e0952b

- Crash, mostly due to above use-after-free

[CAUSE]
The converted fs has the following data chunk layout:

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
        length 86016 owner 2 stripe_len 65536 type DATA|single

For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).

This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.

However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.

[FIX]
- Make sure scrub_read_endio() only updates bits in its range
  Since we may read less than 64K at the end of the chunk, we should not
  touch the bits beyond chunk boundary.

- Make sure scrub_submit_initial_read() only to read the chunk range
  This is done by calculating the real number of sectors we need to
  read, and add sector-by-sector to the bio.

Thankfully the scrub read repair path won't need extra fixes:

- scrub_stripe_submit_repair_read()
  With above fixes, we won't update error bit for range beyond chunk,
  thus scrub_stripe_submit_repair_read() should never submit any read
  beyond the chunk.

Reported-by: Rongrong <[email protected]>
Fixes: e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
[ Use min_t() to fix a compiling error due to difference types ]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Portia Stephens <[email protected]>
Signed-off-by: Stefan Bader <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants