Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.10-rc4-scx1 #31

Merged
merged 650 commits into from
Jun 21, 2024
Merged

v6.10-rc4-scx1 #31

merged 650 commits into from
Jun 21, 2024

Conversation

Byte-Lab
Copy link
Contributor

Sorry, the last PR was merging into the wrong branch. Fixed.

q2ven and others added 30 commits June 6, 2024 12:57
net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be
changed concurrently.

Let's use READ_ONCE() in unix_create1().

Fixes: 1da177e ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Once sk->sk_state is changed to TCP_LISTEN, it never changes.

unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.

It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().

Thus, we need to use unix_recvq_full_lockless() to avoid data-race.

Now we remove unix_recvq_full() as no one uses it.

Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:

  (1) For SOCK_DGRAM, it is a written-once field in unix_create1()

  (2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
      listener's unix_state_lock() in unix_listen(), and we hold
      the lock in unix_stream_connect()

Fixes: 1da177e ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().

However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock.  Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().

Thues, unix_state_lock() does not protect qlen.

Let's use skb_queue_empty_lockless() in unix_release_sock().

Fixes: 1da177e ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
We can dump the socket queue length via UNIX_DIAG by specifying
UDIAG_SHOW_RQLEN.

If sk->sk_state is TCP_LISTEN, we return the recv queue length,
but here we do not hold recvq lock.

Let's use skb_queue_len_lockless() in sk_diag_show_rqlen().

Fixes: c9da99e ("unix_diag: Fixup RQLEN extension report")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock().

Let's use READ_ONCE() to read sk->sk_shutdown.

Fixes: e4e541a ("sock-diag: Report shutdown for inet and unix sockets (v2)")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
…ields'

Kuniyuki Iwashima says:

====================
af_unix: Fix lockless access of sk->sk_state and others fields.

The patch 1 fixes a bug where SOCK_DGRAM's sk->sk_state is changed
to TCP_CLOSE even if the socket is connect()ed to another socket.

The rest of this series annotates lockless accesses to the following
fields.

  * sk->sk_state
  * sk->sk_sndbuf
  * net->unx.sysctl_max_dgram_qlen
  * sk->sk_receive_queue.qlen
  * sk->sk_shutdown

Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().

This will require some refactoring, so another series will follow.

Changes:
  v2:
    * Patch 1: Fix wrong double lock

  v1: https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
…ange

This is a leftover from commit ce1fc93 ("kconfig: do not clear
SYMBOL_DEF_USER when the value is out of range").

This code is now redundant because if a user-supplied value is out
of range, the value adjusted by sym_validate_range() differs, and
conf_unsaved has already been incremented a few lines above.

Signed-off-by: Masahiro Yamada <[email protected]>
Currently, the initial state of the "Save" button is always active.

If none of the CONFIG options are changed while loading the .config
file, the "Save" button should be greyed out.

This can be fixed by calling conf_read() after widget initialization.

Signed-off-by: Masahiro Yamada <[email protected]>
This sentence does not make sense due to a typo. Fix it.

Fixes: def2fbf ("kconfig: allow symbols implied by y to become m")
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
Documentation/kbuild/kconfig-language.rst explains the behavior of
'select' as follows:

  reverse dependencies can be used to force a lower limit of
  another symbol. The value of the current menu symbol is used as the
  minimal value <symbol> can be set to.

This is not true when the 'select' property is followed by 'if'.

[Test Code]

    config MODULES
            def_bool y
            modules

    config A
            def_tristate y
            select C if B

    config B
            def_tristate m

    config C
            tristate

[Result]

    CONFIG_MODULES=y
    CONFIG_A=y
    CONFIG_B=m
    CONFIG_C=m

If "the value of A is used as the minimal value C can be set to",
C must be 'y'.

The actual behavior is "C is selected by (A && B)". The lower limit of
C is downgraded due to B being 'm'.

This behavior is kind of weird, and this has arisen several times in
the mailing list.

I do not know whether it is a bug or intended behavior. Anyway, it is
not feasible to change it now because many Kconfig files are written
based on this behavior. The same applies to 'imply'.

Document this (but reserve the possibility for a future change).

Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
syzbot found a race in __fib6_drop_pcpu_from() [1]

If compiler reads more than once (*ppcpu_rt),
second read could read NULL, if another cpu clears
the value in rt6_get_pcpu_route().

Add a READ_ONCE() to prevent this race.

Also add rcu_read_lock()/rcu_read_unlock() because
we rely on RCU protection while dereferencing pcpu_rt.

[1]

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097]
CPU: 0 PID: 7543 Comm: kworker/u8:17 Not tainted 6.10.0-rc1-syzkaller-00013-g2bfcfd584ff5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Workqueue: netns cleanup_net
 RIP: 0010:__fib6_drop_pcpu_from.part.0+0x10a/0x370 net/ipv6/ip6_fib.c:984
Code: f8 48 c1 e8 03 80 3c 28 00 0f 85 16 02 00 00 4d 8b 3f 4d 85 ff 74 31 e8 74 a7 fa f7 49 8d bf 90 00 00 00 48 89 f8 48 c1 e8 03 <80> 3c 28 00 0f 85 1e 02 00 00 49 8b 87 90 00 00 00 48 8b 0c 24 48
RSP: 0018:ffffc900040df070 EFLAGS: 00010206
RAX: 0000000000000012 RBX: 0000000000000001 RCX: ffffffff89932e16
RDX: ffff888049dd1e00 RSI: ffffffff89932d7c RDI: 0000000000000091
RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000007
R10: 0000000000000001 R11: 0000000000000006 R12: ffff88807fa080b8
R13: fffffbfff1a9a07d R14: ffffed100ff41022 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8880b9200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32c26000 CR3: 000000005d56e000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  __fib6_drop_pcpu_from net/ipv6/ip6_fib.c:966 [inline]
  fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1027 [inline]
  fib6_purge_rt+0x7f2/0x9f0 net/ipv6/ip6_fib.c:1038
  fib6_del_route net/ipv6/ip6_fib.c:1998 [inline]
  fib6_del+0xa70/0x17b0 net/ipv6/ip6_fib.c:2043
  fib6_clean_node+0x426/0x5b0 net/ipv6/ip6_fib.c:2205
  fib6_walk_continue+0x44f/0x8d0 net/ipv6/ip6_fib.c:2127
  fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2175
  fib6_clean_tree+0xd7/0x120 net/ipv6/ip6_fib.c:2255
  __fib6_clean_all+0x100/0x2d0 net/ipv6/ip6_fib.c:2271
  rt6_sync_down_dev net/ipv6/route.c:4906 [inline]
  rt6_disable_ip+0x7ed/0xa00 net/ipv6/route.c:4911
  addrconf_ifdown.isra.0+0x117/0x1b40 net/ipv6/addrconf.c:3855
  addrconf_notify+0x223/0x19e0 net/ipv6/addrconf.c:3778
  notifier_call_chain+0xb9/0x410 kernel/notifier.c:93
  call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:1992
  call_netdevice_notifiers_extack net/core/dev.c:2030 [inline]
  call_netdevice_notifiers net/core/dev.c:2044 [inline]
  dev_close_many+0x333/0x6a0 net/core/dev.c:1585
  unregister_netdevice_many_notify+0x46d/0x19f0 net/core/dev.c:11193
  unregister_netdevice_many net/core/dev.c:11276 [inline]
  default_device_exit_batch+0x85b/0xae0 net/core/dev.c:11759
  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
  cleanup_net+0x5b7/0xbf0 net/core/net_namespace.c:640
  process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231
  process_scheduled_works kernel/workqueue.c:3312 [inline]
  worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393
  kthread+0x2c1/0x3a0 kernel/kthread.c:389
  ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

Fixes: d52d399 ("ipv6: Create percpu rt6_info")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
expr_trans_bool() performs an incorrect transformation.

[Test Code]

    config MODULES
            def_bool y
            modules

    config A
            def_bool y
            select C if B != n

    config B
            def_tristate m

    config C
            tristate

[Result]

    CONFIG_MODULES=y
    CONFIG_A=y
    CONFIG_B=m
    CONFIG_C=m

This output is incorrect because CONFIG_C=y is expected.

Documentation/kbuild/kconfig-language.rst clearly explains the function
of the '!=' operator:

    If the values of both symbols are equal, it returns 'n',
    otherwise 'y'.

Therefore, the statement:

    select C if B != n

should be equivalent to:

    select C if y

Or, more simply:

    select C

Hence, the symbol C should be selected by the value of A, which is 'y'.

However, expr_trans_bool() wrongly transforms it to:

    select C if B

Therefore, the symbol C is selected by (A && B), which is 'm'.

The comment block of expr_trans_bool() correctly explains its intention:

  * bool FOO!=n => FOO
    ^^^^

If FOO is bool, FOO!=n can be simplified into FOO. This is correct.

However, the actual code performs this transformation when FOO is
tristate:

    if (e->left.sym->type == S_TRISTATE) {
                             ^^^^^^^^^^

While it can be fixed to S_BOOLEAN, there is no point in doing so
because expr_tranform() already transforms FOO!=n to FOO when FOO is
bool. (see the "case E_UNEQUAL" part)

expr_trans_bool() is wrong and unnecessary.

Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Clang static checker (scan-build) warning:
net/ethtool/ioctl.c:line 2233, column 2
Called function pointer is null (null dereference).

Return '-EOPNOTSUPP' when 'ops->get_ethtool_phy_stats' is NULL to fix
this typo error.

Fixes: 201ed31 ("net/ethtool/ioctl: split ethtool_get_phy_stats into multiple helpers")
Signed-off-by: Su Hui <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Reviewed-by: Hariprasad Kelam <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
The pata_macio driver advertises a max_segment_size of 0xff00, because
the hardware doesn't cope with requests >= 64K.

However the SCSI core requires max_segment_size to be at least
PAGE_SIZE, which is a problem for pata_macio when the kernel is built
with 64K pages.

In older kernels the SCSI core would just increase the segment size to
be equal to PAGE_SIZE, however since the commit tagged below it causes a
warning and the device fails to probe:

  WARNING: CPU: 0 PID: 26 at block/blk-settings.c:202 .blk_validate_limits+0x2f8/0x35c
  CPU: 0 PID: 26 Comm: kworker/u4:1 Not tainted 6.10.0-rc1 #1
  Hardware name: PowerMac7,2 PPC970 0x390202 PowerMac
  ...
  NIP .blk_validate_limits+0x2f8/0x35c
  LR  .blk_alloc_queue+0xc0/0x2f8
  Call Trace:
    .blk_alloc_queue+0xc0/0x2f8
    .blk_mq_alloc_queue+0x60/0xf8
    .scsi_alloc_sdev+0x208/0x3c0
    .scsi_probe_and_add_lun+0x314/0x52c
    .__scsi_add_device+0x170/0x1a4
    .ata_scsi_scan_host+0x2bc/0x3e4
    .async_port_probe+0x6c/0xa0
    .async_run_entry_fn+0x60/0x1bc
    .process_one_work+0x228/0x510
    .worker_thread+0x360/0x530
    .kthread+0x134/0x13c
    .start_kernel_thread+0x10/0x14
  ...
  scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured

Although the hardware can't cope with a 64K segment, the driver
already deals with that internally by splitting large requests in
pata_macio_qc_prep(). That is how the driver has managed to function
until now on 64K kernels.

So fix the driver to advertise a max_segment_size of 64K, which avoids
the warning and keeps the SCSI core happy.

Fixes: afd53a3 ("scsi: core: Initialize scsi midlayer limits before allocating the queue")
Reported-by: Guenter Roeck <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Doru Iorgulescu <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218858
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: John Garry <[email protected]>
Signed-off-by: Niklas Cassel <[email protected]>
If errexit is enabled ('set -e'), loopy_wait -- or busywait and others
using it -- will stop after the first failure.

Note that if the returned status of loopy_wait is checked, and even if
errexit is enabled, Bash will not stop at the first error.

Fixes: 25ae948 ("selftests/net: add lib.sh")
Cc: [email protected]
Acked-by: Geliang Tang <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-1-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
If there is an error to create the first netns with 'setup_ns()',
'cleanup_ns()' will be called with an empty string as first parameter.

The consequences is that 'cleanup_ns()' will try to delete an invalid
netns, and wait 20 seconds if the netns list is empty.

Instead of just checking if the name is not empty, convert the string
separated by spaces to an array. Manipulating the array is cleaner, and
calling 'cleanup_ns()' with an empty array will be a no-op.

Fixes: 25ae948 ("selftests/net: add lib.sh")
Cc: [email protected]
Acked-by: Geliang Tang <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-2-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
Without this, the 'i' variable declared before could be overridden by
accident, e.g.

  for i in "${@}"; do
      __ksft_status_merge "${i}"  ## 'i' has been modified
      foo "${i}"                  ## using 'i' with an unexpected value
  done

After a quick look, it looks like 'i' is currently not used after having
been modified in __ksft_status_merge(), but still, better be safe than
sorry. I saw this while modifying the same file, not because I suspected
an issue somewhere.

Fixes: 596c881 ("selftests: forwarding: Have RET track kselftest framework constants")
Acked-by: Geliang Tang <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-3-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
Matthieu Baerts says:

====================
selftests: net: lib: small fixes

While looking at using 'lib.sh' for the MPTCP selftests [1], we found
some small issues with 'lib.sh'. Here they are:

- Patch 1: fix 'errexit' (set -e) support with busywait. 'errexit' is
  supported in some functions, not all. A fix for v6.8+.

- Patch 2: avoid confusing error messages linked to the cleaning part
  when the netns setup fails. A fix for v6.8+.

- Patch 3: set a variable as local to avoid accidentally changing the
  value of a another one with the same name on the caller side. A fix
  for v6.10-rc1+.

Link: https://lore.kernel.org/mptcp/[email protected]/T/ [1]
====================

Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-0-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
…ux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - Ensure that .discard sections are really discarded in the EFI zboot
   image build

 - Return proper error numbers from efi-pstore

 - Add __nocfi annotations to EFI runtime wrappers

* tag 'efi-fixes-for-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Add missing __nocfi annotations to runtime wrappers
  efi: pstore: Return proper errors on UEFI failures
  efi/libstub: zboot.lds: Discard .discard sections
Pull tomoyo fixlet from Tetsuo Handa:
 "Single patch to update project links, no behavior changes"

* tag 'tomoyo-pr-20240606' of git://git.code.sf.net/p/tomoyo/tomoyo:
  tomoyo: update project links
…/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF and big collection of fixes for WiFi core and
  drivers.

  Current release - regressions:

   - vxlan: fix regression when dropping packets due to invalid src
     addresses

   - bpf: fix a potential use-after-free in bpf_link_free()

   - xdp: revert support for redirect to any xsk socket bound to the
     same UMEM as it can result in a corruption

   - virtio_net:
      - add missing lock protection when reading return code from
        control_buf
      - fix false-positive lockdep splat in DIM
      - Revert "wifi: wilc1000: convert list management to RCU"

   - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config

  Previous releases - regressions:

   - rtnetlink: make the "split" NLM_DONE handling generic, restore the
     old behavior for two cases where we started coalescing those
     messages with normal messages, breaking sloppily-coded userspace

   - wifi:
      - cfg80211: validate HE operation element parsing
      - cfg80211: fix 6 GHz scan request building
      - mt76: mt7615: add missing chanctx ops
      - ath11k: move power type check to ASSOC stage, fix connecting to
        6 GHz AP
      - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
      - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
      - iwlwifi: mvm: fix a crash on 7265

  Previous releases - always broken:

   - ncsi: prevent multi-threaded channel probing, a spec violation

   - vmxnet3: disable rx data ring on dma allocation failure

   - ethtool: init tsinfo stats if requested, prevent unintentionally
     reporting all-zero stats on devices which don't implement any

   - dst_cache: fix possible races in less common IPv6 features

   - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED

   - ax25: fix two refcounting bugs

   - eth: ionic: fix kernel panic in XDP_TX action

  Misc:

   - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"

* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
  selftests: net: lib: set 'i' as local
  selftests: net: lib: avoid error removing empty netns name
  selftests: net: lib: support errexit with busywait
  net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
  ipv6: fix possible race in __fib6_drop_pcpu_from()
  af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
  af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  af_unix: Annotate data-races around sk->sk_sndbuf.
  af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
  af_unix: Annotate data-race of sk->sk_state in unix_accept().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
  af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  af_unix: Annodate data-races around sk->sk_state for writers.
  af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  ...
memblock_set_node() warns about using MAX_NUMNODES, see

  e0eec24 ("memblock: make memblock_set_node() also warn about use of MAX_NUMNODES")

for details.

Reported-by: Narasimhan V <[email protected]>
Signed-off-by: Jan Beulich <[email protected]>
Cc: [email protected]
[bp: commit message]
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:

  BUG: Bad page state in process kswapd0  pfn:d6e840
  page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
  pfn:0xd6e840
  aops:btree_aops ino:1
  flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
  page_type: 0xffffffff()
  raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
  raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: non-NULL mapping

[CAUSE]
Commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.

Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).

This can lead to the following race:

Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).

           Thread A                |                 Thread B
-----------------------------------+-------------------------------------
                                   | btree_release_folio()
				   | | This is for the page at X + 4K,
				   | | Not page X.
				   | |
alloc_extent_buffer()              | |- release_extent_buffer()
|- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
|  page at bytenr X (the first     | |  |
|  page).                          | |  |
|  Which returned -EEXIST.         | |  |
|                                  | |  |
|- filemap_lock_folio()            | |  |
|  Returned the first page locked. | |  |
|                                  | |  |
|- grab_extent_buffer()            | |  |
|  |- atomic_inc_not_zero()        | |  |
|  |  Returned false               | |  |
|  |- folio_detach_private()       | |  |- folio_detach_private() for X
|     |- folio_test_private()      | |     |- folio_test_private()
      |  Returned true             | |     |  Returned true
      |- folio_put()               |       |- folio_put()

Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.

The condition is not that easy to hit:

- The release must be triggered for the middle page of an eb
  If the release is on the same first page of an eb, page lock would kick
  in and prevent the race.

- folio_detach_private() has a very small race window
  It's only between folio_test_private() and folio_clear_private().

That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.

At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.

[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.

Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.

To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):

  #!/bin/sh
  drop_caches () {
	  while(true); do
		  echo 3 > /proc/sys/vm/drop_caches
		  echo 1 > /proc/sys/vm/compact_memory
	  done
  }

  run_tar () {
	  while(true); do
		  for x in `seq 1 80` ; do
			  tar cf /dev/zero /mnt > /dev/null &
		  done
		  wait
	  done
  }

  mkfs.btrfs -f -d single -m single /dev/vda
  mount -o noatime /dev/vda /mnt
  # create 200,000 files, 1K each
  ./simoop -n 200000 -E -f 1k /mnt
  drop_caches &
  (run_tar)

Reported-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <[email protected]>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Reported-by: [email protected]
Fixes: 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: [email protected] # 6.8+
CC: Chris Mason <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Because I left Intel, I'm removing myself from the list
of Xe driver maintainers.

Signed-off-by: Oded Gabbay <[email protected]>
Acked-by: Lucas De Marchi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Lucas De Marchi <[email protected]>
(cherry picked from commit 8de6625)
Signed-off-by: Lucas De Marchi <[email protected]>
Add Rodrigo Vivi as an Xe driver maintainer.

v2:
- Cc also Lucas De Marchi (Rodrigo vivi)
- Remove a blank line in commit the commit message (Lucas De Marchi)

Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Lucas De Marchi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Thomas Hellström <[email protected]>
Acked-by: Rodrigo Vivi <[email protected]>
Acked-by: Lucas De Marchi <[email protected]>
Acked-by: Jani Nikula <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Lucas De Marchi <[email protected]>
…ernel/git/pci/pci

Pull pci fix from Bjorn Helgaas:

 - Revert lockdep checking on locking that protects device resets from
   user-space config accesses; it exposed issues for which fixes are in
   the works but are too risky for this cycle (Dan Williams)

* tag 'pci-v6.10-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI: Revert the cfg_access_lock lockdep mechanism
…it/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "The core change is to detect unusually large number of VPD pages
  (caused by device manufacturers having an endiannes issue) and reject
  them rather than trying to parse a huge non-existent array.

  The remaining fixes are in drivers the most user visible of which is
  the ALUA state transition recognition (leads to intermittent I/O
  errors in some situations otherwise)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: mcq: Fix error output and clean up ufshcd_mcq_abort()
  scsi: core: Handle devices which return an unusually large VPD page count
  scsi: mpt3sas: Add missing kerneldoc parameter descriptions
  scsi: qedf: Set qed_slowpath_params to zero before use
  scsi: qedf: Wait for stag work during unload
  scsi: qedf: Don't process stag work during unload and recovery
  scsi: sr: Fix unintentional arithmetic wraparound
  scsi: core: alua: I/O errors for ALUA state transitions
  scsi: mpi3mr: Use proper format specifier in mpi3mr_sas_port_add()
…op.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.10-2024-06-06:

amdgpu:
- Fix shutdown issues on some SMU 13.x platforms
- Silence some UBSAN flexible array warnings

Signed-off-by: Dave Airlie <[email protected]>

From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
…rg/drm/misc/kernel into drm-fixes

drm-misc-fixes for v6.10-rc3:
- Robustness fixes for vmwgfx.
- Error check for of_drm_get_panel_orientation failing in
  sitronix-st7789v.

Signed-off-by: Dave Airlie <[email protected]>
From: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
…top.org/drm/misc/kernel into drm-fixes

drm-misc-next-fixes for v6.10-rc3:
- Single unused struct removal that should have been in -fixes.

Signed-off-by: Dave Airlie <[email protected]>
From: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
torvalds and others added 26 commits June 15, 2024 12:00
Pull smb server fixes from Steve French:
 "Two small smb3 server fixes:

   - set xatttr fix

   - pathname parsing check fix"

* tag '6.10-rc3-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix missing use of get_write in in smb2_set_ea()
  ksmbd: move leading slash check to smb2_get_name()
…fs-linux

Pull xfs fix from Chandan Babu:
 "Ensure xfs incore superblock's allocated inode counter, free inode
  counter, and free data block counter are all zero or positive when
  they are copied over from xfs_mount->m_[icount,ifree,fdblocks]
  respectively"

* tag 'xfs-6.10-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: make sure sb_fdblocks is non-negative
… translation

The currently used normalized address format is not applicable to all
MI300 systems. This leads to incorrect results during address
translation.

Drop the fixed layout and construct the normalized address from system
settings.

Fixes: 87a6123 ("RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support")
Signed-off-by: Yazen Ghannam <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
…inux/kernel/git/andi.shyti/linux into i2c/for-current

Two fixes from Jean aim to correctly report i2c functionality,
specifically ensuring that I2C_FUNC_SLAVE is reported when a
device operates solely as a slave interface.
… trigger the default trigger"

Commit 66601a2 ("leds: class: If no default trigger is given, make
hw_control trigger the default trigger") causes ledtrig-netdev to get
set as default trigger on various network LEDs.

This causes users to hit a pre-existing AB-BA deadlock issue in
ledtrig-netdev between the LED-trigger locks and the rtnl mutex,
resulting in hung tasks in kernels >= 6.9.

Solving the deadlock is non trivial, so for now revert the change to
set the hw_control trigger as default trigger, so that ledtrig-netdev
no longer gets activated automatically for various network LEDs.

The netdev trigger is not needed because the network LEDs are usually under
hw-control and the netdev trigger tries to leave things that way so setting
it as the active trigger for the LED class device is a no-op.

Fixes: 66601a2 ("leds: class: If no default trigger is given, make hw_control trigger the default trigger")
Reported-by: Genes Lists <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Johannes Wüller <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Cc: [email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Acked-by: Lee Jones <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
…inux/kernel/git/ieee1394/linux1394

Pull firewire fixes from Takashi Sakamoto:

 - Update tracepoints events introduced in v6.10-rc1 so that it includes
   the numeric identifier of host card in which the event happens

 - replace wiki URL with the current website URL in Kconfig

* tag 'firewire-fixes-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: record card index in bus_reset_handle tracepoints event
  firewire: core: record card index in tracepoinrts events derived from bus_reset_arrange_template
  firewire: core: record card index in async_phy_inbound tracepoints event
  firewire: core: record card index in async_phy_outbound_complete tracepoints event
  firewire: core: record card index in async_phy_outbound_initiate tracepoints event
  firewire: core: record card index in tracepoinrts events derived from async_inbound_template
  firewire: core: record card index in tracepoinrts events derived from async_outbound_initiate_template
  firewire: core: record card index in tracepoinrts events derived from async_outbound_complete_template
  firewire: fix website URL in Kconfig
…/linux/kernel/git/ras/ras

Pull EDAC fixes from Borislav Petkov:

 - Fix two issues with MI300 address translation logic

* tag 'edac_urgent_for_v6.10_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  RAS/AMD/ATL: Use system settings for MI300 DRAM to normalized address translation
  RAS/AMD/ATL: Fix MI300 bank hash
…/git/libata/linux

Pull ata fix from Niklas Cassel:
 "Fix a bug where the SCSI Removable Media Bit (RMB) was incorrectly set
  for hot-plug capable (and eSATA) ports.

  The RMB bit means that the media is removable (e.g. floppy or CD-ROM),
  not that the device server is removable. If the RMB bit is set, SCSI
  will set the removable media sysfs attribute.

  If the removable media sysfs attribute is set on a device,
  GNOME/udisks will automatically mount the device on boot.

  We only want to set the SCSI RMB bit (and thus the removable media
  sysfs attribute) for devices where the ATA removable media device bit
  is set"

* tag 'ata-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-scsi: Set the RMB bit only for removable media devices
…kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are a number of small char/misc and iio driver fixes for
  6.10-rc4. Included in here are the following:

   - iio driver fixes for a bunch of reported problems.

   - mei driver fixes for a number of reported issues.

   - amiga parport driver build fix.

   - .editorconfig fix that was causing lots of unintended whitespace
     changes to happen to files when they were being edited. Unless we
     want to sweep the whole tree and remove all trailing whitespace at
     once, this is needed for the .editorconfig file to be able to be
     used at all. This change is required because the original
     submitters never touched older files in the tree.

   - jfs bugfix for a buffer overflow

  The jfs bugfix is in here as I didn't know where else to put it, and
  it's been ignored for a while as the filesystem seems to be abandoned
  and I'm tired of seeing the same issue reported in multiple places.

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (25 commits)
  .editorconfig: remove trim_trailing_whitespace option
  jfs: xattr: fix buffer overflow for invalid xattr
  misc: microchip: pci1xxxx: Fix a memory leak in the error handling of gp_aux_bus_probe()
  misc: microchip: pci1xxxx: fix double free in the error handling of gp_aux_bus_probe()
  parport: amiga: Mark driver struct with __refdata to prevent section mismatch
  mei: vsc: Fix wrong invocation of ACPI SID method
  mei: vsc: Don't stop/restart mei device during system suspend/resume
  mei: me: release irq in mei_me_pci_resume error path
  mei: demote client disconnect warning on suspend to debug
  iio: inkern: fix channel read regression
  iio: imu: inv_mpu6050: stabilized timestamping in interrupt
  iio: adc: ad7173: Fix sampling frequency setting
  iio: adc: ad7173: Clear append status bit
  iio: imu: inv_icm42600: delete unneeded update watermark call
  iio: imu: inv_icm42600: stabilized timestamp in interrupt
  iio: invensense: fix odr switching to same value
  iio: adc: ad7173: Remove index from temp channel
  iio: adc: ad7173: Add ad7173_device_info names
  iio: adc: ad7173: fix buffers enablement for ad7176-2
  iio: temperature: mlx90635: Fix ERR_PTR dereference in mlx90635_probe()
  ...
…x/kernel/git/gregkh/driver-core

Pull driver core and sysfs fixes from Greg KH:
 "Here are three small changes for 6.10-rc4 that resolve reported
  problems, and finally drop an unused api call. These are:

   - removal of devm_device_add_groups(), all the callers of this are
     finally gone after the 6.10-rc1 merge (changes came in through
     different trees), so it's safe to remove.

   - much reported sysfs build error fixed up for systems that did not
     have sysfs enabled

   - driver core sync issue fix for a many reported issue over the years
     that no one really paid much attention to, until Dirk finally
     tracked down the real issue and made the "obviously correct and
     simple" fix for it.

  All of these have been in linux-next for over a week with no reported
  problems"

* tag 'driver-core-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  drivers: core: synchronize really_probe() and dev_uevent()
  sysfs: Unbreak the build around sysfs_bin_attr_simple_read()
  driver core: remove devm_device_add_groups()
…rnel/git/gregkh/staging

Pull staging driver fix from Greg KH:
 "Here is a single staging driver fix, for the vc04 driver. It resolves
  a reported problem that showed up in the merge window set of changes.

  It's been in linux-next for over a week with no reported problems"

* tag 'staging-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: vchiq_debugfs: Fix NPD in vchiq_dump_state
…/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
 "Here are some small tty and serial driver fixes that resolve som
  reported problems. Included in here are:

   - n_tty lookahead buffer bugfix

   - WARN_ON() removal where it was not needed

   - 8250_dw driver bugfixes

   - 8250_pxa bugfix

   - sc16is7xx Kconfig fixes for reported build issues

  All of these have been in linux-next for over a week with no reported
  problems"

* tag 'tty-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: drop debugging WARN_ON_ONCE() from uart_write()
  serial: sc16is7xx: re-add Kconfig SPI or I2C dependency
  serial: sc16is7xx: rename Kconfig CONFIG_SERIAL_SC16IS7XX_CORE
  serial: port: Don't block system suspend even if bytes are left to xmit
  serial: 8250_pxa: Configure tx_loadsz to match FIFO IRQ level
  serial: 8250_dw: Revert "Move definitions to the shared header"
  serial: 8250_dw: Don't use struct dw8250_data outside of 8250_dw
  tty: n_tty: Fix buffer offsets when lookahead is used
…/git/gregkh/usb

Pull USB / Thunderbolt fixes from Greg KH:
 "Here are some small USB and Thunderbolt driver fixes for 6.10-rc4.
  Included in here are:

   - thunderbolt debugfs bugfix

   - USB typec bugfixes

   - kcov usb bugfix

   - xhci bugfixes

   - usb-storage bugfix

   - dt-bindings bugfix

   - cdc-wdm log message spam bugfix

  All of these, except for the last cdc-wdm log level change, have been
  in linux-next for a while with no reported problems. The cdc-wdm
  bugfix has been tested by syzbot and proved to fix the reported cpu
  lockup issues when the log is constantly spammed by a broken device"

* tag 'usb-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: class: cdc-wdm: Fix CPU lockup caused by excessive log messages
  xhci: Handle TD clearing for multiple streams case
  xhci: Apply broken streams quirk to Etron EJ188 xHCI host
  xhci: Apply reset resume quirk to Etron EJ188 xHCI host
  xhci: Set correct transferred length for cancelled bulk transfers
  usb-storage: alauda: Check whether the media is initialized
  usb: typec: ucsi: Ack also failed Get Error commands
  kcov, usb: disable interrupts in kcov_remote_start_usb_softirq
  dt-bindings: usb: realtek,rts5411: Add missing "additionalProperties" on child nodes
  usb: typec: tcpm: Ignore received Hard Reset in TOGGLING state
  usb: typec: tcpm: fix use-after-free case in tcpm_register_source_caps
  USB: xen-hcd: Traverse host/ when CONFIG_USB_XEN_HCD is selected
  usb: typec: ucsi: glink: increase max ports for x1e80100
  Revert "usb: chipidea: move ci_ulpi_init after the phy initialization"
  thunderbolt: debugfs: Fix margin debugfs node creation condition
…rnel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "Two fixes to correctly report i2c functionality, ensuring that
  I2C_FUNC_SLAVE is reported when a device operates solely as a slave
  interface"

* tag 'i2c-for-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: designware: Fix the functionality flags of the slave-only interface
  i2c: at91: Fix the functionality flags of the slave-only interface
…/kernel/git/deller/parisc-linux

Pull parisc fix from Helge Deller:
 "On parisc we have suffered since years from random segfaults which
  seem to have been triggered due to cache inconsistencies. Those
  segfaults happened more often on machines with PA8800 and PA8900 CPUs,
  which have much bigger caches than the earlier machines.

  Dave Anglin has worked over the last few weeks to fix this bug. His
  patch has been successfully tested by various people on various
  machines and with various kernels (6.6, 6.8 and 6.9), and the debian
  buildd servers haven't shown a single random segfault with this patch.

  Since the cache handling has been reworked, the patch is slightly
  bigger than I would like in this stage, but the greatly improved
  stability IMHO justifies the inclusion now"

* tag 'parisc-for-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Try to fix random segmentation faults in package builds
The @to field in ops.yield() can be NULL (when the callback is invoked)
on the yield_task_scx() path). Right now we're not communicating to the
verifier that the arg can be NULL, so if callers don't check for NULL it
can crash the host. Let's update ext.c to check. A subsequent patch will
add tests.

Signed-off-by: David Vernet <[email protected]>
(cherry picked from commit 3f4e6620ea04556b144710c195706dcbe79d6d43)
Now that we've told the verifier about the second argument to
ops.yield(), let's add a test to verify the behavior.

Signed-off-by: David Vernet <[email protected]>
(cherry picked from commit bc5f49cd6e37bfc0f2f826c332eeec458230452e)
…74b5c9b3)

(cherry picked from commit 389f282a8f1940d2a22cd5a6501254869af70e3e)
- Whitespace and comment adj.

- Use sched_class_above() instead of naked comparison.

- Reorder hotplug functions.

- s/promote_op_nth_arg()/set_arg_maybe_null()/. Relocate @arg_n after @op
  and remove promote_op_arg().

- Reorder declarations and dummy defs in kernel/sched/ext.h.

No functional changes.

(cherry picked from commit 159d0cf55f62ecd7f1dfb9c4e44ce442f13c78a6)
Quick fix of spelling in dispatch_to_local_dsq_lock comment.

Signed-off-by: Daniel Hodges <[email protected]>
(cherry picked from commit acb3ddb20fda0dd44ca3c22a54840ab0ac72e66f)
…556cd3cf67a1)

(cherry picked from commit e1f6f964f4017466a0757af96322d88ebe2a9a8c)
__scx_bpf_consume_task() should test whether the task is still on the DSQ
that's being iterated before erroring out if it's on a local DSQ. Otherwise,
another racing consumer can move the task to a local DSQ triggering a
spurious error. Relocate the error condition.

(cherry picked from commit e2c8568eb3075a187de5130fae6a9b3b664766a6)
__COMPAT_SCX_OPS_SWITCH_PARTIAL is no more. Use SCX_OPS_SWITCH_PARTIAL
instead.

(cherry picked from commit 9cd61b5803e426fe199b03124eb72dd147be414d)
Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested a review from htejun June 21, 2024 19:24
@Byte-Lab Byte-Lab merged commit 55a1396 into scx-6.10rc.y Jun 21, 2024
1 check failed
@Byte-Lab Byte-Lab deleted the scx-6.10rc.4 branch June 21, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.