Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Dec 22, 2024

Motivation and Context

The other day, a friend complained about slow rm -r operations on a pool that had two mirrored spinning disks. He was unlinking a large number of files, so I suggested he do a parallel rm and what should have taken a few more minutes to complete ended in seconds:

find /path/to/subtree -name -type f | parallel -j250 rm --
rm -r /path/to/subtree

I have long known that our VFS operations are slower than other filesystems such as ext4, and I have long attributed that to our use of range locks for scaling versus the spin locks that ext4 uses. Since scalability is more important than single threaded performance, I made no attempt to change this. Today, I had a flash of inspiration after writing a comment about this in a hacker news discussion:

https://news.ycombinator.com/item?id=42486279

It occurred to me that the uncontended case could be made faster. My initial idea was to try to make an adaptive range lock that would minimize work done in uncontended cases to approach the performance of ext4's spin locks. However, upon looking at the range lock code, I realized that the amount of work being done in the uncontended case is already somewhat minimal while significant improvement in all cases would be possible by changing how we allocate memory for zfs_locked_range_t.

Instead of allocating memory for zfs_locked_range_t as part of zfs_rangelock_{try,}enter(), we can allocate it externally either on the stack or as part of another structure. This is possible with trivial refactoring in all cases except vdev_raidz_io_start(), where slightly more extensive refactoring is needed to allocate zfs_locked_range_t as part of raidz_map_t.

I recall seeing flame graphs from close to the start of ZFSOnLinux showing significant time spent in kmem_alloc() when doing range locks. Doing this refactoring eliminates that entirely. This should make our VFS and zvol operations slightly faster. Some RAID-Z operations will also become slightly faster.

Description

We do trivial refactoring to allocate zfs_locked_range_t either on the stack or as part of other structures in all but vdev_raidz_io_start(), where non-trival refactoring is needed.

How Has This Been Tested?

I have only done build tests so far. I am letting the buildbot do runtime testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if I saw it as a problem, but I can believe that allocations are not great. Though modern allocations are pretty well cached and I think we are talking about one allocation per operation here.

The patch looks OK to me, but my worry here is a stack consumption. I haven't checked on Linux, but on FreeBSD zfs_locked_range_t takes 96 bytes, which now will be on stack instead of 8 byte pointer. It would be nice to try to compact it, if possible, once you are already changing the API. For example, lr_rangelock seems to be used only in 2 places, and I guess we could just pass it directly to zfs_rangelock_exit() and zfs_rangelock_reduce() to save 8 bytes. May be lr_type could be reduced to uint8_t, which is not very clean, but could save another 8 byte.

@ryao
Copy link
Contributor Author

ryao commented Dec 22, 2024

I don't remember if I saw it as a problem, but I can believe that allocations are not great. Though modern allocations are pretty well cached and I think we are talking about one allocation per operation here.

It was never huge, but it was non-negligible as per my recollection. When I looked into why ext4 was so much faster at unlinking than we were in the past, I had blamed the range locks and dropped the matter since range locks were important for scalability. It occurs to me that was not an entirely accurate assessment, but it was the conclusion I made at the time.

The patch looks OK to me, but my worry here is a stack consumption.

The VFS and block layer on the zvol side have never been stack constrained as far as I know, so the only place where this is potentially a problem is in raidz_reflow_scratch_sync(), which is executed in txg_sync_thread. I am not aware of that one being stack constrained either (correct me if I am wrong), so we should be okay here.

How about we subject this to testing to see if this is a problem and if it passes, let it be? If it is a problem, then we can switch back to kmem_alloc() for raidz_reflow_scratch_sync().

I haven't checked on Linux, but on FreeBSD zfs_locked_range_t takes 96 bytes, which now will be on stack instead of 8 byte pointer. It would be nice to try to compact it, if possible, once you are already changing the API. For example, lr_rangelock seems to be used only in 2 places, and I guess we could just pass it directly to zfs_rangelock_exit() and zfs_rangelock_reduce() to save 8 bytes.

In multiple places in the code, the zfs_locked_range_t is passed to another thread and freed elsewhere, and we depend on ->lr_rangelock to find the original range lock since the pointer to it is gone. In order to do what you suggest, we will need to pass a pointer to the original range lock alongside the zfs_locked_range_t. A few places in the raidz code and {zfs,zvol,ztest}_get_done would need to change. It is cleaner to do it the current way, although I don't mind changing this if it results in a measurable improvement. However, I am not sure if this will result in any savings as you will see below.

May be lr_type could be reduced to uint8_t, which is not very clean, but could save another 8 byte.

It takes 208 bytes on my Linux machine:

$ pahole $(find . -name zfs.ko) -C zfs_locked_range
struct zfs_locked_range {
        zfs_rangelock_t *          lr_rangelock;         /*     0     8 */
        avl_node_t                 lr_node;              /*     8    24 */
        uint64_t                   lr_offset;            /*    32     8 */
        uint64_t                   lr_length;            /*    40     8 */
        uint_t                     lr_count;             /*    48     4 */
        zfs_rangelock_type_t       lr_type;              /*    52     4 */
        kcondvar_t                 lr_write_cv;          /*    56    72 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        kcondvar_t                 lr_read_cv;           /*   128    72 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        uint8_t                    lr_proxy;             /*   200     1 */
        uint8_t                    lr_write_wanted;      /*   201     1 */
        uint8_t                    lr_read_wanted;       /*   202     1 */

        /* size: 208, cachelines: 4, members: 11 */
        /* padding: 5 */
        /* last cacheline: 16 bytes */
};

This is because the Linux SPL's kcondvar_t is huge:

$ pahole $(find . -name spl.ko) -C kcondvar_t
typedef struct {
        int                cv_magic;             /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        wait_queue_head_t  cv_event;             /*     8    24 */
        wait_queue_head_t  cv_destroy;           /*    32    24 */
        atomic_t           cv_refs;              /*    56     4 */
        atomic_t           cv_waiters;           /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        kmutex_t *         cv_mutex;             /*    64     8 */

        /* size: 72, cachelines: 2, members: 6 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* last cacheline: 8 bytes */
} kcondvar_t;

Interestingly, if I try to save some space by switching lr_type to uint8_t and move it to the end, nothing changes due to compiler added padding:

$ pahole $(find . -name zfs.ko) -C zfs_locked_range
struct zfs_locked_range {
        zfs_rangelock_t *          lr_rangelock;         /*     0     8 */
        avl_node_t                 lr_node;              /*     8    24 */
        uint64_t                   lr_offset;            /*    32     8 */
        uint64_t                   lr_length;            /*    40     8 */
        uint_t                     lr_count;             /*    48     4 */

        /* XXX 4 bytes hole, try to pack */

        kcondvar_t                 lr_write_cv;          /*    56    72 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        kcondvar_t                 lr_read_cv;           /*   128    72 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        uint8_t                    lr_proxy;             /*   200     1 */
        uint8_t                    lr_write_wanted;      /*   201     1 */
        uint8_t                    lr_read_wanted;       /*   202     1 */
        uint8_t                    lr_type;              /*   203     1 */

        /* size: 208, cachelines: 4, members: 11 */
        /* sum members: 200, holes: 1, sum holes: 4 */
        /* padding: 4 */
        /* last cacheline: 16 bytes */
};

This yields no improvement. I wonder if FreeBSD sees any savings.

This also makes me wonder if removing ->lr_rangelock is just going to cause the hole to become bigger. If it does, the additional work to pass the pointer externally from zfs_locked_range_t in the RAID-Z code would paradoxically result in us using more memory.

@amotin
Copy link
Member

amotin commented Dec 22, 2024

It takes 208 bytes on my Linux machine

It makes me worry about stack usage even more. We don't know from what context VFS calls us, and if it is not directly from user-space, I worry there might be some unpleasant surprises.

This is because the Linux SPL's kcondvar_t is huge

Yea. On FreeBSD it takes only 16 bytes, that makes other optimizations more noticeable.

if I try to save some space by switching lr_type to uint8_t and move it to the end

Why do you move it to the end if there is 4-byte hole where it was before after lr_count?

@ryao
Copy link
Contributor Author

ryao commented Dec 22, 2024

It takes 208 bytes on my Linux machine

It makes me worry about stack usage even more. We don't know from what context VFS calls us, and if it is not directly from user-space, I worry there might be some unpleasant surprises.

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

I am not very concerned about this on Linux. In 2014, Linux switched from 8KB stacks where stack space had been painful for us to 16KB stacks. We still support a few of the kernels that have 8KB stacks, but stack space when called from the VFS was never a problem even in the 8KB stack days, so adding this to the stack should not create an issue.

if I try to save some space by switching lr_type to uint8_t and move it to the end

Why do you move it to the end if there is 4-byte hole where it was before after lr_count?

The enum is an int per the C standard and int is 4-bytes in our memory model. Changing the enum to a uint8_t would create a 3-byte hole, which I expected to be padded to ensure the following member had at least 4-byte alignment, so I moved it to the end where there was already padding. To my surprise, the compiler is padding to give at least 8-byte alignment, such that a 4-byte hole emerged from the move of lr_type to the end.

Even if I shrink lr_type to 1 byte from 4 bytes, the saved bytes just become padding, no matter where I put it after making it 1 byte. Things might be different on 32-bit systems, but I don't think many people use those anymore to make it worth doing this just for them. It is counter-intuitive, but there is no savings from shrinking lr_type.

@amotin
Copy link
Member

amotin commented Dec 23, 2024

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

May be. But even if so, I guess there might be some more complicate structures like nullfs and similar mounts, etc, increasing call depth.

To my surprise, the compiler is padding to give at least 8-byte alignment,

It is not a surprise that structure is padded to 8 bytes, since it includes 8 byte elements that must be aligned. That is why I thought about reusing the already existing hole after lr_count. Than structure would end just after lr_read_cv.

@ryao
Copy link
Contributor Author

ryao commented Dec 23, 2024

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

May be. But even if so, I guess there might be some more complicate structures like nullfs and similar mounts, etc, increasing call depth.

Linux bind mounts would be the equivalent of the FreeBSD nullfs driver. Those are handled by the Linux VFS if I recall. The main things that would call into the VFS would be the loop driver for mounting filesystems from files and FUSE. However, none of this ever been a problem even on the kernels with 8KB stacks. Linux actually had a push to move to 4KB stacks:

https://lwn.net/Articles/279229/

However, practicality won and in 2014, Linux went to 16KB stacks:

https://lwn.net/Articles/600644/

However the 4KB stack enablement work did plenty to minimize stack usage in these paths in upstream Linux. Linux also has a tool for seeing maximum stack utilization, although I have never had to use them as the stack space issues in the project were already fixed by @behlendorf before I started contributing.

To my surprise, the compiler is padding to give at least 8-byte alignment,

It is not a surprise that structure is padded to 8 bytes, since it includes 8 byte elements that must be aligned. That is why I thought about reusing the already existing hole after lr_count. Than structure would end just after lr_read_cv.

There is no existing hole after lr_count. That hole was made after I tried shrinking lr_type and moved it to the end. See the first pahole $(find . -name zfs.ko) -C zfs_locked_range output for what it currently is. The second one reflected your suggested change and made the hole. The size of both is the same, regardless of whether lr_type is left where it was or is moved to the end after shrinking it.

We might be able to save 8 bytes part of the time by removing ->lr_rangelock as you suggested, but it is a bunch of refactoring. I have done about half of the work needed to do that locally so far, but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here. The code is cleaner the current way.

@amotin
Copy link
Member

amotin commented Dec 23, 2024

but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here.

When compared to 208 bytes, may be not. But I am not buying that we are not stack limited here or anywhere in kernel. I'll let others comment.

…,}enter()

Typically, the memory allocated by kmem_alloc() can be trivially
allocated from either the stack or as part of another structure. The
only case where it cannot is in vdev_raidz_io_start(), although some
further refactoring should be able to eliminate that case too.
Allocating from the stack or as part of another data structure is faster
as it gives us this memory for free, so there is little reason not to do
it.

This eliminates a non-neligible amount of CPU time that I have seen in
flame graphs going back to the early days of OpenZFS when the tree was
the ZFSOnLinux tree. This should make our VFS and zvol operations
slightly faster. Some RAID-Z operations will also become slightly
faster.

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor Author

ryao commented Dec 23, 2024

but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here.

When compared to 208 bytes, may be not. But I am not buying that we are not stack limited here or anywhere in kernel. I'll let others comment.

We are stack limited in the ZIO threads, which this does not touch. I am not aware of stack limitations elsewhere. If it helps, I could set aside time to try measuring.

If we were to decide not to stack allocate things, this API change would still be beneficial since it coalesces allocations in a number of places, such as in the RAID-Z code and {zfs,zvol,ztest}_get_done().

That being said, there is something subtly wrong here. From qemu-x86 (freebsd13-3r):

Fatal trap 12: page fault while in kernel mode
  cpuid = 1; apic id = 01
  fault virtual address	= 0x22710010
  fault code		= supervisor read data, page not present
  instruction pointer	= 0x20:0xffffffff8280012a
  stack pointer	        = 0x28:0xfffffe009a2d6860
  frame pointer	        = 0x28:0xfffffe009a2d6870
  code segment		= base 0x0, limit 0xfffff, type 0x1b
  			= DPL 0, pres 1, long 1, def32 0, gran 1
  processor eflags	= interrupt enabled, resume, IOPL = 0
  current process		= 1273 (txg_thread_enter)
  trap number		= 12
  panic: page fault
  cpuid = 1
  time = [173](https://github.com/openzfs/zfs/actions/runs/12457044088/job/34771163635?pr=16896#step:11:175)4896301
  KDB: stack backtrace:
  #0 0xffffffff80c514c5 at kdb_backtrace+0x65
  #1 0xffffffff80c04e22 at vpanic+0x152
  #2 0xffffffff80c04cc3 at panic+0x43
  #3 0xffffffff810c206d at trap_fatal+0x38d
  #4 0xffffffff810c20bf at trap_pfault+0x4f
  #5 0xffffffff81099c18 at calltrap+0x8
  #6 0xffffffff82a5772e at zfs_rangelock_enter_impl+0x1ee
  #7 0xffffffff82a5ba57 at zfs_get_data+0x207
  #8 0xffffffff82a66620 at zil_lwb_write_issue+0x2c0
  #9 0xffffffff82a620eb at zil_commit_impl+0x3cb
  #10 0xffffffff82a59c03 at zfs_fsync+0x63
  #11 0xffffffff828568bc at zfs_file_fsync+0x7c
  #12 0xffffffff8284a3a8 at vdev_file_io_start+0x48
  #13 0xffffffff82a74f77 at zio_vdev_io_start+0x367
  #14 0xffffffff82a6b70c at zio_nowait+0x10c
  #15 0xffffffff829a5753 at vdev_config_sync+0x173
  #16 0xffffffff82969f1a at spa_sync+0x118a
  #17 0xffffffff82984381 at txg_sync_thread+0x4c1
  Uptime: 8m10s
  Automatic reboot in 15 seconds - press a key on the console to abort
  Rebooting...
  cpu_reset: Restarting BSP
  cpu_reset_proxy: Stopped CPU 1

Unfortunately, I am not sure what. This case is not even stack allocated, as it is part of the zgd_t allocation. I am going to move the position of the structure member to the top to see how the buildbot reacts on the chance that this is somehow alignment related, although this backtrace does not make sense to me unless there is some other zgd_t definition on FreeBSD that I am not seeing in the tree.

@amotin
Copy link
Member

amotin commented Dec 26, 2024

@ryao Please note that there are ranges allocated by zfs_rangelock_proxify(), zfs_rangelock_split() and zfs_rangelock_new_proxy(). I wonder if you properly differentiate what is allocated how to free them respectively.

@ryao
Copy link
Contributor Author

ryao commented Dec 26, 2024

@ryao Please note that there are ranges allocated by zfs_rangelock_proxify(), zfs_rangelock_split() and zfs_rangelock_new_proxy(). I wonder if you properly differentiate what is allocated how to free them respectively.

The pointers for those are stored internally and are never given back to the caller of zfs_rangelock_{try,}enter(), so the handling of them should not have changed at all. My machine is currently compiling updates. I will look at this more after it finishes.

Also, it appears that we recently dropped support for the Linux kernels with 8KB stacks, so we are under less stack pressure on Linux than we used to be.

@tonyhutter
Copy link
Contributor

I gave this a spin on a Fedora 40 VM with 4 vCPUs. The pool was backed by a single virtual block device with all default options. I timed deleting the Linux kernel source tree from the pool:

$ for i in {1..3} ; do tar -xf ~/v6.11-rc1.tar.gz -C /tank && time rm -fr /tank/* ; done
...
# Without PR
real	0m1.024s
user	0m0.026s
sys	0m0.975s

real	0m1.056s
user	0m0.015s
sys	0m1.017s

real	0m1.035s
user	0m0.017s
sys	0m0.993s

# With PR
real	0m1.018s
user	0m0.017s
sys	0m0.983s

real	0m1.052s
user	0m0.014s
sys	0m1.017s

real	0m1.065s
user	0m0.013s
sys	0m1.030s

So times are about the same.

The parallel delete @ryao mentioned did preform really well though (independent of this PR's change):

$ tar -xf ~/v6.11-rc1.tar.gz -C /tank && time bash -c 'find /tank/ -maxdepth 2 -mindepth 1 -type d | parallel -j4 rm -fr --'

real	0m0.604s
user	0m0.115s
sys	0m1.778s

@ryao
Copy link
Contributor Author

ryao commented Dec 26, 2024

@tonyhutter Thanks for trying it. It is clear now that whatever that is making us slow on unlinking is elsewhere and my original inspiration for this likely is wrong, but I would like to explore this a little more before I make a call on whether to withdraw this.

My current thinking is to measure the latencies for zfs_rangelock_{tryenter,enter,exit} to see if this makes a difference in the distribution, after I figure out why the buildbot is unhappy with it. If there is no difference, then I am happy to withdraw this.

@ryao ryao changed the title Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() [WIP] Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() Dec 26, 2024
@ryao ryao marked this pull request as draft December 26, 2024 20:07
@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Dec 26, 2024
@tonyhutter
Copy link
Contributor

@ryao heh you could always go the opposite route and add a new rm --parallel option into upstream coreutils. I'm sure millions of Linux users would thank you!

@ryao
Copy link
Contributor Author

ryao commented Jan 1, 2025

@ryao heh you could always go the opposite route and add a new rm --parallel option into upstream coreutils. I'm sure millions of Linux users would thank you!

This time of the year has been a little hectic, but since I started this experiment, I plan to finish it in a few days. Afterward, I might try doing that rm --parallel option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants