-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember if I saw it as a problem, but I can believe that allocations are not great. Though modern allocations are pretty well cached and I think we are talking about one allocation per operation here.
The patch looks OK to me, but my worry here is a stack consumption. I haven't checked on Linux, but on FreeBSD zfs_locked_range_t takes 96 bytes, which now will be on stack instead of 8 byte pointer. It would be nice to try to compact it, if possible, once you are already changing the API. For example, lr_rangelock seems to be used only in 2 places, and I guess we could just pass it directly to zfs_rangelock_exit() and zfs_rangelock_reduce() to save 8 bytes. May be lr_type could be reduced to uint8_t, which is not very clean, but could save another 8 byte.
It was never huge, but it was non-negligible as per my recollection. When I looked into why ext4 was so much faster at unlinking than we were in the past, I had blamed the range locks and dropped the matter since range locks were important for scalability. It occurs to me that was not an entirely accurate assessment, but it was the conclusion I made at the time.
The VFS and block layer on the zvol side have never been stack constrained as far as I know, so the only place where this is potentially a problem is in How about we subject this to testing to see if this is a problem and if it passes, let it be? If it is a problem, then we can switch back to
In multiple places in the code, the
It takes 208 bytes on my Linux machine:
This is because the Linux SPL's kcondvar_t is huge:
Interestingly, if I try to save some space by switching lr_type to uint8_t and move it to the end, nothing changes due to compiler added padding:
This yields no improvement. I wonder if FreeBSD sees any savings. This also makes me wonder if removing |
It makes me worry about stack usage even more. We don't know from what context VFS calls us, and if it is not directly from user-space, I worry there might be some unpleasant surprises.
Yea. On FreeBSD it takes only 16 bytes, that makes other optimizations more noticeable.
Why do you move it to the end if there is 4-byte hole where it was before after lr_count? |
Very few places in the Linux kernel call into the VFS. It is generally discouraged. I am not very concerned about this on Linux. In 2014, Linux switched from 8KB stacks where stack space had been painful for us to 16KB stacks. We still support a few of the kernels that have 8KB stacks, but stack space when called from the VFS was never a problem even in the 8KB stack days, so adding this to the stack should not create an issue.
The enum is an int per the C standard and int is 4-bytes in our memory model. Changing the enum to a uint8_t would create a 3-byte hole, which I expected to be padded to ensure the following member had at least 4-byte alignment, so I moved it to the end where there was already padding. To my surprise, the compiler is padding to give at least 8-byte alignment, such that a 4-byte hole emerged from the move of lr_type to the end. Even if I shrink lr_type to 1 byte from 4 bytes, the saved bytes just become padding, no matter where I put it after making it 1 byte. Things might be different on 32-bit systems, but I don't think many people use those anymore to make it worth doing this just for them. It is counter-intuitive, but there is no savings from shrinking lr_type. |
May be. But even if so, I guess there might be some more complicate structures like nullfs and similar mounts, etc, increasing call depth.
It is not a surprise that structure is padded to 8 bytes, since it includes 8 byte elements that must be aligned. That is why I thought about reusing the already existing hole after lr_count. Than structure would end just after lr_read_cv. |
Linux bind mounts would be the equivalent of the FreeBSD nullfs driver. Those are handled by the Linux VFS if I recall. The main things that would call into the VFS would be the loop driver for mounting filesystems from files and FUSE. However, none of this ever been a problem even on the kernels with 8KB stacks. Linux actually had a push to move to 4KB stacks: https://lwn.net/Articles/279229/ However, practicality won and in 2014, Linux went to 16KB stacks: https://lwn.net/Articles/600644/ However the 4KB stack enablement work did plenty to minimize stack usage in these paths in upstream Linux. Linux also has a tool for seeing maximum stack utilization, although I have never had to use them as the stack space issues in the project were already fixed by @behlendorf before I started contributing.
There is no existing hole after lr_count. That hole was made after I tried shrinking lr_type and moved it to the end. See the first We might be able to save 8 bytes part of the time by removing |
When compared to 208 bytes, may be not. But I am not buying that we are not stack limited here or anywhere in kernel. I'll let others comment. |
…,}enter() Typically, the memory allocated by kmem_alloc() can be trivially allocated from either the stack or as part of another structure. The only case where it cannot is in vdev_raidz_io_start(), although some further refactoring should be able to eliminate that case too. Allocating from the stack or as part of another data structure is faster as it gives us this memory for free, so there is little reason not to do it. This eliminates a non-neligible amount of CPU time that I have seen in flame graphs going back to the early days of OpenZFS when the tree was the ZFSOnLinux tree. This should make our VFS and zvol operations slightly faster. Some RAID-Z operations will also become slightly faster. Signed-off-by: Richard Yao <[email protected]>
We are stack limited in the ZIO threads, which this does not touch. I am not aware of stack limitations elsewhere. If it helps, I could set aside time to try measuring. If we were to decide not to stack allocate things, this API change would still be beneficial since it coalesces allocations in a number of places, such as in the RAID-Z code and That being said, there is something subtly wrong here. From qemu-x86 (freebsd13-3r):
Unfortunately, I am not sure what. This case is not even stack allocated, as it is part of the zgd_t allocation. I am going to move the position of the structure member to the top to see how the buildbot reacts on the chance that this is somehow alignment related, although this backtrace does not make sense to me unless there is some other zgd_t definition on FreeBSD that I am not seeing in the tree. |
@ryao Please note that there are ranges allocated by |
The pointers for those are stored internally and are never given back to the caller of Also, it appears that we recently dropped support for the Linux kernels with 8KB stacks, so we are under less stack pressure on Linux than we used to be. |
I gave this a spin on a Fedora 40 VM with 4 vCPUs. The pool was backed by a single virtual block device with all default options. I timed deleting the Linux kernel source tree from the pool:
So times are about the same. The parallel delete @ryao mentioned did preform really well though (independent of this PR's change):
|
@tonyhutter Thanks for trying it. It is clear now that whatever that is making us slow on unlinking is elsewhere and my original inspiration for this likely is wrong, but I would like to explore this a little more before I make a call on whether to withdraw this. My current thinking is to measure the latencies for |
@ryao heh you could always go the opposite route and add a new |
This time of the year has been a little hectic, but since I started this experiment, I plan to finish it in a few days. Afterward, I might try doing that rm --parallel option. |
Motivation and Context
The other day, a friend complained about slow
rm -r
operations on a pool that had two mirrored spinning disks. He was unlinking a large number of files, so I suggested he do a parallel rm and what should have taken a few more minutes to complete ended in seconds:I have long known that our VFS operations are slower than other filesystems such as ext4, and I have long attributed that to our use of range locks for scaling versus the spin locks that ext4 uses. Since scalability is more important than single threaded performance, I made no attempt to change this. Today, I had a flash of inspiration after writing a comment about this in a hacker news discussion:
https://news.ycombinator.com/item?id=42486279
It occurred to me that the uncontended case could be made faster. My initial idea was to try to make an adaptive range lock that would minimize work done in uncontended cases to approach the performance of ext4's spin locks. However, upon looking at the range lock code, I realized that the amount of work being done in the uncontended case is already somewhat minimal while significant improvement in all cases would be possible by changing how we allocate memory for
zfs_locked_range_t
.Instead of allocating memory for
zfs_locked_range_t
as part ofzfs_rangelock_{try,}enter()
, we can allocate it externally either on the stack or as part of another structure. This is possible with trivial refactoring in all cases exceptvdev_raidz_io_start()
, where slightly more extensive refactoring is needed to allocatezfs_locked_range_t
as part ofraidz_map_t
.I recall seeing flame graphs from close to the start of ZFSOnLinux showing significant time spent in
kmem_alloc()
when doing range locks. Doing this refactoring eliminates that entirely. This should make our VFS and zvol operations slightly faster. Some RAID-Z operations will also become slightly faster.Description
We do trivial refactoring to allocate
zfs_locked_range_t
either on the stack or as part of other structures in all butvdev_raidz_io_start()
, where non-trival refactoring is needed.How Has This Been Tested?
I have only done build tests so far. I am letting the buildbot do runtime testing.
Types of changes
Checklist:
Signed-off-by
.