Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting that the new (3.14) pathlib copy functionality uses Copy-on-Write #124985

Open
opk12 opened this issue Oct 4, 2024 · 6 comments
Open
Labels
docs Documentation in the Doc dir topic-pathlib

Comments

@opk12
Copy link

opk12 commented Oct 4, 2024

Documentation

(edited)

The PR 119058 and 122369 added pathlib.Path.copy(), with Copy-on-Write support. CoW should be documented, because it has distinctive, user-requested properties on huge files.

  • Copying is instantaneous, does negligible I/O, requires no disk space for the data and negligible disk space for the metadata.
  • Reading the original and the copy does half the I/O and requires half the RAM (page cache) than a traditional copy (think booting VM images)

In the context of a Linux VM manager, CoW is an explicit desired property. Disk image copying is the slowest part of snapshotting a VM. Users expect CoW snapshots nowadays, and intentionally set up a CoW filesystem for the disk image directory.

For clarity, I'm not asking to mention FICLONE specifically. I'm not asking to mention copy_file_range, a micro-optimization on the traditional copy algorithm. Instead, my point is that switching from O(file size) to zero is a user-visible feature.

Keywords: reflink copy

Linked PRs

@opk12 opk12 added the docs Documentation in the Doc dir label Oct 4, 2024
@picnixz
Copy link
Contributor

picnixz commented Oct 4, 2024

I'm not sure to follow what you want to do. What I understood is that you want us to document how we support CoW filesystems (by the way, the project you linked only support Python 3.7 according to pypi and I don't know why it would be a "keyword").

cc @barneygale as the PR's author

@picnixz picnixz added pending The issue will be closed if no feedback is provided topic-pathlib labels Oct 4, 2024
@opk12 opk12 changed the title Reflink mention in the pathlib doc Documenting that the new (3.14) pathlib copy functionality uses Copy-on-Write Oct 5, 2024
@opk12
Copy link
Author

opk12 commented Oct 5, 2024

Please document that copies are CoW if the filesystem is CoW and list the supported filesystems, so that one is certain to provide CoW to the end user. Please leave out the implementation details (the exact syscalls).

Sorry for the confusion. reflink copy / shallow copy is jargon for CoW, and the library was named after that. I now reworked the ticket title.

@picnixz picnixz removed the pending The issue will be closed if no feedback is provided label Oct 5, 2024
@barneygale
Copy link
Contributor

barneygale commented Oct 5, 2024

I'm cautious about touting the speed benefits of copy(), because the implementation of PathBase.copy() makes too many system calls when walking directories. We ought to add an implementation of Path.copy() that uses os.DirEntry.is_symlink() and is_dir(), rather than calling these methods on path objects. Or we could open a large can of worms, and consider whether PathBase.iterdir() might be allowed generate path objects with a special dir_entry attribute that grants public access to a os.DirEntry-like object, and then implement this in Path.iterdir(), and consult the attribute from PathBase.copy().

@opk12
Copy link
Author

opk12 commented Oct 6, 2024

Very good point. Without touching on copy()'s performance, how about only saying that it does CoW and on which filesystems? As said above, CoW is an explicit user request, has much better free space requirement and future read performance.

@opk12
Copy link
Author

opk12 commented Oct 6, 2024

(Off-topic: there even was an actual proposal at btrfs, the most used CoW fs on Linux, to CoW-copy an arbitrary directory with a single syscall. The proposal stalled, but showed that this is doable.)

@barneygale
Copy link
Contributor

#125419 will solve the known copy() performance problems. When it lands I'll start working on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-pathlib
Projects
None yet
Development

No branches or pull requests

3 participants