Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use poll implementation to fix file descriptor leak #72

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Stealthii
Copy link

Using select() here causes a file descriptor leak. We've replaced this with a poll() implementation to avoid this.

The polling object needs to be opened with an eventmask that stops it from blocking paramiko.

Fixes #51.

mcontoliniproofpoint and others added 3 commits April 12, 2022 14:55
Using select() here causes a file descriptor leak. We've replaced this
with a poll() implementation to avoid this.

The polling object needs to be opened with an eventmask that stops it
from blocking paramiko.

Fixes ploxiln#51

Co-authored-by: Dan Porter <[email protected]>
Mapping has been added to define the sender and receiver for each event.
Additionally we continue with other events to combat the case where a
sender has sent data and also hung up unexpectedly.
@jealouscloud
Copy link
Contributor

jealouscloud commented Apr 20, 2022

Correct me if I'm wrong, but I think file descriptors (pipes) are leaking due to the issue in this PR:
#74

And the issue is manifesting in issues with select calls.

With that said, here's the docs for the select syscall that backs select.select
https://man7.org/linux/man-pages/man2/select.2.html

WARNING: select() can monitor only file descriptors numbers that
       are less than FD_SETSIZE (1024)—an unreasonably low limit for
       many modern applications—and this limitation will not change.
       All modern applications should instead use [poll(2)](https://man7.org/linux/man-pages/man2/poll.2.html) or [epoll(7)](https://man7.org/linux/man-pages/man7/epoll.7.html),
       which do not suffer this limitation.

select seems to be a problematic syscall, and "all modern applications should use poll/epoll" so replacing it with select.poll seems like the right play even if it isn't the cause of the leak.

@Stealthii
Copy link
Author

@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under select() when running Fabric against a large number of hosts in parallel (5k+).

The above code for io.py has been tested in a production capacity, but I extended the poll implementation to the forwarder in context_managers.py after seeing #51.

@jealouscloud
Copy link
Contributor

@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under select() when running Fabric against a large number of hosts in parallel (5k+).

One of the ways I troubleshot this issue was taking a ls -l look at /proc/$pid/fd where $pid is the process ID of the primary/parent fabric process.

Before noticing the leak cause of the leak for #74 I recall noticing there was a big difference in fd allocation when I used --eagerly-disconnect. That may or may not have helped your case but if it did, that might have been part of intended functionality instead of an unintended leak of unused resources. That said, fd count absolutely scales with your -z value too, depending on what that might have been.

I'm not saying there aren't potentially multiple leaks but I do want to add what i've seen to the conversation.

All that said, I still think this is a good/necessary change, leak or no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Something is opening file descriptors and not closing them
3 participants