-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize lag_keepalive by crafting the LACPDU packet ourselves #3170
base: master
Are you sure you want to change the base?
Conversation
Instead of waiting for a LACPDU packet to be sent and capturing that (which involves waiting roughly 30 seconds), get the necessary information from teamd and craft it ourselves. This means that the 60-second wait in making sure that a LACPDU packet is captured and the keepalive script is ready can be largely eliminated (this has been reduced to 10 seconds to make sure the script has a chance to craft the packets and send some LACPDUs). Signed-off-by: Saikrishna Arcot <[email protected]>
/azp run Azure.sonic-utilities |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Azure.sonic-utilities |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Saikrishna Arcot <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
scripts/fast-reboot
Outdated
# give the lag_keepalive script a chance to get ready (30s) and collect one lacpdu before going down (30s) | ||
sleep 60 | ||
# give the lag_keepalive script a chance to send some LACPDUs | ||
sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach still uses a wait approach, and the assumption is that the LACPDUs will be sent.
In mostly all platforms, sending these LACPDUs has become a prerequisite. So, may be we should do a quick capture here to ensure that LACPDUs are being sent out via (all?) LAGs. If not, abort warm-reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So doing a capture here might make things a bit messy. Instead, I added logic in the Python script to fork into the background after collecting information about the LAGs. That way, if there's an issue at this stage, then it'll get caught.
|
||
|
||
def craft_lacp_packet(portChannelConfig, portName): | ||
portConfig = portChannelConfig["ports"][portName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a possibility of any of the following key accesses from the dict will fail with KeyError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add key presence check wherever needed, and default values (??) where keys are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a try-except in main()
that catches everything and retries immediately.
|
||
def get_port_channel_config(portChannelName): | ||
(processStdout, _) = getCmdOutput(["teamdctl", portChannelName, "state", "dump"]) | ||
return json.loads(processStdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled via main()
try-catch.
Signed-off-by: Saikrishna Arcot <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Start lag_keepalive before pausing orchagent, so that there's less of a delay between when orchagent is paused and when kexec happens, and so that fewer events/changes aren't handled by orchagent. Additionally, add an option into the lag_keepalive script to fork into the background after generating the LACPDUs and opening sockets, but before sending the actual packets. This serves as a sort-of error check to make sure that it is at least able to send LACPDU packets, and didn't bail out early. Signed-off-by: Saikrishna Arcot <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
Instead of waiting for a LACPDU packet to be sent and capturing that (which involves waiting roughly 30 seconds), get the necessary information from teamd and craft it ourselves. This means that the 60-second wait in making sure that a LACPDU packet is captured and the keepalive script is ready can be largely eliminated (this has been reduced to 10 seconds to make sure the script has a chance to craft the packets and send some LACPDUs).
Fixes sonic-net/sonic-buildimage#17416.
How I did it
How to verify it
Verified on KVM that the lag_keepalive script started correctly. Also verified that during a warm-reboot, the script correctly starts and send LACPDU packets every second (the delay between starting the script and the message where it said it is ready was 2 seconds).
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)