Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize lag_keepalive by crafting the LACPDU packet ourselves #3170

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

saiarcot895
Copy link
Contributor

@saiarcot895 saiarcot895 commented Feb 19, 2024

What I did

Instead of waiting for a LACPDU packet to be sent and capturing that (which involves waiting roughly 30 seconds), get the necessary information from teamd and craft it ourselves. This means that the 60-second wait in making sure that a LACPDU packet is captured and the keepalive script is ready can be largely eliminated (this has been reduced to 10 seconds to make sure the script has a chance to craft the packets and send some LACPDUs).

Fixes sonic-net/sonic-buildimage#17416.

How I did it

How to verify it

Verified on KVM that the lag_keepalive script started correctly. Also verified that during a warm-reboot, the script correctly starts and send LACPDU packets every second (the delay between starting the script and the message where it said it is ready was 2 seconds).

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Instead of waiting for a LACPDU packet to be sent and capturing that
(which involves waiting roughly 30 seconds), get the necessary
information from teamd and craft it ourselves. This means that the
60-second wait in making sure that a LACPDU packet is captured and the
keepalive script is ready can be largely eliminated (this has been
reduced to 10 seconds to make sure the script has a chance to craft the
packets and send some LACPDUs).

Signed-off-by: Saikrishna Arcot <[email protected]>
@xumia
Copy link
Collaborator

xumia commented Feb 22, 2024

/azp run Azure.sonic-utilities

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895
Copy link
Contributor Author

/azp run Azure.sonic-utilities

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

vaibhavhd
vaibhavhd previously approved these changes Jan 14, 2025
# give the lag_keepalive script a chance to get ready (30s) and collect one lacpdu before going down (30s)
sleep 60
# give the lag_keepalive script a chance to send some LACPDUs
sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach still uses a wait approach, and the assumption is that the LACPDUs will be sent.

In mostly all platforms, sending these LACPDUs has become a prerequisite. So, may be we should do a quick capture here to ensure that LACPDUs are being sent out via (all?) LAGs. If not, abort warm-reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So doing a capture here might make things a bit messy. Instead, I added logic in the Python script to fork into the background after collecting information about the LAGs. That way, if there's an issue at this stage, then it'll get caught.



def craft_lacp_packet(portChannelConfig, portName):
portConfig = portChannelConfig["ports"][portName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a possibility of any of the following key accesses from the dict will fail with KeyError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add key presence check wherever needed, and default values (??) where keys are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a try-except in main() that catches everything and retries immediately.


def get_port_channel_config(portChannelName):
(processStdout, _) = getCmdOutput(["teamdctl", portChannelName, "state", "dump"])
return json.loads(processStdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled via main() try-catch.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Start lag_keepalive before pausing orchagent, so that there's less of a
delay between when orchagent is paused and when kexec happens, and so
that fewer events/changes aren't handled by orchagent.

Additionally, add an option into the lag_keepalive script to fork into
the background after generating the LACPDUs and opening sockets, but
before sending the actual packets. This serves as a sort-of error check
to make sure that it is at least able to send LACPDU packets, and didn't
bail out early.

Signed-off-by: Saikrishna Arcot <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[warm-reboot] longer warm-reboot command execution time due to added lag_keepalive.py
4 participants