Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[muxorch] Using bulker to program routes/neighbors during switchover #3148

Merged
merged 17 commits into from
Jun 4, 2024

Conversation

Ndancejic
Copy link
Contributor

Uses entity bulker to program routes and neighbors during mux
switchover. Mux switchover performance suffers when switching over with
a large number of neighbors on the mux port. This uses the optimization
of programming the neighbors and routes in bulk to avoid sequentially
programming each.

What I did
Changed mux switchover logic to use neighbor and bulk switchover instead of programming neighbors sequentially.

Why I did it
Testing shows this improves switchover time by an average of 30% at 128 neighbors.

How I verified it
added a test to vstests to test functionality of code, and tested performance on a dualtor lab testbed.

Details if related
ado:#25072867

bulk_switchover_test_results.xlsx

Uses entity bulker to program routes and neighbors during mux
switchover. Mux switchover performance suffers when switching over with
a large number of neighbors on the mux port. This uses the optimization
of programming the neighbors and routes in bulk to avoid sequentially
programming each.

Signed-off-by: Nikola Dancejic <[email protected]>
@Ndancejic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Ndancejic Ndancejic marked this pull request as ready for review May 21, 2024 20:49
@Ndancejic Ndancejic requested a review from prsunny as a code owner May 21, 2024 20:49
* @brief Creates a neighbor add entry and adds it to bulker.
* @param ctx NeighborBulkContext contains neighbor information and list of object statuses.
*/
bool NeighOrch::addBulkNeighbor(NeighborBulkContext& ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it use bulk neighbor only for mux scenario or even for the existing path? I think we should enable it only for mux flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only for mux flow for this PR

@prsunny
Copy link
Collaborator

prsunny commented May 21, 2024

@dgsudharsan for viz

Copy link

linux-foundation-easycla bot commented May 25, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

bulk switchover feature chages the sai_api calls in switchover flow.
Because of this mux_rollback_ut tests fail. This PR adds bulk operations
to mock_sai_neighbor and mock_sai_api, and adjusts the expected function
calls in mux_rollback_ut.

Signed-off-by: Nikola Dancejic <[email protected]>
@prsunny
Copy link
Collaborator

prsunny commented May 30, 2024

Dependant on PR sonic-net/sonic-sairedis#1373

@Ndancejic
Copy link
Contributor Author

still pulling from wrong sairedis pipeline, running again:
Download from the specified build: #558894

@Ndancejic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Ndancejic
Copy link
Contributor Author

Last run was still failing because the correct syncd was not installed for some reason. Ran mux tests locally with syncd package pulled from https://dev.azure.com/mssonic/build/_build/results?buildId=559999&view=results and only had one test_multi_nexthop test fail due to a small bug I was able to resolve. Tested with latest commit and all tests are passing.

=================================================================
conftest.py:1880
/data/sonic-swss/tests/conftest.py:1880: PytestDeprecationWarning: @pytest.yield_fixture is deprecated.
Use @pytest.fixture instead; they are the same.
@pytest.yield_fixture(scope="module")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================== 31 passed, 1 warning in 1031.63s (0:17:11) =====

@Ndancejic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Ndancejic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you post the dualttor sonic-mgmt test results?

while (it != neighbors_.end())
{
NextHopKey nh_key = NextHopKey(it->first, alias_);
if (update_rt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is update_rt checked here? Its already done above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix that

}

it = neighbors_.begin();
while (it != neighbors_.end())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've introduced three loops now in the new code. Can we optimize it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the loop out from the comment below, I believe the two left over are necessary, we need the neighbors enabled before we update routes to point to their next hops

NextHopKey nh_key = NextHopKey(it->first, alias_);
if (update_rt)
{
updateTunnelRoute(nh_key, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we move this to original section and do this loop only if Bulk api fails. So we don't have to do it everytime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, I think I just wanted to make sure the order was maintained. I'll fix that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to take out this loop entirely then, if bulk api fails we will fall back, which will remove the tunnel routes anyways

PortsOrch* ports_orch = gDirectory.get<PortsOrch*>();
auto vlan_ports = ports_orch->getAllVlans();

for (auto vlan_port: vlan_ports)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of repeated code from existing addNeighbor. I think we should only add the bulk function as discussed offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced the amount of repeated code in latest commit

 - using existing add/removeNeighbor functions to do bulker operations.
 - Created addNeighborPost and removeNeighborPost
 - Created addRoutes and removeRoutes to deal with bulk operations
 - Created enableNeighbors and disableNeighbors to handle bulk
 switchover.

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Nikola Dancejic <[email protected]>
}

updateTunnelRoute(nh_key, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inside the update_rt check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll move that back

return false;
}

gRouteBulker.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we clear this even if its returned false above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this, it's not needed since we call clear from removeRoutes

@@ -1060,6 +1081,7 @@ bool NeighOrch::addNeighbor(const NeighborEntry &neighborEntry, const MacAddress
}
else if (isHwConfigured(neighborEntry))
{
ctx.set_neigh_attr_count = (int)neighbor_attrs.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

}

updateTunnelRoute(nh_key, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll move that back

return false;
}

gRouteBulker.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this, it's not needed since we call clear from removeRoutes


SWSS_LOG_INFO("Checking neighbor remove entry status %s on %s.", ip_address.to_string().c_str(), m_syncdNeighbors[neighborEntry].mac.to_string().c_str());

if (isHwConfigured(neighborEntry))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a loot at this, we still need the HwConfigured check since this isn't checked for in the disableNeighbors patch

Signed-off-by: Nikola Dancejic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants