Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix port up/bfd sessions bringup notification delay issue. #3269

Merged
merged 9 commits into from
Oct 9, 2024

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Aug 28, 2024

Fix port up/bfd sessions bringup notification delay issue.

Why I did it

Fix following issue:
sonic-net/sonic-buildimage#19569

Work item tracking
  • Microsoft ADO: 29192284

How I did it

Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);

The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.

How to verify it

Pass all UT.
Manually verify issue fixed.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

  • SONiC.master-20030.629638-f370e2fa8

Description for the changelog

Fix port up/bfd sessions bringup notification delay issue.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mlok-nokia
Copy link

@liuh-80 I built an image with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Aug 29, 2024

@bocon13 , we found a performance issue which cause by your PR, can you review this fix?

Performance issue: sonic-net/sonic-buildimage#19569
PR cause performance issue: #1992

@liuh-80 liuh-80 changed the title [POC] verify route performance issue Fix port up/bfd sessions bringup notification delay issue. Aug 30, 2024
@liuh-80 liuh-80 marked this pull request as ready for review August 30, 2024 06:20
@liuh-80 liuh-80 requested a review from prsunny as a code owner August 30, 2024 06:20
@lguohan
Copy link
Contributor

lguohan commented Aug 30, 2024

@liuh-80 , we must need some UT to prevent such regression.

@wenyiz2021
Copy link

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

qiluo-msft
qiluo-msft previously approved these changes Aug 31, 2024
@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 2, 2024

Comsumer

Will add sonic-swss test case to prevent this issue happen again,

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 2, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@siqbal1986 siqbal1986 self-requested a review September 3, 2024 03:28
@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Collaborator

prsunny commented Sep 4, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 4, 2024

thanks @liuh-80 ! just curious what is the result of Comsumer pop notification once VS pop until size of entry is 0?

This will make Consumer pop all notifications belong to current consumer, so higher priority notification will be blocked.

thanks! then does current design in this PR ensure the priority task notify, or it only notify the first come notification?

With this PR, each Consumer will pop less than 128 notifications every time. which means orchagent will check and handle high priority notification more frequently.

does this mean, bulk-api's (SAI) that orchagent invoke for routes etc will now have a limit of 128 entries

Yes, it will have a 128 entries limit.
This is an orchagent design issue.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 4, 2024

The following tests in pc suite seems to be triggering the crashes. orchagent tries to remove the neighbor and nexthop, either meta layer or SAI is not in sync with orchagent and returns error which causes orchagent to exit. test_po_update_io_no_loss test_po_update

I think it's a timing issue, for example the validation of this PR has lots of test case failed, but after I increase the wait_for_n_keys timeout, many test case passed.

However this change do impact performance, because after this change every doTask() call can only handle 128 entries, so some scenario take longer time.

I'm trying to only change RouteOrch to improve performance.

orchagent/orch.cpp Outdated Show resolved Hide resolved
orchagent/orch.cpp Outdated Show resolved Hide resolved
orchagent/orch.cpp Outdated Show resolved Hide resolved
@liuh-80 liuh-80 force-pushed the dev/liuh/test-route-performance branch from ba0cd6d to ff0b951 Compare September 26, 2024 01:19
@saksarav-nokia
Copy link
Contributor

We tried the fix in the PR in the Voq chassis and seeing the following issue
The IMM has two asics and has 2 pot channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is eanbled. The neighbor and routes are learned on these port channel.
In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel.
In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, the neighbor and nexthop for the old portchannel are being attempted to be removed. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are being removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 9, 2024

We tried the fix in the PR in the Voq chassis and seeing the following issue The IMM has two asics and has 2 pot channels in each asic and 2 port members in each port channel. The ip address is configured on each port channel and bgp is eanbled. The neighbor and routes are learned on these port channel. In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel. In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, the neighbor and nexthop for the old portchannel are being attempted to be removed. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are being removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.

Hi @saksarav-nokia , the code change in this PR does not change any orch logic, so the as my understand this issue is a bug in orchagent and it's already there. is there a plan to fix the issue?

@prsunny prsunny merged commit 766e755 into sonic-net:master Oct 9, 2024
17 checks passed
@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 10, 2024

If this PR need cherry-pick, following PR also need cherry-pick first:
#3304
#3305

@yejianquan
Copy link

@bingwang-ms for viz

mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Oct 14, 2024
…#3269)

Fix port up/bfd sessions bringup notification delay issue.
Why I did it
Fix following issue: sonic-net/sonic-buildimage#19569

How I did it
Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);

The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #3328

mssonicbld pushed a commit that referenced this pull request Oct 14, 2024
Fix port up/bfd sessions bringup notification delay issue.
Why I did it
Fix following issue: sonic-net/sonic-buildimage#19569

How I did it
Revert change in Consumer::execute(), which introduced by this commit:
9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08

The change in this commit add a while loop:
do
{
std::deque entries;
table->pops(entries);
update_size = addToSync(entries);
} while (update_size != 0);

The addToSync sync method will return the size of entries
Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.