-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[202405] [Chassis]: Ports take too long to come up due to delayed port up notification processing by orchagent #19569
Comments
Hi @mannytaheri, |
@mannytaheri please let me know the test result with the patches, if not working I'll look at |
PMON faster bring up does not seem to help this issue. @wenyiz2021 could you help follow up with Arvind/BRCM and try out the BRCM fix |
Related issues raised earlier ? : #17180 |
To further debug
Check with sonic-common-infra subgroup, for a root cause which could be known. |
The FIB suppress pending feature got merged recently, can we check again with latest master.202405 build #19736 @mannytaheri |
cannot reproduce issue on Arista chassis with latest master image with SAI 11.2 taken from #19854 |
@mannytaheri this seems not general issue for all platform? can you try above master image with SAI 11? |
above is the understanding. |
Ack, will simulate this case and check if the issue related with by sonic-swss-common selectable priority. |
Here is update, today I create test case to simulate the case, here is my summary:
I'm not sure if the performance issue caused by this, checking about the database name and table name of port event and route event. |
@liuh-80 , In chassis, the following is the code path and it seems to be using ConsumerStateTable const int routeorch_pri = 5; RouteOrch::RouteOrch(DBConnector *db, vector<table_name_with_pri_t> &tableNames, SwitchOrch *switchOrch, NeighOrch *neighOrch, IntfsOrch *intfsOrch, VRFOrch *vrfOrch, FgNhgOrch *fgNhgOrch, Srv6Orch *srv6Orch) : Orch::Orch(DBConnector *db, const vector<table_name_with_pri_t> &tableNames_with_pri) void Orch::addConsumer(DBConnector *db, string tableName, int pri) |
@saksarav-nokia , thanks, the issue need more investigation, I will try reproduce first. |
@liuh-80 , We can easily reproduce this in our setup. Let me know if you want us to collect any info or logs? |
@saksarav-nokia , can you share me the reproduce steps, OS version and hardware SKU? |
admin@ixre-egl-board211:~$ show version SONiC Software Version: SONiC.HEAD.798897-202405-3192720893 Platform: x86_64-nokia_ixr7250e_36x400g-r0 |
What's the commands I need to run to create BGP routes and port up event? also what's the signal of BGP up event blocked by BGP routes, do I need check syslog? |
@liuh-80 , We have 36 ebgp neighbors and 6 ibgp neighbors with 34 routes from each ebgp neighbor. We just reboot this Line card to see the issue. IPv4 Unicast Summary: Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName 3.3.3.24 4 65100 16567 16535 0 0 0 04:32:04 442664 ASIC0 Total number of neighbors 32 dmin@ixre-egl-board211:
Ethernet-IB0 219 10G 9100 N/A Recirc0/0 routed up up N/A off |
@liuh-80 , We have 36 front panel ports in the asic and connected to Arista VM. We have enabled bgp protocol in both our chassis and Arista VM which established bgp neighbor and the routes are injected from Arista vm. |
@mlok-nokia @fountzou for viz |
Update: Seems the issue caused by following code: void Consumer::execute()
} Here is my theory:
'''
I modify test case to simulate this case, seems the while loop do cause the issue: #define DEFAULT_POP_BATCH_SIZE (128) void ProducerStateTableSet(ProducerStateTable &table, string key) TEST(Priority, massive_route_block_portstatus)
} Test result: |
@liuh-80 thanks so much for the investigation. just curious, in this theory, does number of bgp neighbors matters? |
I'm not understand how the BGP neighbors handled by orchagent, so not sure if the BGP neighbors related with this issue. |
Found the issue may cause by this change: sonic-net/sonic-swss@9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08 Create a draft fix to verify the change is root cause: sonic-net/sonic-swss#3269 |
Being looked into actviely within MSFT. |
@liuh-80 I built an image (latest master) with this change and tested. For the first time boot up after installation on a single linecard, all ports come up in 8 minutes and all 34k routes are also installed. For subsequent reboot a single linecard, it takes about 7 minutes for all linkup and 34k routes installed. It seems this change addresses the issue. We need to do more testing to verify that, includes the OC testing. |
Since the change verified can fix the issue, I published it for review and get comments: |
I am looking at the orchagent crashes seen with this fix. |
Fix port up/bfd sessions bringup notification delay issue. Why I did it Fix following issue: sonic-net/sonic-buildimage#19569 How I did it Revert change in Consumer::execute(), which introduced by this commit: 9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08 The change in this commit add a while loop: do { std::deque entries; table->pops(entries); update_size = addToSync(entries); } while (update_size != 0); The addToSync sync method will return the size of entries Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
Close because fix PR merged. |
…#3269) Fix port up/bfd sessions bringup notification delay issue. Why I did it Fix following issue: sonic-net/sonic-buildimage#19569 How I did it Revert change in Consumer::execute(), which introduced by this commit: 9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08 The change in this commit add a while loop: do { std::deque entries; table->pops(entries); update_size = addToSync(entries); } while (update_size != 0); The addToSync sync method will return the size of entries Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
Fix port up/bfd sessions bringup notification delay issue. Why I did it Fix following issue: sonic-net/sonic-buildimage#19569 How I did it Revert change in Consumer::execute(), which introduced by this commit: 9258978#diff-96451cb89f907afccbd39ddadb6d30aa21fe6fbd01b1cbaf6362078b926f1f08 The change in this commit add a while loop: do { std::deque entries; table->pops(entries); update_size = addToSync(entries); } while (update_size != 0); The addToSync sync method will return the size of entries Which means, if there are massive routes notification, other high priority notification for example port up notification will blocked until all routes notification been handled.
Issue Description
The ports take more the 20 minutes to come up due to the delayed port up notification processing by orchagent after reload/reboot in T2 topo.
Results you see
The port up notifications are queued due to lot of bgp route (34000 routes) updates and take a long time.
This occurs after a config reload or a reboot.
Results you expected to see
The bgp routes update should be handled correctly and ports should come up in a reasonable time.
Is it platform specific
generic
Relevant log output
No response
Output of
show version
admin@ixre-egl-board15:~$ show ver SONiC Software Version: SONiC.HEAD.742851-nokia-master-ef3457c7 SONiC OS Version: 12 Distribution: Debian 12.5 Kernel: 6.1.0-11-2-amd64 Build commit: ef3457c7 Build date: Fri Jun 14 19:31:02 UTC 2024 Built by: gitlab-runner@sonic-build-server04
Attach files (if any)
No response
The text was updated successfully, but these errors were encountered: