-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[portmgrd] regression: prevent runtime exception (crash) in configuring portchannel at boot #3432
base: master
Are you sure you want to change the base?
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@prsunny please review |
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add UT to cover this scenario.
@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948 Why do we need such checks at multiple places? @prsunny what are your thoughts on this? |
People using things like Ansible, don't use the CLI to set configuration. They modify the |
277e0ae
to
201d5b1
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
84669f9
to
d4b4b98
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
I committed one, no idea if its right. |
coverage looks good, any other comments? |
…r [PR sonic-net#3432] Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
I just updated this patch to try to address the expressed concerns and take into account the changes introduced in 8b99543 to hopefully not regress that. I'm in process of building and testing this so it may not be in a final state. That said, I will take any feedback you have to offer if this is going in an acceptable direction or not. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…r [PR sonic-net#3432] Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
a08cb4f
to
ecf32f8
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
I've installed this patch on a physical Dell S5248F and verified it works. However, it appears some test cases are being run by Azure CI are failing. I guess I need to look at them. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…r [PR sonic-net#3432] Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…r [PR sonic-net#3432] Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
…annel member (PR sonic-net#3432) Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel. Signed-off-by: Brad House (@bradh352)
Prevent setting the PORT MTU on PortChannel members as it will likely fail and cause portmgrd to exit (The PortChannel itself is where the MTU gets set, not the PORT). The current code is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set. Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via `config_db.json` this too would bring down portmgrd, so catch that and just emit a warning instead. The YANG model does NOT support checking for this. In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU. The current code *always* attempts to set an MTU on the PORT by setting a default here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172 Then applies it here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226 So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this). **NOTE**: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code. This regression was caused by 8b99543 ... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not. Logs show this issue which is a critical failure causing the entire switch to go down: ``` 2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted 2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected) ``` Signed-off-by: Brad House (@bradh352)
a98cc21
to
b9183b4
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
i've made as minimal a patch as possible while still passing all test cases now. please review. hopefully this is acceptable with the lazy caching of portchannel members. |
@prsunny @dgsudharsan does the current change sufficiently address your performance concerns? |
…r (PR sonic-net#3432) Prevent setting the PORT MTU on PortChannel members as it will likely fail and cause portmgrd to exit (The PortChannel itself is where the MTU gets set, not the PORT). The current code is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set. Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via `config_db.json` this too would bring down portmgrd, so catch that and just emit a warning instead. The YANG model does NOT support checking for this. In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU. The current code *always* attempts to set an MTU on the PORT by setting a default here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172 Then applies it here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226 So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this). **NOTE**: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code. This regression was caused by 8b99543 ... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not. Logs show this issue which is a critical failure causing the entire switch to go down: ``` 2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted 2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected) ``` Signed-off-by: Brad House (@bradh352)
…r (PR sonic-net#3432) Prevent setting the PORT MTU on PortChannel members as it will likely fail and cause portmgrd to exit (The PortChannel itself is where the MTU gets set, not the PORT). The current code is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set. Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via `config_db.json` this too would bring down portmgrd, so catch that and just emit a warning instead. The YANG model does NOT support checking for this. In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU. The current code *always* attempts to set an MTU on the PORT by setting a default here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172 Then applies it here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226 So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this). **NOTE**: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code. This regression was caused by 8b99543 ... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not. Logs show this issue which is a critical failure causing the entire switch to go down: ``` 2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted 2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected) ``` Signed-off-by: Brad House (@bradh352)
I still see this as a gap for dynamic lag members which are created after the config_db read. Can you confirm if you can modify the config_db (may be as a workaround) to have the correct values as expected and it works? |
I'm not sure I understand what you are saying. The issue has nothing to do with config_db.json at all, it is a regression affecting all portchannels brought up during boot (at least on real hardware as of Oct 2024 due to 8b99543). This is reproduced with no mtu specified at all in the config_db.json, because a default of This is not observed in the current upstream code after the initial port bring-up, such as if adding a portchannel after boot because the code to set the default mtu is NOT called in that case. So yes, you have to reboot the switch with your saved configuration having a portchannel configured to observe the issue, but that is the only way in which it relates to config_db.json. If you look at the minimal changeset in doTask(), I tried to make the change as comprehensible as possible there. The main fix there is to unset the MTU specified earlier ONLY IF it was not specified in the port configuration and it is a LAG Member. |
…r (PR sonic-net#3432) Prevent setting the PORT MTU on PortChannel members as it will likely fail and cause portmgrd to exit (The PortChannel itself is where the MTU gets set, not the PORT). The current code is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set. Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via `config_db.json` this too would bring down portmgrd, so catch that and just emit a warning instead. The YANG model does NOT support checking for this. In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU. The current code *always* attempts to set an MTU on the PORT by setting a default here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172 Then applies it here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226 So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this). **NOTE**: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code. This regression was caused by 8b99543 ... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not. Logs show this issue which is a critical failure causing the entire switch to go down: ``` 2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted 2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected) ``` Signed-off-by: Brad House (@bradh352)
…r (PR sonic-net#3432) Prevent setting the PORT MTU on PortChannel members as it will likely fail and cause portmgrd to exit (The PortChannel itself is where the MTU gets set, not the PORT). The current code is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set. Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via `config_db.json` this too would bring down portmgrd, so catch that and just emit a warning instead. The YANG model does NOT support checking for this. In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU. The current code *always* attempts to set an MTU on the PORT by setting a default here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172 Then applies it here: https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226 So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this). **NOTE**: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code. This regression was caused by 8b99543 ... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not. Logs show this issue which is a critical failure causing the entire switch to go down: ``` 2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted 2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected) ``` Signed-off-by: Brad House (@bradh352)
@prsunny please reply or approve |
What I did
Prevent setting a default port MTU on PortChannel member ports as it will likely fail and cause portmgrd to exit. The current code in portmgr.cpp is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set.
Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via
config_db.json
this too would bring down portmgrd, so catch that and just emit a warning instead (NOTE: the YANG model does NOT support checking/preventing an MTU set on a PORT that is part of a PORTCHANNEL, so this secondary issue should be caught and handled gracefully).In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU.
Why I did it
The current code always attempts to set an MTU on the PORT by setting a default here:
sonic-swss/cfgmgr/portmgr.cpp
Lines 163 to 172 in c20902f
Then applies it here:
sonic-swss/cfgmgr/portmgr.cpp
Lines 222 to 226 in c20902f
So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default (in portmgr.cpp) when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this).
NOTE: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code.
This regression was caused by 8b99543 (Oct 2024)... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not.
Logs show this issue which is a critical failure causing the entire switch to go down:
How I verified it
Apply patch and verify this config no longer causes crash on Dell S5248F (Broadcom Trident3).
Tested on 202411 and master.
Details if related
Signed-off-by: Brad House (@bradh352)