Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portmgrd] regression: prevent runtime exception (crash) in configuring portchannel at boot #3432

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bradh352
Copy link
Contributor

@bradh352 bradh352 commented Dec 19, 2024

What I did
Prevent setting a default port MTU on PortChannel member ports as it will likely fail and cause portmgrd to exit. The current code in portmgr.cpp is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set.

Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via config_db.json this too would bring down portmgrd, so catch that and just emit a warning instead (NOTE: the YANG model does NOT support checking/preventing an MTU set on a PORT that is part of a PORTCHANNEL, so this secondary issue should be caught and handled gracefully).

In order to not add much overhead for large port count systems, we are also lazily caching portchannel members and only using that cache on a new port being brought up or on failure to set an MTU.

Why I did it
The current code always attempts to set an MTU on the PORT by setting a default here:

/* If this is the first time we set port settings
* assign default admin status and mtu
*/
if (!configured)
{
admin_status = DEFAULT_ADMIN_STATUS_STR;
mtu = DEFAULT_MTU_STR;
m_portList.insert(alias);
}

Then applies it here:
if (!mtu.empty())
{
setPortMtu(alias, mtu);
SWSS_LOG_NOTICE("Configure %s MTU to %s", alias.c_str(), mtu.c_str());
}

So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default (in portmgr.cpp) when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this).

NOTE: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the PORT is already provisioned so the default MTU setting path isn't taken in the above referenced code.

This regression was caused by 8b99543 (Oct 2024)... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to go down:

2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)

How I verified it

Apply patch and verify this config no longer causes crash on Dell S5248F (Broadcom Trident3).

Tested on 202411 and master.

{
    "PORT": {
        "Ethernet0": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/1",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "1",
            "lanes": "49",
            "speed": "25000"
        },
        "Ethernet1": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/2",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "2",
            "lanes": "50",
            "speed": "25000"
        }
    },
    "PORTCHANNEL": {
        "PortChannel0001": {
            "admin_status": "up",
            "description": "management interface",
            "lacp_key": "auto",
            "min_links": "1"
        }
    },
    "PORTCHANNEL_INTERFACE": {
        "PortChannel0001": {
            "ipv6_use_link_local_only": "enable",
            "mac_addr": "02:d3:ab:fe:fd:c4"
        },
        "PortChannel0001|10.0.0.11/24": {}
    },
    "PORTCHANNEL_MEMBER": {
        "PortChannel0001|Ethernet0": {},
        "PortChannel0001|Ethernet1": {}
    }
}

Details if related
Signed-off-by: Brad House (@bradh352)

@bradh352 bradh352 requested a review from prsunny as a code owner December 19, 2024 00:37
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 changed the title portmgrd: prevent runtime failure in setting MTU on portchannel member [portmgrd] prevent runtime exception (crash) in setting MTU on portchannel member Dec 19, 2024
@bradh352
Copy link
Contributor Author

@prsunny please review

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@prsunny prsunny requested review from dgsudharsan and prgeor January 6, 2025 18:36
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 7, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
Copy link
Collaborator

@dgsudharsan dgsudharsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UT to cover this scenario.

cfgmgr/portmgr.cpp Outdated Show resolved Hide resolved
@dgsudharsan
Copy link
Collaborator

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

People using things like Ansible, don't use the CLI to set configuration. They modify the /etc/sonic/config_db.json which does nothing to prevent this. ALSO, in this case, as you can see from the /etc/sonic/config_db.json example I provided, no MTU is provided at all in the PORT configuration. Its being autopopulated somewhere as a default. I didn't try to track that down.

@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 277e0ae to 201d5b1 Compare January 7, 2025 02:32
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 84669f9 to d4b4b98 Compare January 7, 2025 02:55
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

Please add UT to cover this scenario.

I committed one, no idea if its right.

@bradh352 bradh352 requested a review from dgsudharsan January 7, 2025 09:45
@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

coverage looks good, any other comments?

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 15, 2025
…r [PR sonic-net#3432]

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@bradh352
Copy link
Contributor Author

I just updated this patch to try to address the expressed concerns and take into account the changes introduced in 8b99543 to hopefully not regress that. I'm in process of building and testing this so it may not be in a final state.

That said, I will take any feedback you have to offer if this is going in an acceptable direction or not.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 15, 2025
…r [PR sonic-net#3432]

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from a08cb4f to ecf32f8 Compare January 15, 2025 13:48
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

I've installed this patch on a physical Dell S5248F and verified it works. However, it appears some test cases are being run by Azure CI are failing. I guess I need to look at them.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 15, 2025
…r [PR sonic-net#3432]

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 15, 2025
…r [PR sonic-net#3432]

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 16, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
Prevent setting the PORT MTU on PortChannel members as it will likely fail
and cause portmgrd to exit (The PortChannel itself is where the MTU gets set,
not the PORT).  The current code is setting a default value for an MTU (9100)
even when its a PortChannel member, so this patch prevents that default value
from being set.  Also if a user were to incorrectly specify an MTU on a Port
that is a member of the port channel via `config_db.json` this too would bring
down portmgrd, so catch that and just emit a warning instead.

The YANG model does NOT support checking for this.

In order to not add much overhead for large port count systems, we are also
lazily caching portchannel members and only using that cache on a new port
being brought up or on failure to set an MTU.

The current code *always* attempts to set an MTU on the PORT by setting a
default here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172
Then applies it here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226

So it isn't crashing because the user configured the MTU in the PORT config,
but rather because it is done by default when the port is created.   (But it
also would crash if a user set an MTU on a port which is bad since YANG doesn't
do anything to prevent this).

**NOTE**: this only appears to crash on a freshly loaded config at boot, if you
take an existing running configuration and modify it to add a portchannel it
works since the PORT is already provisioned so the default MTU setting path
isn't taken in the above referenced code.

This regression was caused by 8b99543 ... but
just reverting that patch isn't the right solution.  The startup logic does not
appear to be proper as it tries to set a default MTU regardless if its valid
to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to
go down:
```
2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" :
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
```

Signed-off-by: Brad House (@bradh352)
@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from a98cc21 to b9183b4 Compare January 16, 2025 11:40
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

i've made as minimal a patch as possible while still passing all test cases now. please review. hopefully this is acceptable with the lazy caching of portchannel members.

@bradh352
Copy link
Contributor Author

@prsunny @dgsudharsan does the current change sufficiently address your performance concerns?

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 17, 2025
…r (PR sonic-net#3432)

Prevent setting the PORT MTU on PortChannel members as it will likely fail
and cause portmgrd to exit (The PortChannel itself is where the MTU gets set,
not the PORT).  The current code is setting a default value for an MTU (9100)
even when its a PortChannel member, so this patch prevents that default value
from being set.  Also if a user were to incorrectly specify an MTU on a Port
that is a member of the port channel via `config_db.json` this too would bring
down portmgrd, so catch that and just emit a warning instead.

The YANG model does NOT support checking for this.

In order to not add much overhead for large port count systems, we are also
lazily caching portchannel members and only using that cache on a new port
being brought up or on failure to set an MTU.

The current code *always* attempts to set an MTU on the PORT by setting a
default here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172
Then applies it here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226

So it isn't crashing because the user configured the MTU in the PORT config,
but rather because it is done by default when the port is created.   (But it
also would crash if a user set an MTU on a port which is bad since YANG doesn't
do anything to prevent this).

**NOTE**: this only appears to crash on a freshly loaded config at boot, if you
take an existing running configuration and modify it to add a portchannel it
works since the PORT is already provisioned so the default MTU setting path
isn't taken in the above referenced code.

This regression was caused by 8b99543 ... but
just reverting that patch isn't the right solution.  The startup logic does not
appear to be proper as it tries to set a default MTU regardless if its valid
to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to
go down:
```
2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" :
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
```

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 17, 2025
…r (PR sonic-net#3432)

Prevent setting the PORT MTU on PortChannel members as it will likely fail
and cause portmgrd to exit (The PortChannel itself is where the MTU gets set,
not the PORT).  The current code is setting a default value for an MTU (9100)
even when its a PortChannel member, so this patch prevents that default value
from being set.  Also if a user were to incorrectly specify an MTU on a Port
that is a member of the port channel via `config_db.json` this too would bring
down portmgrd, so catch that and just emit a warning instead.

The YANG model does NOT support checking for this.

In order to not add much overhead for large port count systems, we are also
lazily caching portchannel members and only using that cache on a new port
being brought up or on failure to set an MTU.

The current code *always* attempts to set an MTU on the PORT by setting a
default here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172
Then applies it here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226

So it isn't crashing because the user configured the MTU in the PORT config,
but rather because it is done by default when the port is created.   (But it
also would crash if a user set an MTU on a port which is bad since YANG doesn't
do anything to prevent this).

**NOTE**: this only appears to crash on a freshly loaded config at boot, if you
take an existing running configuration and modify it to add a portchannel it
works since the PORT is already provisioned so the default MTU setting path
isn't taken in the above referenced code.

This regression was caused by 8b99543 ... but
just reverting that patch isn't the right solution.  The startup logic does not
appear to be proper as it tries to set a default MTU regardless if its valid
to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to
go down:
```
2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" :
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
```

Signed-off-by: Brad House (@bradh352)
@prsunny
Copy link
Collaborator

prsunny commented Jan 17, 2025

@prsunny @dgsudharsan does the current change sufficiently address your performance concerns?

I still see this as a gap for dynamic lag members which are created after the config_db read. Can you confirm if you can modify the config_db (may be as a workaround) to have the correct values as expected and it works?

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 17, 2025

@prsunny @dgsudharsan does the current change sufficiently address your performance concerns?

I still see this as a gap for dynamic lag members which are created after the config_db read. Can you confirm if you can modify the config_db (may be as a workaround) to have the correct values as expected and it works?

I'm not sure I understand what you are saying.

The issue has nothing to do with config_db.json at all, it is a regression affecting all portchannels brought up during boot (at least on real hardware as of Oct 2024 due to 8b99543). This is reproduced with no mtu specified at all in the config_db.json, because a default of 9100 is set in the c++ code of portmgr, this is not a value read from the config at all.

This is not observed in the current upstream code after the initial port bring-up, such as if adding a portchannel after boot because the code to set the default mtu is NOT called in that case. So yes, you have to reboot the switch with your saved configuration having a portchannel configured to observe the issue, but that is the only way in which it relates to config_db.json.

If you look at the minimal changeset in doTask(), I tried to make the change as comprehensible as possible there. The main fix there is to unset the MTU specified earlier ONLY IF it was not specified in the port configuration and it is a LAG Member.

github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 18, 2025
…r (PR sonic-net#3432)

Prevent setting the PORT MTU on PortChannel members as it will likely fail
and cause portmgrd to exit (The PortChannel itself is where the MTU gets set,
not the PORT).  The current code is setting a default value for an MTU (9100)
even when its a PortChannel member, so this patch prevents that default value
from being set.  Also if a user were to incorrectly specify an MTU on a Port
that is a member of the port channel via `config_db.json` this too would bring
down portmgrd, so catch that and just emit a warning instead.

The YANG model does NOT support checking for this.

In order to not add much overhead for large port count systems, we are also
lazily caching portchannel members and only using that cache on a new port
being brought up or on failure to set an MTU.

The current code *always* attempts to set an MTU on the PORT by setting a
default here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172
Then applies it here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226

So it isn't crashing because the user configured the MTU in the PORT config,
but rather because it is done by default when the port is created.   (But it
also would crash if a user set an MTU on a port which is bad since YANG doesn't
do anything to prevent this).

**NOTE**: this only appears to crash on a freshly loaded config at boot, if you
take an existing running configuration and modify it to add a portchannel it
works since the PORT is already provisioned so the default MTU setting path
isn't taken in the above referenced code.

This regression was caused by 8b99543 ... but
just reverting that patch isn't the right solution.  The startup logic does not
appear to be proper as it tries to set a default MTU regardless if its valid
to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to
go down:
```
2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" :
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
```

Signed-off-by: Brad House (@bradh352)
@bradh352 bradh352 changed the title [portmgrd] prevent runtime exception (crash) in setting MTU on portchannel member [portmgrd] regression: prevent runtime exception (crash) in configuring portchannel at boot Jan 18, 2025
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 22, 2025
…r (PR sonic-net#3432)

Prevent setting the PORT MTU on PortChannel members as it will likely fail
and cause portmgrd to exit (The PortChannel itself is where the MTU gets set,
not the PORT).  The current code is setting a default value for an MTU (9100)
even when its a PortChannel member, so this patch prevents that default value
from being set.  Also if a user were to incorrectly specify an MTU on a Port
that is a member of the port channel via `config_db.json` this too would bring
down portmgrd, so catch that and just emit a warning instead.

The YANG model does NOT support checking for this.

In order to not add much overhead for large port count systems, we are also
lazily caching portchannel members and only using that cache on a new port
being brought up or on failure to set an MTU.

The current code *always* attempts to set an MTU on the PORT by setting a
default here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L163-L172
Then applies it here:
https://github.com/sonic-net/sonic-swss/blob/c20902f3195b5bf8a941045e131aa1b863b69fd0/cfgmgr/portmgr.cpp#L222-L226

So it isn't crashing because the user configured the MTU in the PORT config,
but rather because it is done by default when the port is created.   (But it
also would crash if a user set an MTU on a port which is bad since YANG doesn't
do anything to prevent this).

**NOTE**: this only appears to crash on a freshly loaded config at boot, if you
take an existing running configuration and modify it to add a portchannel it
works since the PORT is already provisioned so the default MTU setting path
isn't taken in the above referenced code.

This regression was caused by 8b99543 ... but
just reverting that patch isn't the right solution.  The startup logic does not
appear to be proper as it tries to set a default MTU regardless if its valid
to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to
go down:
```
2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" :
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
```

Signed-off-by: Brad House (@bradh352)
@bradh352
Copy link
Contributor Author

@prsunny @dgsudharsan does the current change sufficiently address your performance concerns?

I still see this as a gap for dynamic lag members which are created after the config_db read. Can you confirm if you can modify the config_db (may be as a workaround) to have the correct values as expected and it works?

I'm not sure I understand what you are saying.

The issue has nothing to do with config_db.json at all, it is a regression affecting all portchannels brought up during boot (at least on real hardware as of Oct 2024 due to 8b99543). This is reproduced with no mtu specified at all in the config_db.json, because a default of 9100 is set in the c++ code of portmgr, this is not a value read from the config at all.

This is not observed in the current upstream code after the initial port bring-up, such as if adding a portchannel after boot because the code to set the default mtu is NOT called in that case. So yes, you have to reboot the switch with your saved configuration having a portchannel configured to observe the issue, but that is the only way in which it relates to config_db.json.

If you look at the minimal changeset in doTask(), I tried to make the change as comprehensible as possible there. The main fix there is to unset the MTU specified earlier ONLY IF it was not specified in the port configuration and it is a LAG Member.

@prsunny please reply or approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants