Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xcvrd] DPB support on platforms with CmisManagerTask enabled #500

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ishidawataru
Copy link

@ishidawataru ishidawataru commented Jun 14, 2024

Description

This commit fixes DPB support with CMIS transceivers.

CmisManagerTask's port_dict must be updated according to the port add/remove events.
This commit removes the port_mapping field from CmisManagerTask as
port_mapping was mostly used just for storing asic_id information
and that can be simply done by port_dict instead.

Added a helper method get_asic_id() method to CmisManagerTask for
getting asic_id from logical_port.

Motivation and Context

fixes sonic-net/sonic-buildimage#18893

How Has This Been Tested?

I tested on the VS environment.

Additional Information (Optional)

Copy link

linux-foundation-easycla bot commented Jun 14, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@ishidawataru ishidawataru changed the title [xcvrd] DPB support with CmisManagerTask enabled [xcvrd] DPB support on platforms with CmisManagerTask enabled Jun 14, 2024
@mihirpat1
Copy link
Contributor

@ishidawataru Can you please fix the code coverage check failure?

@ishidawataru
Copy link
Author

@mihirpat1 I'm still communicating with the original reporter of this issue to check the fix actually works. After that I'll fix the code coverage failure.

@ishidawataru ishidawataru force-pushed the fix-xcvrd-dpb branch 4 times, most recently from 83ad867 to a5dd3e8 Compare June 25, 2024 09:38
@ishidawataru
Copy link
Author

@mihirpat1 Confirmed that the fix resolved the issue. Also, I fixed the coverage test failure. Please review.

Copy link
Collaborator

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ishidawataru There is not enough information for me to understand the problem. As long as CONFIG_DB port table has the right number of lanes and Speed specified, Xcvrd must be able to select the best matching application from the list of advertised application by the module. Do you see any issue in DPB port deletion path or DPB port creation path?

@ishidawataru
Copy link
Author

@prgeor This PR fixes sonic-net/sonic-buildimage#18893. It is not about the application selection. The current CmisManagerTask implementation doesn't delete the port info from port_dict when ports are deleted due to DPB configuration, which causes a crash.

@@ -896,7 +897,13 @@ def on_port_update_event(self, port_change_event):

self.force_cmis_reinit(lport, 0)
else:
# PORT_DEL event for the same lport happens 3 times because
# we are subscribing to CONFIG_DB, STATE_DB|TRANSCEIVER_INFO, and STATE_DB|PORT_TABLE.
# We only handle the first one and ignore the rest.
Copy link
Contributor

@longhuan-cisco longhuan-cisco Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new attribute deleted_ports and its corresponding update logic, you may be able to simply rely on port_change_event's existing db_name(/table_name) to only proceed to do update_port_transceiver_status_table_sw_cmis_state for the case of STATE_DB|TRANSCEIVER_INFO (since update_port_transceiver_status_table_sw_cmis_state is updating SW state of transceiver for the case of transceiver removal)?

self.db_name = db_name
self.table_name = table_name

Copy link
Author

@ishidawataru ishidawataru Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/sonic-net/sonic-platform-daemons/pull/500/files#diff-fc70c43977a1ce57f3a6e9d7d09cb7ea5204f52b3adf3162936205e5680efb2fR899-R902

@longhuan-cisco Thank you for the suggestion. I was able to fix the issue without introducing deleted_ports.

I initially added deleted_ports because the CmisManagerTask handles port events in two stages.

The first stage is done in on_port_update_event via port_change_observer.handle_port_update_event() and the second stage is done after returning from port_change_observer.handle_port_update_event() by iterating over port_dict.

If we delete the port info from port_dict in the first stage, the second stage cannot handle the port.
By introducing deleted_ports, the original code was deleting the port info from port_dict after finishing the second stage.
However, it looks like currently, the second stage is not doing anything for the removed port. so deleting the port info in the first stage is safe.

Though this works, I think the current two-stage design is bug-prone, because we might want to add some handling(e.g. de-initialize) for the removed port in the second stage.
If you agree, I'd like to refactor the current design in a different PR.

self.handle_cmis_state_machine(lport, info, is_fast_reboot)

for event in self.deleted_ports.values():
self.port_dict.pop(event.port_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make the flow/logic simpler by just directly deleting self.port_dict[lport] entry at the PORT_DEL event of CONFIG_DB in on_port_update_event()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I explained the background of this change here

self.port_dict[lport]['appl'] = 0
self.port_dict[lport]['host_lanes_mask'] = 0
continue
def handle_cmis_state_machine(self, lport, info, is_fast_reboot):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here seems to be refactoring today's cmis_mgr's task_worker function by splitting it into sub functions. I'm assuming the flow/logic of cmis_mgr's task_worker doesn't get changed, it's mostly just splitting functions. But this makes the number of LOC (Line of change) very big. I feel It's a little bit difficult for other people to distinguish the actual change of DPB fix from the entire change set of this PR.

Thus, I feel it might be better to put this refactoring change set in a separate PR. It will also be less of a work for release mgr to revert in case of anything broken. Just my 2 cents, not something must-have.
Not sure how @mihirpat1 or Prince feel about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longhuan-cisco Agreed that splitting the current changeset into 2 separate PRs will be more helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longhuan-cisco @mihirpat1 I'm sorry for the late response. Thanks for the suggestion.

I'll move the refactoring commit to a separate PR and only include the DPB fix in this PR.

@ishidawataru
Copy link
Author

@longhuan-cisco @mihirpat1 I updated the PR based on @longhuan-cisco 's suggestion. Please review.

I used force-push and removed the original commits. I backed up the original branch here.

self.update_port_transceiver_status_table_sw_cmis_state(lport, CMIS_STATE_REMOVED)
self.port_dict.pop(lport)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a transceiver is removed (not doing port breakout), the port would be removed from port_dict. When the transceiver is reattached to the port, there would be no port-related information in port_dict. I think the port should only be removed from port_dict after receiving a delete CONFIG_DB|PORT event and update cmis state after receiving a delete STATE_DB|TRANSCEIVER_INFO event.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. You're right. I'll update the code as you suggested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chiourung I fixed the issue with this commit. 954e9cb

Could you take a look?

CmisManagerTask's `port_dict` and `port_mapping` must be updated
according to the port add/remove events.

Before this commit, `port_mapping` is only intialized when
CmisManagerTask is initialized and not updated after that,
which was causing KeyError exception when DBP is used.
(sonic-net/sonic-buildimage#18893)

This commit removes the `port_mapping` field from CmisManagerTask as
`port_mapping` was used just for storing `asic_id` information
and that can be simply done by `port_dict` instead.

Also, this commit updates `port_dict` accoding to the port add/remove
events to support DPB.

Signed-off-by: Wataru Ishida <[email protected]>
Signed-off-by: Wataru Ishida <[email protected]>
@ishidawataru
Copy link
Author

@prgeor Hi, just a gentle ping to check if I need to do something to get this PR merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[xcvrd] xcvrd crashes during breakout port
5 participants