Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intelligence to xcvrd to understand process restart and config reload #356

Closed
jaganbal-a opened this issue May 19, 2023 · 5 comments · Fixed by sonic-net/SONiC#1432
Closed
Assignees
Labels

Comments

@jaganbal-a
Copy link

Upon xcvrd/pmon docker restart set the media settings in app DB which sends spurious SI programming from OA upon receiving. xcvrd should not set the settings upon process restart/pmon restart.

So add intelligence to xcvrd to understand the process restart and config reload.

-Create a new table "TABLE_xcvrd" by xcvrd to set a magic number and process restart count in STATE_DB upon fresh start. (config reload/cold reboot)
-Upon xcvrd process restart /pmon docker restart, xcvrd to check the magic number presence on coming up path to understand whether it is a restart case or a fresh start case.

  • Add a hook in config reload CLI handler to clear this magic number upon config reload.
    -Mask spurious notification during process restart
  • Possibly another flag can be added to the same table to differentiate between config reload and cold boot.
@shyam77git
Copy link

Also, need to discuss and understand multi-ASIC behavior/handling in these new use-cases

@shyam77git
Copy link

Based on technical discussion between CSCO-MSFT team, plan is to go with option#3 out of three options brainstormed

First option, files/image_config/misc/docker-wait-any can be used to build dependency b/w dockers, for eg- syncd/swss docker restart can trigger pmon docker restart, but this is not desirable, since pmon has other platform processes, we are primarily interested in restarting xcvrd. [NO-GO]

Second option, below APIs are invoked during swss restart, needs investigation if this could be leveraged to trigger xcvrd restart. [NO-GO]
stop_peer_and_dependent_services()
start_peer_and_dependent_services()

Third option (chosen one, updated previous proposal): #356 [AI-Mihir]
xcvrd creates “magic number” in APP-DB:-TABLE while coming up.
If only xcvrd crashes, then “magic number” is still present, so xcvrd will not renotify media_settings and will not reinit modules.
syncd/swss restart clears the entire APP-DB, including the “magic number” in APP-DB:-TABLE. No explicit change required.
xcvrd needs to subscribe to APP-DB:-TABLE to watch for this event and trigger self-restart (cold).
for non-CMIS modules, platform will do optics POR
for CMIS modules, xcvrd will pass ‘force_init’ flag to CMISTaskManager for CMIS FSM to perform DP_DEACTIVATE and DP reinitialization.

@amnsinghal
Copy link
Contributor

amnsinghal commented Jun 13, 2023

Updated proposal based on 6/12 CSCO-MSFT discussion to cover multi-NPU platforms:

  1. XCVRD main thread creates “magic number” in APPL-DB:PROCINFO-TABLE for the relevant namespace while coming up (creates only if "magic number" is not present) and sets the xcvrd_port_init_rqd_tbl accordingly for the ports belonging to the namespace.
  2. If only xcvrd crashes, then “magic number” is still present, so xcvrd will not renotify media_settings and will not reinit modules.
  3. XCVRD will subscribe to APPL-DB:PROCINFO-TABLE and trigger self-restart if the magic number is deleted for the namespace. XCVRD will be killed using SIGKILL. After re-spawn, CMIS re-init and media_settings notified is triggered for the ports belonging to the affected namespace.
  • The drawback here is that xcvrd will restart even if OA crashes for just 1 NPU.
  1. Based on the xcvrd_port_init_rqd_tbl, the CMIS FSM will perform DP_DEACTIVATE and DP reinitialization upon xcvrd re-spawn.
  2. syncd/swss restart clears the entire APPL-DB, including the “magic number” in APPL-DB:PROCINFO-TABLE in the namespace context.

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Jun 14, 2023

I just have another thought: why not fix the issue from portsorch.cpp side? Here is my proposal:

In portsorch.cpp, we shall cache all the serdes attributes. We shall compare the cached serdes attributes values with incoming values while doPortTask is called. If the value equals, no need apply the value to SAI.

Any comment? @prgeor @mihirpat1 @jaganbal-a

@snider-nokia
Copy link

snider-nokia commented Jun 15, 2023

XCVRD will subscribe to APPL-DB:PROCINFO-TABLE and trigger self-restart if the magic number is deleted for the namespace. XCVRD will be killed using SIGKILL. After re-spawn, CMIS re-init and media_settings notified is triggered for the ports belonging to the affected namespace.

Please don't use SIGKILL. SIGTERM is more appropriate, just in case there is platform specific cleanup that needs to done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment