Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WR - known issues (all issues combined) #309

Open
alyxazon opened this issue Jul 19, 2022 · 1 comment
Open

WR - known issues (all issues combined) #309

alyxazon opened this issue Jul 19, 2022 · 1 comment

Comments

@alyxazon
Copy link
Collaborator

Issue #266: WR node hangs in endless BOOTP loop, when WR PTP link is not established

dietrichb commented on Feb 2, 2021

In rare cases it might happen that a WR node is unable to establish a WR PTP link. In this case the following symptoms are commonly observed.

   node issues a BOOTP request via the network ~once every second
   (BOOTP server replies with IP)
   when reading the relevant register of the WR core, the IP matches the one from the BOOTP server reply: it seems the reply from the BOOTP server has been received by the node
   (I believer the console claims 'in training' for IP)
   the node continues issuing BOOTP requests until the WR PTP link is successfully established

I am not sure if this is an issue with the way how the WR core is instantiated on Arria II devices or of this is an issue of the WR core itself. Should we file this issue on OHWR?

Issue #256: connection between White Rabbbit node and switch unreliable after reboot of WRS

dietrichb commented on Feb 2, 2021 •

a variety of symptoms is observed when rebooting a WRS to which is WR node is connected. This can happen during maintenance, WRS reboot on purpose, or when recovering from a power-cut.

   no White Rabbit lock, occasionally; WRS port claims WA_MSG (waiting for message); node is accessible via the network
   no Ethernet link; rarely; WRS ports claims 'link down'; node inaccessible
   'hang up'; WRS port claims 'WA_MSG' and node MAC is detected by the WRS; node inaccessible via the network

In all cases, power-cycling the WR node helps
In cases '1' and '2' it is usually possible to recover by 'eb-reset' of the node.
In case '1', forcing a sequence port up->down->up on the WRS helps in some cases
In case '2', forcing a sequence port up->down->up on the WRS does not help
In case '3', the node seems to be almost dead. Access to the node is possible neither from the timing network nor from the host system (no chance for eb-reset). Forcing port ->down->up on the WRS does not help. Autorecovery of the WR node via the 'watchdog' implemented on the SCU does not work. A powercycle helps.

Issue #111: WR port not reachable after power cycle of WR switch

dietrichb commented on Dec 15, 2018

symptoms

WRS

   ports shows MAC and ptp state 6 (looks good)

node

   eb-mon shows LINK_UP and TRACKING (looks good)
   node not reachable via timing network (all EB requests time out)

when

   after reboot power cycle of WRS
   it may take a few power cycles of the WRS to trigger the bug

workaround

power cycle or restart FPGA using eb-reset
dietrichb commented on Aug 20, 2019

solved for Arria5 based platforms

requires major work (PHY control update) for Arria II based devices (SCU and VETAR)

Issue #51: WR port of node remains down after power cycle of node AND WR switch

dietrichb commented on Oct 23, 2017
 
There seems to be an annoying bug that seems to occur when a node (SCU) and WRS are switched-on simultaneously after a power cut.

The symptoms are the following

    PPS LED not blinking, activity LED not blinking, link LED off
    eb-mon -v dev/wbm0 shows "LINK_DOWN" and "NO_SYNC"
    eb-console dev/wbm0 causes freezing of the ssh shell
    node fails to get an IP via BOOTP
    (but the WRS shows both "link up" and "activity" LEDs)
    node is not accessible via the WR network
    resetting the FPGA of the node via its Reset controller is possible and cures the symptom.

Suspicion: The FPGA of the node is much faster with "booting" compared to the WRS. It somehow misses to detect "link up" after WRS starts and remains trapped in "link down" state.

This issue is causing real annoyance in cases were major parts of the facility need to be recovered after a major power-cut.

Maybe this is linked to another issue:

dietrichb commented on Aug 20, 2019

solved for Arria 5
not solved for Arria II (SCU and Vetar)

a fix for Arria would require a major effort
dietrichb commented on Feb 2, 2021

update (January 2021): in rare cases this is also observed with fallout gateware
@dietrichb
Copy link
Contributor

There is another issue. If a 'fallout' node locks to White Rabbit, it might look at different 'positions' within a 4 ns window. Once locked, it will always remain locked at its initial position.

This issue becomes obvious if one compares two timing receivers (time stamping or digital output) with the same signal. The time difference will remain identical as long as none of the two timing receivers is restarted. But after a restart, the time difference might have a different value. This issue seems to be present for all form factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants