Gaps in Log data accompanied by screen flash - Priority 2 #48

taylorlm88 · 2022-02-22T15:10:25Z

Summary: Identify and fix cause for spontaneous all-off.

Description: Occasionally there is a flash on the arena screen (flash meaning the LEDs turn off for about ~20 ms). The log of this data then shows an accompanying ~20 ms gap in all data, including timestamps. This happens in any context, whether a trial is being run in a particular mode or whether the screens are just running an all-on command. We have been unable to recreate any particular circumstance in which this happens. Running the same code on different days will result in different frequency, but it does inevitably happen on occasion.

floesche · 2022-03-15T13:32:24Z

Maybe the delay is a short version of the issue #57, although there the "all on" command is delayed by seconds.

floesche · 2022-05-02T15:23:27Z

Based on the suggestion received via email, the following code was used for testing:

function gapsInLogXLSX(testCase)
    testCase.panelsController.setRootDirectory("C:\matlabroot\G4");
    testCase.panelsController.setControlMode(1);
    rsps = [];
    runTime = [];
    seqTime = [];            
    for ii=1:400
        testCase.panelsController.startLog()        
        testCase.panelsController.setPatternID(ii);
        testCase.panelsController.setPatternFunctionID(ii);
        testCase.panelsController.setPositionX(ii);
        startTime = tic;
        rsp = testCase.panelsController.startDisplay(23);
        seqComplete = toc(startTime);
        rsps = [rsps; rsp];
        runTime = [runTime; 23];
        seqTime = [seqTime; seqComplete];
        testCase.panelsController.stopLog();
    end
    T = table(rsps, runTime, seqTime);
    writetable(T, "gapsInLog.xlsx", "Sheet", "gaps")
end

With G4 Host version 237 the following metrics were produced:

Threshold	Fails run 1	Fails run 2	Fails run 3
0.02s	1%	1%	1%
0.01s	1.5%	1%	1%
0.008s	2.75%	1.75%	1.25%
0.005s	40.5%	34.75%	26.75%

floesche · 2022-05-02T16:52:12Z

G4 Host version 239 was suggested as a solution to this issue. Similar metrics for this version looks like this:

Threshold	Fails run 1	Fails run 3	Fails run 5
0.02s	1%	0.25%	0%
0.01s	1.5%	1%	0.25 %
0.008s	2%	1.75%	2%
0.005s	29.5%	40%	41%

After the first iteration of the test the above summary suggests that there are roughly the same amount of issues happening, so version v239 doesn't seem to fix the issue.

floesche · 2022-05-02T17:12:41Z

Looking into the issue a bit more in depth:

The gapsInLog.xlsx file generated from the script above can be used to identify the trials where the gaps occur. For example, the attached file shows a slowed down response for the 396th trial of around 2,319s (compared to the median of around 2.304).

I then looked into the TDMS-log files of that trial, specifically the *_Frame_Time.tdms. There I calculated the difference between two successive frames and plotted the value. In general, a difference of 2000µs is expected (for 500Hz FPS), but in this specific example the frame takes 18000µs.

In other cases of the same test I have seen gaps as big s 50000µs

...or 42000µs

Since the total number of trials above the threshold hasn't changed between version 237 and 239, the assumption is that v239 doesn't fix the issue.

floesche · 2022-05-03T20:09:01Z

For the tests before I had used a wrong combination of pattern and arena configuration (4x12 pattern, 3x12 arena).

Configured correctly, the issue seems to become less frequent on my setup. I ran 3 full experiments with v239 and only in one of them I found the error using the same strategy (inside 05-03-2022_15-03-43-46_Frame_Time.tdms and 05-03-2022_15-03-57-29_Frame_Time.tdms of gapsInLog-v239-test9.zip ):

and

floesche · 2022-05-13T14:00:52Z

After updating the G4 Host to v240, I ran two tests, both produced the same type of errors.

For gapsInLog-v240-test1 one example for an error is inside 05-12-2022_15-03-31-16_Frame_Time.tdms around the timestamp 200332765.

For gapsInLog-v240-test2 there are 9 errors within 28 frames (timestamp 50192333 to 50270332) in 05-12-2022_17-05-28-35_Frame_Time.tdms.

floesche · 2022-05-13T14:38:50Z

Response via email:

Looking at the first set of v240 data, it showed an error interval of 108-109 runs again and I also didn't see any dequeue error. I ran the test 7 times (2800 total scans) on my PC without an issue so this is difficult to debug. I have to come up with some more debug utilities to help me isolate this issue.

floesche · 2022-05-16T15:35:45Z

Via email:

Can you run this tester for about an hour or so whenever you have some time and take a screenshot of the front panel at the end?

floesche · 2022-05-16T15:37:50Z

50 minutes after starting the bandwidth tester:

70 hours after starting the bandwidth tester:

floesche · 2022-05-16T15:38:03Z

response via email:

It is a bit concerning that you are getting errors with the tester. The tester is allocating a buffer the size of 3 frames worth of data and the host is filling in another frame worth of data whenever there is enough space available. The FPGA is pulling out 1 frame worth of data every 2 ms. So on 3 separate occasions, there was not enough data in the DMA for the FPGA to pull out. I wonder if there is something that is interrupting the DMA transfer occasionally. I could try making the buffer larger to mitigate these interruptions, but the trade off will be longer latency between new patterns (shouldn't result in blank frames).

floesche · 2022-05-16T15:38:18Z

if this happens only 3 times during a 66 hour interval, then we can live with the problem.

If it happens several times during a 20min experiment (as it does for the test scripts), then this would ruin a lot of valuable experiments.

Blank frames (or even delayed frames) are not an option for our experiments, they ruin data as well. For an increasing number of our experiment we are using electrophysiology and rely on continuous delivery of stimuli -- any small deviation from the 2ms frame length might look like a false signal in our measurements. I just wanted to give you this as an explanation why we are so persistent on getting this issue fixed now -- or at least understand why we are seeing them on some of our setups so we can avoid the same problem on others.

floesche · 2022-05-16T15:38:36Z

response via email:

Just to clarify, when I say increase the latency, this means the latency between sending new patterns and not the latency between frames. There would still be a 2ms period, but increasing the buffer size would take longer for new patterns to show because it needs to play out the ones that are already in the buffer.

floesche · 2022-05-16T15:39:02Z

Also that is not really an option: Closed loop experiments, where the behavior of the fly or even just the elecrophysiological reaction triggers a change of the stimulus, are an important part of our experiments. If anything, these type of latencies should become shorter if possible.

achiusciotex · 2022-07-01T13:22:59Z

Test build. Reworked FPGA polling and Frame data passing to see if it alleviates issue with gaps in the play out.
https://www.dropbox.com/s/oonxp04zx4v7fn8/G4%20Host%28Gap%20test%20build%29.zip?dl=0

floesche · 2022-08-01T18:43:17Z

To confirm that the error might get fixed with v243, I reran the test with v240 and I was still able to replicate the issue, for example in gapInLogs-v240-test11, 08-01-2022_10-10-56-56_Frame_Time:

After the upgrade to v243 (linked above) I still see a similar error (gapsInLog-v243-test03, 08-01-2022_14-02-12-20_Frame_Time)

even several times within 400 trials (gapsInLog-v243-test03, 08-01-2022_14-02-32-08_Frame_Time)

achiusciotex · 2022-08-02T17:13:11Z

Can you go into the configuration file and change the GS16 Buffer Size (Stream) to 20000 and try it again?

floesche · 2022-08-04T22:03:57Z

I ran it a few times with the increased GS16 Buffer size and did not see the same problem. That's great.

For my understanding: What are the side effects of this change. What exactly are the expected changes in the timing?

achiusciotex · 2022-08-05T15:28:45Z

So this tells me that host software will occasionally have a blip where it prevents the next frames from being pushed into the buffer in time. This is probably due to windows being a non RT OS. Increasing the buffer size helps mitigate this moments because there are more frames in the buffer giving it enough time to recover. The issue with increasing the buffer size is higher latency for closed operations because the buffer needs to be emptied out before the new frames reach the arena. 20000 may be on the high end so you may be able to lower that. Each frame is around 2000 elements for GS=16. I may have to think of an alternative way to do this if the closed loop latency is too long.

floesche · 2022-08-05T16:12:14Z

Yesterday's measurements were promising, but I ran another test this morning while being at a meeting. I just came back and saw errors in at least 10 of the 400 trials (test gapsInLog-v243-test07).

Its not quantified at this point, but my impression is that the larger buffer also increases the length of the error once it occurs. Would that make sense?

in 08-05-2022_09-09-32-37_Frame_Time:

in 08-05-2022_09-09-45-30_Frame_Time:

in 08-05-2022_09-09-45-43_Frame_Time:

in 08-05-2022_09-09-45-52_Frame_Time:

in 08-05-2022_09-09-46-51_Frame_Time:

in 08-05-2022_09-09-55-18_Frame_Time:

in 08-05-2022_09-09-55-22_Frame_Time:

in 08-05-2022_09-09-55-40_Frame_Time:

in 08-05-2022_09-09-55-45_Frame_Time:

in 08-05-2022_09-09-55-58_Frame_Time:

It is possible and likely that I miss errors in the other trials since I am not actually parsing all the tdms files but instead use an imprecise pre-selection of trials that I actually look at. So this might only be the tip of the iceberg and I might have missed some errors in the previous trials.

Let me know if I should run other tests...

floesche · 2022-08-16T12:59:06Z

Response via email on 8/9/22:

I ran 2800 iterations and didn't have an issue with the gaps. I ran the test script below, the only change I made was commenting out the setPositonX command because it is not applicable in this mode. I renamed the version I was used to "HHMI G4 v244". Maybe I am not running the correct test, what test are you running when you test the gaps?

  function gapsInLogXLSX(testCase)
            testCase.panelsController.setRootDirectory("C:\matlabroot\G4");
            testCase.panelsController.setControlMode(1);
            rsps = [];
            runTime = [];
            seqTime = [];            
            for ii=1:400
                testCase.panelsController.startLog()        
                testCase.panelsController.setPatternID(ii);
                testCase.panelsController.setPatternFunctionID(ii);
              %  testCase.panelsController.setPositionX(ii);
                startTime = tic;
                rsp = testCase.panelsController.startDisplay(23);
                seqComplete = toc(startTime);
                testCase.panelsController.stopLog();
                rsps = [rsps; rsp];
                runTime = [runTime; 23];
                seqTime = [seqTime; seqComplete];
            end
            T = table(rsps, runTime, seqTime);
            writetable(T, "gapsInLog.xlsx", "Sheet", "gaps")
        end

floesche · 2022-08-16T13:22:56Z

You reset the G4 Host from v244 back to version v243. Is that intentionally or can I update to v244?

I ran exactly the same test (but with the setPositionX) when I got the reproducible errors. Right now (for about 800 iterations) I am not getting them, neither with or without the setPositionX. Is there anything else you changed on the setup of the computer?

I also pulled experimental data from an actual rig that we are using for experiments and I am seeing the error there as well. Just for a subset of experiments and reaching back to March, I see the problem happening consistently over time. In that case I was only looking for the longest delays in frames and the longest one was 2325043 (instead of 2000), with a mean maximum error (sorry, bit of a "wild" descriptive statistical concept) of around 9600.

achiusciotex · 2022-09-30T15:57:45Z

Tried to move frame data to the FPGA for more reliable frame outputs, but insufficient space on FPGA to store all the data. Will try running the software in 64-bit LabVIEW to see if any performance is more reliable.

Do you know if C-state is enabled in your BIOS? When C-state is enabled, it can affect performance of PCIe slots.

achiusciotex · 2023-11-10T16:42:14Z

Has this been resolved with the latest releases?

taylorlm88 · 2023-11-14T20:10:49Z

@floesche Has this issue been resolved?

taylorlm88 added the Main Host Main host issues for Andy label Feb 22, 2022

taylorlm88 changed the title ~~Gaps in Log data accompanied by screen flash~~ Gaps in Log data accompanied by screen flash - Priority 2 Feb 24, 2022

floesche mentioned this issue Mar 16, 2022

Fast succession of 'all on' - 'all off' breaks system #57

Open

floesche added a commit to floesche/LED-Display_G4_Display-Tools that referenced this issue May 2, 2022

add test case for JaneliaSciComp#48

cf8f9a2

floesche mentioned this issue Sep 1, 2023

Display glitching #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaps in Log data accompanied by screen flash - Priority 2 #48

Gaps in Log data accompanied by screen flash - Priority 2 #48

taylorlm88 commented Feb 22, 2022 •

edited

Loading

floesche commented Mar 15, 2022

floesche commented May 2, 2022

floesche commented May 2, 2022 •

edited

Loading

floesche commented May 2, 2022

floesche commented May 3, 2022

floesche commented May 13, 2022

floesche commented May 13, 2022

floesche commented May 16, 2022 •

edited

Loading

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

achiusciotex commented Jul 1, 2022

floesche commented Aug 1, 2022

achiusciotex commented Aug 2, 2022

floesche commented Aug 4, 2022

achiusciotex commented Aug 5, 2022

floesche commented Aug 5, 2022

floesche commented Aug 16, 2022

floesche commented Aug 16, 2022

achiusciotex commented Sep 30, 2022

achiusciotex commented Nov 10, 2023

taylorlm88 commented Nov 14, 2023

Gaps in Log data accompanied by screen flash - Priority 2 #48

Gaps in Log data accompanied by screen flash - Priority 2 #48

Comments

taylorlm88 commented Feb 22, 2022 • edited Loading

floesche commented Mar 15, 2022

floesche commented May 2, 2022

floesche commented May 2, 2022 • edited Loading

floesche commented May 2, 2022

floesche commented May 3, 2022

floesche commented May 13, 2022

floesche commented May 13, 2022

floesche commented May 16, 2022 • edited Loading

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

floesche commented May 16, 2022

achiusciotex commented Jul 1, 2022

floesche commented Aug 1, 2022

achiusciotex commented Aug 2, 2022

floesche commented Aug 4, 2022

achiusciotex commented Aug 5, 2022

floesche commented Aug 5, 2022

floesche commented Aug 16, 2022

floesche commented Aug 16, 2022

achiusciotex commented Sep 30, 2022

achiusciotex commented Nov 10, 2023

taylorlm88 commented Nov 14, 2023

taylorlm88 commented Feb 22, 2022 •

edited

Loading

floesche commented May 2, 2022 •

edited

Loading

floesche commented May 16, 2022 •

edited

Loading