Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaps in Log data accompanied by screen flash - Priority 2 #48

Open
taylorlm88 opened this issue Feb 22, 2022 · 24 comments
Open

Gaps in Log data accompanied by screen flash - Priority 2 #48

taylorlm88 opened this issue Feb 22, 2022 · 24 comments
Labels
Main Host Main host issues for Andy

Comments

@taylorlm88
Copy link
Collaborator

taylorlm88 commented Feb 22, 2022

Summary: Identify and fix cause for spontaneous all-off.

Description: Occasionally there is a flash on the arena screen (flash meaning the LEDs turn off for about ~20 ms). The log of this data then shows an accompanying ~20 ms gap in all data, including timestamps. This happens in any context, whether a trial is being run in a particular mode or whether the screens are just running an all-on command. We have been unable to recreate any particular circumstance in which this happens. Running the same code on different days will result in different frequency, but it does inevitably happen on occasion.

@taylorlm88 taylorlm88 added the Main Host Main host issues for Andy label Feb 22, 2022
@taylorlm88 taylorlm88 changed the title Gaps in Log data accompanied by screen flash Gaps in Log data accompanied by screen flash - Priority 2 Feb 24, 2022
@floesche
Copy link
Collaborator

Maybe the delay is a short version of the issue #57, although there the "all on" command is delayed by seconds.

floesche added a commit to floesche/LED-Display_G4_Display-Tools that referenced this issue May 2, 2022
@floesche
Copy link
Collaborator

floesche commented May 2, 2022

Based on the suggestion received via email, the following code was used for testing:

function gapsInLogXLSX(testCase)
    testCase.panelsController.setRootDirectory("C:\matlabroot\G4");
    testCase.panelsController.setControlMode(1);
    rsps = [];
    runTime = [];
    seqTime = [];            
    for ii=1:400
        testCase.panelsController.startLog()        
        testCase.panelsController.setPatternID(ii);
        testCase.panelsController.setPatternFunctionID(ii);
        testCase.panelsController.setPositionX(ii);
        startTime = tic;
        rsp = testCase.panelsController.startDisplay(23);
        seqComplete = toc(startTime);
        rsps = [rsps; rsp];
        runTime = [runTime; 23];
        seqTime = [seqTime; seqComplete];
        testCase.panelsController.stopLog();
    end
    T = table(rsps, runTime, seqTime);
    writetable(T, "gapsInLog.xlsx", "Sheet", "gaps")
end

With G4 Host version 237 the following metrics were produced:

Threshold Fails run 1 Fails run 2 Fails run 3
0.02s 1% 1% 1%
0.01s 1.5% 1% 1%
0.008s 2.75% 1.75% 1.25%
0.005s 40.5% 34.75% 26.75%

@floesche
Copy link
Collaborator

floesche commented May 2, 2022

G4 Host version 239 was suggested as a solution to this issue. Similar metrics for this version looks like this:

Threshold Fails run 1 Fails run 3 Fails run 5
0.02s 1% 0.25% 0%
0.01s 1.5% 1% 0.25 %
0.008s 2% 1.75% 2%
0.005s 29.5% 40% 41%

After the first iteration of the test the above summary suggests that there are roughly the same amount of issues happening, so version v239 doesn't seem to fix the issue.

@floesche
Copy link
Collaborator

floesche commented May 2, 2022

Looking into the issue a bit more in depth:

The gapsInLog.xlsx file generated from the script above can be used to identify the trials where the gaps occur. For example, the attached file shows a slowed down response for the 396th trial of around 2,319s (compared to the median of around 2.304).

image

I then looked into the TDMS-log files of that trial, specifically the *_Frame_Time.tdms. There I calculated the difference between two successive frames and plotted the value. In general, a difference of 2000µs is expected (for 500Hz FPS), but in this specific example the frame takes 18000µs.

image

In other cases of the same test I have seen gaps as big s 50000µs

image

...or 42000µs

image

Since the total number of trials above the threshold hasn't changed between version 237 and 239, the assumption is that v239 doesn't fix the issue.

@floesche
Copy link
Collaborator

floesche commented May 3, 2022

For the tests before I had used a wrong combination of pattern and arena configuration (4x12 pattern, 3x12 arena).

Configured correctly, the issue seems to become less frequent on my setup. I ran 3 full experiments with v239 and only in one of them I found the error using the same strategy (inside 05-03-2022_15-03-43-46_Frame_Time.tdms and 05-03-2022_15-03-57-29_Frame_Time.tdms of gapsInLog-v239-test9.zip ):

image and
image

@floesche
Copy link
Collaborator

After updating the G4 Host to v240, I ran two tests, both produced the same type of errors.

For gapsInLog-v240-test1 one example for an error is inside 05-12-2022_15-03-31-16_Frame_Time.tdms around the timestamp 200332765.

For gapsInLog-v240-test2 there are 9 errors within 28 frames (timestamp 50192333 to 50270332) in 05-12-2022_17-05-28-35_Frame_Time.tdms.

@floesche
Copy link
Collaborator

Response via email:

Looking at the first set of v240 data, it showed an error interval of 108-109 runs again and I also didn't see any dequeue error. I ran the test 7 times (2800 total scans) on my PC without an issue so this is difficult to debug. I have to come up with some more debug utilities to help me isolate this issue.

@floesche
Copy link
Collaborator

floesche commented May 16, 2022

Via email:

Can you run this tester for about an hour or so whenever you have some time and take a screenshot of the front panel at the end?

@floesche
Copy link
Collaborator

50 minutes after starting the bandwidth tester:

image

70 hours after starting the bandwidth tester:

image

@floesche
Copy link
Collaborator

response via email:

It is a bit concerning that you are getting errors with the tester. The tester is allocating a buffer the size of 3 frames worth of data and the host is filling in another frame worth of data whenever there is enough space available. The FPGA is pulling out 1 frame worth of data every 2 ms. So on 3 separate occasions, there was not enough data in the DMA for the FPGA to pull out. I wonder if there is something that is interrupting the DMA transfer occasionally. I could try making the buffer larger to mitigate these interruptions, but the trade off will be longer latency between new patterns (shouldn't result in blank frames).

@floesche
Copy link
Collaborator

if this happens only 3 times during a 66 hour interval, then we can live with the problem.

If it happens several times during a 20min experiment (as it does for the test scripts), then this would ruin a lot of valuable experiments.

Blank frames (or even delayed frames) are not an option for our experiments, they ruin data as well. For an increasing number of our experiment we are using electrophysiology and rely on continuous delivery of stimuli -- any small deviation from the 2ms frame length might look like a false signal in our measurements. I just wanted to give you this as an explanation why we are so persistent on getting this issue fixed now -- or at least understand why we are seeing them on some of our setups so we can avoid the same problem on others.

@floesche
Copy link
Collaborator

response via email:

Just to clarify, when I say increase the latency, this means the latency between sending new patterns and not the latency between frames. There would still be a 2ms period, but increasing the buffer size would take longer for new patterns to show because it needs to play out the ones that are already in the buffer.

@floesche
Copy link
Collaborator

Also that is not really an option: Closed loop experiments, where the behavior of the fly or even just the elecrophysiological reaction triggers a change of the stimulus, are an important part of our experiments. If anything, these type of latencies should become shorter if possible.

@achiusciotex
Copy link

Test build. Reworked FPGA polling and Frame data passing to see if it alleviates issue with gaps in the play out.
https://www.dropbox.com/s/oonxp04zx4v7fn8/G4%20Host%28Gap%20test%20build%29.zip?dl=0

@floesche
Copy link
Collaborator

floesche commented Aug 1, 2022

To confirm that the error might get fixed with v243, I reran the test with v240 and I was still able to replicate the issue, for example in gapInLogs-v240-test11, 08-01-2022_10-10-56-56_Frame_Time:

image

After the upgrade to v243 (linked above) I still see a similar error (gapsInLog-v243-test03, 08-01-2022_14-02-12-20_Frame_Time)

image

even several times within 400 trials (gapsInLog-v243-test03, 08-01-2022_14-02-32-08_Frame_Time)

image

@achiusciotex
Copy link

Can you go into the configuration file and change the GS16 Buffer Size (Stream) to 20000 and try it again?

@floesche
Copy link
Collaborator

floesche commented Aug 4, 2022

I ran it a few times with the increased GS16 Buffer size and did not see the same problem. That's great.

For my understanding: What are the side effects of this change. What exactly are the expected changes in the timing?

@achiusciotex
Copy link

So this tells me that host software will occasionally have a blip where it prevents the next frames from being pushed into the buffer in time. This is probably due to windows being a non RT OS. Increasing the buffer size helps mitigate this moments because there are more frames in the buffer giving it enough time to recover. The issue with increasing the buffer size is higher latency for closed operations because the buffer needs to be emptied out before the new frames reach the arena. 20000 may be on the high end so you may be able to lower that. Each frame is around 2000 elements for GS=16. I may have to think of an alternative way to do this if the closed loop latency is too long.

@floesche
Copy link
Collaborator

floesche commented Aug 5, 2022

Yesterday's measurements were promising, but I ran another test this morning while being at a meeting. I just came back and saw errors in at least 10 of the 400 trials (test gapsInLog-v243-test07).

Its not quantified at this point, but my impression is that the larger buffer also increases the length of the error once it occurs. Would that make sense?

in 08-05-2022_09-09-32-37_Frame_Time:
image

in 08-05-2022_09-09-45-30_Frame_Time:
image

in 08-05-2022_09-09-45-43_Frame_Time:
image

in 08-05-2022_09-09-45-52_Frame_Time:
image

in 08-05-2022_09-09-46-51_Frame_Time:
image

in 08-05-2022_09-09-55-18_Frame_Time:
image

in 08-05-2022_09-09-55-22_Frame_Time:
image

in 08-05-2022_09-09-55-40_Frame_Time:
image

in 08-05-2022_09-09-55-45_Frame_Time:
image

in 08-05-2022_09-09-55-58_Frame_Time:
image

It is possible and likely that I miss errors in the other trials since I am not actually parsing all the tdms files but instead use an imprecise pre-selection of trials that I actually look at. So this might only be the tip of the iceberg and I might have missed some errors in the previous trials.

Let me know if I should run other tests...

@floesche
Copy link
Collaborator

Response via email on 8/9/22:

I ran 2800 iterations and didn't have an issue with the gaps. I ran the test script below, the only change I made was commenting out the setPositonX command because it is not applicable in this mode. I renamed the version I was used to "HHMI G4 v244". Maybe I am not running the correct test, what test are you running when you test the gaps?

  function gapsInLogXLSX(testCase)
            testCase.panelsController.setRootDirectory("C:\matlabroot\G4");
            testCase.panelsController.setControlMode(1);
            rsps = [];
            runTime = [];
            seqTime = [];            
            for ii=1:400
                testCase.panelsController.startLog()        
                testCase.panelsController.setPatternID(ii);
                testCase.panelsController.setPatternFunctionID(ii);
              %  testCase.panelsController.setPositionX(ii);
                startTime = tic;
                rsp = testCase.panelsController.startDisplay(23);
                seqComplete = toc(startTime);
                testCase.panelsController.stopLog();
                rsps = [rsps; rsp];
                runTime = [runTime; 23];
                seqTime = [seqTime; seqComplete];
            end
            T = table(rsps, runTime, seqTime);
            writetable(T, "gapsInLog.xlsx", "Sheet", "gaps")
        end

@floesche
Copy link
Collaborator

You reset the G4 Host from v244 back to version v243. Is that intentionally or can I update to v244?

I ran exactly the same test (but with the setPositionX) when I got the reproducible errors. Right now (for about 800 iterations) I am not getting them, neither with or without the setPositionX. Is there anything else you changed on the setup of the computer?

I also pulled experimental data from an actual rig that we are using for experiments and I am seeing the error there as well. Just for a subset of experiments and reaching back to March, I see the problem happening consistently over time. In that case I was only looking for the longest delays in frames and the longest one was 2325043 (instead of 2000), with a mean maximum error (sorry, bit of a "wild" descriptive statistical concept) of around 9600.

@achiusciotex
Copy link

Tried to move frame data to the FPGA for more reliable frame outputs, but insufficient space on FPGA to store all the data. Will try running the software in 64-bit LabVIEW to see if any performance is more reliable.

Do you know if C-state is enabled in your BIOS? When C-state is enabled, it can affect performance of PCIe slots.

@achiusciotex
Copy link

Has this been resolved with the latest releases?

@taylorlm88
Copy link
Collaborator Author

@floesche Has this issue been resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Main Host Main host issues for Andy
Projects
None yet
Development

No branches or pull requests

3 participants