Extend benchmarks #74

bentaculum · 2023-10-31T13:19:45Z

This is an initial attempt to extend the existing 2d benchmarks to a 3d C. elegans dataset from the Cell Tracking Challenge. I'm not sure about many things in here, for example:

How to reduce duplicate code between 2d/3d, while still using fixtures? Specifically, I can't get pytest.mark.parametrize to consume fixtures.
The 3D dataset is a few GBs to download. This should definitely be stored/cached somewhere for Github Actions to avoid very long CI runs. Maybe somehow as an artifact? https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts @tlambert03 would you know a good way to do this?

tlambert03 · 2023-10-31T14:58:56Z

The 3D dataset is a few GBs to download. This should definitely be stored/cached somewhere for Github Actions to avoid very long CI runs. Maybe somehow as an artifact? https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts @tlambert03 would you know a good way to do this?

If the data had been created by one job in this repo and then used by another job, then the upload-artifact and download-artifact actions would be useful. But, since you're downloading data from elsewhere, you don't want artifacts... you want the actions/cache action.

I do a similar thing for the nd2 repo you could use it as a model. I'd suggest making a separate script (scripts/download_samples.py) that downloads all the necessary test data, and then use that script as the hash key for the action.

msschwartz21 · 2023-10-31T18:44:21Z

How to reduce duplicate code between 2d/3d, while still using fixtures? Specifically, I can't get pytest.mark.parametrize to consume fixtures.

I'm taking a look at the code locally now, but based on my experience when I was getting the benchmarking running initially I wouldn't be surprised if the benchmark fixture didn't play well with parametrize. This thread might have some good leads on how to get around at least the benchmark/parameterize issues.

I don't have any personal experience with parameterize and fixtures but could you set up some sort of dictionary where you pass the key into parameterize and then construct the dictionary with the fixture data inside the test function?

Re: cacheing

I also have an example of using caching for the benchmarking results (although maybe I should have used artifacts?)

traccuracy/.github/workflows/ci.yml

Lines 73 to 91 in ed2b7b1

    
                 - name: Retrieve cached baseline if available 
        
                   uses: actions/cache/restore@v3 
        
                   id: cache 
        
                   with: 
        
                     path: baseline.json 
        
                     key: ${{ github.event.pull_request.base.sha }} 
        
                 - name: Run baseline benchmark if not in cache 
        
                   if: steps.cache.outputs.cache-hit != 'true' 
        
                   run: | 
        
                     git checkout ${{ github.event.pull_request.base.sha }} 
        
                     pytest tests/bench.py --benchmark-json baseline.json 
        
                 - name: Cache baseline results 
        
                   uses: actions/cache/save@v3 
        
                   if: steps.cache.outputs.cache-hit != 'true' 
        
                   with: 
        
                     path: baseline.json 
        
                     key: ${{ github.event.pull_request.base.sha }}

Happy to help you get that part of the action worked out if you need it, but I think Talley has the right idea with the download script.

tlambert03 · 2023-10-31T18:59:30Z

one thing that you might consider looking into is https://codspeed.io
I use it in psygnal and nd2 to run benchmarks during CI and catch potential regressions. It's quite good for that, but it's not great for running local benchmarks, or doing retrospective benchmarks to see where a regression occurred after the fact. However, since it uses the same api as pytest-benchmark... it's probably not to hard to make a flag to switch between the two. Would you want me to play around with this a bit to see whether I see any patterns you might want to use here?

tlambert03 · 2023-10-31T19:06:23Z

... eh... probably not worth it, the benchmark-action seems plenty good. codspeed is probably an unnecessary complication at the moment

msschwartz21 · 2023-11-01T03:40:43Z

Ok after looking into it a bit more, it seems like parametrize works just fine with benchmark which is great. Looks like this could be a workaround for the fixtures in parametrize issue.

@tlambert03 I noticed in the nd2 benchmarks that you use the decorator @pytest.mark.benchmark instead of the fixture. Any particular reason for that choice? It does seem to make the code a bit cleaner, but I was wondering if there is anything else to consider.

bentaculum · 2023-11-01T09:17:00Z

Thanks @tlambert03 @msschwartz21 for the suggestions. I'll come back with a second iteration of the PR.

tlambert03 · 2023-11-01T10:42:26Z

I was wondering if there is anything else to consider.

Yeah, only use the decorator if you want to time the entire test from start to finish with no setup or teardown (good for simple tests). But I don't always use it. See https://github.com/pyapp-kit/psygnal/blob/main/tests/test_bench.py for a file that uses both forms

bentaculum · 2024-04-12T13:41:21Z

Hi @msschwartz21 @tlambert03 , I finally got to this, and did the following:

routines for downloading data now in test_utils.py
fixtures for 2d and 3d datasets, then have tests that parametrize the data input as layed out here
Moved the actual tests for specific values of metrics on the Fluo-N2DL-Hela dataset out of the benchmark.

Even though the benchmark runs fine locally its github action currently fails with an unknown error, to be fixed.

Left to do, caching the datasets in github actions.

msschwartz21 · 2024-04-15T18:05:26Z

@bentaculum just wanted to give you a heads up that I'm in crunch mode finishing up writing my thesis. Hopefully next week I'll have time to review and help figure out what's causing the benchmarking action to fail.

bentaculum · 2024-04-16T09:14:23Z

No worries, and good luck!

bentaculum · 2024-09-19T10:18:02Z

Hi @msschwartz21, thanks for looking over this one and a fixing it up.
It looks like the bottleneck now is loading the 3d data from disk, that's good.

I would suggest to not limit the benchmark to a few frames but instead to run it on the entire video to catch possible increases in metric runtime.
I get total 79s full video vs 75s three frames locally, so I changed to full video again.

bentaculum · 2024-09-19T10:19:48Z

If we want to speed up the benchmark we can consider creating a subset of the 3d dataset on disk, or tweaking load_ctc_data to load only a part of the dataset on disk.

bentaculum · 2024-09-19T10:26:57Z

I still don't know why the benchmark runs locally but fails on GH actions, even with the increased time limit that you put ...
It might be that the VMs are really slow.

msschwartz21 · 2024-09-19T15:44:26Z

Sometimes the action is failing with a 143 error code which I think means the vm is being terminated probably due to resources maxing out. It seems to always fail on the ctc matcher on 3D. I was going to try limiting the volume to a smaller roi to reduce the number of nodes. If that works then we can keep the full length of the video.

This reverts commit 9bab2a7.

msschwartz21 · 2024-09-19T22:14:06Z

Everything that I could find about github actions failing pointed to a possible memory issue. I also remembered that we decided to err on the conservative side in terms of creating copies of data before possibly modifying a graph. I got rid of the copy in the initializer of the Matcher which led to interesting results in the action.

tests/bench.py::test_ctc_matcher[3d] actually hits the timeout instead of failing immediately. The action then seems to fail during tests/bench.py::test_ctc_metrics[3d]
The unexpected outcome is that one of the division tests is now failing which suggests that maybe our original paranoia about creating copies wasn't entirely misplaced...

DragaDoncila · 2024-09-19T23:17:16Z

The unexpected outcome is that one of the division tests is now failing which suggests that maybe our original paranoia about creating copies wasn't entirely misplaced...

Just wanna chime on in this as it would be great to not have to copy the data. I have my local traccuracy install always with graph copying disabled, because in a lot of my pipelines, copying the graph just instantly hits my RAM limit, which is annoying for comparatively small datasets.

Will also note that I think our load_ctc_data makes a copy of the data by keeping the frames in a list and then calling np.stack - might want to try avoiding that by initializing a correctly sized np array first and then assigning into it. ~~I'll open an issue~~ Opened #165 to track.

msschwartz21 · 2024-09-20T01:19:18Z

I have my local traccuracy install always with graph copying disabled, because in a lot of my pipelines, copying the graph just instantly hits my RAM limit, which is annoying for comparatively small datasets.

Have you noticed any weird behavior when having the copy disabled?

But generally I agree that we should eliminate the copying. It was an easy way to get unblocked when we were getting started but it's a pretty obvious limitation on dataset size.

DragaDoncila · 2024-09-20T03:04:16Z

Have you noticed any weird behavior when having the copy disabled?

No, none. Caveat would be that I don't have any multithreaded/distributed code. But I do feel that if someone does have that, controlling edits and object access is on them, not on us.

I think we should definitely figure out what's going on in that failing test though, and whether it's likely to affect users, whether it's an artifact of how we wrote the test... etc.

…g data

This reverts commit d8c595a.

msschwartz21 · 2024-11-07T18:35:42Z

It turns out the memory problem that we were encountering in tests/bench.py was due including copy.deepcopy in multiple fixtures in order to avoid problems associated with running fixtures at the module scope. Originally, this was desirable in order to reduce the time spent loading the same data multiple times. However it contributed to the large memory footprint of the benchmarking script.

I rescoped all of the benchmarking fixtures back to function scope and eliminated copy.deepcopy. As a side effect, I had to increase the timeout since the data is now being loaded inside of each test function.

@bentaculum Can you take a look and confirm that I haven't eliminated anything that you wanted in there in the process of trying to get it to work on github?

bentaculum · 2024-11-11T13:28:25Z

Hi @msschwartz21, thanks for looking into this once more. Given #166 and the memory issues on GH actions, it seems that loading data separately is the way to go at the moment.
The benchmark runs locally for me now in ~200s. Somehow the benchmark on GH still failed above, not sure why. Might be that the available virtual machine is much slower than my local one and hits the time limit (300s)?

msschwartz21 · 2024-11-11T16:04:06Z

Dang it, I didn't notice that the action failed after I merged in the latest changes 🤯 I'll take another look today and see if I can get it to pass.

Tried to move this change into a different PR, but that caused the action to start failing This reverts commit c1ba1d5.

msschwartz21 · 2024-11-11T16:39:57Z

@bentaculum Looks like I messed up the action when I reverted the commit that eliminated copying of the graph in the matcher. I wanted to move it into another memory related PR, but clearly it mattered for this one. I reverted the revert commit and I think we're now actually good to go!

bentaculum · 2024-11-11T17:31:36Z

Awesome 🎉! I suggest merging this as is.

Benchmarking takes a while now (~200s) due to loading datasets repeatedly, but on the plus side it now measures performance on a full standard CTC 3d dataset, which is a common use case, and it can be sped up significantly once the TrackingGraph is made agnostic to different metrics as hinted towards in #166.

@DragaDoncila @cmalinmayor any major concerns?

Check in 3d benchmark

4f8799c

bentaculum requested a review from msschwartz21 October 31, 2023 13:19

Reduce some duplicate code

fddadd0

bentaculum added enhancement New feature or request help wanted Extra attention is needed labels Oct 31, 2023

bentaculum mentioned this pull request Apr 12, 2024

Speed up CTCMatcher #148

Merged

16 tasks

bentaculum added 2 commits April 12, 2024 13:27

Merge branch 'main' into 3d-benchmark, clean up

1f1bc7b

Clean up benchmarks

8c83c27

bentaculum added tests Test case related issues performance Changes that impact runtime and performance and removed enhancement New feature or request labels Apr 12, 2024

Fix typo

ed0eedf

bentaculum marked this pull request as ready for review April 12, 2024 13:42

bentaculum changed the title ~~Add 3d benchmark~~ Extend benchmarks Apr 12, 2024

msschwartz21 added 6 commits September 18, 2024 10:05

Merge branch 'main' into 3d-benchmark

9fed947

Remove expected fail from benchmarking tests

f837bb2

Setup script and cacheing for downloading test data for benchmarking

a4bdaa7

Use unique ids for cache steps in workflow

29a5b55

Add missing comma to list of urls

d381e65

Correct data cacheing steps in workflow

1b15ec2

msschwartz21 and others added 3 commits September 18, 2024 14:26

Add timeout for ctc matcher

78c9842

limit 3d data to 3 frames

a4f35d1

Revert benchmark to full 3d video

e98c47a

Add progress bar to ctc loader feature extraction

35519ae

msschwartz21 added 4 commits September 19, 2024 14:30

Try setting ubuntu runner to 22.04 instead of latest

9bab2a7

Revert "Try setting ubuntu runner to 22.04 instead of latest"

25760d3

This reverts commit 9bab2a7.

Show print outputs during benchmarking to try to identify failure point

dfce2e7

Stop copying graphs when initializing matcher

d8c595a

msschwartz21 added 6 commits November 7, 2024 11:29

Limit fixture scope to function and remove unnecessary copy.deepcopy

d328182

Bump timeout to 300s to allow for additional time loading and preppin…

d330c07

…g data

Remove print statements from benchmarking ci output

c3e92f6

Fix bug introduced by eliminating data copying in matcher

1f4a47b

Revert "Stop copying graphs when initializing matcher"

c1ba1d5

This reverts commit d8c595a.

Merge branch 'main' into 3d-benchmark

910c569

Revert "Revert "Stop copying graphs when initializing matcher""

9aa9a7a

Tried to move this change into a different PR, but that caused the action to start failing This reverts commit c1ba1d5.

bentaculum requested review from cmalinmayor and removed request for msschwartz21 November 11, 2024 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend benchmarks #74

Extend benchmarks #74

bentaculum commented Oct 31, 2023 •

edited

Loading

tlambert03 commented Oct 31, 2023 •

edited

Loading

msschwartz21 commented Oct 31, 2023

tlambert03 commented Oct 31, 2023

tlambert03 commented Oct 31, 2023

msschwartz21 commented Nov 1, 2023

bentaculum commented Nov 1, 2023

tlambert03 commented Nov 1, 2023

bentaculum commented Apr 12, 2024

msschwartz21 commented Apr 15, 2024

bentaculum commented Apr 16, 2024

bentaculum commented Sep 19, 2024

bentaculum commented Sep 19, 2024

bentaculum commented Sep 19, 2024

msschwartz21 commented Sep 19, 2024

msschwartz21 commented Sep 19, 2024

DragaDoncila commented Sep 19, 2024 •

edited

Loading

msschwartz21 commented Sep 20, 2024

DragaDoncila commented Sep 20, 2024

msschwartz21 commented Nov 7, 2024

bentaculum commented Nov 11, 2024

msschwartz21 commented Nov 11, 2024

msschwartz21 commented Nov 11, 2024

bentaculum commented Nov 11, 2024 •

edited

Loading

Extend benchmarks #74

Are you sure you want to change the base?

Extend benchmarks #74

Conversation

bentaculum commented Oct 31, 2023 • edited Loading

tlambert03 commented Oct 31, 2023 • edited Loading

msschwartz21 commented Oct 31, 2023

tlambert03 commented Oct 31, 2023

tlambert03 commented Oct 31, 2023

msschwartz21 commented Nov 1, 2023

bentaculum commented Nov 1, 2023

tlambert03 commented Nov 1, 2023

bentaculum commented Apr 12, 2024

msschwartz21 commented Apr 15, 2024

bentaculum commented Apr 16, 2024

bentaculum commented Sep 19, 2024

bentaculum commented Sep 19, 2024

bentaculum commented Sep 19, 2024

msschwartz21 commented Sep 19, 2024

msschwartz21 commented Sep 19, 2024

DragaDoncila commented Sep 19, 2024 • edited Loading

msschwartz21 commented Sep 20, 2024

DragaDoncila commented Sep 20, 2024

msschwartz21 commented Nov 7, 2024

bentaculum commented Nov 11, 2024

msschwartz21 commented Nov 11, 2024

msschwartz21 commented Nov 11, 2024

bentaculum commented Nov 11, 2024 • edited Loading

bentaculum commented Oct 31, 2023 •

edited

Loading

tlambert03 commented Oct 31, 2023 •

edited

Loading

DragaDoncila commented Sep 19, 2024 •

edited

Loading

bentaculum commented Nov 11, 2024 •

edited

Loading