Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend benchmarks #74

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Extend benchmarks #74

wants to merge 29 commits into from

Conversation

bentaculum
Copy link
Contributor

@bentaculum bentaculum commented Oct 31, 2023

This is an initial attempt to extend the existing 2d benchmarks to a 3d C. elegans dataset from the Cell Tracking Challenge. I'm not sure about many things in here, for example:

@bentaculum bentaculum added enhancement New feature or request help wanted Extra attention is needed labels Oct 31, 2023
@tlambert03
Copy link
Contributor

tlambert03 commented Oct 31, 2023

If the data had been created by one job in this repo and then used by another job, then the upload-artifact and download-artifact actions would be useful. But, since you're downloading data from elsewhere, you don't want artifacts... you want the actions/cache action.

I do a similar thing for the nd2 repo you could use it as a model. I'd suggest making a separate script (scripts/download_samples.py) that downloads all the necessary test data, and then use that script as the hash key for the action.

@msschwartz21
Copy link
Collaborator

How to reduce duplicate code between 2d/3d, while still using fixtures? Specifically, I can't get pytest.mark.parametrize to consume fixtures.

I'm taking a look at the code locally now, but based on my experience when I was getting the benchmarking running initially I wouldn't be surprised if the benchmark fixture didn't play well with parametrize. This thread might have some good leads on how to get around at least the benchmark/parameterize issues.

I don't have any personal experience with parameterize and fixtures but could you set up some sort of dictionary where you pass the key into parameterize and then construct the dictionary with the fixture data inside the test function?


Re: cacheing

I also have an example of using caching for the benchmarking results (although maybe I should have used artifacts?)

- name: Retrieve cached baseline if available
uses: actions/cache/restore@v3
id: cache
with:
path: baseline.json
key: ${{ github.event.pull_request.base.sha }}
- name: Run baseline benchmark if not in cache
if: steps.cache.outputs.cache-hit != 'true'
run: |
git checkout ${{ github.event.pull_request.base.sha }}
pytest tests/bench.py --benchmark-json baseline.json
- name: Cache baseline results
uses: actions/cache/save@v3
if: steps.cache.outputs.cache-hit != 'true'
with:
path: baseline.json
key: ${{ github.event.pull_request.base.sha }}

Happy to help you get that part of the action worked out if you need it, but I think Talley has the right idea with the download script.

@tlambert03
Copy link
Contributor

one thing that you might consider looking into is https://codspeed.io
I use it in psygnal and nd2 to run benchmarks during CI and catch potential regressions. It's quite good for that, but it's not great for running local benchmarks, or doing retrospective benchmarks to see where a regression occurred after the fact. However, since it uses the same api as pytest-benchmark... it's probably not to hard to make a flag to switch between the two. Would you want me to play around with this a bit to see whether I see any patterns you might want to use here?

@tlambert03
Copy link
Contributor

... eh... probably not worth it, the benchmark-action seems plenty good. codspeed is probably an unnecessary complication at the moment

@msschwartz21
Copy link
Collaborator

Ok after looking into it a bit more, it seems like parametrize works just fine with benchmark which is great. Looks like this could be a workaround for the fixtures in parametrize issue.

@tlambert03 I noticed in the nd2 benchmarks that you use the decorator @pytest.mark.benchmark instead of the fixture. Any particular reason for that choice? It does seem to make the code a bit cleaner, but I was wondering if there is anything else to consider.

@bentaculum
Copy link
Contributor Author

Thanks @tlambert03 @msschwartz21 for the suggestions. I'll come back with a second iteration of the PR.

@tlambert03
Copy link
Contributor

I was wondering if there is anything else to consider.

Yeah, only use the decorator if you want to time the entire test from start to finish with no setup or teardown (good for simple tests). But I don't always use it. See https://github.com/pyapp-kit/psygnal/blob/main/tests/test_bench.py for a file that uses both forms

@bentaculum bentaculum mentioned this pull request Apr 12, 2024
16 tasks
@bentaculum bentaculum added tests Test case related issues performance Changes that impact runtime and performance and removed enhancement New feature or request labels Apr 12, 2024
@bentaculum
Copy link
Contributor Author

Hi @msschwartz21 @tlambert03 , I finally got to this, and did the following:

  • routines for downloading data now in test_utils.py
  • fixtures for 2d and 3d datasets, then have tests that parametrize the data input as layed out here
  • Moved the actual tests for specific values of metrics on the Fluo-N2DL-Hela dataset out of the benchmark.

Even though the benchmark runs fine locally its github action currently fails with an unknown error, to be fixed.

Left to do, caching the datasets in github actions.

@bentaculum bentaculum marked this pull request as ready for review April 12, 2024 13:42
@bentaculum bentaculum changed the title Add 3d benchmark Extend benchmarks Apr 12, 2024
@msschwartz21
Copy link
Collaborator

@bentaculum just wanted to give you a heads up that I'm in crunch mode finishing up writing my thesis. Hopefully next week I'll have time to review and help figure out what's causing the benchmarking action to fail.

@bentaculum
Copy link
Contributor Author

No worries, and good luck!

@bentaculum
Copy link
Contributor Author

Hi @msschwartz21, thanks for looking over this one and a fixing it up.
It looks like the bottleneck now is loading the 3d data from disk, that's good.

I would suggest to not limit the benchmark to a few frames but instead to run it on the entire video to catch possible increases in metric runtime.
I get total 79s full video vs 75s three frames locally, so I changed to full video again.

@bentaculum
Copy link
Contributor Author

If we want to speed up the benchmark we can consider creating a subset of the 3d dataset on disk, or tweaking load_ctc_data to load only a part of the dataset on disk.

@bentaculum
Copy link
Contributor Author

I still don't know why the benchmark runs locally but fails on GH actions, even with the increased time limit that you put ...
It might be that the VMs are really slow.

@msschwartz21
Copy link
Collaborator

Sometimes the action is failing with a 143 error code which I think means the vm is being terminated probably due to resources maxing out. It seems to always fail on the ctc matcher on 3D. I was going to try limiting the volume to a smaller roi to reduce the number of nodes. If that works then we can keep the full length of the video.

@msschwartz21
Copy link
Collaborator

Everything that I could find about github actions failing pointed to a possible memory issue. I also remembered that we decided to err on the conservative side in terms of creating copies of data before possibly modifying a graph. I got rid of the copy in the initializer of the Matcher which led to interesting results in the action.

  • tests/bench.py::test_ctc_matcher[3d] actually hits the timeout instead of failing immediately. The action then seems to fail during tests/bench.py::test_ctc_metrics[3d]
  • The unexpected outcome is that one of the division tests is now failing which suggests that maybe our original paranoia about creating copies wasn't entirely misplaced...

@DragaDoncila
Copy link
Collaborator

DragaDoncila commented Sep 19, 2024

The unexpected outcome is that one of the division tests is now failing which suggests that maybe our original paranoia about creating copies wasn't entirely misplaced...

Just wanna chime on in this as it would be great to not have to copy the data. I have my local traccuracy install always with graph copying disabled, because in a lot of my pipelines, copying the graph just instantly hits my RAM limit, which is annoying for comparatively small datasets.

Will also note that I think our load_ctc_data makes a copy of the data by keeping the frames in a list and then calling np.stack - might want to try avoiding that by initializing a correctly sized np array first and then assigning into it. I'll open an issue Opened #165 to track.

@msschwartz21
Copy link
Collaborator

I have my local traccuracy install always with graph copying disabled, because in a lot of my pipelines, copying the graph just instantly hits my RAM limit, which is annoying for comparatively small datasets.

Have you noticed any weird behavior when having the copy disabled?

But generally I agree that we should eliminate the copying. It was an easy way to get unblocked when we were getting started but it's a pretty obvious limitation on dataset size.

@DragaDoncila
Copy link
Collaborator

Have you noticed any weird behavior when having the copy disabled?

No, none. Caveat would be that I don't have any multithreaded/distributed code. But I do feel that if someone does have that, controlling edits and object access is on them, not on us.

I think we should definitely figure out what's going on in that failing test though, and whether it's likely to affect users, whether it's an artifact of how we wrote the test... etc.

@msschwartz21
Copy link
Collaborator

It turns out the memory problem that we were encountering in tests/bench.py was due including copy.deepcopy in multiple fixtures in order to avoid problems associated with running fixtures at the module scope. Originally, this was desirable in order to reduce the time spent loading the same data multiple times. However it contributed to the large memory footprint of the benchmarking script.

I rescoped all of the benchmarking fixtures back to function scope and eliminated copy.deepcopy. As a side effect, I had to increase the timeout since the data is now being loaded inside of each test function.

@bentaculum Can you take a look and confirm that I haven't eliminated anything that you wanted in there in the process of trying to get it to work on github?

@bentaculum
Copy link
Contributor Author

Hi @msschwartz21, thanks for looking into this once more. Given #166 and the memory issues on GH actions, it seems that loading data separately is the way to go at the moment.
The benchmark runs locally for me now in ~200s. Somehow the benchmark on GH still failed above, not sure why. Might be that the available virtual machine is much slower than my local one and hits the time limit (300s)?

@msschwartz21
Copy link
Collaborator

Dang it, I didn't notice that the action failed after I merged in the latest changes 🤯 I'll take another look today and see if I can get it to pass.

Tried to move this change into a different PR, but that caused the action to start failing

This reverts commit c1ba1d5.
@msschwartz21
Copy link
Collaborator

@bentaculum Looks like I messed up the action when I reverted the commit that eliminated copying of the graph in the matcher. I wanted to move it into another memory related PR, but clearly it mattered for this one. I reverted the revert commit and I think we're now actually good to go!

@bentaculum
Copy link
Contributor Author

bentaculum commented Nov 11, 2024

Awesome 🎉! I suggest merging this as is.

Benchmarking takes a while now (~200s) due to loading datasets repeatedly, but on the plus side it now measures performance on a full standard CTC 3d dataset, which is a common use case, and it can be sped up significantly once the TrackingGraph is made agnostic to different metrics as hinted towards in #166.

@DragaDoncila @cmalinmayor any major concerns?

@bentaculum bentaculum requested review from cmalinmayor and removed request for msschwartz21 November 11, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance Changes that impact runtime and performance tests Test case related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants