Why does my model perform worse after labelling more and more frames? #1689

majaneubauer · 2024-02-15T09:22:11Z

majaneubauer
Feb 15, 2024

Hi there!

I have been working with SLEAP for the past 4 weeks and i cannot help but wonder/question whether my training workflow even makes sense. I have already tried a variety of things but I thought I would double check that what I am doing is sensible.

The overall goal is to use SLEAP on videos of two mice (one black and one white), filmed from below. I have about 40 videos with that setting, which vary in lighting conditions.

What I have been doing so far is as follows:

I started with adding my first video to the SLEAP GUI, created my skeleton, and labelled 30 frames
I then exported my training job package (I am training a multi-animal bottom-up mode on Google Colab). These are screenshots of my configuration and my training pipeline:
I then trained my model using !sleap-train multi_instance.json labels.v001_bottom_up.pkg.slp --run_name "240205_multi_instance"
I then predicted on a new video using
!sleap-track videos_to_analyse/Trial2.mp4 \ --max_instances 2 \ --tracking.tracker flowmaxtracks \ --tracking.max_tracking true \ --tracking.max_tracks 2 \ --tracking.target_instance_count 2 \ --tracking.post_connect_single_breaks 1 \ -m models/240205_multi_instance.multi_instance
I merged my prediction file into my original project and labelled 100 more frames. I used the sample and random method for this in the labelling suggestions tab.
I then exported a new training job package, but resumed training. In Google Colab I added the --base_checkpoint command to my normal sleap-train command.
I predicted on another video again using the same command as above (making sure to change the video and the model directory). I specifically chose to use a video with different lighting this time, in the hope of making my model more robust.
After merging my predictions file into my current project, I again used the sample and random method to label 100 more frames (230 total now).
I exported a training job package again and resumed training, setting my model obtained with 130 frames as my --base_checkpoint
Predicting, merging, labelling, and resuming training went on for quite some time. At some point I started using the prediction score method in the labelling suggestions tab in the hope of improving my model.

My problem now is this: I have been looking at the mAP scores of each model and have realised that they sometimes seem to get worse. I thought, okay maybe this is due to resume training, so I've tried training from scratch again. However, this also did not give me a satisfactory result. Here are some values I got. Please keep in mind that I've never changed anything on my hyperparameters on my model, even when I did not resume training and trained from scratch.

First approach (This is where I did the same training workflow as described above, but I was training from scratch every time):
number of labelled frames | mAP
30 | 0.06732
130 | 0.11867
230 | 0.18940

I then, by accident basically, discovered the resume training button in the GUI. So after labelling 50 more frames (280 frames total), I resumed training with the model obtained from 230 frames as my base_checkpoint
280 | 0.50627
This was such a big improvement, that I thought I should have resumed training all the time. So I labelled 50 more frames and resumed training with the model obtained from 280 frames as my base_checkpoint
330 | 0.502889

Now here I got confused. Why did my model get worse even though I labelled more frames. So I trained my 280 frames model and my 330 frames model from scratch and obtained these mAP values:
280 | 0.24979
330 | 0.25126

At this point I was at a loss of what to do. It appeared as if resume training was indeed helping my model, but after a certain point labelling more frames seemed to make it worse again. So I went on another quest and tried resuming training from the beginning. I kept exactly the same labelled frames, but when training after a labelling session, I now used the previously obtained model as my base_checkpoint. I then obtained these results
30 | 0.06733
130 | 0.30236
230 | 0.25289
280 | 0.47352
330 | 0.57064
380 | 0.58931
432 | 0.60116

Now I was already pretty happy with that, but it seemed like we stagnated somewhere near 0.60. I had seen somewhere on GitHub that it might be beneficial for my model if I label frames with low prediction scores. So I did that and my model got worse again:
482 | 0.55066
505 | 0.58807

I am doubting all my past work at the moment and would appreciate if someone could let me know whether what I am doing is sensibel or where I need to make changes to my training workflow.

Thank you!

talmo · 2024-03-16T22:11:15Z

talmo
Mar 16, 2024
Maintainer

Hi @majaneubauer,

My apologies, we seem to have let this one slip through the cracks!

Here's some context:

mAP

This is a tricky metric that incorporates many factors.

Problem 1: Sampling at small sizes

First off, the value you see in the SLEAP GUI is based on the evaluation on the validation set, which is a random 10% of the data that is held out during training. The set of images SLEAP will use for validation is unique to that training run, so every time you train a new model, it'll be a different subset of images. Because this is random, at small dataset sizes, we may be talking about tens of images, which makes the mAP estimate super sensitive to sampling.

Problem 2: Evaluation set composition

To make matters trickier, as you're adding more frames, the size of the validation set changes, and it may include fewer "easy" frames. This is especially the case as you add more frames from harder cases. As you do so, the mAP may go down even though the model performance is actually improving. The reason is that in previous runs which did not contain as many "hard" frames, the mAP was actually an underestimate of the true performance.

Test splits are more reliable, but impractical

In order to get truly reliable and stable mAP estimates, the appropriate procedure is to label a ton of frames, hold out a fixed test set of 100+ images, and then vary what you're using for training. This doesn't make a lot of sense in the early stages on the labeling workflow since it would mean that you'd need to (painfully) manually label many images and then not using them for training, so the validation split is an ok proxy, but should be taken with a grain of salt.

OKS

The other issue is that the mAP depends on the underlying OKS measure. This is a metric that is a bit more holistic than the pixel distance errors since it factors in visibility, scale and annotation variability.

Problem 1: Node count

The OKS is, however, sensitive to the number of nodes in your skeleton since it is an aggregate metric. This means that you may have significantly lower mAP values when you make relatively few errors, i.e., it is less tolerant to mistakes when you have fewer nodes.

Problem 2: Variability calibration

Another factor is that the OKS attempts to account for human annotator variability as a reference point for how bad each pixel of error should really be for a given node type. For example, eyes are exceptionally easy to click on correctly, and if you ask 100 humans to label an eye, they'll click on almost the exact same pixel. Conversely, if you ask them to label a hip node, which is internal and covered by clothing, the variability is much higher.

The OKS factors this in by calibrating the error in pixels to the difficulty of localizing that node type. Unfortunately, we don't have those exact variability constants for every node type on every animal, so we use the human eye node variability constant as a baseline. This makes our OKS values more conservative (i.e., underestimates) to be on the safe side. The more "hard" keypoints you have though (which on mice is most of them since they're so blobby), the worse of an underestimate of the true performance this will be. This means that the upper ceiling for some datasets may be significantly lower than the reported ~0.8 ones you might find elsewhere.

Problem 3: Scale calibration

The OKS calculation also normalizes the score by the relative scale of the ground truth instance. The point of this is to treat small absolute distance errors on small instances as equivalent to larger errors on larger instances so that they roughly correspond to the physical distances, as well as accounting for limitations on the available information (which is limited by the image resolution).

For some skeleton types, particularly ones that form a line segment, the area might be particularly small, which penalizes smaller errors more than they might otherwise. IIRC we may even be using the axis-aligned bounding box as the estimate of area, which works fine for lateral views, but for top-down/bottom-up views is sensitive to rotation within the FOV. This is partially mitigated when the skeleton has more nodes off the midline of the body (e.g., paws/legs), but still a factor that can throw off this metric.

This is all to say: if the performance is looking good qualitatively, and the error distances are reasonable within the validation set, it may be safe to disregard the mAP. If it's important to robustly quantify this value (or trend in performance), then you can follow the procedure mentioned above for creating a test split.

Let us know if you have any questions!

Talmo

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does my model perform worse after labelling more and more frames? #1689

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why does my model perform worse after labelling more and more frames? #1689

majaneubauer Feb 15, 2024

Replies: 1 comment

talmo Mar 16, 2024 Maintainer

mAP

Problem 1: Sampling at small sizes

Problem 2: Evaluation set composition

Test splits are more reliable, but impractical

OKS

Problem 1: Node count

Problem 2: Variability calibration

Problem 3: Scale calibration

majaneubauer
Feb 15, 2024

talmo
Mar 16, 2024
Maintainer