SAM2 for segmenting a 2 hour video? #264

aendrs · 2024-08-26T15:59:51Z

In your opinion would it be possible to use SAM2 to segment a 2 hour video (720p, 60fps) with a 4090 GPU, avoiding of course the errors due to lack of memory?
What could be the best strategy to succeed in doing so?

kevinpl07 · 2024-08-26T16:22:06Z

You would have to do it in chunks of 10s clips. You could take the mask of the last frame per chunk and use it as input for the next chunk. That would take a while but be fully automated.

heyoeyo · 2024-08-26T17:46:42Z

The largest model uses <2GB of VRAM for videos, so a 4090 should have no issues. The main problem would be the likelihood of the segmentation failing at some point combined with the time it takes (using the large model at 60fps, I'd guess it would be 3-4 hours on a 4090), since that's a long time to have to sit there and correct the outputs. It might make sense to first run the tiny model at 512px resolution (see issue #257), which should take <1hr, to give some idea of where the tracking struggles.

As for memory issues, the original code is setup for interactive use and won't work as-is. You'd have to clear the cached results as the video runs (see #196) and probably also avoid loading all the frames in at the start... I guess by a combination of using the async_loading_frames option on init_state and disabling (i.e. comment out) the storage of async loaded frames.

Alternatively, there are existing code bases that are aimed at this, for example #90, maybe PR #46, maybe #73 and I also have a script for it.

aendrs · 2024-08-28T14:56:06Z

Thanks, I'll take a look at the links you provided. Could you explain to me what async_loading_frames do?

heyoeyo · 2024-08-28T16:33:31Z

Could you explain to me what async_loading_frames do?

By default, the video predictor loads & preprocesses every single frame of your video before doing any segmentation. If you run the examples, you'll see this show up as a progress bar when you begin tracking:

frame loading (JPEG): 100%|███████| 200/200

Only after this finishes does the SAM model actually start doing anything. The model results show up as a different progress bar:

propagate in video:  22%|████     | 45/200

When you set async_loading_frames=True, the frame loading and SAM model run at the same time.

In theory the async loading is a far more practical choice, because it avoids loading everything into memory. Weirdly, the loader runs in it's own thread and actually does store everything in memory, which sort of defeats the purpose. But you can fix it by commenting out the storage line like I mentioned before, and it's also probably worth commenting out the threading lines to stop the loader from trying to get ahead of the model. Those changes should allow you to run any length of video, but the predictor still caches results as it runs (around ~1MB per frame) which will eventually consume all your memory for longer videos (but that can be fixed with the other changes I mentioned).

Here's a minimal video example that prints out VRAM usage. You can try running it with a different async setting and with/without the threading/storage lines commented out to see the differences:

from time import perf_counter
import torch
import numpy as np
from sam2.build_sam import build_sam2_video_predictor

video_folder_path = "notebooks/videos/bedroom"
cfg, ckpt = "sam2_hiera_t.yaml", "checkpoints/sam2_hiera_tiny.pt"
device = "cuda" # or "cpu"
predictor = build_sam2_video_predictor(cfg, ckpt, device)
inference_state = predictor.init_state(
    video_path=video_folder_path,
    async_loading_frames=False
)

predictor.add_new_points(
    inference_state=inference_state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[210, 350]], dtype=np.float32),
    labels=np.array([1], np.int32),
)

tprev = -1
for result in predictor.propagate_in_video(inference_state):
    # Do nothing with results, just report VRAM use
    if  (perf_counter() > tprev + 1.0) and torch.cuda.is_available():
        free_bytes, total_bytes = torch.cuda.mem_get_info()
        print("VRAM:", (total_bytes - free_bytes) // 1_000_000, "MB")
        tprev = perf_counter()
    pass

When I run this, the worst case scenario is the original code with async=True which uses >2.5GB VRAM and keeps ballooning as it runs. The best case is also with async=True but with threading & storage commented out, which ends up needing around 1.1GB (but will still grow slowly without clearing cached results).

heyoeyo mentioned this issue Sep 6, 2024

Long Video Support #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM2 for segmenting a 2 hour video? #264

SAM2 for segmenting a 2 hour video? #264

aendrs commented Aug 26, 2024

kevinpl07 commented Aug 26, 2024

heyoeyo commented Aug 26, 2024

aendrs commented Aug 28, 2024 •

edited

Loading

heyoeyo commented Aug 28, 2024

SAM2 for segmenting a 2 hour video? #264

SAM2 for segmenting a 2 hour video? #264

Comments

aendrs commented Aug 26, 2024

kevinpl07 commented Aug 26, 2024

heyoeyo commented Aug 26, 2024

aendrs commented Aug 28, 2024 • edited Loading

heyoeyo commented Aug 28, 2024

aendrs commented Aug 28, 2024 •

edited

Loading