Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM2 for segmenting a 2 hour video? #264

Open
aendrs opened this issue Aug 26, 2024 · 4 comments
Open

SAM2 for segmenting a 2 hour video? #264

aendrs opened this issue Aug 26, 2024 · 4 comments

Comments

@aendrs
Copy link

aendrs commented Aug 26, 2024

In your opinion would it be possible to use SAM2 to segment a 2 hour video (720p, 60fps) with a 4090 GPU, avoiding of course the errors due to lack of memory?
What could be the best strategy to succeed in doing so?

@kevinpl07
Copy link

You would have to do it in chunks of 10s clips. You could take the mask of the last frame per chunk and use it as input for the next chunk. That would take a while but be fully automated.

@heyoeyo
Copy link

heyoeyo commented Aug 26, 2024

The largest model uses <2GB of VRAM for videos, so a 4090 should have no issues. The main problem would be the likelihood of the segmentation failing at some point combined with the time it takes (using the large model at 60fps, I'd guess it would be 3-4 hours on a 4090), since that's a long time to have to sit there and correct the outputs. It might make sense to first run the tiny model at 512px resolution (see issue #257), which should take <1hr, to give some idea of where the tracking struggles.

As for memory issues, the original code is setup for interactive use and won't work as-is. You'd have to clear the cached results as the video runs (see #196) and probably also avoid loading all the frames in at the start... I guess by a combination of using the async_loading_frames option on init_state and disabling (i.e. comment out) the storage of async loaded frames.

Alternatively, there are existing code bases that are aimed at this, for example #90, maybe PR #46, maybe #73 and I also have a script for it.

@aendrs
Copy link
Author

aendrs commented Aug 28, 2024

Thanks, I'll take a look at the links you provided. Could you explain to me what async_loading_frames do?

@heyoeyo
Copy link

heyoeyo commented Aug 28, 2024

Could you explain to me what async_loading_frames do?

By default, the video predictor loads & preprocesses every single frame of your video before doing any segmentation. If you run the examples, you'll see this show up as a progress bar when you begin tracking:

frame loading (JPEG): 100%|███████| 200/200

Only after this finishes does the SAM model actually start doing anything. The model results show up as a different progress bar:

propagate in video:  22%|████     | 45/200

When you set async_loading_frames=True, the frame loading and SAM model run at the same time.

In theory the async loading is a far more practical choice, because it avoids loading everything into memory. Weirdly, the loader runs in it's own thread and actually does store everything in memory, which sort of defeats the purpose. But you can fix it by commenting out the storage line like I mentioned before, and it's also probably worth commenting out the threading lines to stop the loader from trying to get ahead of the model. Those changes should allow you to run any length of video, but the predictor still caches results as it runs (around ~1MB per frame) which will eventually consume all your memory for longer videos (but that can be fixed with the other changes I mentioned).

Here's a minimal video example that prints out VRAM usage. You can try running it with a different async setting and with/without the threading/storage lines commented out to see the differences:

from time import perf_counter
import torch
import numpy as np
from sam2.build_sam import build_sam2_video_predictor

video_folder_path = "notebooks/videos/bedroom"
cfg, ckpt = "sam2_hiera_t.yaml", "checkpoints/sam2_hiera_tiny.pt"
device = "cuda" # or "cpu"
predictor = build_sam2_video_predictor(cfg, ckpt, device)
inference_state = predictor.init_state(
    video_path=video_folder_path,
    async_loading_frames=False
)

predictor.add_new_points(
    inference_state=inference_state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[210, 350]], dtype=np.float32),
    labels=np.array([1], np.int32),
)

tprev = -1
for result in predictor.propagate_in_video(inference_state):
    # Do nothing with results, just report VRAM use
    if  (perf_counter() > tprev + 1.0) and torch.cuda.is_available():
        free_bytes, total_bytes = torch.cuda.mem_get_info()
        print("VRAM:", (total_bytes - free_bytes) // 1_000_000, "MB")
        tprev = perf_counter()
    pass

When I run this, the worst case scenario is the original code with async=True which uses >2.5GB VRAM and keeps ballooning as it runs. The best case is also with async=True but with threading & storage commented out, which ends up needing around 1.1GB (but will still grow slowly without clearing cached results).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants