-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM2 for segmenting a 2 hour video? #264
Comments
You would have to do it in chunks of 10s clips. You could take the mask of the last frame per chunk and use it as input for the next chunk. That would take a while but be fully automated. |
The largest model uses <2GB of VRAM for videos, so a 4090 should have no issues. The main problem would be the likelihood of the segmentation failing at some point combined with the time it takes (using the large model at 60fps, I'd guess it would be 3-4 hours on a 4090), since that's a long time to have to sit there and correct the outputs. It might make sense to first run the tiny model at 512px resolution (see issue #257), which should take <1hr, to give some idea of where the tracking struggles. As for memory issues, the original code is setup for interactive use and won't work as-is. You'd have to clear the cached results as the video runs (see #196) and probably also avoid loading all the frames in at the start... I guess by a combination of using the async_loading_frames option on Alternatively, there are existing code bases that are aimed at this, for example #90, maybe PR #46, maybe #73 and I also have a script for it. |
Thanks, I'll take a look at the links you provided. Could you explain to me what |
By default, the video predictor loads & preprocesses every single frame of your video before doing any segmentation. If you run the examples, you'll see this show up as a progress bar when you begin tracking:
Only after this finishes does the SAM model actually start doing anything. The model results show up as a different progress bar:
When you set In theory the async loading is a far more practical choice, because it avoids loading everything into memory. Weirdly, the loader runs in it's own thread and actually does store everything in memory, which sort of defeats the purpose. But you can fix it by commenting out the storage line like I mentioned before, and it's also probably worth commenting out the threading lines to stop the loader from trying to get ahead of the model. Those changes should allow you to run any length of video, but the predictor still caches results as it runs (around ~1MB per frame) which will eventually consume all your memory for longer videos (but that can be fixed with the other changes I mentioned). Here's a minimal video example that prints out VRAM usage. You can try running it with a different async setting and with/without the threading/storage lines commented out to see the differences: from time import perf_counter
import torch
import numpy as np
from sam2.build_sam import build_sam2_video_predictor
video_folder_path = "notebooks/videos/bedroom"
cfg, ckpt = "sam2_hiera_t.yaml", "checkpoints/sam2_hiera_tiny.pt"
device = "cuda" # or "cpu"
predictor = build_sam2_video_predictor(cfg, ckpt, device)
inference_state = predictor.init_state(
video_path=video_folder_path,
async_loading_frames=False
)
predictor.add_new_points(
inference_state=inference_state,
frame_idx=0,
obj_id=1,
points=np.array([[210, 350]], dtype=np.float32),
labels=np.array([1], np.int32),
)
tprev = -1
for result in predictor.propagate_in_video(inference_state):
# Do nothing with results, just report VRAM use
if (perf_counter() > tprev + 1.0) and torch.cuda.is_available():
free_bytes, total_bytes = torch.cuda.mem_get_info()
print("VRAM:", (total_bytes - free_bytes) // 1_000_000, "MB")
tprev = perf_counter()
pass When I run this, the worst case scenario is the original code with async=True which uses >2.5GB VRAM and keeps ballooning as it runs. The best case is also with async=True but with threading & storage commented out, which ends up needing around 1.1GB (but will still grow slowly without clearing cached results). |
In your opinion would it be possible to use SAM2 to segment a 2 hour video (720p, 60fps) with a 4090 GPU, avoiding of course the errors due to lack of memory?
What could be the best strategy to succeed in doing so?
The text was updated successfully, but these errors were encountered: