Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory. #9

Open
kdu4108 opened this issue Jul 4, 2024 · 3 comments

Comments

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 4, 2024

TODO: Modify and run save_vq_tokens.py to tokenize RGB videos.

save_vq_tokens.py is the file which you run in 4M to pretokenize images, e.g., to go from images of the modality rgb to examples of the modality tok_rgb. It takes a pretrained tokenizer and input dataset directory (among other things) and applies the tokenizer to the images in the input dataset to create the tokens in a new output dataset directory.

We want to get tokens for the rgb videos, so going from and input directory of root/video_rgb to the tokenized examples in the output directory root/video_tok_rgb.

The steps to do this are to modify save_vq_tokens.py to have the following capabilities:

  1. It can load video files from the dataset folder in webdataset format (see the structure under video_rgb modality directory proposed in this post [PARENT ISSUE] Data preprocessing and pseudolabeling #3 (comment)).
  2. you can run the pretrained rgb tokenizer on each frame of each video.
  3. It saves the tokens as .npy files in webdataset format in the directory root/video_tok_rgb.

Proposed input directory format:

root/video_rgb/shard-00000.tar
 |     ├── 00000.mp4 # this corresponds to one video.
 |     ├── 00001.mp4
 |     └── ...

Proposed output directory format:

root/video_tok_rgb/shard-00000.tar
 |     ├── 00000.npy # this corresponds to one video. shape: something like (num_frames, H, C, W)
 |     ├── 00001.npy
 |     └── ...

Definition of Done:

  • we have a script which can, given an input directory (e.g. video_rgb), pretrained tokenizer (e.g., from https://huggingface.co/collections/EPFL-VILAB/4m-tokenizers-66019388bda47e9bcff3f887), and output directory (e.g., video_tok_rgb), generate the tokenized representations of those videos according to the structure above saved to the output directory.
  • This script is run and we actually have tokenized rgb videos in root/video_tok_rgb.

(This is a subtask of #3)

@kdu4108 kdu4108 changed the title Modify save_vq_tokens.py to tokenize RGB videos. Modify and run save_vq_tokens.py to tokenize RGB videos. Jul 4, 2024
@kdu4108 kdu4108 changed the title Modify and run save_vq_tokens.py to tokenize RGB videos. Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory. Jul 4, 2024
@kdu4108
Copy link
Collaborator Author

kdu4108 commented Aug 2, 2024

TODO (@markus583 ?) : need to make sure the names of the token files are correct and line up with the mp4 filename within the tar file.

E.g. if an mp4 in filtered_raw/shard-000.tar is 00013.mp4, then then the corresponding .npy file needs to be filtered_raw/shard-000.tar/00013.npy.

@kdu4108
Copy link
Collaborator Author

kdu4108 commented Aug 2, 2024

Can you also make sure the directory is named video_tok_rgb instead of video_rgb_tok? @markus583

@markus583
Copy link

Done. Also fixed intra-json paths and cleaned up. @kdu4108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants