Experiments with CLIP based image search for dataset creation and foundational models for image autolabeling
cd ./dataset_creator/
# create conda environment
conda env create -f environment.yml
# install this pkg
pip install -e .
# then it should be possible to run existing scripts
python scripts/download_data.py
python scripts/select_dataset.py
python scripts/autolabel_dataset.py
Data Selection
-
python scripts/select_dataset.py
-
Rough Inference Times (RTX 3070 laptop):
- CLIP img/text embedding: ~0.06s / it (~15it/s)
-
Selected images:
Autolabeling
-
python scripts/autolabel_dataset.py
TODO: -
Rough Inference Times (RTX 3070 laptop):
- DepthAnything: ~0.35s / it (~2.8it/s)
- GroundingSAM: ~17s / it (scales ~linearly with instances to detect in
class_onthology
) - COCA: ~1s / it
-
Autolabeled images:
- script to download files from internet (Pixabay API)
- CLIP based image directory search
- image based search
- text based search
- similarity based filtering
- GT Autolabeling
- ImageCaptions (based on COCA model)
- BBox + InstanceSegmentation (based on Grounding-Sam)
- Depth (based on DepthAnything)
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
-
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data