This repository contains the code for the paper:
Online Continual Learning Without the Storage Constraint
Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, Ozan Sener
[Arxiv]
[PDF]
[Bibtex]
Our code was run on a 16GB RTX 3080Ti Laptop GPU with 64GB RAM and PyTorch >=1.13, although better GPU/RAM space will allow for faster experimentation.
- Install all requirements required to run the code on a Python >=3.9 environment by:
# First, activate a new virtual environment
pip3 install -r requirements.txt
- There is a fast, direct mechanism to download and use our datasets implemented in this repository.
- Input the directory where the dataset was downloaded into
data_dir
field insrc/opts.py
. - All codes in this repository were run on this dataset.
YOUR_DATA_DIR
would contain two subfolders:cglm
andcloc
. Following are instructions to setup each dataset:
- You can download Continual Google Landmarks V2 dataset by following instructions on their Github repository, run in the
DATA_DIR
directory:
wget -c https://raw.githubusercontent.com/cvdfoundation/google-landmark/master/download-dataset.sh
mkdir train && cd train
bash ../download-dataset.sh train 499
- Download metadata by running the following commands in the
scripts
directory:
wget -c https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv
python cglm_scrape.py
- Parse the XML files and organize it as a dictionary.
- Ordering used in the paper is available to download from here.
- Now, select only images that are a part of the order file and your dataset should be ready!
- Download the
cloc.txt
file from this link inside theYOUR_DATASET_DIR/cloc
directory. - The
cloc.txt
file contains 36.8M image links, removing missing/broken links from the original download file of CLOC. - Download the dataset parallely and scalably using img2dataset, finishes in <a day on a 8-node server (read instructions in
img2dataset
repo for further distributed download options):
pip install img2dataset
img2dataset --url_list cyfcc.txt --input_format "txt" --output_form webdataset output_folder images --process_count 16 --thread_count 256 --resize_mode no --skip_reencode True
- Match the urls and file indexes to the idx for training script given in the original CLOC repo via this script .
- To reproduce our KNN scaling graphs (Figure 1b), please run the following on a computer with high RAM:
cd scripts/
python knn_scaling.py
python plot_knn_results.py
- To reproduce the blind classifier, please run the following:
cd scripts/
python run_blind.py
- New ordering files using the
upload_date
instead of date from EXIF metadata (more unique timestamps and more faithful to the story), we get this new order file. Differerent from order file at CLDatasets repo. Do not crosscompare. - However, no substantial changes observed in trends! The label correlation does not go away (slightly increases infact with better ordering, by breaking ties of same timestamps which led to random ordering!)
We hope ACM is a strong method for comparison, and this idea/codebase is useful for your cool CL idea! To cite our work:
@article{prabhu2023online,
title={Online Continual Learning Without the Storage Constraint},
author={Prabhu, Ameya and Cai, Zhipeng and Dokania, Puneet and Torr, Philip and Koltun, Vladlen and Sener, Ozan},
journal={arXiv preprint arXiv:2305.09253},
year={2023}
}