In this multilabel example we will import the Corel-5k dataset as a Cassandra dataset and then read the data into NVIDIA DALI. We will save the original images as JPEG blobs and the labels as NPY blobs (i.e., serialized numpy tensors).
As a first step, the raw files are to be downloaded from:
or, if you have installed Kaggle API, you can just run this command:
$ kaggle datasets download -d parhamsalar/corel5k
In the following we will assume the original images are stored in the
/data/Corel-5k/
directory.
We begin by starting the Cassandra server shipped with the provided Docker container:
# Start Cassandra server
$ /cassandra/bin/cassandra
Note that the shell prompt is immediately returned. Wait until state jump to NORMAL
is shown (about 1 minute).
The following commands will insert the original dataset in Cassandra and use the plugin to read the images in NVIDIA DALI.
# - Create the tables in the Cassandra DB
$ cd examples/corel5k/
$ /cassandra/bin/cqlsh -f create_tables.cql
# - Fill the tables with data and metadata
$ python3 extract_serial.py /data/Corel-5k/images/ /data/Corel-5k/npy_labs /data/Corel-5k/train.json --data-table corel5k.data --metadata-table corel5k.metadata
# - Read the list of UUIDs and cache it to disk
$ python3 cache_uuids.py --metadata-table corel5k.metadata --rows-fn corel5k.rows
# - Tight loop, data loading and decoding in GPU memory (GPU:0)
$ python3 loop_read.py --data-table corel5k.data --rows-fn corel5k.rows --use-gpu
# - Sharded, tight loop test, using 2 GPUs via torchrun
$ torchrun --nproc_per_node=2 loop_read.py --data-table corel5k.data --rows-fn corel5k.rows --use-gpu