A blazing fast HDF5 Image Generator for Keras ⚡
Sometimes you'd like to work with large scale image datasets that cannot fit into the memory. Luckily, Keras provides various data generators to feed your network with mini-batch of data directly from a directory, simply by passing the source path. But this method is terribly inefficient. During training, the model has to deal with massive I/Os operations on disk which introduces huge latency.
A more efficient way is to take advantage of HDF5 data structure which is optimized for I/O operations. The idea is to (1) store your raw images and their labels to an HDF5 file, and to (2) create a generator that will load and preprocess mini-batches in real-time.
This image generator is built on top of Keras Sequence
class and it's safe for multiprocessing. It's also using the super-fast image-processing albumentations library.
To install use pip:
$ pip install h5imagegenerator
- Keras
- Numpy
- Albumentations
- h5py
Feel free to PR any change/request. 😁
First, import the image generator class:
from h5imagegenerator import HDF5ImageGenerator
Then, create a new image generator:
train_generator = HDF5ImageGenerator(
src='path/to/train.h5',
X_key='images,
y_key='labels,
scaler=True,
num_classes=10,
labels_encoding='hot',
batch_size=32,
mode='train')
- src: the source HDF5 file
- X_key: the key of the image tensors dataset (default is
images
) - y_key: the key of the labels dataset (default is
labels
) - scaler: scale inputs to the range [0, 1] (basic normalization) (default is
True
) - num_classes: tells the generator the total number of classes (one hot encoding/smooth encoding)
- labels_encoding: set it to
hot
to convert integers labels to binary matrix (one hot encoding), set it tosmooth
to perform smooth encoding (default ishot
) - batch_size: the number of samples to be generated at each iteration (default is
32
) - mode: 'train' to generate tuples of image samples and labels, 'test' to generate image samples only (default is
'train'
)
Note:
(1) When using smooth
labels_encoding, you should provides a smooth_factor (defaults to 0.1
).
(2) Labels stored in the HDF5 file must be integers or list of lists/tuples of integers in case you're doing multi-labels classification. ie: labels=[1, 2, 3, 6, 9] or labels=[(1, 2), (5, 9), (3, 9)]
...
Sometimes you'd like to perform some data augmentation on-the-fly, to flip, zoom, rotate or scale images. You can pass to the generator an albumentations transformation pipeline:
my_augmenter = Compose([
HorizontalFlip(p=0.5),
RandomContrast(limit=0.2, p=0.5),
RandomGamma(gamma_limit=(80, 120), p=0.5),
RandomBrightness(limit=0.2, p=0.5),
Resize(227, 227, cv2.INTER_AREA)])
train_generator = HDF5ImageGenerator(
src='path/to/train.h5',
X_key='images,
y_key='labels,
scaler=True,
labels_encoding='hot',
num_classes=10,
batch_size=32,
augmenter=my_augmenter)
Note:
(1) albumentations offers a ToFloat(max_value=255)
transformation which scales pixel intensities from [0, 255] to [0, 1]. Thus, when using it, you must turn off scaling: scaler=False
.
(2) If you want to apply standardization (mean/std), you may want to use albumentations Normalize instead.
(3) Make sure to turn off data augmentation (augmenter=False
) when using evaluate_generator()
and predict_generator()
.
Finally, pass the generator to your model:
model.compile(
loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer='rmsprop')
# Example with fit:
model.fit_generator(
train_generator,
validation_data=val_generator,
workers=10,
use_multiprocessing=True,
verbose=1,
epochs=1)
# Example with evaluate:
model.evaluate_generator(
eval_generator,
workers=10,
use_multiprocessing=True,
verbose=1,
epochs=1)