In the general case, adding a dataset to TFDS involves two steps:
-
Implement a python class that provides a dataset builder with the specs of the data (e.g., what is the shape of the observations, actions, etc.) and how to read your dataset files.
NOTE: If you want your dataset to be compatible with RLDS pipelines, make sure your implementation provides the same structure and keys as an RLDS dataset.
-
Run a
download_and_prepare
pipeline that converts the data to the TFDS intermediate format.
You can follow the instructions to add a dataset in the TFDS site.
There are situations in which it might be preferrable to re-write the raw data into an RLDS/TFDS compatible format before adding it to TFDS (for example if your data uses a format that cannot be shared). You can use the Envlogger or the [EpisodeWriter] directly to do so. To use the EpisodeWriter, you can create your DatasetConfig with the ConfigGenerator tool.
Even if your data is already in TFDS format, you may want to create a TFDS builder if you want to:
- Reshuffle: When you want to re-generate the data to ensure that episodes are shuffled on disk (otherwise, they are stored as they were generated with Envlogger).
- Share: Consider if you want to add the dataset to the TFDS catalog or if
you just want to share it in your own repository (note that users will still
be able to load your data directly with
tfds.builder_from_directory
).
Most of the steps to follow in this case will be either no-ops or very simple. You can find an example here.
If you have generated your RLDS dataset with the Envlogger Riegeli backend and you want to convert it to TFDS, you can take a look at the instructions to create a TFDS builder using the RLDS helpers.
You can add your dataset directly to TFDS following the instructions at https://www.tensorflow.org/datasets.
- If your data has been generated with Envlogger or the RLDS Creator, you can just use the rlds helpers in TFDS (see here an example, or here if you used the TFDS Envlogger backend).
- Otherwise, make sure your
generate_examples
implementation provides the same structure and keys as RLDS loaders if you want your dataset to be compatible with RLDS pipelines (example).
Note that you can follow the same steps to add the data to your own repository (see more details in the TFDS documentation).