-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset v2.0 #461
base: main
Are you sure you want to change the base?
Dataset v2.0 #461
Conversation
…_25_reshape_dataset
…_25_reshape_dataset
…_25_reshape_dataset
'~/.cache/huggingface/lerobot'. | ||
episodes (list[int] | None, optional): If specified, this will only load episodes specified by | ||
their episode_index in this list. Defaults to None. | ||
split (str, optional): _description_. Defaults to "train". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it 8bd406e (it wasn't used anymore).
I suggest we just allow to keep a notion of split in the info.json
as I've done in the conversion script:
"splits": {
"train": "0:50"
}
…_25_reshape_dataset
Hey folks! Awesome work here. I was wondering what the timeline is for merging Dataset 2.0? The reason I ask is I'm about to start working on adding support for Elephant Robotics MyArm M&C, and I'm not sure if I should target dataset 1.0 or 2.0 initially. |
@apockill Thank you for your support! The team took some time off and we're just getting back. Hopefully this will be merged very soon (we mainly need to update some more tests now).
We will refactor how robot classes are structured soon but for now this PR shouldn't have a big impact on adding support for a new robot. The only thing that's being added to robot classes are the |
What this does
This PR introduces a new format for
LeRobotDataset
, which is accompanied by a new file structure. As these changes are not backward compatible, we increaseCODEBASE_VERSION
fromv1.6
tov2.0
.What do I need to do?
If you already pushed a dataset using
v1.6
of our codebase, you can use the conversion scriptlerobot/common/datasets/v2/convert_dataset_v1_to_v2.py
to convert it to the new format.You will be asked to enter a prompt describing the task performed in the dataset.
Examples for single-task dataset:
python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \ --repo-id lerobot/aloha_sim_insertion_human_image \ --task "Insert the peg into the socket." \ --robot-config lerobot/configs/robot/aloha.yaml
python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \ --repo-id aliberts/koch_tutorial \ --task "Pick the Lego block and drop it in the box on the right." \ --robot-config lerobot/configs/robot/koch.yaml
For the more complicated cases of one task per episode or multiple tasks per episodes, please refer to the documentation in that script.
Motivation
Current implementation of our
LeRobotDataset
suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:datasets
andhuggingface_hub
makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.VideoFrame
not yet being integrated intodatasets
.Changes
Some of the biggest change come from the new file structure and their content:
Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses
.parquet
,.json
,.jsonl
and.mp4
files (.md
for the README).Added
LeRobotDataset
can now be called with anepisodes
argument (e.g.episodes=[1, 10, 12, 40]
) to select a specific subset of episodes by their episode_index. By doing so, only the files corresponding to these episodes will be downloaded (if they're not already on disk). In that case, thehf_dataset
attribute will only contain data from these episodes, as well as theepisode_data_index
.LeRobotDatasetMetadata
class. This allows to get info about a dataset before loading the data. For example, you could do this:tasks.json
mapped to itstask_index
which is what's actually stored in parquet files. Using the api, they can be accessed either withdataset.tasks
to get that mapping or throughdataset.episode_dict[episode_index]["tasks"]
if you're only interested in a particular episode.info.json
(keys, shapes, number of episodes, etc.). It serves as a source of truth for what's inside the dataset.episodes.jsonl
contains per-episode information (episode_index, tasks in natural language and episode lengths). This is accessed through theepisode_dict
attribute in the api.LeRobotDataset.create()
allows to create a new dataset from scratch, either for recording data or for porting an existing dataset to the LeRobotDataset format. To that end, new methods are added:start_image_writter()
: This instantiates anImageWriter
in theimage_writer
attribute to write images asynchonously during data recording. This is automatically called duringLeRobotDataset.create()
if specified in the arguments.stop_image_writter()
: This is to properly stop and remove theImageWriter
from the dataset's attributes. Importantly: if theimage_writer
has been set to a multiprocessImageWriter
, this needs to be called first if you want to pass this dataset into a parallelized DataLoader as theImageWriter
class is not pickleable (required for objects to be transfered between processes). This is not needed when instantiating a dataset with__init__
as the image_writer then is not created.add_frame()
: Adds a single timestamp data frame to theepisode_buffer
, which keep data in memory temporarily. Note: this will be merged with theDataBuffer
from #445 in a subsequent PR.add_episode()
: Saves the content of theepisode_buffer
to disk and updates metadata for them to be in sync with the contents of the files. This method expects atask
argument as a string prompt in natural language describing the task performed in the episode. Videos from that episode can optionally be encoded during this phase but it's not mandatory and can be done later in order to give more flexibility on when to do that.consolidate()
: This will encode videos that have not yet been encoded, clean up the temporary image files, compute dataset statistics, check timestamps are in sync with thefps
and perform additional sanity checks in the dataset. It needs to be done before uploading the dataset to the hub withpush_to_hub()
.clear_episode_buffer()
: This can be used to reset theepisode_buffer
(e.g. to discard data from a current recording).Changed
__get_item__()
and is now done during__init__
orconsolidate
. This has the benefit of both saving computation during the__get_item__()
as well as knowing immediately if there are sync issues with the timestamps.info.json
to allow flexibility and to easily split chunks of files between directories to avoid the hub's limit of files (10k) per folder.~/.cache/huggingface/lerobot
by default. Changingroot
or setting theLEROBOT_HOME
env variable allows to change that location. Every call to thehuggingface_hub
download functions likesnapshot_download
orhf_hub_download
use thelocal_dir
argument to that location so that files are not duplicated in cache and to solve the issue of having to download again files already present on disk.populate_dataset.py
into anImageWriter
class.stats.safetensors
is nowstats.json
(the content remains the same but it's unflattened).episode_data_index.safetensors
is removed but theepisode_data_index
is still in the api to map episode_index to indices.Performance
In the nominal case (no
delta_timestamp
),LeRobotDataset.__get_item__()
is on par with the previous version, sometimes slightly improved but still in the same ballpark generally.__get_item__()
call time in seconds (average on 10k iterations):Benchmarking code
Using
delta_timestamps
, results are more diverse depending on the dataset but still remain in the same ballpark.__get_item__()
call time in seconds (average on 10k iterations),delta_timestamps=[-1/fps, 0, 1/fps]
:Benchmarking code (delta_timestamps)
Fixes
load_previous_and_future_frames
which didn't actually raise an error when the requested timestamps fromdelta_timestamps
did not correspond to actual timestamps in the dataset."tf.Tensor(b'Do something', shape=(), dtype=string)"
)lerobot/aloha_mobile_shrimp
lerobot/aloha_static_battery
lerobot/aloha_static_fork_pick_up
lerobot/aloha_static_thread_velcro
lerobot/uiuc_d3field
lerobot/viola
is missing video keys [TODO]How it was tested
tests/fixtures/
in which fixtures and fixtures factories have been added to simplify writing/adding tests. These factories allow the flexibility to create partially mocked objects on the fly to be used in tests, while not relying on other components of the codebase that are not meant to be tested in a particular test (e.g. initializing a dataset using hydra).tests/test_image_writer.py
tests/test_delta_timestamps.py
How to checkout & try? (for the reviewer)
Use an existing dataset:
Try out the new feature to select / download specific episodes:
You can also create a new dataset: