Download data

We kindly hope you to fill out the form before downloading. To get started, download nuScenes subset image data and DriveLM-nuScenes QA json files below. For v1.1 data, please visit the DriveLM/challenge folder.

nuScenes subset images	DriveLM-nuScenes version-1.0
Google Drive	Google Drive
Baidu Netdisk	Baidu Netdisk
HuggingFace	HuggingFace

You can also download the full nuScenes dataset HERE to enable video input.

Our DriveLM dataset contains a collection of questions and answers. Currently, only the training set is publicly available. The dataset is named v1_0_train_nus.json.

Prepare the dataset

Organize the data structure as follows:

DriveLM
├── data/
│   ├── QA_dataset_nus/
│   │   ├── v1_0_train_nus.json
│   ├── nuscenes/
│   │   ├── samples/

File structure

The QA pairs are in the v1_0_train_nus.json. Below is the json file structure. All coordinates mentioned are referenced from the upper-left corner of the respective camera, with the right and bottom directions serving as the positive x and y axes, respectively.

v1_0_train_nus.json
├── scene_token:{
│   ├── "scene_description": "The ego vehicle proceeds along the current road, preparing to enter the main road after a series of consecutive right turns.",
│   ├── "key_frames":{
│   │   ├── "frame_token_1":{
│   │   │   ├── "key_object_infos":{"<c1,CAM_FRONT,258.3,442.5>": {"Category": "Vehicle", "Status": "Moving", "Visual_description": "White Sedan", "2d_bbox": [x_min, y_min, x_max, y_max]}, ...},
│   │   │   ├── "QA":{
│   │   │   │   ├── "perception":[
│   │   │   │   │   ├── {"Q": "What are the important objects in the current scene?", "A": "The important objects are <c1,CAM_FRONT,258.3,442.5>, <c2,CAM_FRONT,1113.3,505.0>, ...", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None},
│   │   │   │   │   ├── {"Q": "xxx", "A": "xxx", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
│   │   │   │   ├── ],
│   │   │   │   ├── "prediction":[
│   │   │   │   │   ├── {"Q": "What is the future state of <c1,CAM_FRONT,258.3,442.5>?", "A": "Slightly offset to the left in maneuvering.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
│   │   │   │   ├── ],
│   │   │   │   ├── "planning":[
│   │   │   │   │   ├── {"Q": "In this scenario, what are safe actions to take for the ego vehicle?", "A": "Brake gently to a stop, turn right, turn left.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
│   │   │   │   ├── ],
│   │   │   │   ├── "behavior":[
│   │   │   │   │   ├── {"Q": "Predict the behavior of the ego vehicle.", "A": "The ego vehicle is going straight. The ego vehicle is driving slowly.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}
│   │   │   │   ├── ]
│   │   │   ├── },
│   │   │   ├── "image_paths":{
│   │   │   │   ├── "CAM_FRONT": "xxx",
│   │   │   │   ├── "CAM_FRONT_LEFT": "xxx",
│   │   │   │   ├── "CAM_FRONT_RIGHT": "xxx",
│   │   │   │   ├── "CAM_BACK": "xxx",
│   │   │   │   ├── "CAM_BACK_LEFT": "xxx",
│   │   │   │   ├── "CAM_BACK_RIGHT": "xxx",
│   │   │   ├── }
│   │   ├── },
│   │   ├── "frame_token_2":{
│   │   │   ├── "key_object_infos":{"<c1,CAM_BACK,612.5,490.6>": {"Category": "Traffic element", "Status": "None", "Visual_description": "Stop sign", "2d_bbox": [x_min, y_min, x_max, y_max]}, ...},
│   │   │   ├── "QA":{
│   │   │   │   ├── "perception":[...],
│   │   │   │   ├── "prediction":[...],
│   │   │   │   ├── "planning":[...],
│   │   │   │   ├── "behavior":[...]
│   │   │   ├── },
│   │   │   ├── "image_paths":{...}
│   │   ├── }
│   ├── }
├── }

scene_token is the same as in nuScenes dataset.
scene_description is a one-sentence summary of ego-vehicle behavior in the about 20-second video clip (the notion of a scene in nuScenes dataset).
Under key_frames, each key frame is identified by the frame_token, which corresponds to the token in the nuScenes dataset.
The key_object_infos is a mapping between c tag (i.e. <c1,CAM_FRONT,258.3,442.5>) and more information about the related key objects such as the category, the status, the visual description, and the 2d bounding box.
QA is divided into different tasks, and QA pairs under each task are formulated as a list of dictionaries. Each dictionary encompasses keys of Q (question), A (answer), C (context), con_up, con_down, cluster, and layer. Currently, the values of context related keys are set to None, serving as a tentative placeholder for future fields related to DriveLM-CARLA.

Note: The c tag label is used to indicate key objects selected during the annotation process. These objects include not only those present in the ground truth but also objects that are not, such as landmarks and traffic lights. Each key frame contains a minimum of three and a maximum of six key objects. The organization format of the c tag is <c,CAM,x,y>, where c is the identifier, CAM indicates the camera where the key object’s center point is situated, and x, y represent the horizontal and vertical coordinates of the 2D bounding box in the respective camera’s coordinate system with the upper-left corner as the origin, and the right and bottom as the positive x and y axes, respectively.

In contrast to the c tag, for the question "Identify all the traffic elements in the front view," the output is presented as a list formatted as [(c, s, x1, y1, x2, y2), ...]. Here, c denotes the category, s represents the status, and x1, y1, x2, y2 indicate the offsets of the top-left and bottom-right corners of the box relative to the center point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_prep_nus.md

data_prep_nus.md

Download data

Prepare the dataset

File structure

Files

data_prep_nus.md

Latest commit

History

data_prep_nus.md

File metadata and controls

Download data

Prepare the dataset

File structure