This repository provides our implementation of StROL in two different simulation environments and in a user study on a 7-DoF Franka Emika Panda robot arm. The videos for our user study can be found here
You need to have the following libraries with Python3:
To install StROL, clone the repo using
git clone
In StROL, we modify the learning dynamics of the system to incorporate a correction term in order to make learning from noisy and suboptimal actions more robust.
The block diagram below shows the methodology used to generate the dataset using the given online learning rule
For our implementation we formulate the correction term $\hat g$ as a fully connected neural network.
Next, we explain the implementation of StROL in different simulation environments. We define the online learning rule
This experiment is performed in the 2-D driving simulator CALRO [1]. In this setting, a robot car (a car controlled by a robot) is driving in front of a human car (a car controlled and driven by a human). Both the cars start in the left lane on a two-lane highway. The action and state space for both the cars is 2-dimensional, i.e.
Below, we define the features and the online learning rule
- Distance between the cars
$d = x_{\mathcal{R}} - x_{\mathcal{H}}$ - Speed of the robot car
$v$ - Heading direction of the human car
The learning rule of the robot
For the Highway environment, move to the corresponding folder using
cd simulations/CARLO
To train a the correction term
python3 --train
We provide a trained model for the environment with
python3 --eval
This script will run the evaluation script for the Highway environment for StROL and the baselines --- Gradient , One [2], MOF [3] and e2e --- and save a plot for the performance of the different approached.
In order to test the trained model with different noise and bias levels, you can provide the noise using --noise
and --bias
arguments respectively.
The full list of arguments for training and testing and their default values can be seen in
In this environment, a simulated human is trying to convey thier task preferences to the robot. The action and state spaces in this environment are both 3-dimensional, i.e.
The features used in this environment and the robot's original learning dynamics
- 2-D distance of the robot's end-effector from the plate
$d_p$ - 3-D distancs of the robot's end-effector from the cup
The learning rule
For training the
cd simulations/robot
And then run
python3 --train
This will train test_ours/py
. To change the noise and bias values, provide --noise
and --bias
as arguments with the training command.
We provide a pretrained model for g_data/g_tilde/model_2objs_500
. To test StROL in the Robot environment, run
python3 --eval --boltzmann
This code uses a model of the human that always chooses the optimal actions for the given reward function to provide corrections to the robot (noise and bias are added after the optimal action is chosen). The simulated human, by default, chooses their actions from a binormal distribution of tasks. To use a uniform prior for the tasks, add the argument --uniform
when running the script. The results for the runs for all approaces will be saved in `/results'.
In our in-person user-study, the participants interact with a 7-DoF Franka Emika Panda robot arm to teach it 3 different tasks. The state and action space for one task is 3-dimensional, i.e.
The tasks, features for the tasks and the original learning rule
- Distance from the plate
$d_{plate}$ - Distance from the pitcher
$d_{pitcher}$ - Height from the table
$h$ - Orientation of the end-effector
Task 1: The robot starts at a fixed position, holding a cup upright. The users taught the robot to move to the plate while avoiding the pitcher and keeping the cup close to the table.
Task 2: The robot starts at a fixed position, holding the cup in a tilded position. The users taught the robot to carry the cup upright while moving to the plate, avoiding the pitcher and moving close to the table.
Task 3: The robot starts in a similar pose as Task 2. The users taught the robot to carry the cup upright, while moving away from the plate, the pitcher and the table.
Note that Task 1 and Task 2 were incorporated in the prior, while Task 3 was a new task that was not included in the prior.
To implement StROL in a user study, move to the user_study
folder using
cd user_study
The pre-trained model of g_data/model_t1
, and the pre-trained model for Task 2 and Task 3 is saved in `g_data/model_t2'.
To run tests on the robot, run the following command:
python3 --eval --alg <algorithm> --task <task number> --env_dim <3/6> --n_features <3/4>
defines the algorithm being used for the test - 'strol', 'oat' or 'mof', --task
takes in the task number, i.e. 1, 2 or 3, --env_dim
should be 3 for Task 1 and 6 otherwise and --n_features
is 3 for Task 1 and 4 for the other tasks. If you want the robot to play the optimal robot trajectory for a given task use --demo
argument when running the script.
In the Highway environment, the robot car observes the actions and updates its estimate of the human car's task parameters. The performance of the algorithms is quantifies by the error in the estimate of the human car's task parameters.
The performance of different learning approaches averaged over 250 runs for the highway simulation are tabulated below:
Condition | Methods | ||||
Gradient | One | MOF | e2e | StROL | |
Training | 0.59 ± 0.36 | 0.51 ± 0.37 | 0.58 ± 0.36 | 0.48 ± 0.08 | 0.31 ± 0.06 |
0% Noise 0% Bias | 0.55 ± 0.36 | 0.53 ± 0.35 | 0.56 ± 0.37 | 0.48 ± 0.08 | 0.30 ± 0.08 |
50% Noise 50% Bias | 1.18 ± 0.48 | 1.18 ± 0.41 | 1.08 ± 0.48 | 1.18 ± 0.6 | 1.09 ± 0.6 |
Uniform Prior | 1.07 ± 0.43 | 1.1 ± 0.45 | 1.01 ± 0.42 | 0.98 ± 0.4 | 0.10 ± 0.42 |
In this environment, the simulated interacts with the robot to provide corrections over 5 timesteps to convey their desired task parameters. The performance of the robot is measured in terms of regret as
The results for the Robot simulation for different approaches averaged over 100 runs are tabulated below:
Condition | Methods | ||||
Gradient | One | MOF | e2e | StROL | |
Training | 0.18 ± 0.37 | 0.20 ± 0.38 | 0.12 ± 0.29 | 3.69 ± 0.89 | 0.001 ± 0.004 |
0% Noise 0% Bias | 0.10 ± 0.26 | 0.14 ± 0.33 | 0.06 ± 0.18 | 3.86 ± 0.95 | 0.01 ± 0.08 |
50% Noise 50% Bias | 0.77 ± 0.84 | 0.44 ± 0.68 | 0.47 ± 0.59 | 3.25 ± 1.36 | 0.16 ± 0.83 |
Uniform Prior | 0.17 ± 0.47 | 0.18 ± 0.50 | 0.18 ± 0.46 | 1.12 ± 0.49 | 0.12 ± 0.34 |
We also performed simulated experiments in this environment to study the effect that the relative weight of
We vary the value of lambda from 0 to 10 and report the results for two different testing conditions --- (a) "Best Case": when the robot is given the prior and the human model similar to training and (b) "Worst Case": when the robot has an incorrect prior.
Figure 2: Performance of StROL with varying relative weights of the original learning dynamics and the correction term.
We observe that when
We observe that the relative weight of
Next, we move on to test the efficacy of StROL when the user teaching the task changes their desired task parameters midway through the interaction. In this simulation, the simulated human always chooses a task from the prior. For the first 2 timesteps, the human provides corrections for one task from the prior and for the remaining 3 timesteps provides corrections for the other task. The performance of the robot using different approaches is summarized in the plot below.
Figure 3: Comparison of StROL with the baselines when the user changes preference for teaching a task midway through the interaction.
We observe that the using StROL, the simulate humans able to convey their task preferences to the robot more efficiently even if their preferences changed in between the interaction.
In the above simulations and the manuscript, we test the performance of StROL when we have information about the prior and the human model. But in practice, we may not always have access to environment priors. To test the performance of StROL when we do not have access to a prior and only partial information about the human model, we conduct experiments in a simulated Robot environment. For training, the robot is given a prior in the form of a uniform distrubution and the simulated human takes actions with a consistent bias of 25% of the magnitude of largest action. For testing, the simulated humans take actions with a consistent bias and a gaussian noise with variance equal to 25% of the largest action.
Figure 3: Comparison of StROL with the baselines when StROL does not have access to environment priors and has only partial information about the human model.
We see that StROL has lower regret than each baseline, even when the prior is not known. This suggests that StROL is still useful in settings where the prior is unknown, but the robot has at least partial knowledge of the human model.
In our user study, we measure the performance of a the robot by measuring the regret in performing the task and the time for which the users proivided corrections to the robot to convey their intended task. The gifs of the users teaching different tasks to the robot in the user study can be seen below:
The averages objective results for all tasks in the user study are shown below:
Metric | Methods | ||
One | MOF | StROL | |
Regret | 4.904 ± 1.125 | 2.42 ± 0.48 | 1.03 ± 0.08 |
Correction Time (s) | 2.50 ± 1.57 | 2.22 ± 1.53 | 1.382 ± 1.40 |
We also plot the the Correction Time vs Regret for all tasks and approaches in a scatter plot to analyze the effort that the users had to exert in order to convey their task parameters to the robot:
Figure 7: Scatter Plot for showing Regret vs Number of corrections provided by the users for all approaches.
The offline learning of the correction term was performed on a system with 8 cores and 16 GB Memory (GPU was not used for training). The offline training times for different experiments are tabulated below.
Experiment | # of Training Steps | Training Time (mins) |
Highway | 1000 | ~105 minutes |
Robot | 500 | ~10 minutes |
User Study | 2000 | ~45 minutes |
The offline training is performned only once for a given environment and a set of priors. The correction term
The user study is performed on a 7-DoF Franka Emika Panda robot arm. After the user provides feedback to convey their task preferences to the robot, the robot's estimate of the reward parameters is updated in real time (a delay of ~ 2-4 sec). This time delay is similar to the baselines in the user study that also learn in real time form the human feedback. Note that this is the observed delay in learning when performing the experiments on the above specified setup. This delay may differ for different settings and hardware setups.
title={StROL: Stabilized and Robust Online Learning from Humans},
author={Mehta, Shaunak A and Meng, Forrest and Bajcsy, Andrea and Losey, Dylan P},
journal={arXiv preprint arXiv:2308.09863},
[1] Z. Cao, E. Biyik, W. Z. Wang, A. Raventos, A. Gaidon, G. Rosman, and D. Sadigh, “Reinforcement learning based control of imitative policies for near-accident driving,” in RSS, July 2020.
[2] D. P. Losey, A. Bajcsy, M. K. O’Malley, and A. D. Dragan, “Physical interaction as communication: Learning robot objectives online from human corrections,” JRR, vol. 41, no. 1, pp. 20–44, 2022.
[3] A. Bobu, A. Bajcsy, J. F. Fisac, S. Deglurkar, and A. D. Dragan, “Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections,” IEEE Transactions on Robotics, vol. 36, no. 3, pp. 835–854, 2020