diff --git a/docs/_video/user_guide_video/_agent_page_CartPole.mp4 b/docs/_video/user_guide_video/_agent_page_CartPole.mp4 new file mode 100644 index 000000000..9fe3fd2b5 Binary files /dev/null and b/docs/_video/user_guide_video/_agent_page_CartPole.mp4 differ diff --git a/docs/_video/user_guide_video/_agent_page_chain1.mp4 b/docs/_video/user_guide_video/_agent_page_chain1.mp4 new file mode 100644 index 000000000..4d6e11e4a Binary files /dev/null and b/docs/_video/user_guide_video/_agent_page_chain1.mp4 differ diff --git a/docs/_video/user_guide_video/_agent_page_chain2.mp4 b/docs/_video/user_guide_video/_agent_page_chain2.mp4 new file mode 100644 index 000000000..df26af82f Binary files /dev/null and b/docs/_video/user_guide_video/_agent_page_chain2.mp4 differ diff --git a/docs/_video/user_guide_video/_agent_page_frozenLake.mp4 b/docs/_video/user_guide_video/_agent_page_frozenLake.mp4 new file mode 100644 index 000000000..1892eecd4 Binary files /dev/null and b/docs/_video/user_guide_video/_agent_page_frozenLake.mp4 differ diff --git a/docs/_video/user_guide_video/_env_page_Breakout.mp4 b/docs/_video/user_guide_video/_env_page_Breakout.mp4 new file mode 100644 index 000000000..de95a4476 Binary files /dev/null and b/docs/_video/user_guide_video/_env_page_Breakout.mp4 differ diff --git a/docs/_video/user_guide_video/_env_page_MountainCar.mp4 b/docs/_video/user_guide_video/_env_page_MountainCar.mp4 new file mode 100644 index 000000000..5a3e29f0f Binary files /dev/null and b/docs/_video/user_guide_video/_env_page_MountainCar.mp4 differ diff --git a/docs/_video/user_guide_video/_env_page_chain.mp4 b/docs/_video/user_guide_video/_env_page_chain.mp4 new file mode 100644 index 000000000..5e7d8f861 Binary files /dev/null and b/docs/_video/user_guide_video/_env_page_chain.mp4 differ diff --git a/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 b/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 new file mode 100644 index 000000000..c9ae206f8 Binary files /dev/null and b/docs/_video/user_guide_video/_experimentManager_page_CartPole.mp4 differ diff --git a/docs/api.rst b/docs/api.rst index e2a038821..193e50408 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -68,6 +68,7 @@ Base class :template: class.rst envs.interface.Model + envs.basewrapper.Wrapper Spaces ------ diff --git a/docs/basics/DeepRLTutorial/TutorialDeepRL.md b/docs/basics/DeepRLTutorial/TutorialDeepRL.md new file mode 100644 index 000000000..e0697d07f --- /dev/null +++ b/docs/basics/DeepRLTutorial/TutorialDeepRL.md @@ -0,0 +1,581 @@ +(TutorialDeepRL)= + +Quickstart for Deep Reinforcement Learning in rlberry +===================================================== + +In this tutorial, we will focus on Deep Reinforcement Learning with the +**Advantage Actor-Critic** algorithm. + +Imports +------- + +```python +from rlberry.envs import gym_make +from rlberry.manager import plot_writer_data, ExperimentManager, evaluate_agents +from rlberry_research.agents.torch import A2CAgent +from rlberry_research.agents.torch.utils.training import model_factory_from_env +``` + +Reminder of the RL setting +-------------------------- + +We will consider a MDP $M = (\mathcal{S}, \mathcal{A}, p, r, \gamma)$ +with: + +- $\mathcal{S}$ the state space, +- $\mathcal{A}$ the action space, +- $p(x^\prime \mid x, a)$ the transition probability, +- $r(x, a, x^\prime)$ the reward of the transition $(x, a, x^\prime)$, +- $\gamma \in [0,1)$ is the discount factor. + +A policy $\pi$ is a mapping from the state space $\mathcal{S}$ to the +probability of selecting each action. The action value function of a +policy is the overall expected reward from a state action. +$Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]$ +where $\tau$ is an episode +$(s_0, a_0, r_0, s_1, a_1, r_1, s_2, ..., s_T, a_T, r_T)$ with the +actions drawn from $\pi(s)$; $R(\tau)$ is the random variable defined as +the cumulative sum of the discounted reward. + +The goal is to maximize the cumulative sum of discount rewards: + +$$J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \big]$$ + +Gymnasium Environment +--------------------- + +In this tutorial we are going to use the [Gymnasium library (previously +OpenAI's Gym)](https://gymnasium.farama.org/api/env/). This library +provides a large number of environments to test RL algorithm. + +We will focus only on the **CartPole-v1** environment, although we +recommend experimenting with other environments such as **Acrobot-v1** +and **MountainCar-v0**. The following table presents some basic +components of the three environments, such as the dimensions of their +observation and action spaces and the rewards occurring at each step. + + | Env Info | CartPole-v1 | Acrobot-v1 | MountainCar-v0 | + |:----------------------|:------------|:----------------------------|:---------------| + | **Observation Space** | Box(4) | Box(6) | Box(2) | + | **Action Space** | Discrete(2)| Discrete(3) | Discrete(3) | + | **Rewards** | 1 per step | -1 if not terminal else 0 | -1 per step | + +Actor-Critic algorithms and A2C +------------------------------- + +**Actor-Critic algorithms** methods consist of two models, which may +optionally share parameters: + +- Critic updates the value function parameters w and depending on the +algorithm it could be action-value $Q_{\varphi}(s,a )$ or state-value +$V_{\varphi}(s)$. +- Actor updates the policy parameters $\theta$ for +$\pi_{\theta}(a \mid s)$, in the direction suggested by the critic. + +**A2C** is an Actor-Critic algorithm and it is part of the on-policy +family, which means that we are learning the value function for one +policy while following it. The original paper in which it was proposed +can be found [here](https://arxiv.org/pdf/1602.01783.pdf) and the +pseudocode of the algorithm is the following: + +- Initialize the actor $\pi_{\theta}$ and the critic $V_{\varphi}$ + with random weights. +- Observe the initial state $s_{0}$. +- for $t \in\left[0, T_{\text {total }}\right]$ : + - Initialize empty episode minibatch. + - for $k \in[0, n]:$ \# Sample episode + - Select a action $a_{k}$ using the actor $\pi_{\theta}$. + - Perform the action $a_{k}$ and observe the next state + $s_{k+1}$ and the reward $r_{k+1}$. + - Store $\left(s_{k}, a_{k}, r_{k+1}\right)$ in the episode + minibatch. + - if $s_{n}$ is not terminal: set + $R=V_{\varphi}\left(s_{n}\right)$ with the critic, else $R=0$. + - Reset gradient $d \theta$ and $d \varphi$ to 0 . + - for $k \in[n-1,0]$ : \# Backwards iteration over the episode + - Update the discounted sum of rewards + $R \leftarrow r_{k}+\gamma R$ + + - Accumulate the policy gradient using the critic: + + $$d \theta \leftarrow d \theta+\nabla_{\theta} \log \pi_{\theta}\left(a_{k}\mid s_{k}\right)\left(R-V_{\varphi}\left(s_{k}\right)\right)$$ + + - Accumulate the critic gradient: + +$$d \varphi \leftarrow d \varphi+\nabla_{\varphi}\left(R-V_{\varphi}\left(s_{k}\right)\right)^{2}$$ + +- Update the actor and the critic with the accumulated gradients using + gradient descent or similar: + +$$\theta \leftarrow \theta+\eta d \theta \quad \varphi \leftarrow \varphi+\eta d \varphi$$ + +Running A2C on CartPole +----------------------- + + **warning :** depending on the seed, you may get different results, and if you're (un)lucky, your default agent may learn and be better than the tuned agent. + +In the next example we use default parameters for both the Actor and the +Critic and we use rlberry to train and evaluate our A2C agent. The +default networks are: + +- a dense neural network with two hidden layers of 64 units for the + **Actor**, the input layer has the dimension of the state space + while the output layer has the dimension of the action space. The + activations are RELU functions and we have a softmax in the last + layer. +- a dense neural network with two hidden layers of 64 units for the + **Critic**, the input layer has the dimension of the state space + while the output has dimension 1. The activations are RELU functions + apart from the last layer that has a linear activation. + +```python +""" +The ExperimentManager class is a compact way of experimenting with a deepRL agent. +""" +default_agent = ExperimentManager( + A2CAgent, # The Agent class. + (gym_make, dict(id="CartPole-v1")), # The Environment to solve. + fit_budget=3e5, # The number of interactions + # between the agent and the + # environment during training. + eval_kwargs=dict(eval_horizon=500), # The number of interactions + # between the agent and the + # environment during evaluations. + n_fit=1, # The number of agents to train. + # Usually, it is good to do more + # than 1 because the training is + # stochastic. + agent_name="A2C default", # The agent's name. +) + +print("Training ...") +default_agent.fit() # Trains the agent on fit_budget steps! + + +# Plot the training data: +_ = plot_writer_data( + [default_agent], + tag="episode_rewards", + title="Training Episode Cumulative Rewards", + show=True, +) +``` + +```none +[INFO] Running ExperimentManager fit() for A2C default with n_fit = 1 and max_workers = None. +INFO: Making new env: CartPole-v1 +INFO: Making new env: CartPole-v1 +[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead +``` + +
+ +```none +Training ... +``` + +
+ +```none +[INFO] [A2C default[worker: 0]] | max_global_step = 5644 |episode_rewards = 196.0 | total_episodes = 111 | +[INFO] [A2C default[worker: 0]] | max_global_step = 9551 | episode_rewards = 380.0 | total_episodes = 134 | +[INFO] [A2C default[worker: 0]] | max_global_step = 13128 | episode_rewards = 125.0 | total_episodes = 182 | +[INFO] [A2C default[worker: 0]] | max_global_step = 16617 | episode_rewards = 246.0 | total_episodes = 204 | +[INFO] [A2C default[worker: 0]] | max_global_step = 20296 | episode_rewards = 179.0 | total_episodes = 222 | +[INFO] [A2C default[worker: 0]] | max_global_step = 23633 | episode_rewards = 120.0 | total_episodes = 240 | +[INFO] [A2C default[worker: 0]] | max_global_step = 26193 | episode_rewards = 203.0 | total_episodes = 252 | +[INFO] [A2C default[worker: 0]] | max_global_step = 28969 | episode_rewards = 104.0 | total_episodes = 271 | +[INFO] [A2C default[worker: 0]] | max_global_step = 34757 | episode_rewards = 123.0 | total_episodes = 335 | +[INFO] [A2C default[worker: 0]] | max_global_step = 41554 | episode_rewards = 173.0 | total_episodes = 373 | +[INFO] [A2C default[worker: 0]] | max_global_step = 48418 | episode_rewards = 217.0 | total_episodes = 423 | +[INFO] [A2C default[worker: 0]] | max_global_step = 55322 | episode_rewards = 239.0 | total_episodes = 446 | +[INFO] [A2C default[worker: 0]] | max_global_step = 62193 | episode_rewards = 218.0 | total_episodes = 471 | +[INFO] [A2C default[worker: 0]] | max_global_step = 69233 | episode_rewards = 377.0 | total_episodes = 509 | +[INFO] [A2C default[worker: 0]] | max_global_step = 76213 | episode_rewards = 211.0 | total_episodes = 536 | +[INFO] [A2C default[worker: 0]] | max_global_step = 83211 | episode_rewards = 212.0 | total_episodes = 562 | +[INFO] [A2C default[worker: 0]] | max_global_step = 90325 | episode_rewards = 211.0 | total_episodes = 586 | +[INFO] [A2C default[worker: 0]] | max_global_step = 97267 | episode_rewards = 136.0 | total_episodes = 631 | [INFO] [A2C default[worker: 0]] | max_global_step = 104280 | episode_rewards = 175.0 | total_episodes = 686 | +[INFO] [A2C default[worker: 0]] | max_global_step = 111194 | episode_rewards = 258.0 | total_episodes = 722 | +[INFO] [A2C default[worker: 0]] | max_global_step = 118067 | episode_rewards = 235.0 | total_episodes = 755 | +[INFO] [A2C default[worker: 0]] | max_global_step = 125040 | episode_rewards = 500.0 | total_episodes = 777 | +[INFO] [A2C default[worker: 0]] | max_global_step = 132478 | episode_rewards = 500.0 | total_episodes = 792 | +[INFO] [A2C default[worker: 0]] | max_global_step = 139591 | episode_rewards = 197.0 | total_episodes = 813 | +[INFO] [A2C default[worker: 0]] | max_global_step = 146462 | episode_rewards = 500.0 | total_episodes = 835 | +[INFO] [A2C default[worker: 0]] | max_global_step = 153462 | episode_rewards = 500.0 | total_episodes = 849 | +[INFO] [A2C default[worker: 0]] | max_global_step = 160462 | episode_rewards = 500.0 | total_episodes = 863 | +[INFO] [A2C default[worker: 0]] | max_global_step = 167462 | episode_rewards = 500.0 | total_episodes = 877 | [INFO] [A2C default[worker: 0]] | max_global_step = 174462 | episode_rewards = 500.0 | total_episodes = 891 | +[INFO] [A2C default[worker: 0]] | max_global_step = 181462 | episode_rewards = 500.0 | total_episodes = 905 | +[INFO] [A2C default[worker: 0]] | max_global_step = 188462 | episode_rewards = 500.0 | total_episodes = 919 | +[INFO] [A2C default[worker: 0]] | max_global_step = 195462 | episode_rewards = 500.0 | total_episodes = 933 | +[INFO] [A2C default[worker: 0]] | max_global_step = 202520 | episode_rewards = 206.0 | total_episodes = 957 | +[INFO] [A2C default[worker: 0]] | max_global_step = 209932 | episode_rewards = 500.0 | total_episodes = 978 | +[INFO] [A2C default[worker: 0]] | max_global_step = 216932 | episode_rewards = 500.0 | total_episodes = 992 | +[INFO] [A2C default[worker: 0]] | max_global_step = 223932 | episode_rewards = 500.0 | total_episodes = 1006 | +[INFO] [A2C default[worker: 0]] | max_global_step = 230916 | episode_rewards = 214.0 | total_episodes = 1024 | +[INFO] [A2C default[worker: 0]] | max_global_step = 235895 | episode_rewards = 500.0 | total_episodes = 1037 | +[INFO] [A2C default[worker: 0]] | max_global_step = 242782 | episode_rewards = 118.0 | total_episodes = 1072 | +[INFO] [A2C default[worker: 0]] | max_global_step = 249695 | episode_rewards = 131.0 | total_episodes = 1111 | +[INFO] [A2C default[worker: 0]] | max_global_step = 256649 | episode_rewards = 136.0 | total_episodes = 1160 | +[INFO] [A2C default[worker: 0]] | max_global_step = 263674 | episode_rewards = 100.0 | total_episodes = 1215 | +[INFO] [A2C default[worker: 0]] | max_global_step = 270727 | episode_rewards = 136.0 | total_episodes = 1279 | +[INFO] [A2C default[worker: 0]] | max_global_step = 277588 | episode_rewards = 275.0 | total_episodes = 1313 | +[INFO] [A2C default[worker: 0]] | max_global_step = 284602 | episode_rewards = 136.0 | total_episodes = 1353 | +[INFO] [A2C default[worker: 0]] | max_global_step = 291609 | episode_rewards = 117.0 | total_episodes = 1413 | +[INFO] [A2C default[worker: 0]] | max_global_step = 298530 | episode_rewards = 147.0 | total_episodes = 1466 | +[INFO] ... trained! +INFO: Making new env: CartPole-v1 INFO: Making new env: CartPole-v1 +[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead +``` + +
+ +```{image} output_5_3.png +:align: center +``` + +```python +print("Evaluating ...") +_ = evaluate_agents( + [default_agent], n_simulations=50, show=True +) # Evaluate the trained agent on +# 10 simulations of 500 steps each. +``` + +```none +[INFO] Evaluating A2C default... +``` + +
+ +```none +Evaluating ... +``` + +
+ +```none +[INFO][eval]... simulation 1/50 +[INFO][eval]... simulation 2/50 +[INFO][eval]... simulation 3/50 +[INFO][eval]... simulation 4/50 +[INFO][eval]... simulation 5/50 +[INFO][eval]... simulation 6/50 +[INFO][eval]... simulation 7/50 +[INFO][eval]... simulation 8/50 +[INFO][eval]... simulation 9/50 +[INFO][eval]... simulation 10/50 +[INFO][eval]... simulation 11/50 +[INFO][eval]... simulation 12/50 +[INFO][eval]... simulation 13/50 +[INFO][eval]... simulation 14/50 +[INFO][eval]... simulation 15/50 +[INFO][eval]... simulation 16/50 +[INFO][eval]... simulation 17/50 +[INFO][eval]... simulation 18/50 +[INFO][eval]... simulation 19/50 +[INFO][eval]... simulation 20/50 +[INFO][eval]... simulation 21/50 +[INFO][eval]... simulation 22/50 +[INFO][eval]... simulation 23/50 +[INFO][eval]... simulation 24/50 +[INFO][eval]... simulation 25/50 +[INFO][eval]... simulation 26/50 +[INFO][eval]... simulation 27/50 +[INFO][eval]... simulation 28/50 +[INFO][eval]... simulation 29/50 +[INFO][eval]... simulation 30/50 +[INFO][eval]... simulation 31/50 +[INFO][eval]... simulation 32/50 +[INFO][eval]... simulation 33/50 +[INFO][eval]... simulation 34/50 +[INFO][eval]... simulation 35/50 +[INFO][eval]... simulation 36/50 +[INFO][eval]... simulation 37/50 +[INFO][eval]... simulation 38/50 +[INFO][eval]... simulation 39/50 +[INFO][eval]... simulation 40/50 +[INFO][eval]... simulation 41/50 +[INFO][eval]... simulation 42/50 +[INFO][eval]... simulation 43/50 +[INFO][eval]... simulation 44/50 +[INFO][eval]... simulation 45/50 +[INFO][eval]... simulation 46/50 +[INFO][eval]... simulation 47/50 +[INFO][eval]... simulation 48/50 +[INFO][eval]... simulation 49/50 +[INFO][eval]... simulation 50/50 +``` + +
+ + +```{image} output_6_3.png +:align: center +``` + +Let's try to change the neural networks' architectures and see if we can +beat our previous result. This time we use a smaller learning rate and +bigger batch size to have more stable training. + +```python +policy_configs = { + "type": "MultiLayerPerceptron", # A network architecture + "layer_sizes": (64, 64), # Network dimensions + "reshape": False, + "is_policy": True, # The network should output a distribution + # over actions +} + +critic_configs = { + "type": "MultiLayerPerceptron", + "layer_sizes": (64, 64), + "reshape": False, + "out_size": 1, # The critic network is an approximator of + # a value function V: States -> |R +} +``` + +```python +tuned_agent = ExperimentManager( + A2CAgent, # The Agent class. + (gym_make, dict(id="CartPole-v1")), # The Environment to solve. + init_kwargs=dict( # Where to put the agent's hyperparameters + policy_net_fn=model_factory_from_env, # A policy network constructor + policy_net_kwargs=policy_configs, # Policy network's architecure + value_net_fn=model_factory_from_env, # A Critic network constructor + value_net_kwargs=critic_configs, # Critic network's architecure. + optimizer_type="ADAM", # What optimizer to use for policy + # gradient descent steps. + learning_rate=1e-3, # Size of the policy gradient + # descent steps. + entr_coef=0.0, # How much to force exploration. + batch_size=1024 # Number of interactions used to + # estimate the policy gradient + # for each policy update. + ), + fit_budget=3e5, # The number of interactions + # between the agent and the + # environment during training. + eval_kwargs=dict(eval_horizon=500), # The number of interactions + # between the agent and the + # environment during evaluations. + n_fit=1, # The number of agents to train. + # Usually, it is good to do more + # than 1 because the training is + # stochastic. + agent_name="A2C tuned", # The agent's name. +) + + +print("Training ...") +tuned_agent.fit() # Trains the agent on fit_budget steps! + + +# Plot the training data: +_ = plot_writer_data( + [default_agent, tuned_agent], + tag="episode_rewards", + title="Training Episode Cumulative Rewards", + show=True, +) +``` + +```none +[INFO] Running ExperimentManager fit() for A2C tuned with n_fit = 1 +and max_workers = None. +INFO: Making new env: CartPole-v1 +INFO: Making new env: CartPole-v1 +[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead +``` + +
+ +```none +Training ... +``` + +
+ +```none +[INFO] [A2C tuned[worker: 0]] | max_global_step = 6777 | episode_rewards = 15.0 | total_episodes = 314 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 13633 | episode_rewards = 14.0 | total_episodes = 602 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 20522 | episode_rewards = 41.0 | total_episodes = 854 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 27531 | episode_rewards = 13.0 | total_episodes = 1063 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 34398 | episode_rewards = 42.0 | total_episodes = 1237 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 41600 | episode_rewards = 118.0 | total_episodes = 1389 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 48593 | episode_rewards = 50.0 | total_episodes = 1511 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 55721 | episode_rewards = 113.0 | total_episodes = 1603 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 62751 | episode_rewards = 41.0 | total_episodes = 1687 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 69968 | episode_rewards = 344.0 | total_episodes = 1741 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 77259 | episode_rewards = 418.0 | total_episodes = 1787 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 84731 | episode_rewards = 293.0 | total_episodes = 1820 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 91890 | episode_rewards = 185.0 | total_episodes = 1853 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 99031 | episode_rewards = 278.0 | total_episodes = 1876 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 106305 | episode_rewards = 318.0 | total_episodes = 1899 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 113474 | episode_rewards = 500.0 | total_episodes = 1921 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 120632 | episode_rewards = 370.0 | total_episodes = 1941 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 127753 | episode_rewards = 375.0 | total_episodes = 1962 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 135179 | episode_rewards = 393.0 | total_episodes = 1987 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 142433 | episode_rewards = 500.0 | total_episodes = 2005 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 149888 | episode_rewards = 500.0 | total_episodes = 2023 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 157312 | episode_rewards = 467.0 | total_episodes = 2042 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 164651 | episode_rewards = 441.0 | total_episodes = 2060 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 172015 | episode_rewards = 500.0 | total_episodes = 2076 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 178100 | episode_rewards = 481.0 | total_episodes = 2089 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 183522 | episode_rewards = 462.0 | total_episodes = 2101 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 190818 | episode_rewards = 500.0 | total_episodes = 2117 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 198115 | episode_rewards = 500.0 | total_episodes = 2135 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 205097 | episode_rewards = 500.0 | total_episodes = 2151 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 212351 | episode_rewards = 500.0 | total_episodes = 2166 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 219386 | episode_rewards = 500.0 | total_episodes = 2181 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 226386 | episode_rewards = 500.0 | total_episodes = 2195 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 233888 | episode_rewards = 500.0 | total_episodes = 2211 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 241388 | episode_rewards = 500.0 | total_episodes = 2226 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 248287 | episode_rewards = 500.0 | total_episodes = 2240 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 255483 | episode_rewards = 500.0 | total_episodes = 2255 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 262845 | episode_rewards = 500.0 | total_episodes = 2270 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 270032 | episode_rewards = 500.0 | total_episodes = 2285 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 277009 | episode_rewards = 498.0 | total_episodes = 2301 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 284044 | episode_rewards = 255.0 | total_episodes = 2318 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 291189 | episode_rewards = 500.0 | total_episodes = 2334 | +[INFO] [A2C tuned[worker: 0]] | max_global_step = 298619 | episode_rewards = 500.0 | total_episodes = 2350 | +[INFO] ... trained! +INFO: Making new env: CartPole-v1 +INFO: Making new env: CartPole-v1 +[INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead +``` + +
+ + +```{image} output_9_3.png +:align: center +``` + + : For more information on plots and visualization, you can check [here (in construction)](visualization_page) + +```python +print("Evaluating ...") + +# Evaluating and comparing the agents : +_ = evaluate_agents([default_agent, tuned_agent], n_simulations=50, show=True) +``` + +
+ +```none +Evaluating ... +``` + +
+ +```none +[INFO] Evaluating A2C default... +[INFO] [eval]... simulation 1/50 +[INFO] [eval]... simulation 2/50 +[INFO] [eval]... simulation 3/50 +[INFO] [eval]... simulation 4/50 +[INFO] [eval]... simulation 5/50 +[INFO] [eval]... simulation 6/50 +[INFO] [eval]... simulation 7/50 +[INFO] [eval]... simulation 8/50 +[INFO] [eval]... simulation 9/50 +[INFO] [eval]... simulation 10/50 +[INFO] [eval]... simulation 11/50 +[INFO] [eval]... simulation 12/50 +[INFO] [eval]... simulation 13/50 +[INFO] [eval]... simulation 14/50 +[INFO] [eval]... simulation 15/50 +[INFO] [eval]... simulation 16/50 +[INFO] [eval]... simulation 17/50 +[INFO] [eval]... simulation 18/50 +[INFO] [eval]... simulation 19/50 +[INFO] [eval]... simulation 20/50 +[INFO] [eval]... simulation 21/50 +[INFO] [eval]... simulation 22/50 +[INFO] [eval]... simulation 23/50 +[INFO] [eval]... simulation 24/50 +[INFO] [eval]... simulation 25/50 +[INFO] [eval]... simulation 26/50 +[INFO] [eval]... simulation 27/50 +[INFO] [eval]... simulation 28/50 +[INFO] [eval]... simulation 29/50 +[INFO] [eval]... simulation 30/50 +[INFO] [eval]... simulation 31/50 +[INFO] [eval]... simulation 32/50 +[INFO] [eval]... simulation 33/50 +[INFO] [eval]... simulation 34/50 +[INFO] [eval]... simulation 35/50 +[INFO] [eval]... simulation 36/50 +[INFO] [eval]... simulation 37/50 +[INFO] [eval]... simulation 38/50 +[INFO] [eval]... simulation 39/50 +[INFO] [eval]... simulation 40/50 +[INFO] [eval]... simulation 41/50 +[INFO] [eval]... simulation 42/50 +[INFO] [eval]... simulation 43/50 +[INFO] [eval]... simulation 44/50 +[INFO] [eval]... simulation 45/50 +[INFO] [eval]... simulation 46/50 +[INFO] [eval]... simulation 47/50 +[INFO] [eval]... simulation 48/50 +[INFO] [eval]... simulation 49/50 +[INFO] [eval]... simulation 50/50 +[INFO] Evaluating A2C tuned... +[INFO] [eval]... simulation 1/50 +[INFO] [eval]... simulation 2/50 +[INFO] [eval]... simulation 3/50 +[INFO] [eval]... simulation 4/50 +[INFO] [eval]... simulation 5/50 +[INFO] [eval]... simulation 6/50 +[INFO] [eval]... simulation 7/50 +[INFO] [eval]... simulation 8/50 +[INFO] [eval]... simulation 9/50 +[INFO] [eval]... simulation 10/50 +[INFO] [eval]... simulation 11/50 +[INFO] [eval]... simulation 12/50 +[INFO] [eval]... simulation 13/50 +[INFO] [eval]... simulation 14/50 +[INFO] [eval]... simulation 15/50 +[INFO] [eval]... simulation 16/50 +[INFO] [eval]... simulation 17/50 +[INFO] [eval]... simulation 18/50 +[INFO] [eval]... simulation 19/50 +[INFO] [eval]... simulation 20/50 +[INFO] [eval]... simulation 21/50 +[INFO] [eval]... simulation 22/50 +[INFO] [eval]... simulation 23/50 +[INFO] [eval]... simulation 24/50 +[INFO] [eval]... simulation 25/50 +[INFO] [eval]... simulation 26/50 +[INFO] [eval]... simulation 27/50 +[INFO] [eval]... simulation 28/50 +[INFO] [eval]... simulation 29/50 +[INFO] [eval]... simulation 30/50 +[INFO] [eval]... simulation 31/50 +[INFO] [eval]... simulation 32/50 +[INFO] [eval]... simulation 33/50 +[INFO] [eval]... simulation 34/50 +[INFO] [eval]... simulation 35/50 +[INFO] [eval]... simulation 36/50 +[INFO] [eval]... simulation 37/50 +[INFO] [eval]... simulation 38/50 +[INFO] [eval]... simulation 39/50 +[INFO] [eval]... simulation 40/50 +[INFO] [eval]... simulation 41/50 +[INFO] [eval]... simulation 42/50 +[INFO] [eval]... simulation 43/50 +[INFO] [eval]... simulation 44/50 +[INFO] [eval]... simulation 45/50 +[INFO] [eval]... simulation 46/50 +[INFO] [eval]... simulation 47/50 +[INFO] [eval]... simulation 48/50 +[INFO] [eval]... simulation 49/50 +[INFO] [eval]... simulation 50/50 +``` + +
+ +```{image} output_10_3.png +:align: center +``` diff --git a/docs/basics/DeepRLTutorial/TutorialDeepRL.rst b/docs/basics/DeepRLTutorial/TutorialDeepRL.rst deleted file mode 100644 index 99dbe4425..000000000 --- a/docs/basics/DeepRLTutorial/TutorialDeepRL.rst +++ /dev/null @@ -1,591 +0,0 @@ -Quickstart for Deep Reinforcement Learning in rlberry -===================================================== - -.. highlight:: none - -.. - Authors: Riccardo Della Vecchia, Hector Kohler, Alena Shilova. - -In this tutorial, we will focus on Deep Reinforcement Learning with the **Advantage Actor-Critic** algorithm. - - -Imports ------------------------------ - -.. code:: python - - from rlberry.envs import gym_make - from rlberry.manager import plot_writer_data, ExperimentManager, evaluate_agents - from rlberry.agents.torch import A2CAgent - from rlberry.agents.torch.utils.training import model_factory_from_env - - -Reminder of the RL setting --------------------------- - -We will consider a MDP :math:`M = (\mathcal{S}, \mathcal{A}, p, r, \gamma)` with: - -* :math:`\mathcal{S}` the state space, -* :math:`\mathcal{A}` the action space, -* :math:`p(x^\prime \mid x, a)` the transition probability, -* :math:`r(x, a, x^\prime)` the reward of the transition :math:`(x, a, x^\prime)`, -* :math:`\gamma \in [0,1)` is the discount factor. - -A policy :math:`\pi` is a mapping from the state space :math:`\mathcal{S}` to the probability of selecting each action. -The action value function of a policy is the overall expected reward -from a state action. -:math:`Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]` -where :math:`\tau` is an episode -:math:`(s_0, a_0, r_0, s_1, a_1, r_1, s_2, ..., s_T, a_T, r_T)` with the -actions drawn from :math:`\pi(s)`; :math:`R(\tau)` is the random -variable defined as the cumulative sum of the discounted reward. - -The goal is to maximize the cumulative sum of discount rewards: - -.. math:: J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \big] - -Gymnasium Environment ---------------- - -In this tutorial we are going to use the `Gymnasium library (previously OpenAI’s Gym) -`__. This library provides a large -number of environments to test RL algorithm. - -We will focus only on the **CartPole-v1** environment, although we recommend experimenting with other environments such as **Acrobot-v1** -and **MountainCar-v0**. -The following table presents some -basic components of the three environments, such as the dimensions of -their observation and action spaces and the rewards occurring at each -step. - -===================== =========== ========================= -Env Info CartPole-v1 Acrobot-v1 MountainCar-v0 -===================== =========== ========================= -**Observation Space** Box(4) Box(6) Box(2) -**Action Space** Discrete(2) Discrete(3) Discrete(3) -**Rewards** 1 per step -1 if not terminal else 0 -1 per step -===================== =========== ========================= - -Actor-Critic algorithms and A2C -------------------------------- - -**Actor-Critic algorithms** methods consist of two models, which may -optionally share parameters: - -- Critic updates the value function parameters w and depending on the algorithm it could be action-value -:math:`Q_{\varphi}(s,a )` or state-value :math:`V_{\varphi}(s)`. -- Actor updates the policy parameters :math:`\theta` for -:math:`\pi_{\theta}(a \mid s)`, in the direction suggested by the -critic. - -**A2C** is an Actor-Critic algorithm and it is part of the on-policy -family, which means that we are learning the value function for one -policy while following it. The original paper in which it was proposed -can be found `here `__ and the -pseudocode of the algorithm is the following: - -- Initialize the actor :math:`\pi_{\theta}` and the critic - :math:`V_{\varphi}` with random weights. -- Observe the initial state :math:`s_{0}`. -- for :math:`t \in\left[0, T_{\text {total }}\right]` : - - - Initialize empty episode minibatch. - - for :math:`k \in[0, n]:` # Sample episode - - - Select a action :math:`a_{k}` using the actor - :math:`\pi_{\theta}`. - - Perform the action :math:`a_{k}` and observe the next state - :math:`s_{k+1}` and the reward :math:`r_{k+1}`. - - Store :math:`\left(s_{k}, a_{k}, r_{k+1}\right)` in the episode - minibatch. - - - if :math:`s_{n}` is not terminal: set - :math:`R=V_{\varphi}\left(s_{n}\right)` with the critic, else - :math:`R=0`. - - Reset gradient :math:`d \theta` and :math:`d \varphi` to 0 . - - for :math:`k \in[n-1,0]` : # Backwards iteration over the episode - - - Update the discounted sum of rewards - :math:`R \leftarrow r_{k}+\gamma R` - - Accumulate the policy gradient using the critic: - - .. math:: - - - d \theta \leftarrow d \theta+\nabla_{\theta} \log \pi_{\theta}\left(a_{k}\mid s_{k}\right)\left(R-V_{\varphi}\left(s_{k}\right)\right) - - - Accumulate the critic gradient: - -.. math:: - - - d \varphi \leftarrow d \varphi+\nabla_{\varphi}\left(R-V_{\varphi}\left(s_{k}\right)\right)^{2} - -- Update the actor and the critic with the accumulated gradients using - gradient descent or similar: - -.. math:: - - - \theta \leftarrow \theta+\eta d \theta \quad \varphi \leftarrow \varphi+\eta d \varphi - -Running A2C on CartPole ------------------------ - -In the next example we use default parameters for both the Actor and the -Critic and we use rlberry to train and evaluate our A2C agent. The -default networks are: - -- a dense neural network with two hidden layers of 64 units for the - **Actor**, the input layer has the dimension of the state space while - the output layer has the dimension of the action space. The - activations are RELU functions and we have a softmax in the last - layer. -- a dense neural network with two hidden layers of 64 units for the - **Critic**, the input layer has the dimension of the state space - while the output has dimension 1. The activations are RELU functions - apart from the last layer that has a linear activation. - -.. code:: python - - """ - The ExperimentManager class is compact way of experimenting with a deepRL agent. - """ - default_agent = ExperimentManager( - A2CAgent, # The Agent class. - (gym_make, dict(id="CartPole-v1")), # The Environment to solve. - fit_budget=3e5, # The number of interactions - # between the agent and the - # environment during training. - eval_kwargs=dict(eval_horizon=500), # The number of interactions - # between the agent and the - # environment during evaluations. - n_fit=1, # The number of agents to train. - # Usually, it is good to do more - # than 1 because the training is - # stochastic. - agent_name="A2C default", # The agent's name. - ) - - print("Training ...") - default_agent.fit() # Trains the agent on fit_budget steps! - - - # Plot the training data: - _ = plot_writer_data( - [default_agent], - tag="episode_rewards", - title="Training Episode Cumulative Rewards", - show=True, - ) - - -.. parsed-literal:: - - [INFO] Running ExperimentManager fit() for A2C default with n_fit = 1 and max_workers = None. - INFO: Making new env: CartPole-v1 - INFO: Making new env: CartPole-v1 - [INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead - - -.. parsed-literal:: - - Training ... - - -.. parsed-literal:: - - [INFO] [A2C default[worker: 0]] | max_global_step = 5644 | episode_rewards = 196.0 | total_episodes = 111 | - [INFO] [A2C default[worker: 0]] | max_global_step = 9551 | episode_rewards = 380.0 | total_episodes = 134 | - [INFO] [A2C default[worker: 0]] | max_global_step = 13128 | episode_rewards = 125.0 | total_episodes = 182 | - [INFO] [A2C default[worker: 0]] | max_global_step = 16617 | episode_rewards = 246.0 | total_episodes = 204 | - [INFO] [A2C default[worker: 0]] | max_global_step = 20296 | episode_rewards = 179.0 | total_episodes = 222 | - [INFO] [A2C default[worker: 0]] | max_global_step = 23633 | episode_rewards = 120.0 | total_episodes = 240 | - [INFO] [A2C default[worker: 0]] | max_global_step = 26193 | episode_rewards = 203.0 | total_episodes = 252 | - [INFO] [A2C default[worker: 0]] | max_global_step = 28969 | episode_rewards = 104.0 | total_episodes = 271 | - [INFO] [A2C default[worker: 0]] | max_global_step = 34757 | episode_rewards = 123.0 | total_episodes = 335 | - [INFO] [A2C default[worker: 0]] | max_global_step = 41554 | episode_rewards = 173.0 | total_episodes = 373 | - [INFO] [A2C default[worker: 0]] | max_global_step = 48418 | episode_rewards = 217.0 | total_episodes = 423 | - [INFO] [A2C default[worker: 0]] | max_global_step = 55322 | episode_rewards = 239.0 | total_episodes = 446 | - [INFO] [A2C default[worker: 0]] | max_global_step = 62193 | episode_rewards = 218.0 | total_episodes = 471 | - [INFO] [A2C default[worker: 0]] | max_global_step = 69233 | episode_rewards = 377.0 | total_episodes = 509 | - [INFO] [A2C default[worker: 0]] | max_global_step = 76213 | episode_rewards = 211.0 | total_episodes = 536 | - [INFO] [A2C default[worker: 0]] | max_global_step = 83211 | episode_rewards = 212.0 | total_episodes = 562 | - [INFO] [A2C default[worker: 0]] | max_global_step = 90325 | episode_rewards = 211.0 | total_episodes = 586 | - [INFO] [A2C default[worker: 0]] | max_global_step = 97267 | episode_rewards = 136.0 | total_episodes = 631 | - [INFO] [A2C default[worker: 0]] | max_global_step = 104280 | episode_rewards = 175.0 | total_episodes = 686 | - [INFO] [A2C default[worker: 0]] | max_global_step = 111194 | episode_rewards = 258.0 | total_episodes = 722 | - [INFO] [A2C default[worker: 0]] | max_global_step = 118067 | episode_rewards = 235.0 | total_episodes = 755 | - [INFO] [A2C default[worker: 0]] | max_global_step = 125040 | episode_rewards = 500.0 | total_episodes = 777 | - [INFO] [A2C default[worker: 0]] | max_global_step = 132478 | episode_rewards = 500.0 | total_episodes = 792 | - [INFO] [A2C default[worker: 0]] | max_global_step = 139591 | episode_rewards = 197.0 | total_episodes = 813 | - [INFO] [A2C default[worker: 0]] | max_global_step = 146462 | episode_rewards = 500.0 | total_episodes = 835 | - [INFO] [A2C default[worker: 0]] | max_global_step = 153462 | episode_rewards = 500.0 | total_episodes = 849 | - [INFO] [A2C default[worker: 0]] | max_global_step = 160462 | episode_rewards = 500.0 | total_episodes = 863 | - [INFO] [A2C default[worker: 0]] | max_global_step = 167462 | episode_rewards = 500.0 | total_episodes = 877 | - [INFO] [A2C default[worker: 0]] | max_global_step = 174462 | episode_rewards = 500.0 | total_episodes = 891 | - [INFO] [A2C default[worker: 0]] | max_global_step = 181462 | episode_rewards = 500.0 | total_episodes = 905 | - [INFO] [A2C default[worker: 0]] | max_global_step = 188462 | episode_rewards = 500.0 | total_episodes = 919 | - [INFO] [A2C default[worker: 0]] | max_global_step = 195462 | episode_rewards = 500.0 | total_episodes = 933 | - [INFO] [A2C default[worker: 0]] | max_global_step = 202520 | episode_rewards = 206.0 | total_episodes = 957 | - [INFO] [A2C default[worker: 0]] | max_global_step = 209932 | episode_rewards = 500.0 | total_episodes = 978 | - [INFO] [A2C default[worker: 0]] | max_global_step = 216932 | episode_rewards = 500.0 | total_episodes = 992 | - [INFO] [A2C default[worker: 0]] | max_global_step = 223932 | episode_rewards = 500.0 | total_episodes = 1006 | - [INFO] [A2C default[worker: 0]] | max_global_step = 230916 | episode_rewards = 214.0 | total_episodes = 1024 | - [INFO] [A2C default[worker: 0]] | max_global_step = 235895 | episode_rewards = 500.0 | total_episodes = 1037 | - [INFO] [A2C default[worker: 0]] | max_global_step = 242782 | episode_rewards = 118.0 | total_episodes = 1072 | - [INFO] [A2C default[worker: 0]] | max_global_step = 249695 | episode_rewards = 131.0 | total_episodes = 1111 | - [INFO] [A2C default[worker: 0]] | max_global_step = 256649 | episode_rewards = 136.0 | total_episodes = 1160 | - [INFO] [A2C default[worker: 0]] | max_global_step = 263674 | episode_rewards = 100.0 | total_episodes = 1215 | - [INFO] [A2C default[worker: 0]] | max_global_step = 270727 | episode_rewards = 136.0 | total_episodes = 1279 | - [INFO] [A2C default[worker: 0]] | max_global_step = 277588 | episode_rewards = 275.0 | total_episodes = 1313 | - [INFO] [A2C default[worker: 0]] | max_global_step = 284602 | episode_rewards = 136.0 | total_episodes = 1353 | - [INFO] [A2C default[worker: 0]] | max_global_step = 291609 | episode_rewards = 117.0 | total_episodes = 1413 | - [INFO] [A2C default[worker: 0]] | max_global_step = 298530 | episode_rewards = 147.0 | total_episodes = 1466 | - [INFO] ... trained! - INFO: Making new env: CartPole-v1 - INFO: Making new env: CartPole-v1 - [INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead - - - -.. image:: output_5_3.png - - -.. code:: python - - print("Evaluating ...") - _ = evaluate_agents( - [default_agent], n_simulations=50, show=True - ) # Evaluate the trained agent on - # 10 simulations of 500 steps each. - - -.. parsed-literal:: - - [INFO] Evaluating A2C default... - - -.. parsed-literal:: - - Evaluating ... - - -.. parsed-literal:: - - [INFO] [eval]... simulation 1/50 - [INFO] [eval]... simulation 2/50 - [INFO] [eval]... simulation 3/50 - [INFO] [eval]... simulation 4/50 - [INFO] [eval]... simulation 5/50 - [INFO] [eval]... simulation 6/50 - [INFO] [eval]... simulation 7/50 - [INFO] [eval]... simulation 8/50 - [INFO] [eval]... simulation 9/50 - [INFO] [eval]... simulation 10/50 - [INFO] [eval]... simulation 11/50 - [INFO] [eval]... simulation 12/50 - [INFO] [eval]... simulation 13/50 - [INFO] [eval]... simulation 14/50 - [INFO] [eval]... simulation 15/50 - [INFO] [eval]... simulation 16/50 - [INFO] [eval]... simulation 17/50 - [INFO] [eval]... simulation 18/50 - [INFO] [eval]... simulation 19/50 - [INFO] [eval]... simulation 20/50 - [INFO] [eval]... simulation 21/50 - [INFO] [eval]... simulation 22/50 - [INFO] [eval]... simulation 23/50 - [INFO] [eval]... simulation 24/50 - [INFO] [eval]... simulation 25/50 - [INFO] [eval]... simulation 26/50 - [INFO] [eval]... simulation 27/50 - [INFO] [eval]... simulation 28/50 - [INFO] [eval]... simulation 29/50 - [INFO] [eval]... simulation 30/50 - [INFO] [eval]... simulation 31/50 - [INFO] [eval]... simulation 32/50 - [INFO] [eval]... simulation 33/50 - [INFO] [eval]... simulation 34/50 - [INFO] [eval]... simulation 35/50 - [INFO] [eval]... simulation 36/50 - [INFO] [eval]... simulation 37/50 - [INFO] [eval]... simulation 38/50 - [INFO] [eval]... simulation 39/50 - [INFO] [eval]... simulation 40/50 - [INFO] [eval]... simulation 41/50 - [INFO] [eval]... simulation 42/50 - [INFO] [eval]... simulation 43/50 - [INFO] [eval]... simulation 44/50 - [INFO] [eval]... simulation 45/50 - [INFO] [eval]... simulation 46/50 - [INFO] [eval]... simulation 47/50 - [INFO] [eval]... simulation 48/50 - [INFO] [eval]... simulation 49/50 - [INFO] [eval]... simulation 50/50 - - - -.. image:: output_6_3.png - - -Let’s try to change the neural networks’ architectures and see if we can -beat our previous result. This time we use a smaller learning rate -and bigger batch size to have more stable training. - -.. code:: python - - policy_configs = { - "type": "MultiLayerPerceptron", # A network architecture - "layer_sizes": (64, 64), # Network dimensions - "reshape": False, - "is_policy": True, # The network should output a distribution - # over actions - } - - critic_configs = { - "type": "MultiLayerPerceptron", - "layer_sizes": (64, 64), - "reshape": False, - "out_size": 1, # The critic network is an approximator of - # a value function V: States -> |R - } - -.. code:: python - - tuned_agent = ExperimentManager( - A2CAgent, # The Agent class. - (gym_make, dict(id="CartPole-v1")), # The Environment to solve. - init_kwargs=dict( # Where to put the agent's hyperparameters - policy_net_fn=model_factory_from_env, # A policy network constructor - policy_net_kwargs=policy_configs, # Policy network's architecure - value_net_fn=model_factory_from_env, # A Critic network constructor - value_net_kwargs=critic_configs, # Critic network's architecure. - optimizer_type="ADAM", # What optimizer to use for policy - # gradient descent steps. - learning_rate=1e-3, # Size of the policy gradient - # descent steps. - entr_coef=0.0, # How much to force exploration. - batch_size=1024 # Number of interactions used to - # estimate the policy gradient - # for each policy update. - ), - fit_budget=3e5, # The number of interactions - # between the agent and the - # environment during training. - eval_kwargs=dict(eval_horizon=500), # The number of interactions - # between the agent and the - # environment during evaluations. - n_fit=1, # The number of agents to train. - # Usually, it is good to do more - # than 1 because the training is - # stochastic. - agent_name="A2C tuned", # The agent's name. - ) - - - print("Training ...") - tuned_agent.fit() # Trains the agent on fit_budget steps! - - - # Plot the training data: - _ = plot_writer_data( - [default_agent, tuned_agent], - tag="episode_rewards", - title="Training Episode Cumulative Rewards", - show=True, - ) - - -.. parsed-literal:: - - [INFO] Running ExperimentManager fit() for A2C tuned with n_fit = 1 and max_workers = None. - INFO: Making new env: CartPole-v1 - INFO: Making new env: CartPole-v1 - [INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead - - -.. parsed-literal:: - - Training ... - - -.. parsed-literal:: - - [INFO] [A2C tuned[worker: 0]] | max_global_step = 6777 | episode_rewards = 15.0 | total_episodes = 314 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 13633 | episode_rewards = 14.0 | total_episodes = 602 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 20522 | episode_rewards = 41.0 | total_episodes = 854 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 27531 | episode_rewards = 13.0 | total_episodes = 1063 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 34398 | episode_rewards = 42.0 | total_episodes = 1237 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 41600 | episode_rewards = 118.0 | total_episodes = 1389 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 48593 | episode_rewards = 50.0 | total_episodes = 1511 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 55721 | episode_rewards = 113.0 | total_episodes = 1603 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 62751 | episode_rewards = 41.0 | total_episodes = 1687 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 69968 | episode_rewards = 344.0 | total_episodes = 1741 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 77259 | episode_rewards = 418.0 | total_episodes = 1787 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 84731 | episode_rewards = 293.0 | total_episodes = 1820 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 91890 | episode_rewards = 185.0 | total_episodes = 1853 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 99031 | episode_rewards = 278.0 | total_episodes = 1876 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 106305 | episode_rewards = 318.0 | total_episodes = 1899 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 113474 | episode_rewards = 500.0 | total_episodes = 1921 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 120632 | episode_rewards = 370.0 | total_episodes = 1941 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 127753 | episode_rewards = 375.0 | total_episodes = 1962 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 135179 | episode_rewards = 393.0 | total_episodes = 1987 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 142433 | episode_rewards = 500.0 | total_episodes = 2005 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 149888 | episode_rewards = 500.0 | total_episodes = 2023 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 157312 | episode_rewards = 467.0 | total_episodes = 2042 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 164651 | episode_rewards = 441.0 | total_episodes = 2060 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 172015 | episode_rewards = 500.0 | total_episodes = 2076 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 178100 | episode_rewards = 481.0 | total_episodes = 2089 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 183522 | episode_rewards = 462.0 | total_episodes = 2101 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 190818 | episode_rewards = 500.0 | total_episodes = 2117 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 198115 | episode_rewards = 500.0 | total_episodes = 2135 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 205097 | episode_rewards = 500.0 | total_episodes = 2151 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 212351 | episode_rewards = 500.0 | total_episodes = 2166 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 219386 | episode_rewards = 500.0 | total_episodes = 2181 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 226386 | episode_rewards = 500.0 | total_episodes = 2195 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 233888 | episode_rewards = 500.0 | total_episodes = 2211 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 241388 | episode_rewards = 500.0 | total_episodes = 2226 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 248287 | episode_rewards = 500.0 | total_episodes = 2240 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 255483 | episode_rewards = 500.0 | total_episodes = 2255 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 262845 | episode_rewards = 500.0 | total_episodes = 2270 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 270032 | episode_rewards = 500.0 | total_episodes = 2285 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 277009 | episode_rewards = 498.0 | total_episodes = 2301 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 284044 | episode_rewards = 255.0 | total_episodes = 2318 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 291189 | episode_rewards = 500.0 | total_episodes = 2334 | - [INFO] [A2C tuned[worker: 0]] | max_global_step = 298619 | episode_rewards = 500.0 | total_episodes = 2350 | - [INFO] ... trained! - INFO: Making new env: CartPole-v1 - INFO: Making new env: CartPole-v1 - [INFO] Could not find least used device (nvidia-smi might be missing), use cuda:0 instead - - - -.. image:: output_9_3.png - - -.. code:: python - - print("Evaluating ...") - - # Evaluate each trained agent on 10 simulations of 500 steps each. - _ = evaluate_agents([default_agent, tuned_agent], n_simulations=50, show=True) - - -.. parsed-literal:: - - [INFO] Evaluating A2C default... - - -.. parsed-literal:: - - Evaluating ... - - -.. parsed-literal:: - - [INFO] [eval]... simulation 1/50 - [INFO] [eval]... simulation 2/50 - [INFO] [eval]... simulation 3/50 - [INFO] [eval]... simulation 4/50 - [INFO] [eval]... simulation 5/50 - [INFO] [eval]... simulation 6/50 - [INFO] [eval]... simulation 7/50 - [INFO] [eval]... simulation 8/50 - [INFO] [eval]... simulation 9/50 - [INFO] [eval]... simulation 10/50 - [INFO] [eval]... simulation 11/50 - [INFO] [eval]... simulation 12/50 - [INFO] [eval]... simulation 13/50 - [INFO] [eval]... simulation 14/50 - [INFO] [eval]... simulation 15/50 - [INFO] [eval]... simulation 16/50 - [INFO] [eval]... simulation 17/50 - [INFO] [eval]... simulation 18/50 - [INFO] [eval]... simulation 19/50 - [INFO] [eval]... simulation 20/50 - [INFO] [eval]... simulation 21/50 - [INFO] [eval]... simulation 22/50 - [INFO] [eval]... simulation 23/50 - [INFO] [eval]... simulation 24/50 - [INFO] [eval]... simulation 25/50 - [INFO] [eval]... simulation 26/50 - [INFO] [eval]... simulation 27/50 - [INFO] [eval]... simulation 28/50 - [INFO] [eval]... simulation 29/50 - [INFO] [eval]... simulation 30/50 - [INFO] [eval]... simulation 31/50 - [INFO] [eval]... simulation 32/50 - [INFO] [eval]... simulation 33/50 - [INFO] [eval]... simulation 34/50 - [INFO] [eval]... simulation 35/50 - [INFO] [eval]... simulation 36/50 - [INFO] [eval]... simulation 37/50 - [INFO] [eval]... simulation 38/50 - [INFO] [eval]... simulation 39/50 - [INFO] [eval]... simulation 40/50 - [INFO] [eval]... simulation 41/50 - [INFO] [eval]... simulation 42/50 - [INFO] [eval]... simulation 43/50 - [INFO] [eval]... simulation 44/50 - [INFO] [eval]... simulation 45/50 - [INFO] [eval]... simulation 46/50 - [INFO] [eval]... simulation 47/50 - [INFO] [eval]... simulation 48/50 - [INFO] [eval]... simulation 49/50 - [INFO] [eval]... simulation 50/50 - [INFO] Evaluating A2C tuned... - [INFO] [eval]... simulation 1/50 - [INFO] [eval]... simulation 2/50 - [INFO] [eval]... simulation 3/50 - [INFO] [eval]... simulation 4/50 - [INFO] [eval]... simulation 5/50 - [INFO] [eval]... simulation 6/50 - [INFO] [eval]... simulation 7/50 - [INFO] [eval]... simulation 8/50 - [INFO] [eval]... simulation 9/50 - [INFO] [eval]... simulation 10/50 - [INFO] [eval]... simulation 11/50 - [INFO] [eval]... simulation 12/50 - [INFO] [eval]... simulation 13/50 - [INFO] [eval]... simulation 14/50 - [INFO] [eval]... simulation 15/50 - [INFO] [eval]... simulation 16/50 - [INFO] [eval]... simulation 17/50 - [INFO] [eval]... simulation 18/50 - [INFO] [eval]... simulation 19/50 - [INFO] [eval]... simulation 20/50 - [INFO] [eval]... simulation 21/50 - [INFO] [eval]... simulation 22/50 - [INFO] [eval]... simulation 23/50 - [INFO] [eval]... simulation 24/50 - [INFO] [eval]... simulation 25/50 - [INFO] [eval]... simulation 26/50 - [INFO] [eval]... simulation 27/50 - [INFO] [eval]... simulation 28/50 - [INFO] [eval]... simulation 29/50 - [INFO] [eval]... simulation 30/50 - [INFO] [eval]... simulation 31/50 - [INFO] [eval]... simulation 32/50 - [INFO] [eval]... simulation 33/50 - [INFO] [eval]... simulation 34/50 - [INFO] [eval]... simulation 35/50 - [INFO] [eval]... simulation 36/50 - [INFO] [eval]... simulation 37/50 - [INFO] [eval]... simulation 38/50 - [INFO] [eval]... simulation 39/50 - [INFO] [eval]... simulation 40/50 - [INFO] [eval]... simulation 41/50 - [INFO] [eval]... simulation 42/50 - [INFO] [eval]... simulation 43/50 - [INFO] [eval]... simulation 44/50 - [INFO] [eval]... simulation 45/50 - [INFO] [eval]... simulation 46/50 - [INFO] [eval]... simulation 47/50 - [INFO] [eval]... simulation 48/50 - [INFO] [eval]... simulation 49/50 - [INFO] [eval]... simulation 50/50 - - - -.. image:: output_10_3.png diff --git a/docs/basics/DeepRLTutorial/output_10_3.png b/docs/basics/DeepRLTutorial/output_10_3.png index 15406e0d2..8a6c39010 100644 Binary files a/docs/basics/DeepRLTutorial/output_10_3.png and b/docs/basics/DeepRLTutorial/output_10_3.png differ diff --git a/docs/basics/DeepRLTutorial/output_6_3.png b/docs/basics/DeepRLTutorial/output_6_3.png index a94ce0cd7..cebfcbd05 100644 Binary files a/docs/basics/DeepRLTutorial/output_6_3.png and b/docs/basics/DeepRLTutorial/output_6_3.png differ diff --git a/docs/basics/DeepRLTutorial/output_9_3.png b/docs/basics/DeepRLTutorial/output_9_3.png index 4dc78d963..80a69998b 100644 Binary files a/docs/basics/DeepRLTutorial/output_9_3.png and b/docs/basics/DeepRLTutorial/output_9_3.png differ diff --git a/docs/basics/comparison.md b/docs/basics/comparison.md index cd57d4a3b..75041a9c0 100644 --- a/docs/basics/comparison.md +++ b/docs/basics/comparison.md @@ -1,3 +1,4 @@ +(comparison_page)= # Comparison of Agents diff --git a/docs/basics/quick_start_rl/Figure_1.png b/docs/basics/quick_start_rl/Figure_1.png new file mode 100644 index 000000000..61bdfd114 Binary files /dev/null and b/docs/basics/quick_start_rl/Figure_1.png differ diff --git a/docs/basics/quick_start_rl/Figure_2.png b/docs/basics/quick_start_rl/Figure_2.png new file mode 100644 index 000000000..aa1218107 Binary files /dev/null and b/docs/basics/quick_start_rl/Figure_2.png differ diff --git a/docs/basics/quick_start_rl/Figure_3.png b/docs/basics/quick_start_rl/Figure_3.png new file mode 100644 index 000000000..8423bd3ef Binary files /dev/null and b/docs/basics/quick_start_rl/Figure_3.png differ diff --git a/docs/basics/quick_start_rl/agent_manager_diagram.png b/docs/basics/quick_start_rl/agent_manager_diagram.png deleted file mode 100644 index b678634a9..000000000 Binary files a/docs/basics/quick_start_rl/agent_manager_diagram.png and /dev/null differ diff --git a/docs/basics/quick_start_rl/experiment_manager_diagram.png b/docs/basics/quick_start_rl/experiment_manager_diagram.png new file mode 100644 index 000000000..c5e004bff Binary files /dev/null and b/docs/basics/quick_start_rl/experiment_manager_diagram.png differ diff --git a/docs/basics/quick_start_rl/output_10_1.png b/docs/basics/quick_start_rl/output_10_1.png deleted file mode 100644 index a2b6a9c41..000000000 Binary files a/docs/basics/quick_start_rl/output_10_1.png and /dev/null differ diff --git a/docs/basics/quick_start_rl/output_19_0.png b/docs/basics/quick_start_rl/output_19_0.png deleted file mode 100644 index 1e2b61cee..000000000 Binary files a/docs/basics/quick_start_rl/output_19_0.png and /dev/null differ diff --git a/docs/basics/quick_start_rl/quickstart.md b/docs/basics/quick_start_rl/quickstart.md new file mode 100644 index 000000000..a68ae3520 --- /dev/null +++ b/docs/basics/quick_start_rl/quickstart.md @@ -0,0 +1,356 @@ +(quick_start)= + +Quick Start for Reinforcement Learning in rlberry +================================================= + +$$\def\CC{\bf C} +\def\QQ{\bf Q} +\def\RR{\bf R} +\def\ZZ{\bf Z} +\def\NN{\bf N}$$ + +Importing required libraries +---------------------------- + +```python +import numpy as np +import pandas as pd +import time +from rlberry.agents import AgentWithSimplePolicy +from rlberry_research.agents import UCBVIAgent +from rlberry_research.envs import Chain +from rlberry.manager import ( + ExperimentManager, + evaluate_agents, + plot_writer_data, + read_writer_data, +) +from rlberry.wrappers import WriterWrapper +``` + +Choosing an RL environment +-------------------------- + +In this tutorial, we will use the Chain(from [rlberry_scool](https://github.com/rlberry-py/rlberry-scool)) +environment, which is a very simple environment where the agent has to +go from one end of a chain to the other end. + +```python +env_ctor = Chain +env_kwargs = dict(L=10, fail_prob=0.1) +# chain of length 10. With proba 0.1, the agent will not be able to take the action it wants to take. +env = env_ctor(**env_kwargs) +``` + +The agent has two actions, going left or going right, but it might +move in the opposite direction according to a failure probability +`fail_prob=0.1`. + +Let us see a graphical representation + +```python +env.enable_rendering() +observation, info = env.reset() +for tt in range(5): + observation, reward, terminated, truncated, info = env.step(1) + done = terminated or truncated +video = env.save_video("video_chain.mp4", framerate=5) +``` + +```none + ffmpeg version n5.0 Copyright (c) 2000-2022 the FFmpeg developers + built with gcc 11.2.0 (GCC) + configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-amf --enable-avisynth --enable-cuda-llvm --enable-lto --enable-fontconfig --enable-gmp --enable-gnutls --enable-gpl --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libdav1d --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libiec61883 --enable-libjack --enable-libmfx --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librav1e --enable-librsvg --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxcb --enable-libxml2 --enable-libxvid --enable-libzimg --enable-nvdec --enable-nvenc --enable-shared --enable-version3 + libavutil 57. 17.100 / 57. 17.100 + libavcodec 59. 18.100 / 59. 18.100 + libavformat 59. 16.100 / 59. 16.100 + libavdevice 59. 4.100 / 59. 4.100 + libavfilter 8. 24.100 / 8. 24.100 + libswscale 6. 4.100 / 6. 4.100 + libswresample 4. 3.100 / 4. 3.100 + libpostproc 56. 3.100 / 56. 3.100 + Input #0, rawvideo, from 'pipe:': + Duration: N/A, start: 0.000000, bitrate: 7680 kb/s + Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 800x80, 7680 kb/s, 5 tbr, 5 tbn + Stream mapping: + Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264)) + [libx264 @ 0x564b9e570340] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 + [libx264 @ 0x564b9e570340] profile High, level 1.2, 4:2:0, 8-bit + [libx264 @ 0x564b9e570340] 264 - core 164 r3081 19856cc - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=2 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=5 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 + Output #0, mp4, to 'video_chain.mp4': + Metadata: + encoder : Lavf59.16.100 + Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 800x80, q=2-31, 5 fps, 10240 tbn + Metadata: + encoder : Lavc59.18.100 libx264 + Side data: + cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A + frame= 6 fps=0.0 q=-1.0 Lsize= 4kB time=00:00:00.60 bitrate= 48.8kbits/s speed=56.2x + video:3kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 32.212582% + [libx264 @ 0x564b9e570340] frame I:1 Avg QP: 6.94 size: 742 + [libx264 @ 0x564b9e570340] frame P:5 Avg QP:22.68 size: 267 + [libx264 @ 0x564b9e570340] mb I I16..4: 95.2% 0.0% 4.8% + [libx264 @ 0x564b9e570340] mb P I16..4: 1.2% 2.1% 2.0% P16..4: 0.2% 0.0% 0.0% 0.0% 0.0% skip:94.6% + [libx264 @ 0x564b9e570340] 8x8 transform intra:8.2% inter:0.0% + [libx264 @ 0x564b9e570340] coded y,uvDC,uvAC intra: 6.5% 12.3% 11.4% inter: 0.0% 0.0% 0.0% + [libx264 @ 0x564b9e570340] i16 v,h,dc,p: 79% 1% 20% 0% + [libx264 @ 0x564b9e570340] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 0% 0% 100% 0% 0% 0% 0% 0% 0% + [libx264 @ 0x564b9e570340] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 52% 22% 19% 1% 0% 3% 1% 3% 1% + [libx264 @ 0x564b9e570340] i8c dc,h,v,p: 92% 4% 3% 0% + [libx264 @ 0x564b9e570340] Weighted P-Frames: Y:0.0% UV:0.0% + [libx264 @ 0x564b9e570340] kb/s:13.85 +``` +
+ + + + + +Defining an agent and a baseline +-------------------------------- + +We will compare a RandomAgent (which select random action) to the +UCBVIAgent(from [rlberry_research](https://github.com/rlberry-py/rlberry-research)), which is a algorithm that is designed to perform an +efficient exploration. Our goal is then to assess the performance of the +two algorithms. + +This is the code to create your RandomAgent agent : + +```python +# Create random agent as a baseline +class RandomAgent(AgentWithSimplePolicy): + name = "RandomAgent" + + def __init__(self, env, **kwargs): + AgentWithSimplePolicy.__init__(self, env, **kwargs) + + def fit(self, budget=100, **kwargs): + observation, info = self.env.reset() + for ep in range(budget): + action = self.policy(observation) + observation, reward, terminated, truncated, info = self.env.step(action) + + def policy(self, observation): + return self.env.action_space.sample() # choose an action at random +``` + +Experiment Manager +------------------ + +One of the main feature of rlberry is its +[ExperimentManager](rlberry.manager.ExperimentManager) +class. Here is a diagram to explain briefly what it does. + + +```{image} experiment_manager_diagram.png +:align: center +``` + +In a few words, ExperimentManager spawns agents and environments for +training and then once the agents are trained, it uses these agents and +new environments to evaluate how well the agent perform. All of these +steps can be done several times to assess the stochasticity of agents and/or +environment. + +Comparing the expected rewards of the final policies +---------------------------------------------------- + +We want to assess the expected reward of the policy learned by our +agents for a time horizon of (say) $T=20$. + +To evaluate the agents during the training, we can (arbitrary) use 10 Monte-Carlo simulations (```n_simulations``` in eval_kwargs), i.e., we do the evaluation 10 times for each agent and at the end we take the mean of the obtained reward. + +To check variability, we can train many instance of the same agent with +`n_fit` (here, we use only 1 to be faster). Each instance of agent will train with a specific +budget `fit_budget` (here 100). Remark that `fit_budget` may not mean +the same thing among agents. + +In order to manage the agents, we use an Experiment Manager. The manager +will then spawn agents as desired during the experiment. + +To summarize: +We will train 1 agent (```n_fit```) with a budget of 100 (```fit_budget```). During the training, the evaluation will be on 10 Monte-Carlo run (```n_simulations```), and we doing it for both Agent (```UCBVIAgent``` and ```RandomAgent```) + +```python +# Define parameters +ucbvi_params = {"gamma": 0.9, "horizon": 100} + +# Create ExperimentManager to fit 1 agent +ucbvi_stats = ExperimentManager( + UCBVIAgent, + (env_ctor, env_kwargs), + fit_budget=100, + eval_kwargs=dict(eval_horizon=20, n_simulations=10), + init_kwargs=ucbvi_params, + n_fit=1, +) +ucbvi_stats.fit() + +# Create ExperimentManager for baseline +baseline_stats = ExperimentManager( + RandomAgent, + (env_ctor, env_kwargs), + fit_budget=100, + eval_kwargs=dict(eval_horizon=20, n_simulations=10), + n_fit=1, +) +baseline_stats.fit() +``` + +```none + [INFO] Running ExperimentManager fit() for UCBVI with n_fit = 1 and max_workers = None. + [INFO] ... trained! + [INFO] Running ExperimentManager fit() for RandomAgent with n_fit = 1 and max_workers = None. + [INFO] ... trained! +``` + +
+ +Evaluating and comparing the agents : + +```python +output = evaluate_agents([ucbvi_stats, baseline_stats], n_simulations=10, plot=True) +``` + +```none + [INFO] Evaluating UCBVI... + [INFO] [eval]... simulation 1/10 + [INFO] [eval]... simulation 2/10 + [INFO] [eval]... simulation 3/10 + [INFO] [eval]... simulation 4/10 + [INFO] [eval]... simulation 5/10 + [INFO] [eval]... simulation 6/10 + [INFO] [eval]... simulation 7/10 + [INFO] [eval]... simulation 8/10 + [INFO] [eval]... simulation 9/10 + [INFO] [eval]... simulation 10/10 + [INFO] Evaluating RandomAgent... + [INFO] [eval]... simulation 1/10 + [INFO] [eval]... simulation 2/10 + [INFO] [eval]... simulation 3/10 + [INFO] [eval]... simulation 4/10 + [INFO] [eval]... simulation 5/10 + [INFO] [eval]... simulation 6/10 + [INFO] [eval]... simulation 7/10 + [INFO] [eval]... simulation 8/10 + [INFO] [eval]... simulation 9/10 + [INFO] [eval]... simulation 10/10 +``` + +
+ +```{image} Figure_1.png +:align: center +``` + + : For more in depth methodology to compare agents, you can check [here](comparison_page) + +Comparing the agents during the learning period +----------------------------------------------- + +In the previous section, we compared the performance of the **final** +policies learned by the agents, **after** the learning period. + +To compare the performance of the agents **during** the learning period +(in the fit method), we can estimate their cumulative regret, which is +the difference between the rewards gathered by the agents during +training and the rewards of an optimal agent. Alternatively, if we +cannot compute the optimal policy, we could simply compare the rewards +gathered during learning, instead of the regret. + +First, we have to record the reward during the fit as this is not done +automatically. To do this, we can use the +[WriterWrapper](rlberry.wrappers.writer_utils.WriterWrapper) +module, or simply the [writer](rlberry.agents.Agent.writer) attribute. + +```python +class RandomAgent2(RandomAgent): + name = "RandomAgent2" + + def __init__(self, env, **kwargs): + RandomAgent.__init__(self, env, **kwargs) + self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") + + +class UCBVIAgent2(UCBVIAgent): + name = "UCBVIAgent2" + + def __init__(self, env, **kwargs): + UCBVIAgent.__init__(self, env, **kwargs) + self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") +``` + + +Then, we fit the two agents. + +```python +# Create ExperimentManager for UCBI to fit 10 agents +ucbvi_stats = ExperimentManager( + UCBVIAgent2, + (env_ctor, env_kwargs), + fit_budget=50, + init_kwargs=ucbvi_params, + n_fit=10, +) +ucbvi_stats.fit() + +# Create ExperimentManager for baseline to fit 10 agents +baseline_stats = ExperimentManager( + RandomAgent2, + (env_ctor, env_kwargs), + fit_budget=5000, + n_fit=10, +) +baseline_stats.fit() +``` + +```none + [INFO] Running ExperimentManager fit() for UCBVIAgent2 with n_fit = 10 and max_workers = None. + [INFO] ... trained! + [INFO] Running ExperimentManager fit() for RandomAgent2 with n_fit = 10 and max_workers = None. + [INFO] ... trained! + [INFO] Running ExperimentManager fit() for OptimalAgent with n_fit = 10 and max_workers = None. + [INFO] ... trained! +``` + +Remark that `fit_budget` may not mean the same thing among agents. For RandomAgent `fit_budget` is the number of steps in the environments that the agent is allowed to take. + +The reward that we recover is recorded every time `env.step` is called. + +For UCBVI this is the number of iterations of the algorithm and in each +iteration, the environment takes 100 steps (`horizon`) times the +`fit_budget`. Hence the fit_budget used here + + + +Finally, we plot the reward: Here you can see the mean value over the 10 fited agent, with 2 options (raw and smoothed). + +```python +# Plot of the reward. +output = plot_writer_data( + [ucbvi_stats, baseline_stats], + tag="reward", + title="Episode Reward", +) + + +# Plot of the reward. +output = plot_writer_data( + [ucbvi_stats, baseline_stats], + tag="reward", + smooth=True, + title="Episode Reward smoothed", +) +``` + +```{image} Figure_2.png +:align: center +``` +```{image} Figure_3.png +:align: center +``` + + + : As you can see, different visualizations are possible. For more information on plots and visualization, you can check [here (in construction)](visualization_page) diff --git a/docs/basics/quick_start_rl/quickstart.rst b/docs/basics/quick_start_rl/quickstart.rst deleted file mode 100644 index 21341dac4..000000000 --- a/docs/basics/quick_start_rl/quickstart.rst +++ /dev/null @@ -1,395 +0,0 @@ -.. _quick_start: - -.. highlight:: none - -Quick Start for Reinforcement Learning in rlberry -================================================= - -.. math:: - - - \def\CC{\bf C} - \def\QQ{\bf Q} - \def\RR{\bf R} - \def\ZZ{\bf Z} - \def\NN{\bf N} - - -Importing required libraries ----------------------------- - -.. code:: python - - import numpy as np - import pandas as pd - import time - from rlberry.agents import UCBVIAgent, AgentWithSimplePolicy - from rlberry.envs import Chain - from rlberry.manager import ( - ExperimentManager, - evaluate_agents, - plot_writer_data, - read_writer_data, - ) - from rlberry.wrappers import WriterWrapper - -Choosing an RL environment --------------------------- - -In this tutorial, we will use the :class:`~rlberry.envs.finite.chain.Chain` -environment, which is a very simple environment where the agent has to go from one -end of a chain to the other end. - -.. code:: python - - env_ctor = Chain - env_kwargs = dict(L=10, fail_prob=0.1) - # chain of length 10. With proba 0.2, the agent will not be able to take the action it wants to take/ - env = env_ctor(**env_kwargs) - -Let us see a graphical representation - -.. code:: python - - env.enable_rendering() - observation, info = env.reset() - for tt in range(5): - observation, reward, terminated, truncated, info = env.step(1) - done = terminated or truncated - video = env.save_video("video_chain.mp4", framerate=5) - - -.. parsed-literal:: - - ffmpeg version n5.0 Copyright (c) 2000-2022 the FFmpeg developers - built with gcc 11.2.0 (GCC) - configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-amf --enable-avisynth --enable-cuda-llvm --enable-lto --enable-fontconfig --enable-gmp --enable-gnutls --enable-gpl --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libdav1d --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libiec61883 --enable-libjack --enable-libmfx --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librav1e --enable-librsvg --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxcb --enable-libxml2 --enable-libxvid --enable-libzimg --enable-nvdec --enable-nvenc --enable-shared --enable-version3 - libavutil 57. 17.100 / 57. 17.100 - libavcodec 59. 18.100 / 59. 18.100 - libavformat 59. 16.100 / 59. 16.100 - libavdevice 59. 4.100 / 59. 4.100 - libavfilter 8. 24.100 / 8. 24.100 - libswscale 6. 4.100 / 6. 4.100 - libswresample 4. 3.100 / 4. 3.100 - libpostproc 56. 3.100 / 56. 3.100 - Input #0, rawvideo, from 'pipe:': - Duration: N/A, start: 0.000000, bitrate: 7680 kb/s - Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 800x80, 7680 kb/s, 5 tbr, 5 tbn - Stream mapping: - Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264)) - [libx264 @ 0x564b9e570340] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 - [libx264 @ 0x564b9e570340] profile High, level 1.2, 4:2:0, 8-bit - [libx264 @ 0x564b9e570340] 264 - core 164 r3081 19856cc - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=2 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=5 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 - Output #0, mp4, to 'video_chain.mp4': - Metadata: - encoder : Lavf59.16.100 - Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 800x80, q=2-31, 5 fps, 10240 tbn - Metadata: - encoder : Lavc59.18.100 libx264 - Side data: - cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A - frame= 6 fps=0.0 q=-1.0 Lsize= 4kB time=00:00:00.60 bitrate= 48.8kbits/s speed=56.2x - video:3kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 32.212582% - [libx264 @ 0x564b9e570340] frame I:1 Avg QP: 6.94 size: 742 - [libx264 @ 0x564b9e570340] frame P:5 Avg QP:22.68 size: 267 - [libx264 @ 0x564b9e570340] mb I I16..4: 95.2% 0.0% 4.8% - [libx264 @ 0x564b9e570340] mb P I16..4: 1.2% 2.1% 2.0% P16..4: 0.2% 0.0% 0.0% 0.0% 0.0% skip:94.6% - [libx264 @ 0x564b9e570340] 8x8 transform intra:8.2% inter:0.0% - [libx264 @ 0x564b9e570340] coded y,uvDC,uvAC intra: 6.5% 12.3% 11.4% inter: 0.0% 0.0% 0.0% - [libx264 @ 0x564b9e570340] i16 v,h,dc,p: 79% 1% 20% 0% - [libx264 @ 0x564b9e570340] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 0% 0% 100% 0% 0% 0% 0% 0% 0% - [libx264 @ 0x564b9e570340] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 52% 22% 19% 1% 0% 3% 1% 3% 1% - [libx264 @ 0x564b9e570340] i8c dc,h,v,p: 92% 4% 3% 0% - [libx264 @ 0x564b9e570340] Weighted P-Frames: Y:0.0% UV:0.0% - [libx264 @ 0x564b9e570340] kb/s:13.85 - - -The agent has two actions, go to the left of to the right, but it might -move to a random direction according to a failure probability -``fail_prob=0.1``. - -.. video:: ../../video_chain_quickstart.mp4 - :width: 600 - :align: center - - -Defining an agent and a baseline --------------------------------- - -We will compare a RandomAgent (which plays random action) to the -:class:`~rlberry.agents.ucbvi.ucbvi.UCBVIAgent`, which -is a algorithm that is designed to perform an efficient exploration. -Our goal is then to assess the performance of the two algorithms. - -.. code:: python - - # Create random agent as a baseline - class RandomAgent(AgentWithSimplePolicy): - name = "RandomAgent" - - def __init__(self, env, **kwargs): - AgentWithSimplePolicy.__init__(self, env, **kwargs) - - def fit(self, budget=100, **kwargs): - observation, info = self.env.reset() - for ep in range(budget): - action = self.policy(observation) - observation, reward, done, _ = self.env.step(action) - - def policy(self, observation): - return self.env.action_space.sample() # choose an action at random - - - # Define parameters - ucbvi_params = {"gamma": 0.9, "horizon": 100} - -There are a number of agents that are already coded in rlberry. See the -module :class:`~rlberry.agents.Agent` for more informations. - -Agent Manager -------------- - -One of the main feature of rlberry is its :class:`~rlberry.manager.ExperimentManager` -class. Here is a diagram to explain briefly what it does. - - -.. figure:: experiment_manager_diagram.png - :align: center - - -In a few words, agent manager spawns agents and environments for training and -then once the agents are trained, it uses these agents and new environments -to evaluate how well the agent perform. All of these steps can be -done several times to assess stochasticity of agents and/or environment. - -Comparing the expected rewards of the final policies ----------------------------------------------------- - - -We want to assess the expected reward of the policy learned by our agents -for a time horizon of (say) :math:`T=20`. - -To do that we use 10 Monte-Carlo simulations, i.e., we do the experiment -10 times for each agent and at the end we take the mean of the 10 -obtained reward. - -This gives us 1 value per agent. We do this 10 times (so 10 times 10 -equal 100 simulations) in order to have an idea of the variability of -our estimation. - -In order to manage the agents, we use an Agent Manager. The manager will -then spawn agents as desired during the experiment. - - -.. code:: python - - # Create ExperimentManager to fit 1 agent - ucbvi_stats = ExperimentManager( - UCBVIAgent, - (env_ctor, env_kwargs), - fit_budget=100, - eval_kwargs=dict(eval_horizon=20, n_simulations=10), - init_kwargs=ucbvi_params, - n_fit=1, - ) - ucbvi_stats.fit() - - # Create ExperimentManager for baseline - baseline_stats = ExperimentManager( - RandomAgent, - (env_ctor, env_kwargs), - fit_budget=100, - eval_kwargs=dict(eval_horizon=20, n_simulations=10), - n_fit=1, - ) - baseline_stats.fit() - - -.. parsed-literal:: - - [INFO] Running ExperimentManager fit() for UCBVI with n_fit = 1 and max_workers = None. - [INFO] ... trained! - [INFO] Running ExperimentManager fit() for RandomAgent with n_fit = 1 and max_workers = None. - [INFO] ... trained! - - -.. code:: python - - output = evaluate_agents([ucbvi_stats, baseline_stats], n_simulations=10, plot=True) - - -.. parsed-literal:: - - [INFO] Evaluating UCBVI... - [INFO] [eval]... simulation 1/10 - [INFO] [eval]... simulation 2/10 - [INFO] [eval]... simulation 3/10 - [INFO] [eval]... simulation 4/10 - [INFO] [eval]... simulation 5/10 - [INFO] [eval]... simulation 6/10 - [INFO] [eval]... simulation 7/10 - [INFO] [eval]... simulation 8/10 - [INFO] [eval]... simulation 9/10 - [INFO] [eval]... simulation 10/10 - [INFO] Evaluating RandomAgent... - [INFO] [eval]... simulation 1/10 - [INFO] [eval]... simulation 2/10 - [INFO] [eval]... simulation 3/10 - [INFO] [eval]... simulation 4/10 - [INFO] [eval]... simulation 5/10 - [INFO] [eval]... simulation 6/10 - [INFO] [eval]... simulation 7/10 - [INFO] [eval]... simulation 8/10 - [INFO] [eval]... simulation 9/10 - [INFO] [eval]... simulation 10/10 - - - -.. image:: output_10_1.png - :align: center - -Comparing the agents during the learning period ------------------------------------------------- - -In the previous section, we compared the performance of the **final** policies learned by -the agents, **after** the learning period. - -To compare the performance of the agents **during** the learning period -(in the fit method), we can estimate their cumulative regret, which is the difference -between the rewards gathered by the agents during training and the -rewards of an optimal agent. Alternatively, if the we cannot compute the optimal -policy, we could simply compare the rewards gathered during learning, instead of the regret. - -First, we have to record the reward during the fit as this is not done -automatically. To do this, we can use the :class:`~rlberry.wrappers.writer_utils.WriterWrapper` -module, or simply the `Agent.writer` attribute. - -.. code:: python - - class RandomAgent2(RandomAgent): - name = "RandomAgent2" - - def __init__(self, env, **kwargs): - RandomAgent.__init__(self, env, **kwargs) - self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") - - - class UCBVIAgent2(UCBVIAgent): - name = "UCBVIAgent2" - - def __init__(self, env, **kwargs): - UCBVIAgent.__init__(self, env, **kwargs) - self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") - -To compute the regret, we also need to define an optimal agent. Here -it’s an agent that always chooses the action that moves to the right. - -.. code:: python - - class OptimalAgent(AgentWithSimplePolicy): - name = "OptimalAgent" - - def __init__(self, env, **kwargs): - AgentWithSimplePolicy.__init__(self, env, **kwargs) - self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") - - def fit(self, budget=100, **kwargs): - observation, info = self.env.reset() - for ep in range(budget): - action = 1 - observation, reward, terminated, truncated, info = self.env.step(action) - done = terminated or truncated - - def policy(self, observation): - return 1 - -Then, we fit the two agents and plot the data in the writer. - -.. code:: python - - # Create ExperimentManager to fit 4 agents using 1 job - ucbvi_stats = ExperimentManager( - UCBVIAgent2, - (env_ctor, env_kwargs), - fit_budget=50, - init_kwargs=ucbvi_params, - n_fit=10, - parallelization="process", - mp_context="fork", - ) # mp_context is needed to have parallel computing in notebooks. - ucbvi_stats.fit() - - # Create ExperimentManager for baseline - baseline_stats = ExperimentManager( - RandomAgent2, - (env_ctor, env_kwargs), - fit_budget=5000, - n_fit=10, - parallelization="process", - mp_context="fork", - ) - baseline_stats.fit() - - # Create ExperimentManager for baseline - opti_stats = ExperimentManager( - OptimalAgent, - (env_ctor, env_kwargs), - fit_budget=5000, - n_fit=10, - parallelization="process", - mp_context="fork", - ) - opti_stats.fit() - - -.. parsed-literal:: - - [INFO] Running ExperimentManager fit() for UCBVIAgent2 with n_fit = 10 and max_workers = None. - [INFO] ... trained! - [INFO] Running ExperimentManager fit() for RandomAgent2 with n_fit = 10 and max_workers = None. - [INFO] ... trained! - [INFO] Running ExperimentManager fit() for OptimalAgent with n_fit = 10 and max_workers = None. - [INFO] ... trained! - -Remark that ``fit_budget`` may not mean the same thing among agents. For -OptimalAgent and RandomAgent ``fit_budget`` is the number of steps in -the environments that the agent is allowed to take. - -The reward that we recover is recorded every time env.step is called. - -For UCBVI this is the number of iterations of the algorithm and in each -iteration, the environment takes 100 steps (``horizon``) times the -``fit_budget``. Hence the fit_budget used here - -Next, we estimate the optimal reward using the optimal policy. - -Be careful that this is only an estimation: we estimate the optimal -regret using Monte Carlo and the optimal policy. - -.. code:: python - - df = plot_writer_data(opti_stats, tag="reward", show=False) - df = df.loc[df["tag"] == "reward"][["global_step", "value"]] - opti_reward = df.groupby("global_step").mean()["value"].values - -Finally, we plot the cumulative regret using the 5000 reward values. - - -.. code:: python - - def compute_regret(rewards): - return np.cumsum(opti_reward - rewards[: len(opti_reward)]) - - - # Plot of the cumulative reward. - output = plot_writer_data( - [ucbvi_stats, baseline_stats], - tag="reward", - preprocess_func=compute_regret, - title="Cumulative Regret", - ) - - - -.. image:: output_19_0.png - :align: center diff --git a/docs/basics/userguide/agent.md b/docs/basics/userguide/agent.md new file mode 100644 index 000000000..f6a5b0a35 --- /dev/null +++ b/docs/basics/userguide/agent.md @@ -0,0 +1,446 @@ +(agent_page)= + +# How to use an Agent +In Reinforcement learning, the Agent is the entity to train to solve an environment. It's able to interact with the environment: observe, take actions, and learn through trial and error. +In rlberry, you can use existing Agent, or create your own custom Agent. You can find the API [here](/api) and [here](rlberry.agents.Agent) . + + +## Use rlberry Agent +An agent needs an environment to train. We'll use the same environment as in the [environment](environment_page) section of the user guide. +("Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)") + +### without agent +```python +from rlberry_research.envs.finite import Chain + +env = Chain(10, 0.1) +env.enable_rendering() +for tt in range(50): + env.step(env.action_space.sample()) +env.render(loop=False) + +# env.save_video is only available for rlberry envs and custom env (with 'RenderInterface' as parent class) +video = env.save_video("_agent_page_chain1.mp4") +env.close() +``` +
+ + + +If we use random actions on this environment, we don't have good results (the cross don't go to the right) + +### With agent + +With the same environment, we will use an Agent to choose the actions instead of random actions. +For this example, you can use "ValueIterationAgent" Agent from "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)" + +```python +from rlberry_research.envs.finite import Chain +from rlberry_scool.agents.dynprog import ValueIterationAgent + +env = Chain(10, 0.1) # same env +agent = ValueIterationAgent(env, gamma=0.95) # creation of the agent +info = agent.fit() # Agent's training (ValueIteration don't use budget) +print(info) + +# test the trained agent +env.enable_rendering() +observation, info = env.reset() +for tt in range(50): + action = agent.policy( + observation + ) # use the agent's policy to choose the next action + observation, reward, terminated, truncated, info = env.step(action) # do the action + done = terminated or truncated + if done: + break # stop if the environement is done +env.render(loop=False) + +# env.save_video is only available for rlberry envs and custom env (with 'RenderInterface' as parent class) +video = env.save_video("_agent_page_chain2.mp4") +env.close() +``` + +```none +{'n_iterations': 269, 'precision': 1e-06} + pg.display.set_mode(display, DOUBLEBUF | OPENGL) + _ = pg.display.set_mode(display, DOUBLEBUF | OPENGL) +ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers + built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) + configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared + libavutil 56. 70.100 / 56. 70.100 + libavcodec 58.134.100 / 58.134.100 + libavformat 58. 76.100 / 58. 76.100 + libavdevice 58. 13.100 / 58. 13.100 + libavfilter 7.110.100 / 7.110.100 + libswscale 5. 9.100 / 5. 9.100 + libswresample 3. 9.100 / 3. 9.100 + libpostproc 55. 9.100 / 55. 9.100 +Input #0, rawvideo, from 'pipe:': + Duration: N/A, start: 0.000000, bitrate: 38400 kb/s + Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 800x80, 38400 kb/s, 25 tbr, 25 tbn, 25 tbc +Stream mapping: + Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264)) +[libx264 @ 0x5570932967c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512 +[libx264 @ 0x5570932967c0] profile High, level 1.3, 4:2:0, 8-bit +[libx264 @ 0x5570932967c0] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=2 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 +Output #0, mp4, to '_agent_page_chain.mp4': + Metadata: + encoder : Lavf58.76.100 + Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 800x80, q=2-31, 25 fps, 12800 tbn + Metadata: + encoder : Lavc58.134.100 libx264 + Side data: + cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A +frame= 51 fps=0.0 q=-1.0 Lsize= 12kB time=00:00:01.92 bitrate= 51.9kbits/s speed=48.8x +video:11kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 12.817029% +[libx264 @ 0x5570932967c0] frame I:1 Avg QP:29.06 size: 6089 +[libx264 @ 0x5570932967c0] frame P:18 Avg QP:18.13 size: 172 +[libx264 @ 0x5570932967c0] frame B:32 Avg QP:13.93 size: 37 +[libx264 @ 0x5570932967c0] consecutive B-frames: 15.7% 0.0% 5.9% 78.4% +[libx264 @ 0x5570932967c0] mb I I16..4: 46.4% 0.0% 53.6% +[libx264 @ 0x5570932967c0] mb P I16..4: 5.9% 0.8% 1.3% P16..4: 0.4% 0.0% 0.0% 0.0% 0.0% skip:91.6% +[libx264 @ 0x5570932967c0] mb B I16..4: 0.1% 0.0% 0.2% B16..8: 1.6% 0.0% 0.0% direct: 0.0% skip:98.1% L0:58.9% L1:41.1% BI: 0.0% +[libx264 @ 0x5570932967c0] 8x8 transform intra:6.3% inter:14.3% +[libx264 @ 0x5570932967c0] coded y,uvDC,uvAC intra: 46.1% 37.1% 35.7% inter: 0.0% 0.0% 0.0% +[libx264 @ 0x5570932967c0] i16 v,h,dc,p: 55% 7% 38% 1% +[libx264 @ 0x5570932967c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 0% 0% 100% 0% 0% 0% 0% 0% 0% +[libx264 @ 0x5570932967c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 14% 41% 0% 4% 3% 2% 1% 1% +[libx264 @ 0x5570932967c0] i8c dc,h,v,p: 87% 5% 7% 1% +[libx264 @ 0x5570932967c0] Weighted P-Frames: Y:5.6% UV:5.6% +[libx264 @ 0x5570932967c0] ref P L0: 10.5% 5.3% 73.7% 10.5% +[libx264 @ 0x5570932967c0] ref B L0: 59.2% 27.6% 13.2% +[libx264 @ 0x5570932967c0] ref B L1: 96.2% 3.8% +[libx264 @ 0x5570932967c0] kb/s:40.59 +``` +
+ + + +The agent has learned how to obtain good results (the cross go to the right). + + + + + + +## Use StableBaselines3 as rlberry Agent +With rlberry, you can use an algorithm from [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html) and wrap it in rlberry Agent. To do that, you need to use [StableBaselinesAgent](rlberry.agents.stable_baselines.StableBaselinesAgent). + + +```python +from rlberry.envs import gym_make +from gymnasium.wrappers.record_video import RecordVideo +from stable_baselines3 import PPO +from rlberry.agents.stable_baselines import StableBaselinesAgent + +env = gym_make("CartPole-v1", render_mode="rgb_array") +agent = StableBaselinesAgent( + env, PPO, "MlpPolicy", verbose=1 +) # wrap StableBaseline3's PPO inside rlberry Agent +info = agent.fit(10000) # Agent's training +print(info) + +env = RecordVideo( + env, video_folder="./", name_prefix="CartPole" +) # wrap the env to save the video output +observation, info = env.reset() # initialize the environment +for tt in range(3000): + action = agent.policy( + observation + ) # use the agent's policy to choose the next action + observation, reward, terminated, truncated, info = env.step(action) # do the action + done = terminated or truncated + if done: + break # stop if the environement is done +env.close() +``` + +```none +Using cpu device +Wrapping the env with a `Monitor` wrapper +Wrapping the env in a DummyVecEnv. +--------------------------------- +| rollout/ | | +| ep_len_mean | 22 | +| ep_rew_mean | 22 | +| time/ | | +| fps | 2490 | +| iterations | 1 | +| time_elapsed | 0 | +| total_timesteps | 2048 | +--------------------------------- +----------------------------------------- +| rollout/ | | +| ep_len_mean | 28.1 | +| ep_rew_mean | 28.1 | +| time/ | | +| fps | 1842 | +| iterations | 2 | +| time_elapsed | 2 | +| total_timesteps | 4096 | +| train/ | | +| approx_kl | 0.009214947 | +| clip_fraction | 0.102 | +| clip_range | 0.2 | +| entropy_loss | -0.686 | +| explained_variance | -0.00179 | +| learning_rate | 0.0003 | +| loss | 8.42 | +| n_updates | 10 | +| policy_gradient_loss | -0.0158 | +| value_loss | 51.5 | +----------------------------------------- +----------------------------------------- +| rollout/ | | +| ep_len_mean | 40 | +| ep_rew_mean | 40 | +| time/ | | +| fps | 1708 | +| iterations | 3 | +| time_elapsed | 3 | +| total_timesteps | 6144 | +| train/ | | +| approx_kl | 0.009872524 | +| clip_fraction | 0.0705 | +| clip_range | 0.2 | +| entropy_loss | -0.666 | +| explained_variance | 0.119 | +| learning_rate | 0.0003 | +| loss | 16 | +| n_updates | 20 | +| policy_gradient_loss | -0.0195 | +| value_loss | 38.7 | +----------------------------------------- +[INFO] 16:36: [[worker: -1]] | max_global_step = 6144 | time/iterations = 2 | rollout/ep_rew_mean = 28.13 | rollout/ep_len_mean = 28.13 | time/fps = 1842 | time/time_elapsed = 2 | time/total_timesteps = 4096 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6860913151875139 | train/policy_gradient_loss = -0.015838009686558508 | train/value_loss = 51.528612112998964 | train/approx_kl = 0.009214947000145912 | train/clip_fraction = 0.10205078125 | train/loss = 8.420166969299316 | train/explained_variance = -0.001785874366760254 | train/n_updates = 10 | train/clip_range = 0.2 | +------------------------------------------ +| rollout/ | | +| ep_len_mean | 50.2 | +| ep_rew_mean | 50.2 | +| time/ | | +| fps | 1674 | +| iterations | 4 | +| time_elapsed | 4 | +| total_timesteps | 8192 | +| train/ | | +| approx_kl | 0.0076105352 | +| clip_fraction | 0.068 | +| clip_range | 0.2 | +| entropy_loss | -0.634 | +| explained_variance | 0.246 | +| learning_rate | 0.0003 | +| loss | 29.6 | +| n_updates | 30 | +| policy_gradient_loss | -0.0151 | +| value_loss | 57.3 | +------------------------------------------ +----------------------------------------- +| rollout/ | | +| ep_len_mean | 66 | +| ep_rew_mean | 66 | +| time/ | | +| fps | 1655 | +| iterations | 5 | +| time_elapsed | 6 | +| total_timesteps | 10240 | +| train/ | | +| approx_kl | 0.006019583 | +| clip_fraction | 0.0597 | +| clip_range | 0.2 | +| entropy_loss | -0.606 | +| explained_variance | 0.238 | +| learning_rate | 0.0003 | +| loss | 31.1 | +| n_updates | 40 | +| policy_gradient_loss | -0.0147 | +| value_loss | 72.3 | +----------------------------------------- +None + +Moviepy - Building video /CartPole-episode-0.mp4. +Moviepy - Writing video /CartPole-episode-0.mp4 + +Moviepy - Done ! +Moviepy - video ready /CartPole-episode-0.mp4 +``` + +
+ + + + + +## Create your own Agent + **warning :** For advanced users only + +rlberry requires you to use a **very simple interface** to write agents, with basically +two methods to implement: `fit()` and `eval()`. + +You can find more information on this interface [here(Agent)](rlberry.agents.agent.Agent) (or [here(AgentWithSimplePolicy)](rlberry.agents.agent.AgentWithSimplePolicy)) + +The example below shows how to create an agent. + + +```python +import numpy as np +from rlberry.agents import AgentWithSimplePolicy + + +class MyAgentQLearning(AgentWithSimplePolicy): + name = "QLearning" + # create an agent with q-table + + def __init__( + self, + env, + exploration_rate=0.01, + learning_rate=0.8, + discount_factor=0.95, + **kwargs + ): # it's important to put **kwargs to ensure compatibility with the base class + # self.env is initialized in the base class + super().__init__(env=env, **kwargs) + + state_space_size = env.observation_space.n + action_space_size = env.action_space.n + + self.exploration_rate = exploration_rate # percentage to select random action + self.q_table = np.zeros( + (state_space_size, action_space_size) + ) # q_table to store result and choose actions + self.learning_rate = learning_rate + self.discount_factor = discount_factor # gamma + + def fit(self, budget, **kwargs): + """ + The parameter budget can represent the number of steps, the number of episodes etc, + depending on the agent. + * Interact with the environment (self.env); + * Train the agent + * Return useful information + """ + n_episodes = budget + rewards = np.zeros(n_episodes) + + for ep in range(n_episodes): + observation, info = self.env.reset() + done = False + while not done: + action = self.policy(observation) + next_step, reward, terminated, truncated, info = self.env.step(action) + # update the q_table + self.q_table[observation, action] = ( + 1 - self.learning_rate + ) * self.q_table[observation, action] + self.learning_rate * ( + reward + self.discount_factor * np.max(self.q_table[next_step, :]) + ) + observation = next_step + done = terminated or truncated + rewards[ep] += reward + + info = {"episode_rewards": rewards} + return info + + def eval(self, **kwargs): + """ + Returns a value corresponding to the evaluation of the agent on the + evaluation environment. + + For instance, it can be a Monte-Carlo evaluation of the policy learned in fit(). + """ + + return super().eval() # use the eval() from AgentWithSimplePolicy + + def policy(self, observation, explo=True): + state = observation + if explo and np.random.rand() < self.exploration_rate: + action = env.action_space.sample() # Explore + else: + action = np.argmax(self.q_table[state, :]) # Exploit + + return action +``` + + + **warning :** It's important that your agent accepts optional `**kwargs` and pass it to the base class as `Agent.__init__(self, env, **kwargs)`. + +You can use it like this : + +```python +from gymnasium.wrappers.record_video import RecordVideo +from rlberry.envs import gym_make + +env = gym_make( + "FrozenLake-v1", render_mode="rgb_array", is_slippery=False +) # remove the slippery from the env +agent = MyAgentQLearning( + env, exploration_rate=0.25, learning_rate=0.8, discount_factor=0.95 +) +info = agent.fit(100000) # Agent's training +print("----------") +print(agent.q_table) # display the q_table content +print("----------") + +env = RecordVideo( + env, video_folder="./", name_prefix="FrozenLake_no_slippery" +) # wrap the env to save the video output +observation, info = env.reset() # initialize the environment +for tt in range(3000): + action = agent.policy( + observation, explo=False + ) # use the agent's policy to choose the next action (without exploration) + observation, reward, terminated, truncated, info = env.step(action) # do the action + done = terminated or truncated + if done: + break # stop if the environement is done +env.close() +``` + + +```none +---------- +[[0.73509189 0.77378094 0.77378094 0.73509189] + [0.73509189 0. 0.81450625 0.77378094] + [0.77378094 0.857375 0.77378094 0.81450625] + [0.81450625 0. 0.77378094 0.77378094] + [0.77378094 0.81450625 0. 0.73509189] + [0. 0. 0. 0. ] + [0. 0.9025 0. 0.81450625] + [0. 0. 0. 0. ] + [0.81450625 0. 0.857375 0.77378094] + [0.81450625 0.9025 0.9025 0. ] + [0.857375 0.95 0. 0.857375 ] + [0. 0. 0. 0. ] + [0. 0. 0. 0. ] + [0. 0.9025 0.95 0.857375 ] + [0.9025 0.95 1. 0.9025 ] + [0. 0. 0. 0. ]] +---------- + +Moviepy - Building video /FrozenLake_no_slippery-episode-0.mp4. +Moviepy - Writing video /FrozenLake_no_slippery-episode-0.mp4 + +Moviepy - Done ! +Moviepy - video ready /FrozenLake_no_slippery-episode-0.mp4 +0.7 + +``` + + + + + + + +## Use experimentManager + +This is one of the core element in rlberry. The ExperimentManager allows you to easily make an experiment between an Agent and an Environment. It is used to train, optimize hyperparameters, evaluate and gather statistics about an agent. +You can find the guide for ExperimentManager [here](experimentManager_page). diff --git a/docs/basics/userguide/environment.md b/docs/basics/userguide/environment.md new file mode 100644 index 000000000..a27f32c24 --- /dev/null +++ b/docs/basics/userguide/environment.md @@ -0,0 +1,186 @@ +(environment_page)= + +# How to use an environment + +This is the world with which the agent interacts. The agent can observe this environment, and can perform actions to modify it (but cannot change its rules). With rlberry, you can use an existing environment, or create your own custom environment. + + +## Use rlberry environment +You can find some environments in our other projects "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" and "[rlberry-scool](https://github.com/rlberry-py/rlberry-scool)". +For this example, you can use "Chain" environment from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" +```python +from rlberry_research.envs.finite import Chain + +env = Chain(10, 0.1) +env.enable_rendering() +for tt in range(20): + # Force to go right every 4 steps to have a better video render. + if tt % 4 == 0: + env.step(1) + else: + env.step(env.action_space.sample()) +env.render(loop=False) + +# env.save_video is only available for rlberry envs and custom env (with 'RenderInterface' as parent class) +video = env.save_video("_env_page_chain.mp4") +env.close() +``` + +```none +ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers + built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) + configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared + libavutil 56. 70.100 / 56. 70.100 + libavcodec 58.134.100 / 58.134.100 + libavformat 58. 76.100 / 58. 76.100 + libavdevice 58. 13.100 / 58. 13.100 + libavfilter 7.110.100 / 7.110.100 + libswscale 5. 9.100 / 5. 9.100 + libswresample 3. 9.100 / 3. 9.100 + libpostproc 55. 9.100 / 55. 9.100 +Input #0, rawvideo, from 'pipe:': + Duration: N/A, start: 0.000000, bitrate: 38400 kb/s + Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 800x80, 38400 kb/s, 25 tbr, 25 tbn, 25 tbc +Stream mapping: + Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264)) +[libx264 @ 0x5644b31f07c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512 +[libx264 @ 0x5644b31f07c0] profile High, level 1.3, 4:2:0, 8-bit +[libx264 @ 0x5644b31f07c0] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=2 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 +Output #0, mp4, to '_env_page_chain.mp4': + Metadata: + encoder : Lavf58.76.100 + Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 800x80, q=2-31, 25 fps, 12800 tbn + Metadata: + encoder : Lavc58.134.100 libx264 + Side data: + cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A +frame= 21 fps=0.0 q=-1.0 Lsize= 11kB time=00:00:00.72 bitrate= 128.4kbits/s speed=29.6x +video:10kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 10.128633% +[libx264 @ 0x5644b31f07c0] frame I:1 Avg QP:29.04 size: 6175 +[libx264 @ 0x5644b31f07c0] frame P:12 Avg QP:24.07 size: 220 +[libx264 @ 0x5644b31f07c0] frame B:8 Avg QP:22.19 size: 124 +[libx264 @ 0x5644b31f07c0] consecutive B-frames: 33.3% 38.1% 28.6% 0.0% +[libx264 @ 0x5644b31f07c0] mb I I16..4: 56.0% 0.0% 44.0% +[libx264 @ 0x5644b31f07c0] mb P I16..4: 8.4% 1.6% 1.7% P16..4: 1.1% 0.0% 0.0% 0.0% 0.0% skip:87.3% +[libx264 @ 0x5644b31f07c0] mb B I16..4: 0.2% 0.5% 0.9% B16..8: 4.1% 0.0% 0.0% direct: 0.1% skip:94.3% L0:48.8% L1:51.2% BI: 0.0% +[libx264 @ 0x5644b31f07c0] 8x8 transform intra:9.0% inter:23.5% +[libx264 @ 0x5644b31f07c0] coded y,uvDC,uvAC intra: 45.6% 37.1% 36.1% inter: 0.2% 0.0% 0.0% +[libx264 @ 0x5644b31f07c0] i16 v,h,dc,p: 51% 13% 35% 1% +[libx264 @ 0x5644b31f07c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 0% 0% 100% 0% 0% 0% 0% 0% 0% +[libx264 @ 0x5644b31f07c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 16% 40% 1% 4% 2% 2% 1% 0% +[libx264 @ 0x5644b31f07c0] i8c dc,h,v,p: 91% 5% 4% 0% +[libx264 @ 0x5644b31f07c0] Weighted P-Frames: Y:8.3% UV:8.3% +[libx264 @ 0x5644b31f07c0] ref P L0: 21.9% 0.0% 65.6% 12.5% +[libx264 @ 0x5644b31f07c0] ref B L0: 52.5% 47.5% +[libx264 @ 0x5644b31f07c0] kb/s:93.39 +``` +
+ + + + +## Use Gymnasium environment +Gymnasium is a project that provides an API for all single agent reinforcement learning environments, and includes implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more.More information [here](https://gymnasium.farama.org/environments/classic_control/). + +In rlberry, you can use Gymnasium environment with [gym_make](rlberry.envs.gym_make). Here, we will use ```MountainCar-v0```, one of the "Classic Control environments". + +```python +from rlberry.envs import gym_make +from gymnasium.wrappers.record_video import RecordVideo + +# If you want an output video of your Gymnasium env, you have to : +# - add a 'render_mode' parameter at your gym_make +# - add a 'RecordVideo' wrapper around the gymnasium environment. +env = gym_make("MountainCar-v0", render_mode="rgb_array") +env = RecordVideo(env, video_folder="./", name_prefix="MountainCar") +# else, this line is enough: +# env = gym_make("MountainCar-v0") + +observation, info = env.reset() +for tt in range(50): + env.step(env.action_space.sample()) +env.close() +``` + +```none +[your path]/.conda/lib/python3.10/site-packages/gymnasium/wrappers/record_video.py:94: UserWarning: WARN: Overwriting existing videos at [your path] folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired) + logger.warn( +[your path]/.conda/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.is_vector_env to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.is_vector_env` for environment variables or `env.get_wrapper_attr('is_vector_env')` that will search the reminding wrappers. + logger.warn( +Moviepy - Building video [your path]/MountainCar-episode-0.mp4. +Moviepy - Writing video [your path]/MountainCar-episode-0.mp4 + +Moviepy - Done ! +Moviepy - video ready [your path]/MountainCar-episode-0.mp4 +``` + +
+ + + +To customize the environments, you can use some wrappers. You can find the rlberry's wrappers and more information [here (in construction)](wrappers_page), or you can use [Gymnasium wrappers](https://gymnasium.farama.org/api/wrappers/) (or create your own). + + + +## Use Atari environment +A set of Atari 2600 environment simulated through Stella and the Arcade Learning Environment. More information [here](https://gymnasium.farama.org/environments/atari/). + +The function "[atari_make()](rlberry.envs.atari_make)" add wrappers on gym.make, to make it easier to use on Atari games. + +```python +from rlberry.envs import atari_make +from gymnasium.wrappers.record_video import RecordVideo + + +# If you want an output video of your Atari env, you have to : +# - add a 'render_mode' parameter at your atari_make +# - add a 'RecordVideo' wrapper around the gymnasium environment. +env = atari_make("ALE/Breakout-v5", render_mode="rgb_array") +env = RecordVideo(env, video_folder="./", name_prefix="Breakout") +# else, this line is enough: +# env = atari_make("ALE/Breakout-v5") + +observation, info = env.reset() +for tt in range(50): + observation, reward, terminated, truncated, info = env.step( + env.action_space.sample() + ) + # if the environment is terminated or truncated (no more life), it need to be reset + if terminated or truncated: + observation, info = env.reset() +env.close() +``` + +```none +A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7) +[Powered by Stella] +[your path]/.conda/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.is_vector_env to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.is_vector_env` for environment variables or `env.get_wrapper_attr('is_vector_env')` that will search the reminding wrappers. + logger.warn( +[your path]/.conda/lib/python3.10/site-packages/gymnasium/utils/passive_env_checker.py:335: UserWarning: WARN: No render fps was declared in the environment (env.metadata['render_fps'] is None or not defined), rendering may occur at inconsistent fps. + logger.warn( +Moviepy - Building video [your path]/Breakout-episode-0.mp4. +Moviepy - Writing video [your path]/Breakout-episode-0.mp4 + +Moviepy - Done ! +Moviepy - video ready [your path]/Breakout-episode-0.mp4 +``` + +
+ + + + +## Create your own environment + **warning :** For advanced users only + +You need to create a new class that inherits from [gymnasium.Env](https://gymnasium.farama.org/api/env/) or one of it child class like [Model](rlberry.envs.interface.Model) (and RenderInterface/one of it child class, if you want an environment with rendering). + +Then you need to make the specific functions that respect gymnasium template (as step, reset, ...). More information [here](https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/) + +You can find examples in our other github project "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" with [Acrobot](https://github.com/rlberry-py/rlberry-research/blob/main/rlberry_research/envs/classic_control/acrobot.py), [MountainCar](https://github.com/rlberry-py/rlberry-research/blob/main/rlberry_research/envs/classic_control/mountain_car.py) or [Chain](https://github.com/rlberry-py/rlberry-research/blob/main/rlberry_research/envs/finite/chain.py) (and their parent classes). diff --git a/docs/basics/userguide/expManager_multieval.png b/docs/basics/userguide/expManager_multieval.png new file mode 100644 index 000000000..2ceeb5ff2 Binary files /dev/null and b/docs/basics/userguide/expManager_multieval.png differ diff --git a/docs/basics/userguide/experimentManager.md b/docs/basics/userguide/experimentManager.md new file mode 100644 index 000000000..f637c85f8 --- /dev/null +++ b/docs/basics/userguide/experimentManager.md @@ -0,0 +1,289 @@ +(experimentManager_page)= + +# How to use the ExperimentManager + +It's the element that allows you to make your experiments on [Agent](agent_page) and [Environment](environment_page). +You can use it to train, optimize hyperparameters, evaluate, compare, and gather statistics about your agent on a specific environment. You can find the API doc [here](rlberry.manager.ExperimentManager). +It's not the only solution, but it's the compact (and recommended) way of doing experiments with an agent. + +For these examples, you will use the "PPO" torch agent from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" + +## Create your experiment + +```python +from rlberry.envs import gym_make +from rlberry_research.agents.torch import PPOAgent +from rlberry.manager import ExperimentManager, evaluate_agents + + +env_id = "CartPole-v1" # Id of the environment + +env_ctor = gym_make # constructor for the env +env_kwargs = dict(id=env_id) # give the id of the env inside the kwargs + + +first_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(100), # Budget used to call our agent "fit()" + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_first_experiment" + env_id, # Name of the agent + seed=42, +) + +first_experiment.fit() + +output = evaluate_agents( + [first_experiment], n_simulations=5, plot=False +) # evaluate the experiment on 5 simulations +print(output) +``` + +```none +[INFO] 14:26: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:26: ... trained! +[INFO] 14:26: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + + PPO_first_experimentCartPole-v1 +0 15.0 +1 18.4 +2 21.4 +3 22.3 +4 23.0 +``` +
+ + + +## Compare with another agent +Now you can compare this agent with another one. Here, we are going to compare it with the same agent, but with a bigger fit budget, and some fine tuning. + + + **warning :** add this code after the previous one. +```python +second_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(10000), # Budget used to call our agent "fit()" + init_kwargs=dict( + batch_size=24, n_steps=96, device="cpu" + ), # Arguments for the Agent’s constructor. + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_second_experiment" + + env_id, # Name of our agent (for saving/printing) + seed=42, +) + +second_experiment.fit() + +output = evaluate_agents( + [first_experiment, second_experiment], n_simulations=5, plot=True +) # evaluate the 2 experiments on 5 simulations +print(output) +``` + +```none +[INFO] 14:39: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2496 | fit/policy_loss = -0.0443466454744339 | fit/value_loss = 33.09639358520508 | fit/entropy_loss = 0.6301112174987793 | fit/approx_kl = 0.0029671359807252884 | fit/clipfrac = 0.0 | fit/explained_variance = 0.4449042081832886 | fit/learning_rate = 0.0003 | +[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5472 | fit/policy_loss = -0.020021788775920868 | fit/value_loss = 171.70037841796875 | fit/entropy_loss = 0.5415757298469543 | fit/approx_kl = 0.001022467389702797 | fit/clipfrac = 0.0 | fit/explained_variance = 0.1336498260498047 | fit/learning_rate = 0.0003 | +[INFO] 14:39: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 8256 | fit/policy_loss = -0.016511857509613037 | fit/value_loss = 199.02989196777344 | fit/entropy_loss = 0.5490894317626953 | fit/approx_kl = 0.022175027057528496 | fit/clipfrac = 0.27083333395421505 | fit/explained_variance = 0.19932276010513306 | fit/learning_rate = 0.0003 | +[INFO] 14:39: ... trained! +[INFO] 14:39: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished +[INFO] 14:39: Evaluating PPO_second_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + + PPO_first_experimentCartPole-v1 PPO_second_experimentCartPole-v1 +0 20.6 200.6 +1 20.5 286.7 +2 18.9 238.6 +3 18.2 248.2 +4 17.7 271.9 +``` +As we can see in the output or in the following image, the second agent succeed better. + +![image](expManager_multieval.png){.align-center} + +
+ +## Output the video +If you want to see the output video of the trained Agent, you need to use the RecordVideo wrapper. As ExperimentManager use tuple for env parameter, you need to give the constructor with the wrapper. To do that, you can use [PipelineEnv](rlberry.envs.PipelineEnv) as constructor, and add the wrapper + the env information in its kwargs. + + **warning :** You have to do it on the eval environment, or you may have videos during the fit of your Agent. + +```python +from rlberry.envs import PipelineEnv +from gymnasium.wrappers.record_video import RecordVideo + +env_id = "CartPole-v1" +env_ctor = gym_make # constructor for training env +env_kwargs = dict(id=env_id) # kwars for training env + +eval_env_ctor = PipelineEnv # constructor for eval env +eval_env_kwargs = { # kwars for eval env (with wrapper) + "env_ctor": gym_make, + "env_kwargs": {"id": env_id, "render_mode": "rgb_array"}, + "wrappers": [ + (RecordVideo, {"video_folder": "./", "name_prefix": env_id}) + ], # list of tuple (class,kwargs) +} + +third_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(10000), # Budget used to call our agent "fit()" + eval_env=(eval_env_ctor, eval_env_kwargs), # Evaluation environment as tuple + init_kwargs=dict(batch_size=24, n_steps=96, device="cpu"), # settings for the Agent + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_third_experiment" + env_id, # Name of the agent + seed=42, +) + +third_experiment.fit() + +output3 = evaluate_agents( + [third_experiment], n_simulations=15, plot=False +) # evaluate the experiment on 5 simulations +print(output3) +``` + + +```None +[INFO] 17:03: Running ExperimentManager fit() for PPO_third_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 1536 | fit/policy_loss = -0.0001924981625052169 | fit/value_loss = 34.07163619995117 | fit/entropy_loss = 0.6320618987083435 | fit/approx_kl = 0.00042163082980550826 | fit/clipfrac = 0.0 | fit/explained_variance = -0.05607199668884277 | fit/learning_rate = 0.0003 | +[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 3744 | fit/policy_loss = -0.02924121916294098 | fit/value_loss = 0.8705029487609863 | fit/entropy_loss = 0.6485489010810852 | fit/approx_kl = 0.0006057650898583233 | fit/clipfrac = 0.0 | fit/explained_variance = 0.9505079835653305 | fit/learning_rate = 0.0003 | +[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 5856 | fit/policy_loss = -0.008760576136410236 | fit/value_loss = 2.063389778137207 | fit/entropy_loss = 0.5526289343833923 | fit/approx_kl = 0.017247432842850685 | fit/clipfrac = 0.08645833283662796 | fit/explained_variance = 0.9867914840579033 | fit/learning_rate = 0.0003 | +[INFO] 17:03: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 8256 | fit/policy_loss = -0.016511857509613037 | fit/value_loss = 199.02989196777344 | fit/entropy_loss = 0.5490894317626953 | fit/approx_kl = 0.022175027057528496 | fit/clipfrac = 0.27083333395421505 | fit/explained_variance = 0.19932276010513306 | fit/learning_rate = 0.0003 | +[INFO] 09:45: Evaluating PPO_third_experimentCartPole-v1... +[INFO] Evaluation:Moviepy - Building video /CartPole-v1-episode-0.mp4. +Moviepy - Writing video CartPole-v1-episode-0.mp4 + +Moviepy - Done ! +Moviepy - video ready /CartPole-v1-episode-0.mp4 +.Moviepy - Building video /CartPole-v1-episode-1.mp4. +Moviepy - Writing video /CartPole-v1-episode-1.mp4 + +Moviepy - Done ! +Moviepy - video ready /CartPole-v1-episode-1.mp4 +.... Evaluation finished + + PPO_third_experimentCartPole-v1 +0 175.0 +1 189.0 +2 234.0 +3 146.0 +4 236.0 +``` + + + + + + +## Some advanced settings +Now an example with some more settings. (check the [API](rlberry.manager.ExperimentManager) to see all of them) + +```python +from rlberry.envs import gym_make +from rlberry_research.agents.torch import PPOAgent +from rlberry.manager import ExperimentManager, evaluate_agents + + +env_id = "CartPole-v1" +env_ctor = gym_make +env_kwargs = dict(id=env_id) + +fourth_experiment = ExperimentManager( + PPOAgent, # Agent Class + train_env=(env_ctor, env_kwargs), # Environment to train the Agent + fit_budget=int(15000), # Budget used to call our agent "fit()" + eval_env=( + env_ctor, + env_kwargs, + ), # Environment to eval the Agent (here, same as training env) + init_kwargs=dict(batch_size=24, n_steps=96, device="cpu"), # Agent setting + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + agent_name="PPO_second_experiment" + env_id, # Name of the agent + n_fit=4, # Number of agent instances to fit. + output_dir="./fourth_experiment_results/", # Directory where to store data. + parallelization="thread", # parallelize agent training using threads + max_workers=2, # max 2 threads with parallelization + enable_tensorboard=True, # enable tensorboard logging +) + +fourth_experiment.fit() + +output = evaluate_agents( + [fourth_experiment], n_simulations=5, plot=False +) # evaluate the experiment on 5 simulations +print(output) +``` + +```none +[INFO] 10:28: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 4 and max_workers = 2. +[INFO] 10:28: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 1248 | fit/policy_loss = -0.0046189031563699245 | fit/value_loss = 17.90558624267578 | fit/entropy_loss = 0.6713765263557434 | fit/approx_kl = 0.008433022536337376 | fit/clipfrac = 0.00416666679084301 | fit/explained_variance = -0.027537941932678223 | fit/learning_rate = 0.0003 | +[INFO] 10:28: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 1344 | fit/policy_loss = -0.0015951147070154548 | fit/value_loss = 30.366439819335938 | fit/entropy_loss = 0.6787645816802979 | fit/approx_kl = 6.758669769624248e-05 | fit/clipfrac = 0.0 | fit/explained_variance = 0.03739374876022339 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2784 | fit/policy_loss = 0.0009148915414698422 | fit/value_loss = 29.08318328857422 | fit/entropy_loss = 0.6197206974029541 | fit/approx_kl = 0.01178667601197958 | fit/clipfrac = 0.012499999906867742 | fit/explained_variance = 0.0344814658164978 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 2880 | fit/policy_loss = 0.001672893762588501 | fit/value_loss = 27.00239372253418 | fit/entropy_loss = 0.6320319771766663 | fit/approx_kl = 0.003481858177110553 | fit/clipfrac = 0.0 | fit/explained_variance = 0.15528488159179688 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 4224 | fit/policy_loss = -0.022785374894738197 | fit/value_loss = 91.76630401611328 | fit/entropy_loss = 0.5638656616210938 | fit/approx_kl = 0.0017503012204542756 | fit/clipfrac = 0.0 | fit/explained_variance = -0.7095993757247925 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 4320 | fit/policy_loss = 0.0013483166694641113 | fit/value_loss = 31.31000518798828 | fit/entropy_loss = 0.589007556438446 | fit/approx_kl = 0.015259895473718643 | fit/clipfrac = 0.11979166707023978 | fit/explained_variance = -0.045020341873168945 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5856 | fit/policy_loss = 0.00605475390329957 | fit/value_loss = 44.3318977355957 | fit/entropy_loss = 0.625015377998352 | fit/approx_kl = 0.00823256652802229 | fit/clipfrac = 0.002083333395421505 | fit/explained_variance = 0.4239630103111267 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 5856 | fit/policy_loss = 0.0038757026195526123 | fit/value_loss = 68.52188873291016 | fit/entropy_loss = 0.5918349027633667 | fit/approx_kl = 0.003220468061044812 | fit/clipfrac = 0.006250000186264515 | fit/explained_variance = -0.18818902969360352 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 7488 | fit/policy_loss = -0.009495516307651997 | fit/value_loss = 101.06624603271484 | fit/entropy_loss = 0.5486583709716797 | fit/approx_kl = 0.003257486969232559 | fit/clipfrac = 0.0 | fit/explained_variance = 0.1193075180053711 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 7488 | fit/policy_loss = -0.008390605449676514 | fit/value_loss = 3.112384080886841 | fit/entropy_loss = 0.5489932894706726 | fit/approx_kl = 0.004215842578560114 | fit/clipfrac = 0.06250000121071934 | fit/explained_variance = 0.9862392572686076 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 9024 | fit/policy_loss = -0.037699855864048004 | fit/value_loss = 7.979381561279297 | fit/entropy_loss = 0.5623810887336731 | fit/approx_kl = 0.004208063241094351 | fit/clipfrac = 0.015625000465661287 | fit/explained_variance = 0.927260547876358 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 9024 | fit/policy_loss = -0.03145790100097656 | fit/value_loss = 136.57496643066406 | fit/entropy_loss = 0.6083818078041077 | fit/approx_kl = 0.00015769463789183646 | fit/clipfrac = 0.0 | fit/explained_variance = 0.020778536796569824 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 10560 | fit/policy_loss = -0.005346258636564016 | fit/value_loss = 45.43724060058594 | fit/entropy_loss = 0.5453484654426575 | fit/approx_kl = 0.0029732866678386927 | fit/clipfrac = 0.0010416666977107526 | fit/explained_variance = 0.5737246572971344 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 10656 | fit/policy_loss = -0.034032005816698074 | fit/value_loss = 8.352469444274902 | fit/entropy_loss = 0.5558638572692871 | fit/approx_kl = 0.00012727950525004417 | fit/clipfrac = 0.0 | fit/explained_variance = 0.9023054912686348 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 12192 | fit/policy_loss = -0.014423723332583904 | fit/value_loss = 4.224886417388916 | fit/entropy_loss = 0.5871571898460388 | fit/approx_kl = 0.00237840972840786 | fit/clipfrac = 0.00833333358168602 | fit/explained_variance = 0.9782726876437664 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 12192 | fit/policy_loss = 0.002156441332772374 | fit/value_loss = 0.6400878429412842 | fit/entropy_loss = 0.5812122821807861 | fit/approx_kl = 0.002348624635487795 | fit/clipfrac = 0.0031250000931322573 | fit/explained_variance = -7.9211273193359375 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 13728 | fit/policy_loss = -0.009624089114367962 | fit/value_loss = 0.2872621715068817 | fit/entropy_loss = 0.5476118922233582 | fit/approx_kl = 0.005943961441516876 | fit/clipfrac = 0.045833333022892477 | fit/explained_variance = -2.098886489868164 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 13824 | fit/policy_loss = -0.002612883923575282 | fit/value_loss = 142.26548767089844 | fit/entropy_loss = 0.5882496237754822 | fit/approx_kl = 0.001458114362321794 | fit/clipfrac = 0.0 | fit/explained_variance = 0.11501973867416382 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 1440 | fit/policy_loss = -0.0015770109603181481 | fit/value_loss = 19.095449447631836 | fit/entropy_loss = 0.6553768515586853 | fit/approx_kl = 0.0036005538422614336 | fit/clipfrac = 0.0 | fit/explained_variance = -0.02544558048248291 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 1440 | fit/policy_loss = 0.010459281504154205 | fit/value_loss = 24.7592716217041 | fit/entropy_loss = 0.6623566746711731 | fit/approx_kl = 0.003298681229352951 | fit/clipfrac = 0.0 | fit/explained_variance = -0.06966197490692139 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 2976 | fit/policy_loss = 0.016300952062010765 | fit/value_loss = 38.56718826293945 | fit/entropy_loss = 0.6324384212493896 | fit/approx_kl = 0.0001397288142470643 | fit/clipfrac = 0.0 | fit/explained_variance = 0.06470108032226562 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 3072 | fit/policy_loss = -0.04757208749651909 | fit/value_loss = 49.06455612182617 | fit/entropy_loss = 0.5877493023872375 | fit/approx_kl = 0.014825299382209778 | fit/clipfrac = 0.05000000027939677 | fit/explained_variance = 0.162692129611969 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 4608 | fit/policy_loss = 0.010635286569595337 | fit/value_loss = 63.65742874145508 | fit/entropy_loss = 0.54666668176651 | fit/approx_kl = 0.014807184226810932 | fit/clipfrac = 0.1291666661389172 | fit/explained_variance = 0.17509007453918457 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 4704 | fit/policy_loss = 0.007104901131242514 | fit/value_loss = 60.899166107177734 | fit/entropy_loss = 0.5803811550140381 | fit/approx_kl = 0.016342414543032646 | fit/clipfrac = 0.10000000083819031 | fit/explained_variance = 0.14491546154022217 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 6240 | fit/policy_loss = -0.017549103125929832 | fit/value_loss = 131.22430419921875 | fit/entropy_loss = 0.5248685479164124 | fit/approx_kl = 0.0007476735045202076 | fit/clipfrac = 0.0 | fit/explained_variance = 0.18155068159103394 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 6336 | fit/policy_loss = -0.009597748517990112 | fit/value_loss = 306.9555358886719 | fit/entropy_loss = 0.5775970220565796 | fit/approx_kl = 0.0005952063947916031 | fit/clipfrac = 0.0 | fit/explained_variance = -0.3066709041595459 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 7872 | fit/policy_loss = -0.0011599212884902954 | fit/value_loss = 0.3407192528247833 | fit/entropy_loss = 0.41181233525276184 | fit/approx_kl = 0.01260202657431364 | fit/clipfrac = 0.19791666492819787 | fit/explained_variance = 0.975825097411871 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 7968 | fit/policy_loss = -0.047126080840826035 | fit/value_loss = 30.541654586791992 | fit/entropy_loss = 0.5876209139823914 | fit/approx_kl = 0.0013518078485503793 | fit/clipfrac = 0.0 | fit/explained_variance = 0.7769163846969604 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 9504 | fit/policy_loss = -0.005419999361038208 | fit/value_loss = 2.6821603775024414 | fit/entropy_loss = 0.4786674976348877 | fit/approx_kl = 0.002310350304469466 | fit/clipfrac = 0.0 | fit/explained_variance = 0.5584505200386047 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 9600 | fit/policy_loss = -0.0188402459025383 | fit/value_loss = 175.68919372558594 | fit/entropy_loss = 0.5457869172096252 | fit/approx_kl = 0.0003522926417645067 | fit/clipfrac = 0.0 | fit/explained_variance = 0.37716150283813477 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 11136 | fit/policy_loss = 0.002097398042678833 | fit/value_loss = 0.04328225180506706 | fit/entropy_loss = 0.520216166973114 | fit/approx_kl = 2.282687091792468e-05 | fit/clipfrac = 0.002083333395421505 | fit/explained_variance = 0.9905285472050309 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 11232 | fit/policy_loss = -0.02607026696205139 | fit/value_loss = 0.9695928692817688 | fit/entropy_loss = 0.544611930847168 | fit/approx_kl = 0.0014795692404732108 | fit/clipfrac = 0.008333333488553762 | fit/explained_variance = 0.775581106543541 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 12768 | fit/policy_loss = -0.00431543355807662 | fit/value_loss = 0.1664377897977829 | fit/entropy_loss = 0.5166257619857788 | fit/approx_kl = 0.01913692243397236 | fit/clipfrac = 0.2750000006519258 | fit/explained_variance = 0.950390450656414 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 12864 | fit/policy_loss = -0.01600167714059353 | fit/value_loss = 1.0876649618148804 | fit/entropy_loss = 0.5791975259780884 | fit/approx_kl = 0.015078354626893997 | fit/clipfrac = 0.05416666679084301 | fit/explained_variance = 0.957294762134552 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 14208 | fit/policy_loss = -0.007962663657963276 | fit/value_loss = 0.4722827970981598 | fit/entropy_loss = 0.5429236888885498 | fit/approx_kl = 0.011440296657383442 | fit/clipfrac = 0.029166666604578496 | fit/explained_variance = -2.395857334136963 | fit/learning_rate = 0.0003 | +[INFO] 10:29: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 14304 | fit/policy_loss = -0.00023897241044323891 | fit/value_loss = 0.37554287910461426 | fit/entropy_loss = 0.5310923457145691 | fit/approx_kl = 0.0071893795393407345 | fit/clipfrac = 0.18125000046566128 | fit/explained_variance = -2.975414991378784 | fit/learning_rate = 0.0003 | +[INFO] 10:29: ... trained! +[INFO] 10:29: Evaluating PPO_second_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + + PPO_second_experimentCartPole-v1 +0 164.0 +1 142.0 +2 180.0 +3 500.0 +4 500.0 +``` + +
+ +- In the output you can see the learning of the workers 0 and 1 first, then 2 and 3 (4 fit, but max 2 threads with parallelization). +- You can check the tensorboard logging with `tensorboard --logdir `. diff --git a/docs/basics/userguide/logging.md b/docs/basics/userguide/logging.md new file mode 100644 index 000000000..e038f7ce4 --- /dev/null +++ b/docs/basics/userguide/logging.md @@ -0,0 +1,91 @@ +(logging_page)= + +# How to logging your experiment + +To get informations and readable result about the training of your algorithm, you can use different logger. +## Set rlberry's logger level +For this examples, you will use the "PPO" torch agent from "[rlberry-research](https://github.com/rlberry-py/rlberry-research)" + +```python +from rlberry.envs import gym_make +from rlberry_research.agents.torch import PPOAgent +from rlberry.manager import ExperimentManager, evaluate_agents + + +env_id = "CartPole-v1" # Id of the environment + +env_ctor = gym_make # constructor for the env +env_kwargs = dict(id=env_id) # give the id of the env inside the kwargs + + +first_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(100), # Budget used to call our agent "fit()" + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_first_experiment" + env_id, # Name of the agent + seed=42, +) + +first_experiment.fit() + +output = evaluate_agents( + [first_experiment], n_simulations=5, plot=False +) # evaluate the experiment on 5 simulations +print(output) +``` + +```none +[INFO] 15:50: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 15:51: ... trained! +[INFO] 15:51: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 15.0 +1 16.0 +2 16.0 +3 18.0 +4 15.0 +``` + +As you can see, on the previous output, you have the "[INFO]" output (from [ExperimentManager](rlberry.manager.ExperimentManager)) + + +You can choose the verbosity of your logger + +But, with [rlberry.logging.set_level(level='...')](rlberry.utils.logging.set_level), you can select the level of your logger to choose what type of information you want. +For examle, you can have only the "CRITICAL" information, for this add this lines on top of the previous code, then run it again : +```python +import rlberry + +rlberry.utils.logging.set_level(level="CRITICAL") +``` + +```none + PPO_first_experimentCartPole-v1 +0 15.0 +1 16.0 +2 16.0 +3 18.0 +4 15.0 +``` +As you can see, on the previous output, you don't have the "INFO" output anymore (because it's not a "CRITICAL" output) + + + +## Writer +To keep informations during and after the experiment, rlberry use a 'writer'. The writer is stored inside the [Agent](agent_page), and is updated in its fit() function. + +By default (with the [Agent interface](rlberry.agents.Agent)), the writer is [DefaultWriter](rlberry.utils.writers.DefaultWriter). + +To keep informations about the environment inside the writer, you can wrap the environment inside [WriterWrapper](rlberry.wrappers.WriterWrapper). + + +To get the data, saved during an experiment, in a Pandas DataFrame, you can use [plot_writer_data](rlberry.manager.plot_writer_data) on the [ExperimentManager](rlberry.manager.ExperimentManager) (or a list of them). +Example [here](../../auto_examples/demo_bandits/plot_mirror_bandit). + +To plot the data, saved during an experiment, you can use [plot_writer_data](rlberry.manager.plot_writer_data) on the [ExperimentManager](rlberry.manager.ExperimentManager) (or a list of them). +Example [here](../../auto_examples/plot_writer_wrapper). diff --git a/docs/basics/userguide/save_load.md b/docs/basics/userguide/save_load.md new file mode 100644 index 000000000..5b63f53ba --- /dev/null +++ b/docs/basics/userguide/save_load.md @@ -0,0 +1,157 @@ +(save_load_page)= + +# How to save/load an experiment + + +For this example, we'll use the same code as [ExperimentManager](ExperimentManager_page) (from User Guide), and use the save and load functions. + +## how to save an experiment + +To save your experiment, you have to train it first (with `fit()`), then you just have to use the `save()` function. + +Train the Agent : +```python +from rlberry.envs import gym_make +from rlberry_scool.agents.tabular_rl import QLAgent +from rlberry.manager import ExperimentManager + +from rlberry.seeding import Seeder + +seeder = Seeder(123) # seeder initialization + +env_id = "FrozenLake-v1" # Id of the environment +env_ctor = gym_make # constructor for the env +env_kwargs = dict( + id=env_id, is_slippery=False +) # give the id of the env inside the kwargs + + +experiment_to_save = ExperimentManager( + QLAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + init_kwargs=dict( + gamma=0.95, alpha=0.8, exploration_type="epsilon", exploration_rate=0.25 + ), # agent args + fit_budget=int(300000), # Budget used to call our agent "fit()" + n_fit=1, # Number of agent instances to fit. + seed=seeder, # to be reproductible + agent_name="PPO" + env_id, # Name of the agent + output_dir="./results/", # where to store the outpus +) + +experiment_to_save.fit() +print(experiment_to_save.get_agent_instances()[0].Q) # print the content of the Q-table +``` + +```none +[INFO] 11:11: Running ExperimentManager fit() for PPOFrozenLake-v1 with n_fit = 1 and max_workers = None. +[INFO] 11:11: agent_name worker episode_rewards max_global_step + PPOFrozenLake-v1 0 0.0 178711 +[INFO] 11:11: ... trained! +writers.py:108: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. + df = pd.concat([df, pd.DataFrame(self._data[tag])], ignore_index=True) +[[0.73509189 0.77378094 0.77378094 0.73509189] + [0.73509189 0. 0.81450625 0.77378094] + [0.77378094 0.857375 0.77378094 0.81450625] + [0.81450625 0. 0.77377103 0.77378092] + [0.77378094 0.81450625 0. 0.73509189] + [0. 0. 0. 0. ] + [0. 0.9025 0. 0.81450625] + [0. 0. 0. 0. ] + [0.81450625 0. 0.857375 0.77378094] + [0.81450625 0.9025 0.9025 0. ] + [0.857375 0.95 0. 0.857375 ] + [0. 0. 0. 0. ] + [0. 0. 0. 0. ] + [0. 0.9025 0.95 0.857375 ] + [0.9025 0.95 1. 0.9025 ] + [0. 0. 0. 0. ]] +[INFO] 11:11: Saved ExperimentManager(PPOFrozenLake-v1) using pickle. +``` + +After this run, you can see the 'print' of the q-table. +At the end of the fit, the data of this experiment are saved automatically. It will be saved according to the `output_dir` parameter (here `./results/`). If you don't specify the `output_dir` parameter, it will saved by default inside the `rlberry_data/temp/` folder. +(Or you can use temporary folder by importing `tempfile` librrary and using `with tempfile.TemporaryDirectory() as tmpdir:`) + +In this folder, you should find : +- `manager_obj.pickle` and folder `agent_handler`, the save of your experiment and your agent. +- `data.csv`, the episodes result during the training process + + +## How to load a previous experiment? +In this example you will load the experiment saved in the part 1. + +To load an experiment previously saved, you need to : +- Locate the file you want to load +- use the function `load()` from the class [ExperimentManager](rlberry.manager.ExperimentManager.load). + +```python +import pathlib +from rlberry.envs import gym_make +from rlberry.manager.experiment_manager import ExperimentManager + + +path_to_load = next( + pathlib.Path("results").glob("**/manager_obj.pickle") +) # find the path to the "manager_obj.pickle" + +loaded_experiment_manager = ExperimentManager.load(path_to_load) # load the experiment + +print( + loaded_experiment_manager.get_agent_instances()[0].Q +) # print the content of the Q-table +``` +If you want to test the agent from the loaded Experiment, you can add : + +```python +env_id = "FrozenLake-v1" # Id of the environment +env_ctor = gym_make # constructor for the env +env_kwargs = dict( + id=env_id, is_slippery=False +) # give the id of the env inside the kwargs +test_env = env_ctor(**env_kwargs) # create the Environment + +# test the agent of the experiment on the test environment +observation, info = test_env.reset() +for tt in range(50): + action = loaded_experiment_manager.get_agent_instances()[0].policy(observation) + next_observation, reward, terminated, truncated, info = test_env.step(action) + done = terminated or truncated + if done: + if reward == 1: + print("Success!") + break + else: + print("Fail! Retry!") + next_observation, info = test_env.reset() + observation = next_observation +``` + +```none +[[0.73509189 0.77378094 0.77378094 0.73509189] + [0.73509189 0. 0.81450625 0.77378094] + [0.77378094 0.857375 0.77378094 0.81450625] + [0.81450625 0. 0.77377103 0.77378092] + [0.77378094 0.81450625 0. 0.73509189] + [0. 0. 0. 0. ] + [0. 0.9025 0. 0.81450625] + [0. 0. 0. 0. ] + [0.81450625 0. 0.857375 0.77378094] + [0.81450625 0.9025 0.9025 0. ] + [0.857375 0.95 0. 0.857375 ] + [0. 0. 0. 0. ] + [0. 0. 0. 0. ] + [0. 0.9025 0.95 0.857375 ] + [0.9025 0.95 1. 0.9025 ] + [0. 0. 0. 0. ]] +Success! +``` + +As you can see, we haven't re-fit the experiment, and the q-table is the same as the one previously saved (and the Agent can finish the environment). + +## Other informations + +The `save` and `load` can be useful for : +- you want to train your agent on a computer, and test/use it on others. +- you have a long training, and you want to do some 'checkpoints'. +- you want to do the training in more than once (only if your agent has "fit(x) then fit(y), is the same as fit(x+y)") diff --git a/docs/basics/userguide/seeding.md b/docs/basics/userguide/seeding.md new file mode 100644 index 000000000..01c3755e7 --- /dev/null +++ b/docs/basics/userguide/seeding.md @@ -0,0 +1,376 @@ +(seeding_page)= + +# How to seed your experiment + +Rlberry has a class [Seeder](rlberry.seeding.seeder.Seeder) that conveniently wraps a [NumPy SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html), and allows you to create independent random number generators for different objects and threads, using a single [Seeder](rlberry.seeding.seeder.Seeder) instance. It works as follows: + +## Basics +Suppose you want generate 5 random Integer between 0 and 9. + +If you run this code many time, you should have different outputs. +```python +from rlberry.seeding import Seeder + +seeder = Seeder() + +result_list = [] +for _ in range(5): + result_list.append(seeder.rng.integers(10)) +print(result_list) +``` + +run 1 : +```none +[9, 3, 4, 8, 4] +``` +run 2 : +```none +[2, 0, 6, 3, 9] +``` +run 3 : +```none +[7, 3, 8, 1, 1] +``` + + +But if you fix the seed as follow, and run it many time... You should have the same 'random' numbers every time : +```python +from rlberry.seeding import Seeder + +seeder = Seeder(123) + +result_list = [] +for _ in range(5): + result_list.append(seeder.rng.integers(10)) +print(result_list) +``` + +run 1 : +```none +[9, 1, 0, 7, 4] +``` +run 2 : +```none +[9, 1, 0, 7, 4] +``` +run 3 : +```none +[9, 1, 0, 7, 4] +``` + +
+ +## In rlberry + +### classic usage +Each [Seeder](rlberry.seeding.seeder.Seeder) instance has a random number generator (rng), see [here](https://numpy.org/doc/stable/reference/random/generator.html) to check the methods available in rng. + +[Agent](agent_page) and [Environment](environment_page) `reseed(seeder)` functions should use `seeder.spawn()` that allow to create new independent child generators from the same seeder. So it's a good practice to use single seeder to reseed the Agent or Environment, and they will have their own seeder and rng. + +When writing your own agents and inheriting from the Agent class, you should use agent.rng whenever you need to generate random numbers; the same applies to your environments. +This is necessary to ensure reproducibility. + +```python +from rlberry.seeding import Seeder + +seeder = Seeder(123) # seeder initialization + +from rlberry.envs import gym_make +from rlberry_research.agents import RSUCBVIAgent + + +env = gym_make("MountainCar-v0") +env.reseed(seeder) # seeder first use + +agent = RSUCBVIAgent(env) +agent.reseed(seeder) # same seeder + +# check that the generated numbers are differents +print("env seeder: ", env.seeder) +print("random sample 1 from env rng: ", env.rng.normal()) +print("random sample 2 from env rng: ", env.rng.normal()) +print("agent seeder: ", agent.seeder) +print("random sample 1 from agent rng: ", agent.rng.normal()) +print("random sample 2 from agent rng: ", agent.rng.normal()) +``` + +```none +env seeder: Seeder object with: SeedSequence( + entropy=123, + spawn_key=(0, 0), +) +random sample 1 from env rng: -1.567498838741829 +random sample 2 from env rng: 0.6356604305460527 +agent seeder: Seeder object with: SeedSequence( + entropy=123, + spawn_key=(0, 1), + n_children_spawned=2, +) +random sample 1 from agent rng: 1.2466559261185188 +random sample 2 from agent rng: 0.8402527193117317 + +``` + +
+ +### With ExperimentManager +For this part we will use the same code from the [ExperimentManager](ExperimentManager_page) part. +3 runs without the seeder : + +```python +from rlberry.envs import gym_make +from rlberry_research.agents.torch import PPOAgent +from rlberry.manager import ExperimentManager, evaluate_agents + + +env_id = "CartPole-v1" # Id of the environment + +env_ctor = gym_make # constructor for the env +env_kwargs = dict(id=env_id) # give the id of the env inside the kwargs + + +first_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(100), # Budget used to call our agent "fit()" + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_first_experiment" + env_id, # Name of the agent +) + +first_experiment.fit() + +output = evaluate_agents( + [first_experiment], n_simulations=5, plot=False +) # evaluate the experiment on 5 simulations +print(output) +``` + + +Run 1: +```none +[INFO] 14:47: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:47: ... trained! +[INFO] 14:47: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 20.8 +1 20.8 +2 21.4 +3 24.3 +4 28.8 +``` + +Run 2 : +```none +[INFO] 14:47: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:47: ... trained! +[INFO] 14:47: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 25.0 +1 19.3 +2 28.5 +3 26.1 +4 19.0 +``` + +Run 3 : +```none +[INFO] 14:47: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:47: ... trained! +[INFO] 14:47: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 23.6 +1 19.2 +2 20.5 +3 19.8 +4 16.5 +``` + + +
+ +**Without the seeder, the outputs are different (non-reproducible).** + +
+ +3 runs with the seeder : + +```python +from rlberry.envs import gym_make +from rlberry_research.agents.torch import PPOAgent +from rlberry.manager import ExperimentManager, evaluate_agents + +from rlberry.seeding import Seeder + +seeder = Seeder(42) + +env_id = "CartPole-v1" # Id of the environment + +env_ctor = gym_make # constructor for the env +env_kwargs = dict(id=env_id) # give the id of the env inside the kwargs + + +first_experiment = ExperimentManager( + PPOAgent, # Agent Class + (env_ctor, env_kwargs), # Environment as Tuple(constructor,kwargs) + fit_budget=int(100), # Budget used to call our agent "fit()" + eval_kwargs=dict( + eval_horizon=1000 + ), # Arguments required to call rlberry.agents.agent.Agent.eval(). + n_fit=1, # Number of agent instances to fit. + agent_name="PPO_first_experiment" + env_id, # Name of the agent + seed=seeder, +) + +first_experiment.fit() + +output = evaluate_agents( + [first_experiment], n_simulations=5, plot=False +) # evaluate the experiment on 5 simulations +print(output) +``` + +Run 1: +```none +[INFO] 14:46: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:46: ... trained! +[INFO] 14:46: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 23.3 +1 19.7 +2 23.0 +3 18.8 +4 19.7 +``` + +Run 2 : +```none +[INFO] 14:46: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:46: ... trained! +[INFO] 14:46: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 23.3 +1 19.7 +2 23.0 +3 18.8 +4 19.7 +``` + + +Run 3 : +```none +[INFO] 14:46: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None. +[INFO] 14:46: ... trained! +[INFO] 14:46: Evaluating PPO_first_experimentCartPole-v1... +[INFO] Evaluation:..... Evaluation finished + PPO_first_experimentCartPole-v1 +0 23.3 +1 19.7 +2 23.0 +3 18.8 +4 19.7 +``` + + +
+ + +**With the seeder, the outputs are the same (reproducible).** + + +
+ + + + +### multi-threading +If you want use multi-threading, a seeder can spawn other seeders that are independent from it. +This is useful to seed two different threads, using seeder1 in the first thread, and seeder2 in the second thread. +```python +from rlberry.seeding import Seeder + +seeder = Seeder(123) +seeder1, seeder2 = seeder.spawn(2) + +print("random sample 1 from seeder1 rng: ", seeder1.rng.normal()) +print("random sample 2 from seeder1 rng: ", seeder1.rng.normal()) +print("-----") +print("random sample 1 from seeder2 rng: ", seeder2.rng.normal()) +print("random sample 2 from seeder2 rng: ", seeder2.rng.normal()) +``` +```none +random sample 1 from seeder1 rng: -0.4732958445958833 +random sample 2 from seeder1 rng: 0.5863995575997462 +----- +random sample 1 from seeder2 rng: -0.1722486099076424 +random sample 2 from seeder2 rng: -0.1930990650226178 +``` + +
+ +## External libraries +You can also use a seeder to seed external libraries (such as torch) using the method `set_external_seed`. +It will be usefull if you want reproducibility with external libraries. In this example, you will use `torch` to generate random numbers. + +If you run this code many time, you should have different outputs. +```python +import torch + +result_list = [] +for i in range(5): + result_list.append(torch.randint(2**32, (1,))[0].item()) + +print(result_list) +``` + +run 1 : +```none +[3817148928, 671396126, 2950680447, 791815335, 3335786391] +``` +run 2 : +```none +[82990446, 2463687945, 1829003305, 647811387, 3543380778] +``` +run 3 : +```none +[3887070615, 363268341, 3607514851, 3881090947, 1018754931] +``` + + + +If you add to this code a [Seeder](rlberry.seeding.seeder.Seeder), use the `set_external_seed` method, and re-run it, you should have the same 'random' numbers everytime. + +```python +import torch +from rlberry.seeding import set_external_seed +from rlberry.seeding import Seeder + +seeder = Seeder(123) + +set_external_seed(seeder) +result_list = [] +for i in range(5): + result_list.append(torch.randint(2**32, (1,))[0].item()) + +print(result_list) +``` + +run 1 : +```none +[693246422, 3606543353, 433394544, 2194426398, 3928404622] +``` +run 2 : +```none +[693246422, 3606543353, 433394544, 2194426398, 3928404622] +``` +run 3 : +```none +[693246422, 3606543353, 433394544, 2194426398, 3928404622] +``` diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..99dd9f048 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,71 @@ +(index)= + +```{image} ../assets/logo_wide.svg +:align: center +:width: 50% +:alt: rlberry logo +``` + +## An RL Library for Research and Education +**Writing reinforcement learning algorithms is fun!** *But after the fun, we have +lots of boring things to implement*: run our agents in parallel, average and plot results, +optimize hyperparameters, compare to baselines, create tricky environments etc etc! + +[rlberry](https://github.com/rlberry-py/rlberry) **is here to make your life easier** by doing all these things with a few lines of code, so that you can spend most of your time developing agents. + +We provide you a number of tools to help you achieve **reproducibility**, **statistically comparisons** of RL agents, and **nice visualization**. + + If you begin with [rlberry](https://github.com/rlberry-py/rlberry), **check our** [RL quickstart](quick_start) **and our** [Deep RL quickstart](TutorialDeepRL). + + +## Documentation Contents +You can find main documentation here : +- [Installation](installation) +- [User Guide](user_guide) +- [Examples](examples) +- [API](api) +- [Changelog](changelog) + + +## Contributing to rlberry +If you want to contribute to rlberry, check out [the contribution guidelines](contributing). + +## rlberry main features + +### ExperimentManager +This is one of the core element in [rlberry](https://github.com/rlberry-py/rlberry). The [ExperimentManager](rlberry.manager.experiment_manager.ExperimentManager) allow you to easily make an experiment between an [Agent](agent_page) and an [Environment](environment_page). It's use to train, optimize hyperparameters, evaluate and gather statistics about an agent. See the [ExperimentManager](experimentManager_page) page. + +### Seeding & Reproducibility +[rlberry](https://github.com/rlberry-py/rlberry) has a class [Seeder](rlberry.seeding.seeder.Seeder) that conveniently wraps a [NumPy SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html), +and allows us to create independent random number generators for different objects and threads, using a single +[Seeder](rlberry.seeding.seeder.Seeder) instance. See the [Seeding](seeding_page) page. + + +You can also save and load your experiments. +It could be useful in many way : +- don't repeat the training part every time. +- continue a previous training (or doing checkpoint). + +See the [Save and Load Experiment](save_load_page) page. + +### Statistical comparison of RL agents + +#### Compare agents +Compare several trained agents using the mean over a specify number of evaluations for each agent. +TODO : to complete + +#### AdaStop +TODO : Text + + +[linked paper](https://hal-lara.archives-ouvertes.fr/hal-04132861/) + + +[GitHub](https://github.com/TimotheeMathieu/adastop) + +### Visualization +TODO : + + +### And many more ! +Check the [User Guide](user_guide) to find more tools ! diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 9f4868f3b..000000000 --- a/docs/index.rst +++ /dev/null @@ -1,64 +0,0 @@ -.. image:: ../assets/logo_wide.svg - :width: 50% - :alt: rlberry logo - :align: center - -.. _rlberry: https://github.com/rlberry-py/rlberry -../ - -.. _index: - -An RL Library for Research and Education -======================================== - - -**Writing reinforcement learning algorithms is fun!** *But after the fun, we have -lots of boring things to implement*: run our agents in parallel, average and plot results, -optimize hyperparameters, compare to baselines, create tricky environments etc etc! - -rlberry_ **is here to make your life easier** by doing all these things with a few lines of code, -so that you can spend most of your time developing agents. **Check our** :ref:`the quickstart` - - - - -In addition, rlberry_: - -* Provides **implementations of several RL agents** for you to use as a starting point or as baselines; -* Provides a set of **benchmark environments**, very useful to debug and challenge your algorithms; -* Handles all random seeds for you, ensuring **reproducibility** of your results; -* Is **fully compatible with** several commonly used RL libraries like `Gymnasium `_ and `Stable Baselines `_. - - - -Seeding & Reproducibility -========================== - -rlberry_ has a class :class:`~rlberry.seeding.seeder.Seeder` that conveniently wraps a `NumPy SeedSequence `_, -and allows us to create independent random number generators for different objects and threads, using a single -:class:`~rlberry.seeding.seeder.Seeder` instance. See :ref:`Seeding `. - - - -Contributing to rlberry -======================= - -If you want to contribute to rlberry, check out :ref:`the contribution guidelines`. - - - -Documentation Contents -====================== - -.. toctree:: - :maxdepth: 3 - - installation - user_guide - external - -.. toctree:: - :maxdepth: 2 - - api - changelog diff --git a/docs/installation.rst b/docs/installation.rst index e89e427da..022dad300 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -14,10 +14,6 @@ First, we suggest you to create a virtual environment using $ conda create -n rlberry $ conda activate rlberry -OS dependency -------------- - -In order to render videos in rlberry, `ffmpeg `_ must be installed. Latest version (0.7.0) ------------------------------------- @@ -36,14 +32,10 @@ Install the development version to test new features. .. code:: bash - $ pip install git+https://github.com/rlberry-py/rlberry.git#egg=rlberry[default] + $ pip install rlberry[default]@git+https://github.com/rlberry-py/rlberry.git .. warning:: - - When using Python 3.10, there seem to be a problem when installing PyOpenGL-accelerate. For - now, we advise people to use Python 3.9 with PyOpenGL==3.1.5 and PyOpenGL-accelerate==3.1.5. - It is also possible to use rlberry without installing PyOpenGL-accelerate but this could cause - rendering to be slow. + For `zsh` users, `zsh` uses brackets for globbing, therefore it is necessary to add quotes around the argument, e.g. :code:`pip install 'rlberry[default]@git+https://github.com/rlberry-py/rlberry.git'`. Previous versions @@ -53,14 +45,14 @@ If you used a previous version in your work, you can install it by running .. code:: bash - $ pip install git+https://github.com/rlberry-py/rlberry.git@{TAG_NAME}#egg=rlberry[default] + $ pip install rlberry[default]@git+https://github.com/rlberry-py/rlberry.git@{TAG_NAME} replacing `{TAG_NAME}` by the tag of the corresponding version, -e.g., :code:`pip install git+https://github.com/rlberry-py/rlberry.git@v0.1#egg=rlberry[default]` +e.g., :code:`pip install rlberry[default]@git+https://github.com/rlberry-py/rlberry.git@v0.1` to install version 0.1. .. warning:: - For `zsh` users, `zsh` uses brackets for globbing, therefore it is necessary to add quotes around the argument, e.g. :code:`pip install 'git+https://github.com/rlberry-py/rlberry.git#egg=rlberry[default]'`. + For `zsh` users, `zsh` uses brackets for globbing, therefore it is necessary to add quotes around the argument, e.g. :code:`pip install 'rlberry[default]@git+https://github.com/rlberry-py/rlberry.git@v0.1'`. Deep RL agents @@ -72,15 +64,8 @@ Deep RL agents require extra libraries, like PyTorch. .. code:: bash - $ pip install git+https://github.com/rlberry-py/rlberry.git#egg=rlberry[torch_agents] + $ pip install rlberry[torch_agents]@git+https://github.com/rlberry-py/rlberry.git $ pip install tensorboard -* JAX agents (**Linux only, experimental**): - - -* Stable-baselines3 agents with Gymnasium support: - (https://github.com/DLR-RM/stable-baselines3/pull/1327) -.. code:: bash - - $ pip install git+https://github.com/DLR-RM/stable-baselines3@feat/gymnasium-support - $ pip install git+https://github.com/Stable-Baselines-Team/stable-baselines3-contrib@feat/gymnasium-support +.. warning:: + For `zsh` users, `zsh` uses brackets for globbing, therefore it is necessary to add quotes around the argument, e.g. :code:`pip install 'rlberry[torch_agents]@git+https://github.com/rlberry-py/rlberry.git'`. diff --git a/docs/themes/scikit-learn-fork/nav.html b/docs/themes/scikit-learn-fork/nav.html index 1bbd96a20..952f0923c 100644 --- a/docs/themes/scikit-learn-fork/nav.html +++ b/docs/themes/scikit-learn-fork/nav.html @@ -31,6 +31,9 @@