Supports only Python3 (oops).
Most (deep) RL algorithms work by optimizing a neural network through interacting with a learning environment. The goal of this package is to minimize the implementation effort of RL practitioners. They only need to implement (or, more commonly, wrap) an OpenAI-gym environment and a neural network they want to use as a tf.keras
model (along with an interface function that turns the observation
from the gym environment into a format that can be fed into the tf.keras
model), in order to run RL algorithms.
pip install -e .
in your favorite virtual env.
- tensorflow>=1.5.0
- gym>=0.9.6
- gym[atari] (optional)
-
Actor-critic family
- A3C (https://arxiv.org/abs/1602.01783) Actor-critic, using a critic-based advantage function as the baseline for variance reduction, asynchronous parallel.
- ACER (https://arxiv.org/abs/1611.01224) A3C with uniform replay, using the Retrace off-policy correction. The critic becomes a state-action value function instead of a state-only function. The authors proposed a trust-region optimization scheme based on the KL divergence wrt a Polyak averaging policy network. This implementation however includes the KL divergence (with a tunable scale factor) in the total loss. This choice is less stable wrt change in hyperparameters, but simplifies the combination of ACER and ACKTR.
- IMPALA (https://arxiv.org/abs/1802.01561) A3C with replay and another (actually, a simpler) flavor of off-policy correction called V-trace. This implementation is a lot more naive compared with the original distributed framework, however it gives an idea of how the off-policy correction is done and is much easier to integrate with ACKTR.
-
DQN family
- DQN (https://arxiv.org/abs/1602.01783) Asynchronous multi-step Q-learning.
-
Algorithm related options
noisynet='ig'
ornoisynet='fg'
: Based on the idea of NoisyNet (https://arxiv.org/abs/1706.10295), which introduces independent ('ig'
) or factorized ('fg'
) Gaussian noises to network weights. Allows the exploration strategy to change across different training stages and adapt to different parts of the state representation.
- Arguments shared by
trainer
andevaluator
classesenv_maker
: callable. Returns a gym env on calling. Detailed in the Gym Environment section below. Default:None
.state_to_input
: callable. Converts theobservation
from a gym env to some data (usually NumPy array) that can be fed into atf.keras
model. Detailed in the Neural network section below. Default:None
(will setself.state_to_input = lambda x: x
internally if set toNone
).load_model
:str
. File name (full path) of ah5py
file that contains a savedtf.keras
model (usually saved throughtf.keras.models.Model:save
). If specified, training or evaluation will start from this model. Default:None
.load_model_custom
:dict
. As same as thecustom_objects
argument intf.keras.models.load_model
. Default:None
.verbose
:bool
. Whether or not to print training/evaluating information. Default:False
.
trainer
classes common argumentsfeature_maker
: callable. Takes inenv.observation_space
and returns(inp_state, feature)
, a 2-tuple of atf.keras.layers.Input
layer and an arbitrary typed (e.g.,tf.keras.layes.Dense
)tf.keras
layer. Detailed in the Neural network section below. Default:None
.model_maker
: callable. Takes in a gym env and returns atf.keras
model. Detailed in the Neural network section below. The trainer will ignorefeature_maker
ifmodel_maker
is set. Default:None
.num_parallel
:int
. Number of parallel processes in training. Default: number of cpu (logical) core counts.port_begin
:int
. Starting gRPC port number used by distributed tensorflow. Default:2220
.discount
:float
. Discount factor (gamma) in reinforcement learning. Default:0.99
.train_steps
:int
. Maximum number of gym env steps in training. Default:1000000
.rollout_maxlen
:int
. Maximum length of a rollout. Also the number of env steps in a rollout list. Please refer to the comments in drlbox/trainer/trainer_base.py for detail explanation. Default:32
.batch_size
:int
. Number of rollout lists in a batch. Please refer to the comments in drlbox/trainer/trainer_base.py for details. Default:1
.online_learning
:bool
. Whether or not to perform online learning on a newly collected batch. Default:True
.replay_type
:None
orstr
. Type of the replay memory. Choices are[None, 'uniform']
whereNone
means no replay memory. Default:None
(note: some algorithms such as ACER and IMPALA will setreplay_type='uniform'
by default).replay_ratio
:int
. After putting a newly collected online batch into the replay memory, a random integer number of offline, off-policy batch learnings will be performed, and the random integer number will be coming from a Poisson distribution using this argument as the Poisson parameter. Default:4
.replay_kwargs
:dict
. Keyword arguments that will be passed to the replay constructor after combining with the default replay keyword argumentsdict(maxlen=1000, minlen=100)
. Default:{}
.optimizer
:str
or atf.train.Optimizer
instance.str
choices are['adam']
. Default:adam
.opt_clip_norm
:float
. Maximum global gradient norm for gradient clipping. Default:40.0
.opt_kwargs
:dict
. Keyword arguments that will be passed to the optimizer constructor after combining with the default keyword arguments. For'adam'
, the default keyword arguments aredict(learning_rate=1e-4, epsilon=1e-4)
. Default:{}
.noisynet
:None
orstr
. Whether or not to enable NoisyNet in building the neural net. Detailed in the above Algorithm related options section.str
choices are['fg', 'ig']
corresponding to factorized and independent Gaussian noises, respectively. Default:None
.save_dir
:str
. Path to save intermediatetf.keras
models during training. Will not save any model if set toNone
. Defaul:None
.save_interval
:int
. Number of (global) env steps between savingtf.keras
models during training. Default:10000
.catch_signal
:bool
. Whether or not to catchsigint
andsigterm
during multiprocess training. Useful in cleaning up dangling processes when run in background but may prevent other parts of the program to respond to signals. Default:False
.
algorithm='a3c'
introduces the following additional argumentsa3c_entropy_weight
:float
. Weight of the entropy term in the A3C loss. Whennoisynet
is notNone
, it is recommended to set this argument to0.0
. Default:1e-2
.
algorithm='acer'
introduces the following additional argumentsacer_kl_weight
:float
. Weight of the KL divergence term wrt the average net in the ACER loss. Default:1e-1
.acer_trunc_max
:float
. Truncating threshold in ACER's modified Retrace off-policy correction. Default:10.0
.acer_soft_update_ratio
:float
. Soft update ratio to the average net. At each online network weight update, the weights in the average net will be a convex combination of the old average net weights and the new online network weights, and the coefficient of the new online network weights is this argument. Default:0.05
.
algorithm='impala'
introduces the following additional argumentsimpala_trunc_rho_max
:float
. Truncating threshold rho in IMPALA's V-trace off-policy correction. Default:1.0
.impala_trunc_c_max
:float
. Truncating threshold c in IMPALA's V-trace off-policy correction. Default:1.0
.
algorithm='dqn'
introduces the following additional argumentsdqn_double
:bool
. Whether to perform double DQN update or regular DQN update. Default:True
.dqn_dueling
:bool
. Whether to setup the DQN network as a dueling network (https://arxiv.org/abs/1511.06581). Default:False
.policy_eps_start
:float
. Starting epsilon in the linearly decayed epsilon greedy policy. Default:1.0
.policy_eps_end
:float
. Ending epsilon in the linearly decayed epsilon greedy policy. Default:0.01
.policy_eps_decay_steps
:int
. Number of (per-process) env steps before the linearly decayed epsilon to reach its minimum. Default:1000000
.sync_target_interval
:int
. Number of online updates between two synchronizations of the target network. Default:1000
.
evaluator
argumentsrender_timestep
:None
orfloat
. Timestep between twoenv.render()
calls.None
means no rendering. Default:None
.render_end
:bool
. If set toTrue
, will do oneenv.render()
call after each episode ends. Default:False
.num_episodes
:int
. Number of evalution episodes. Default:20
.policy_type
:str
. Type of evaluation policy. Choices are['stochastic', 'greedy']
. Default:stochastic
.policy_eps
:float
. Epsilon in the epsilon greedy policy whenpolicy_type='greedy'
. Default:0.0
.
A minimal demo could be as simple as the following code snippet (in examples/cartpole_a3c.py
). (A3C algorithm, CartPole-v0
environment, and a 2-layer fully-connected net with 200/100 hidden units in each layer.)
'''
cartpole_a3c.py
'''
import gym
from tensorflow.python.keras.layers import Input, Dense, Activation
from drlbox.trainer import make_trainer
'''
Input arguments:
observation_space: Observation space of the environment;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(observation_space, num_hid_list):
inp_state = Input(shape=observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
return inp_state, feature
'''
A3C, CartPole-v0
'''
if __name__ == '__main__':
trainer = make_trainer(
algorithm='a3c',
env_maker=lambda: gym.make('CartPole-v0'),
feature_maker=lambda obs_space: make_feature(obs_space, [200, 100]),
num_parallel=1,
train_steps=1000,
verbose=True,
)
trainer.run()
The user is supposed to implement a env_maker
callable which returns an OpenAI-gym environment. Things like history stacking/frame skipping/reward engineering are usually handled here as well.
The above code snippet contains a trivial example:
env_maker=lambda: gym.make('CartPole-v0')
which is a callable that returns the 'CartPole-v0'
environment.
The user is supposed to implement a feature_maker
callable which takes in an observation_space
(explanation) and returns inp_state
, a tf.keras.layers.Input
layer, and feature
, a tf.keras
layer or a tuple of 2 tf.keras
layers. For example, with actor-critic algorithms, when feature
is a tf.keras
layer, the actor and the critic streams share a common stack of layers. When feature
is a tuple of 2 tf.keras
layers, the actor and the critic will be completely separated).
The above code snippet cartpole_a3c.py
also contains a trivial example for the part of a tf.keras
model:
from tensorflow.python.keras.layers import Input, Dense, Activation
'''
Input arguments:
observation_space: Observation space of the environment;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(observation_space, num_hid_list):
inp_state = Input(shape=observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
return inp_state, feature
which makes a fully-connected neural network until the last layer before the policy/value layer. To use the default feature maker, simply let the feature-maker callable be feature_maker=lambda obs_space: make_feature(obs_space, [200, 100])
.
Alternatively, it is possible to specify a full tf.keras
model by implementing a model_maker
callable. model_maker
should take in the full gym env
and returns a tf.keras
model that satisfies the output requirements for each kind of training algorithm. Its model.inputs
should always be a 1-tuple like (inp_state,)
where inp_state
is a tf.keras.layers.Input
layer. Its model.outputs
should also be a tuple but the content varies according to the selected algorithm. For example, with algorithm='a3c'
, model.outputs
should be a 2-tuple of (logits, value)
; with algorithm='dqn'
, model.outputs
should be a 1-tuple of (q_values,)
.
The following code snippet contains a trivial example for implementing a full tf.keras
model for A3C or IMPALA:
from tensorflow.python.keras.initializers import RandomNormal
from tensorflow.python.keras.layers import Input, Dense, Activation
from tensorflow.python.keras.models import Model
'''
Input arguments:
env: Gym env;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(env, num_hid_list):
inp_state = Input(shape=env.observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
logits_init = RandomNormal(stddev=1e-3)
logits = Dense(env.action_space.n, kernel_initializer=logits_init)(feature)
value = Dense(1)(feature)
return Model(inputs=inp_state, outputs=[logits, value])
A more detailed example can be found in examples/breakout_acer.py
.
The user is also supposed to implement a state_to_input
callable which takes in the observation
from the output of the OpenAI-gym environment's reset
or step
function (explanation) and returns something that a tf.keras
model can directly take in. Usually, this function does stuffs like numpy
stackings/reshapings/etc. By default, state_to_input
is set to None
, in which case the a dummy callable state_to_input = lambda x: x
will be created and used internally.
Note: So long as feature_maker
or model_maker
is implemented correctly, the trainer will run. However, to utilize the saving/loading functionalities provided by Keras in a hassle-free manner, when writing feature_maker
or model_maker
it is recommended to only use combinations of Keras layers that already exist, plus some viable NumPy utilities such as np.newaxis
(NumPy has to be imported as import numpy as np
as this is the default importing method assumed by Keras in 'keras/layers/core.py'). It is discouraged to use other modules including plain TensorFlow, as the Keras model loading utility will literally "remember" your code of generating the Keras model and run through the code when it tries to load a saved model. If we really have to, try to import the needed functionalities inside feature_maker
or model_maker
so that it will be imported before execution. However, please do not import the entire TensorFlow (from tensorflow import x
is fine but no import tensorflow as tf
) in feature_maker
or model_maker
as it will cause circular importing.