Skip to content

Reinforcement Learning Task: Direct Policy Search in OpenAI Gym

Yang Yu edited this page Jan 8, 2018 · 1 revision

Direct policy search often results in high-quality policies in complex reinforcement learning problems, which employs some optimization algorithms to search the parameters of the policy for maximizing the its total reward.

This page will show how to use ZOOpt to execute direct policy search on the OpenAI Gym.

To simplify this procedure, we provide necessary APIs in example/direct_policy_search_for_gym/gym_task.py file.

Then we provide a function for running test in example/direct_policy_search_for_gym/run.py file.

This function do following things

  • Construct a GymTask environment.

    gym_task = GymTask(task_name)  # choose a task by name
    gym_task.new_nnmodel(layers)  # construct a neural network
    gym_task.set_max_step(max_step)  # set max step in gym

  • Define corresponding objective and parameter.

    # set dimension
    dim_size = gym_task.get_w_size()
    dim_regs = [[-10, 10]] * dim_size
    dim_tys = [True] * dim_size
    dim = Dimension(dim_size, dim_regs, dim_tys)
    # form up the objective function
    objective = Objective(gym_task.sum_reward, dim)
    # terminal_value: procedure will stop in advance if this value is reached, it is not necessary for this example.
    parameter = Parameter(budget=budget,terminal_value=terminal_value)
  • Optimize.

    solution_list = ExpOpt.min(objective, parameter, repeat=repeat, plot=True)

    ​The whole process lists below.

from gym_task import GymTask
from zoopt import Dimension, Objective, Parameter, Opt, Solution
import matplotlib.pyplot as plt
import numpy as np

def run_test(task_name, layers, in_budget, max_step, repeat, terminal_value=None):
    """
    Api of running direct policy search for gym task.

    :param task_name: gym task name
    :param layers:
        layer information of the neural network
        e.g., [2, 5, 1] means input layer has 2 neurons, hidden layer(only one) has 5 and output layer has 1
    :param in_budget:  number of calls to the objective function
    :param max_step: max step in gym
    :param repeat:  repeat number in a test
    :param terminal_value: early stop, algorithm should stop when such value is reached
    :return: no return
    """
    gym_task = GymTask(task_name)  # choose a task by name
    gym_task.new_nnmodel(layers)  # construct a neural network
    gym_task.set_max_step(max_step)  # set max step in gym

    budget = in_budget  # number of calls to the objective function
    rand_probability = 0.95  # the probability of sample in model

    # set dimension
    dim_size = gym_task.get_w_size()
    dim_regs = [[-10, 10]] * dim_size
    dim_tys = [True] * dim_size
    dim = Dimension(dim_size, dim_regs, dim_tys)
    # form up the objective function
    objective = Objective(gym_task.sum_reward, dim)
    
    parameter = Parameter(budget=budget, terminal_value=terminal_value)
    parameter.set_probability(rand_probability)

    solution_list = ExpOpt.min(objective, parameter, repeat=repeat, plot=True)

With the help of this function, users can run a test in few lines.

if __name__ == '__main__':
    mountain_car_layers = [2, 5, 1]
    run_test('MountainCar-v0', mountain_car_layers, 2000, 1000, 1)

For a few seconds, the optimization is done. Visualized optimization progress looks like

Expeirment results

More concrete examples are available in the example/direct_policy_search_for_gym directory .