Skip to content

Implementation of various policy evaluation and learning methods to teach an agent to play simple 2 games

Notifications You must be signed in to change notification settings

judah-axelrod/Reinforcement-Learning-Games

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement-Learning-Games

Implementation of various policy evaluation and learning methods to teach an agent to play simple 2 games

Here is a quick overview of the files in this repository:

  • RL_games.ipynb and RL_games.html provide answers to all of the questions listed below, with full explanation and accompanying code.
  • two_player_game1.py is the implementation of the environment used for P1 and P2 below.
  • two_player_game2.py is the implementation of the environment used for P3 below.

These problems were completed for an assignment for LSE ST449, a graduate-level Artificial Intelligence and Deep Learning course in LSE's Department of Statistics. The full details of the assignment are presented below:

P1

Consider a two-player game that proceeds over rounds where in each round one player wins and other loses. The winner of the game is the player who first wins rounds more than the opponent, where is a parameter.

Suppose we model this game so that we consider one of the players. In each round, this player has two available actions, either to invest high or low effort.

If the player invests high effort in a round, she wins this round with probability , otherwise, she wins this round with probability .

and are parameters such that .

By investing high effort in a round, the player incurs a cost of value , and otherwise, a cost of value , in this round.

and are parameters such that .

The player receives a prize of value if winning the game.

Assume the following values for parameters:

Questions:

  1. For each deterministic policy, compute the value function numerically by assuming perfect knowledge of the environment, and rank these policies with respect to the value at the initial state. What is the optimal deterministic policy? Discuss the results.
  2. Estimate the action-value function by using the off-policy Monte Carlo estimation method for a threshold policy such that whenever and otherwise, for given threshold value where the behaviour policy, in each state, selects either action equiprobably. Show action values. Can you conclude from the obtained results whether the target policy is optimal? Discuss your answer.
  3. Compute the optimal policy by using the off-policy Monte Carlo algorithm for the behavior policy that in each state selects either action equiprobably. Show action values and policy.

In P1-1, use the termination condition .

In P1-2 and P2-3, use the number of episodes equal to 500,000.

P2

Consider the same reinforcement learning problem as in P1 but for the following questions:

  1. Solve the problem by using Q-learning algorithm. Use -greedy with .
  2. Solve the problem by using SARSA algorithm. Use -greedy with .
  3. Assume that the agent follows a random policy, evaluate it by using on-line TD()-algorithm. Use the values of parameters and step size .

In all questions, use the number of episodes equal to 500,000. For each question report the estimated values and policy.

P3

Consider a variant of the reinforcement learning problem which is identical to that in P1 but where the player does not incur a cost by investing effort but can invest high effort if she has sufficient energy.

The player starts with given energy level which is decremented by whenever in a round the player invests high effort. If the energy level would become negative after subtracting the value of , it is set equal to . In each round, an amount of energy of value is added to the player, independently with probability . The maximum energy level is .

The player can invest high effort in a round only if her energy level is at least .

The player receives a prize of value if winning the game and this is the only reward received.

Use the following setting of parameters:

  • step size
  • Discount parameter
  1. Solve this problem by using SARSA () algorithm, for the value of parameter . Assume that the policy followed by the agent is an -greedy policy with . Use 500,000 episodes. Show the estimated action values for different states and policy. Discuss the results.

P4

Consider the following two-player game of chance. Two players, we refer to as players and , have initial endowments of and tokens, respectively. Assume that . The game proceeds over rounds where in each round a dice is rolled. If the outcome of the dice is or , player loses one token, and otherwise, if the outcome is or , player loses one token. The game ends as soon one of the players runs out of tokens. The winner of the game is the player who at the end of the game has at least one token left.

Answer the following questions:

  1. What is the winning probability of player for and each value of ?

  2. What is the winning probability of player for and each value of ?

  3. What is the winning probability of player for each value of and ?

About

Implementation of various policy evaluation and learning methods to teach an agent to play simple 2 games

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published