You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calling Step in a multiagent environment with more than 2 players where all players' actions are processed simultaneously, the actions in the argument of the C++ API are "offset" from the actions sent in the Python API by 1 rather than "max_num_players".
I am trying to implement Multiagent Particle Environment (MPE) and have found that if I have num_envs environments and I send a [batch_size, ...] vector of actions to step in the Python api, then the Step function in the C++ API will receive an action which is offset by the number of previous calls to Step in that iteration.
For example, if I have 2 environments with 3 agents each and I send a flat vector of 6 actions [0, 1, 2, 3, 4, 5]:
the first time Step is called (either environment 0, or 1) will receive actions in the following order [0, 1, 2, 3, 4, 5]
the next environment will receive actions in the following order [1, 2, 3, 4, 5, 0]
The actions are not offset by the correct amount. The first value in the actions starts at index 1 instead of starting at max_num_players. The only way to correct this error is to know how many times Step has already been called, undo the index advancement done by ParseActions, and offset the actions by the correct amount. However this would be just a work around and cannot be implemented since env_index_ is a private variable so derived classes can't use it
To Reproduce
My working example consists of 6 files:
core.h where the base classes are defined
default_params.h where the default values for all environments are defined
simple_env.h where the spec, Simple MPE base environment, and corresponding envpool async environment is defined
simple_spread.h where the Simple Spread MPE environment and corresponding envpool async environment is defined
mpe.cc the python bindings file
BUILD my Bazel build file
__init__.py the module file according to the EnvPool docs
registration.py the registration file according to the EnvPool docs
This project uses Eigen as a C++ dependency. I have omitted the changes to setup.cfg and the envpool core files to make the custom environment accessible.
Below is modified from the test script for LunarLander
importosimportuuidfromdataclassesimportdataclass, fieldfrompathlibimportPathimportnumpyasnpimportenvpoolimportjaximporttyrofromrich.prettyimportpprint# Fix weird OOM https://github.com/google/jax/discussions/6332#discussioncomment-1279991os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] ="0.6"os.environ["XLA_FLAGS"] ="--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1"# Fix CUDNN non-determinisim; https://github.com/google/jax/issues/4823#issuecomment-952835771os.environ["TF_XLA_FLAGS"] ="--xla_gpu_autotune_level=2 --xla_gpu_deterministic_reductions"os.environ["TF_CUDNN DETERMINISTIC"] ="1"@dataclassclassArgs:
exp_name: str=Path(__file__).stem"the name of this experiment"seed: int=1"seed of the experiment"track: bool=False# "if toggled, this experiment will be tracked with Weights and Biases"# wandb_project_name: str = "cleanRL"# "the wandb's project name"# wandb_entity: str = None# "the entity (team) of wandb's project"# capture_video: bool = False# "whether to capture videos of the agent performances (check out `videos` folder)"save_model: bool=False"whether to save model into the `runs/{run_name}` folder"upload_model: bool=False"whether to upload the saved model to huggingface"hf_entity: str="""the user or org name of the model repository from the Hugging Face Hub"log_frequency: int=10"the logging frequency of the model performance (in terms of `updates`)"# Algorithm specific arguments# env_id: str = "Breakout-v5"env_id: str="SimpleSpreadDiscrete-v0""the id of the environment"total_timesteps: int=50000000"total timesteps of the experiments"learning_rate: float=2.5e-4"the learning rate of the optimizer"local_num_envs: int=4"the number of parallel game environments"num_actor_threads: int=2"the number of actor threads to use"num_steps: int=128"the number of steps to run in each environment per policy rollout"anneal_lr: bool=True"Toggle learning rate annealing for policy and value networks"gamma: float=0.99"the discount factor gamma"gae_lambda: float=0.95"the lambda for the general advantage estimation"num_minibatches: int=4"the number of mini-batches"gradient_accumulation_steps: int=1"the number of gradient accumulation steps before performing an optimization step"update_epochs: int=4"the K epochs to update the policy"norm_adv: bool=True"Toggles advantages normalization"clip_coef: float=0.1"the surrogate clipping coefficient"ent_coef: float=0.01"coefficient of the entropy"vf_coef: float=0.5"coefficient of the value function"max_grad_norm: float=0.5"the maximum norm for the gradient clipping"channels: list[int] =field(default_factory=lambda: [16, 32, 32])
"the channels of the CNN"hiddens: list[int] =field(default_factory=lambda: [256])
"the hiddens size of the MLP"actor_device_ids: list[int] =field(default_factory=lambda: [0])
"the device ids that actor workers will use"learner_device_ids: list[int] =field(default_factory=lambda: [0])
"the device ids that learner workers will use"distributed: bool=False"whether to use `jax.distirbuted`"concurrency: bool=False"whether to run the actor and learner concurrently"# runtime arguments to be filled inlocal_batch_size: int=0local_minibatch_size: int=0num_updates: int=0world_size: int=0local_rank: int=0num_envs: int=0batch_size: int=0minibatch_size: int=0global_learner_decices: list[str] |None=Noneactor_devices: list[str] |None=Nonelearner_devices: list[str] |None=Noneif__name__=="__main__":
args=tyro.cli(Args)
args.local_batch_size=int(args.local_num_envs*args.num_steps*args.num_actor_threads*len(args.actor_device_ids))
args.local_minibatch_size=int(args.local_batch_size//args.num_minibatches)
assertargs.local_num_envs%len(args.learner_device_ids) ==0, "local_num_envs must be divisible by len(learner_device_ids)"assert (
int(args.local_num_envs/len(args.learner_device_ids)) *args.num_actor_threads%args.num_minibatches==0
), "int(local_num_envs / len(learner_device_ids)) must be divisible by num_minibatches"ifargs.distributed:
jax.distributed.initialize(
local_device_ids=range(len(args.learner_device_ids) +len(args.actor_device_ids)),
)
print(list(range(len(args.learner_device_ids) +len(args.actor_device_ids))))
args.world_size=jax.process_count()
args.local_rank=jax.process_index()
args.num_envs=args.local_num_envs*args.world_size*args.num_actor_threads*len(args.actor_device_ids)
args.batch_size=args.local_batch_size*args.world_sizeargs.minibatch_size=args.local_minibatch_size*args.world_sizeargs.num_updates=args.total_timesteps// (args.local_batch_size*args.world_size)
local_devices=jax.local_devices()
global_devices=jax.devices()
learner_devices= [local_devices[d_id] ford_idinargs.learner_device_ids]
actor_devices= [local_devices[d_id] ford_idinargs.actor_device_ids]
global_learner_decices= [
global_devices[d_id+process_index*len(local_devices)]
forprocess_indexinrange(args.world_size)
ford_idinargs.learner_device_ids
]
print("global_learner_decices", global_learner_decices)
args.global_learner_decices= [str(item) foriteminglobal_learner_decices]
args.actor_devices= [str(item) foriteminactor_devices]
args.learner_devices= [str(item) foriteminlearner_devices]
pprint(args)
run_name=f"{args.env_id}__{args.exp_name}__{args.seed}__{uuid.uuid4()}"num_envs=4num_players=3envs=envpool.make(
args.env_id,
env_type="gymnasium",
num_envs=num_envs,
max_num_players=num_players,
num_agents=num_players,
num_landmarks=3,
seed=args.seed,
)
act_space=envs.action_spaceobs0, info=envs.reset()
for_inrange(5000):
if (_+1) %250==0:
print(f"iter {_}")
# action = np.array([act_space.sample() for _ in range(args.local_num_envs)])action=np.array([act_space.sample() for_inrange(num_envs*num_players)])
if (_+1) %250==0:
print(f"sending action {action} to environment")
# obs0, rew0, terminated, truncated, info0 = envs.step(action[:, None], env_id=np.arange(1))obs0, rew0, terminated, truncated, info0=envs.step(action.reshape(-1), env_id=np.arange(num_envs))
if (_+1) %250==0:
print(f"reward {rew0.reshape(num_envs, -1).sum(-1)} from environment")
print()
envs.close()
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Describe the bug
When calling
Step
in a multiagent environment with more than 2 players where all players' actions are processed simultaneously, the actions in the argument of the C++ API are "offset" from the actions sent in the Python API by 1 rather than "max_num_players".I am trying to implement Multiagent Particle Environment (MPE) and have found that if I have
num_envs
environments and I send a[batch_size, ...]
vector of actions tostep
in the Python api, then theStep
function in the C++ API will receive an action which is offset by the number of previous calls toStep
in that iteration.For example, if I have 2 environments with 3 agents each and I send a flat vector of 6 actions [0, 1, 2, 3, 4, 5]:
Step
is called (either environment 0, or 1) will receive actions in the following order [0, 1, 2, 3, 4, 5]The actions are not offset by the correct amount. The first value in the actions starts at index 1 instead of starting at
max_num_players
. The only way to correct this error is to know how many timesStep
has already been called, undo the index advancement done byParseActions
, and offset the actions by the correct amount. However this would be just a work around and cannot be implemented sinceenv_index_
is a private variable so derived classes can't use itTo Reproduce
My working example consists of 6 files:
core.h
where the base classes are defineddefault_params.h
where the default values for all environments are definedsimple_env.h
where the spec, Simple MPE base environment, and corresponding envpool async environment is definedsimple_spread.h
where the Simple Spread MPE environment and corresponding envpool async environment is definedmpe.cc
the python bindings fileBUILD
my Bazel build file__init__.py
the module file according to the EnvPool docsregistration.py
the registration file according to the EnvPool docsThis project uses Eigen as a C++ dependency. I have omitted the changes to
setup.cfg
and the envpool core files to make the custom environment accessible.Below are the files for my C++ MPE environment
core.h
scenario.h
default_params.h
simple_env.h
simple_spread.h
mpe.cc
BUILD
registration.py
__init__.py
Reproduction script
Below is modified from the test script for LunarLander
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
System info
Describe the characteristic of your environment:
Additional context
Add any other context about the problem here.
Checklist
The text was updated successfully, but these errors were encountered: