How to use AR? #48

GuoPingPan · 2023-03-21T16:26:01Z

If I want to use Anticipation Reward(AR), should I set reward_type = map_accuracy ?
But I found it wasn't set in
https://github.com/facebookresearch/OccupancyAnticipation/blob/aea6a2c0d9/configs/exploration/gibson_train_w_ar.yaml

The text was updated successfully, but these errors were encountered:

srama2512 · 2023-03-30T18:28:01Z

@GuoPingPan - It is set in the baseline config, not the task config.

OccupancyAnticipation/configs/model_configs/occant_rgbd/ppo_exploration.yaml

Lines 24 to 25 in aea6a2c

    
           # Uncomment this for anticipation reward 
        
           # reward_type: "map_accuracy"

GuoPingPan · 2023-03-31T11:50:06Z

Thanks you so much! But I got another question which have the multi-gpu trainning failed. Have you ever meet the error:

Process ForkServerProcess-74:
Traceback (most recent call last):
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/supervised/map_update.py", line 325, in map_update_worker
    losses = map_update_fn(ps_args)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/supervised/map_update.py", line 219, in map_update_fn
    mapper_outputs = mapper(mapper_inputs, method_name="predict_deltas")
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 1.
Original Traceback (most recent call last):
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy.py", line 633, in forward
    outputs = self.predict_deltas(*args, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy.py", line 316, in predict_deltas
    pu_outputs = self.projection_unit(pu_inputs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy_utils.py", line 49, in forward
    x_full = self.main(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 453, in forward
    return self.main(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 51, in forward
    gp_outputs = self._do_gp_anticipation(x)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 278, in _do_gp_anticipation
    x_enc = self.gp_depth_proj_encoder(
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 107, in forward
    x1 = self.inc(x)  # (bs, nsf, ..., ...)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 41, in forward
    x = self.conv(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 31, in forward
    x = self.conv(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

And I only can use one gpu and ensure that they are the same between mapper_copy and MapLargeRolloutStorageMP, as bellow

if mapper_cfg.use_data_parallel and len(mapper_cfg.gpu_ids) > 0:
    self.mapper_copy.to(self.mapper.config.gpu_ids[0]) # cuda:1
    self.mapper_copy = nn.DataParallel(
        self.mapper_copy,
        device_ids=self.mapper.config.gpu_ids, # device = 1, 2, 3, 4, 5, 6, 7
        output_device=self.mapper.config.gpu_ids[0],
        # device_ids=[self.mapper.config.gpu_ids[0]]
    )

...

if ans_cfg.MAPPER.use_data_parallel and len(ans_cfg.MAPPER.gpu_ids) > 0:
    mapper_device = ans_cfg.MAPPER.gpu_ids[0]
    # mapper_device = torch.device("cuda:1")
mapper_rollouts = MapLargeRolloutStorageMP(
    ans_cfg.MAPPER.replay_size,
    mapper_observation_space,
    mapper_device,
    mapper_manager,
)

But I found that the MapLargeRolloutStorageMP, which will easily cause OOM Error.
The most importance thing is how to use multi-gpu or it will be very slow for training

srama2512 · 2023-04-03T15:14:23Z

@GuoPingPan - the mapper training was intended to work only on a single GPU. The other GPUs are used primarily for data collection via habitat-sim/lab. So, GPU-0 uses most memory for mapper training and the other GPUs use most memory for habitat simulator instances. If you plan to use multi-GPU training for the mapper, you may have to appropriately modify how resources are allocated across GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use AR? #48

How to use AR? #48

GuoPingPan commented Mar 21, 2023

srama2512 commented Mar 30, 2023

GuoPingPan commented Mar 31, 2023

srama2512 commented Apr 3, 2023

How to use AR? #48

How to use AR? #48

Comments

GuoPingPan commented Mar 21, 2023

srama2512 commented Mar 30, 2023

GuoPingPan commented Mar 31, 2023

srama2512 commented Apr 3, 2023