Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How to use AR? #48

Open
GuoPingPan opened this issue Mar 21, 2023 · 3 comments
Open

How to use AR? #48

GuoPingPan opened this issue Mar 21, 2023 · 3 comments

Comments

@GuoPingPan
Copy link

If I want to use Anticipation Reward(AR), should I set reward_type = map_accuracy ?
But I found it wasn't set in
https://github.com/facebookresearch/OccupancyAnticipation/blob/aea6a2c0d9/configs/exploration/gibson_train_w_ar.yaml

@srama2512
Copy link
Contributor

@GuoPingPan - It is set in the baseline config, not the task config.

# Uncomment this for anticipation reward
# reward_type: "map_accuracy"

@GuoPingPan
Copy link
Author

Thanks you so much! But I got another question which have the multi-gpu trainning failed. Have you ever meet the error:

Process ForkServerProcess-74:
Traceback (most recent call last):
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/supervised/map_update.py", line 325, in map_update_worker
    losses = map_update_fn(ps_args)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/supervised/map_update.py", line 219, in map_update_fn
    mapper_outputs = mapper(mapper_inputs, method_name="predict_deltas")
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 1.
Original Traceback (most recent call last):
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy.py", line 633, in forward
    outputs = self.predict_deltas(*args, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy.py", line 316, in predict_deltas
    pu_outputs = self.projection_unit(pu_inputs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/rl/policy_utils.py", line 49, in forward
    x_full = self.main(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 453, in forward
    return self.main(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 51, in forward
    gp_outputs = self._do_gp_anticipation(x)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/occant.py", line 278, in _do_gp_anticipation
    x_enc = self.gp_depth_proj_encoder(
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 107, in forward
    x1 = self.inc(x)  # (bs, nsf, ..., ...)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 41, in forward
    x = self.conv(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/workspace/OccAnt/occant_baselines/models/unet.py", line 31, in forward
    x = self.conv(x)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/yzc1/miniconda3/envs/occ/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

And I only can use one gpu and ensure that they are the same between mapper_copy and MapLargeRolloutStorageMP, as bellow

if mapper_cfg.use_data_parallel and len(mapper_cfg.gpu_ids) > 0:
    self.mapper_copy.to(self.mapper.config.gpu_ids[0]) # cuda:1
    self.mapper_copy = nn.DataParallel(
        self.mapper_copy,
        device_ids=self.mapper.config.gpu_ids, # device = 1, 2, 3, 4, 5, 6, 7
        output_device=self.mapper.config.gpu_ids[0],
        # device_ids=[self.mapper.config.gpu_ids[0]]
    )

...

if ans_cfg.MAPPER.use_data_parallel and len(ans_cfg.MAPPER.gpu_ids) > 0:
    mapper_device = ans_cfg.MAPPER.gpu_ids[0]
    # mapper_device = torch.device("cuda:1")
mapper_rollouts = MapLargeRolloutStorageMP(
    ans_cfg.MAPPER.replay_size,
    mapper_observation_space,
    mapper_device,
    mapper_manager,
)

But I found that the MapLargeRolloutStorageMP, which will easily cause OOM Error.
The most importance thing is how to use multi-gpu or it will be very slow for training

@srama2512
Copy link
Contributor

@GuoPingPan - the mapper training was intended to work only on a single GPU. The other GPUs are used primarily for data collection via habitat-sim/lab. So, GPU-0 uses most memory for mapper training and the other GPUs use most memory for habitat simulator instances. If you plan to use multi-GPU training for the mapper, you may have to appropriately modify how resources are allocated across GPUs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants