Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMP training error #39

Open
fan-ziqi opened this issue Dec 11, 2024 · 8 comments
Open

AMP training error #39

fan-ziqi opened this issue Dec 11, 2024 · 8 comments

Comments

@fan-ziqi
Copy link

when I run python phys_anim/train_agent.py +exp=amp +robot=smpl +backbone=isaaclab motion_file =phys_anim/data/motions/smpl_humanoid_walk.npy

wandb: WARNING This integration is tested and supported for lightning Fabric 2.1.3.
wandb: WARNING             Please report any issues to https://github.com/wandb/wandb/issues with the tag `lightning-fabric`.
2024-12-11 19:51:46,438 - INFO - logger - logger initialized
/home/ubuntu/workspaces/ProtoMotions/phys_anim/train_agent.py:67: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="config", config_name="base")
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'base': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 0
Setting seed: 0
[INFO][AppLauncher]: Loading experience file: /home/ubuntu/workspaces/IsaacLab/source/apps/isaaclab.python.headless.kit
[Warning] [omni.isaac.kit.simulation_app] Modules: ['omni.kit_app'] were loaded before SimulationApp was started and might not be loaded correctly.
[Warning] [omni.isaac.kit.simulation_app] Please check to make sure no extra omniverse or pxr modules are imported before the call to SimulationApp(...)
Loading user config located at: '/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/omni/data/Kit/Isaac-Sim/4.2/user.config.json'
[Info] [carb] Logging to file: /home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/omni/logs/Kit/Isaac-Sim/4.2/kit_20241211_195146.log
2024-12-11 11:51:46 [0ms] [Warning] [omni.kit.app.plugin] No crash reporter present, dumps uploading isn't available.
[2024-12-11 19:51:46,992][asyncio][DEBUG] - Using selector: EpollSelector
[2024-12-11 19:51:47,109][omni.kit.telemetry.impl.sentry_extension][INFO] - sentry is disabled for external build
[2024-12-11 19:51:47,109][omni.kit.telemetry.impl.sentry_extension][INFO] - sentry is disabled for external build

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.100       | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name                             | Active | LDA | GPU Memory | Vendor-ID | LUID       |
|     |                                  |        |     |            | Device-ID | UUID       |
|     |                                  |        |     |            | Bus-ID    |            |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA GeForce RTX 4090          | Yes: 0 |     | 24564   MB | 10de      | 0          |
|     |                                  |        |     |            | 2684      | 16760a40.. |
|     |                                  |        |     |            | 1         |            |
|=============================================================================================|
| OS: 20.04.6 LTS (Focal Fossa) ubuntu, Version: 20.04.6, Kernel: 5.15.0-126-generic
| XServer Vendor: The X.Org Foundation, XServer Version: 12013000 (1.20.13.0)
| Processor: Intel(R) Core(TM) i9-14900KF | Cores: 24 | Logical: 32
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 64106 | Free Memory: 23512
| Total Page/Swap (MB): 2047 | Free Page/Swap: 1558
|---------------------------------------------------------------------------------------------|
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Error executing job with overrides: ['+exp=amp', '+robot=smpl', '+backbone=isaaclab', 'motion_file=phys_anim/data/motions/smpl_humanoid_walk.npy']
Error in call to target 'phys_anim.envs.humanoid.isaaclab.Humanoid':
AttributeError("'NoneType' object has no attribute 'total_num_objects'")
full_key: env

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2024-12-11 11:51:51 [4,938ms] [Warning] [carb] Recursive unloadAllPlugins() detected!
@tesslerc
Copy link
Collaborator

Can you pull the latest and try again?

Please let me know if anything fails and I'll quickly debug and push a fix.

@fan-ziqi
Copy link
Author

fan-ziqi commented Dec 11, 2024

Thanks for reply!

A new bug has appeared

(isaaclab) ubuntu@ubuntu-4090:~/workspaces/ProtoMotions$ python phys_anim/train_agent.py +exp=amp +robot=smpl +backbone=isaaclab motion_file=phys_anim/data/motions/smpl_humanoid_walk.npy
wandb: WARNING This integration is tested and supported for lightning Fabric 2.1.3.
wandb: WARNING             Please report any issues to https://github.com/wandb/wandb/issues with the tag `lightning-fabric`.
2024-12-11 21:50:57,537 - INFO - logger - logger initialized
/home/ubuntu/workspaces/ProtoMotions/phys_anim/train_agent.py:67: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="config", config_name="base")
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'base': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 0
Setting seed: 0
[INFO][AppLauncher]: Loading experience file: /home/ubuntu/workspaces/IsaacLab/source/apps/isaaclab.python.headless.kit
[Warning] [omni.isaac.kit.simulation_app] Modules: ['omni.kit_app'] were loaded before SimulationApp was started and might not be loaded correctly.
[Warning] [omni.isaac.kit.simulation_app] Please check to make sure no extra omniverse or pxr modules are imported before the call to SimulationApp(...)
Loading user config located at: '/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/omni/data/Kit/Isaac-Sim/4.2/user.config.json'
[Info] [carb] Logging to file: /home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/omni/logs/Kit/Isaac-Sim/4.2/kit_20241211_215058.log
2024-12-11 13:50:58 [0ms] [Warning] [omni.kit.app.plugin] No crash reporter present, dumps uploading isn't available.
[2024-12-11 21:50:58,074][asyncio][DEBUG] - Using selector: EpollSelector
[2024-12-11 21:50:58,176][omni.kit.telemetry.impl.sentry_extension][INFO] - sentry is disabled for external build
[2024-12-11 21:50:58,177][omni.kit.telemetry.impl.sentry_extension][INFO] - sentry is disabled for external build

|---------------------------------------------------------------------------------------------|
| Driver Version: 550.100       | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name                             | Active | LDA | GPU Memory | Vendor-ID | LUID       |
|     |                                  |        |     |            | Device-ID | UUID       |
|     |                                  |        |     |            | Bus-ID    |            |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA GeForce RTX 4090          | Yes: 0 |     | 24564   MB | 10de      | 0          |
|     |                                  |        |     |            | 2684      | 16760a40.. |
|     |                                  |        |     |            | 1         |            |
|=============================================================================================|
| OS: 20.04.6 LTS (Focal Fossa) ubuntu, Version: 20.04.6, Kernel: 5.15.0-126-generic
| XServer Vendor: The X.Org Foundation, XServer Version: 12013000 (1.20.13.0)
| Processor: Intel(R) Core(TM) i9-14900KF | Cores: 24 | Logical: 32
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 64106 | Free Memory: 31657
| Total Page/Swap (MB): 2047 | Free Page/Swap: 1547
|---------------------------------------------------------------------------------------------|
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPreviewSurface.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdUVTexture.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_float.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_float2.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_float3.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_float4.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_int.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_string.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_normal.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_point.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_vector.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdPrimvarReader_matrix.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

2024-12-11 13:50:59 [1,534ms] [Warning] [omni.usd] Warning: in GetNodeDiscoveryResults at line 136 of /builds/omniverse/usd-ci/USD/pxr/usd/usdShade/shaderDefUtils.cpp -- Unable to resolve info:sourceAsset </UsdTransform2d.info:mdl:sourceAsset> with value @UsdPreviewSurface.mdl@.

[INFO]: Setup complete...
{'disable_discriminator': False, 'discriminator_obs_historical_steps': 10, 'discriminator_obs_size_per_step': 232}
/home/ubuntu/workspaces/ProtoMotions/phys_anim/utils/motion_lib.py:150: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  torch.tensor(key_body_ids, dtype=torch.long, device=device),
Loading motions from yaml/npy file
Loading 1/1 motion files: phys_anim/data/motions/smpl_humanoid_walk.npy
Error executing job with overrides: ['+exp=amp', '+robot=smpl', '+backbone=isaaclab', 'motion_file=phys_anim/data/motions/smpl_humanoid_walk.npy']
Error in call to target 'phys_anim.envs.humanoid.isaaclab.Humanoid':
InstantiationException("Error in call to target 'phys_anim.utils.motion_lib.MotionLib':\nFileNotFoundError(2, 'No such file or directory')\nfull_key: config.motion_lib")
full_key: env

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2024-12-11 13:51:04 [6,535ms] [Warning] [carb] Recursive unloadAllPlugins() detected!

@tesslerc
Copy link
Collaborator

The error suggests the motion file is not there.
Data was moved to the main directory, out of phys_anim

@fan-ziqi
Copy link
Author

That's indeed the problem, thank you!

@fan-ziqi fan-ziqi reopened this Dec 15, 2024
@fan-ziqi
Copy link
Author

When I eval the agent python phys_anim/eval_agent.py +exp=amp +robot=smpl +backbone=isaaclab +checkpoint=results/amp/lightning_logs/version_0/last.ckpt

It says Key 'fabric' is not in struct

2024-12-15 10:59:44,491 - INFO - logger - logger initialized
/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py:62: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="config")
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Could not find config path: results/amp/lightning_logs/config.yaml
Error executing job with overrides: ['+exp=amp', '+robot=smpl', '+backbone=isaaclab', '+checkpoint=results/amp/lightning_logs/version_0/last.ckpt']
Traceback (most recent call last):
  File "/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py", line 104, in main
    fabric: Fabric = instantiate(config.fabric)
omegaconf.errors.ConfigAttributeError: Key 'fabric' is not in struct
    full_key: fabric
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@tesslerc
Copy link
Collaborator

Can you try with checkpoint=results/amp/last.ckpt ?
I'll update the readme accordingly.
We're modifying how experiments are run such that it will be very easy to resume previous experiments.

@yinkangning0124
Copy link

When I eval the agent python phys_anim/eval_agent.py +exp=amp +robot=smpl +backbone=isaaclab +checkpoint=results/amp/lightning_logs/version_0/last.ckpt

It says Key 'fabric' is not in struct

2024-12-15 10:59:44,491 - INFO - logger - logger initialized
/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py:62: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="config")
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Could not find config path: results/amp/lightning_logs/config.yaml
Error executing job with overrides: ['+exp=amp', '+robot=smpl', '+backbone=isaaclab', '+checkpoint=results/amp/lightning_logs/version_0/last.ckpt']
Traceback (most recent call last):
  File "/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py", line 104, in main
    fabric: Fabric = instantiate(config.fabric)
omegaconf.errors.ConfigAttributeError: Key 'fabric' is not in struct
    full_key: fabric
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Hi, have u solved the issue?

@tesslerc
Copy link
Collaborator

When I eval the agent python phys_anim/eval_agent.py +exp=amp +robot=smpl +backbone=isaaclab +checkpoint=results/amp/lightning_logs/version_0/last.ckpt
It says Key 'fabric' is not in struct

2024-12-15 10:59:44,491 - INFO - logger - logger initialized
/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py:62: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="config")
/home/ubuntu/anaconda3/envs/isaaclab/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Could not find config path: results/amp/lightning_logs/config.yaml
Error executing job with overrides: ['+exp=amp', '+robot=smpl', '+backbone=isaaclab', '+checkpoint=results/amp/lightning_logs/version_0/last.ckpt']
Traceback (most recent call last):
  File "/home/ubuntu/workspaces/ProtoMotions/phys_anim/eval_agent.py", line 104, in main
    fabric: Fabric = instantiate(config.fabric)
omegaconf.errors.ConfigAttributeError: Key 'fabric' is not in struct
    full_key: fabric
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Hi, have u solved the issue?

Could you try providing the checkpoint in the root directory of the experiment and not from the lightning logs inner folder?
Each experiment should be self-contained such that a new experiment folder is created for each new training run.
The separation into lightning logs enables auto-resume and checkpointing across training resumptions, but the config is shared across all runs and stored in the root experiment folder.

We will update the readme to reflect this, but please let us first know that this was indeed the case and your problem is now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants