Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension of lightning DDP support #59

Open
bbrier opened this issue Jan 11, 2024 · 1 comment
Open

Extension of lightning DDP support #59

bbrier opened this issue Jan 11, 2024 · 1 comment
Labels
feature Change that does not break compatibility, but affects the public interfaces.

Comments

@bbrier
Copy link

bbrier commented Jan 11, 2024

Motivation

I am attempting to use optuna for hyperparameter optimization of a complex, lightning based deep learning framework. It is essential for this framework to run in a distributed setting. In the distributed lightning integration example, ddp_spawn is used as a strategy, which is strongly discouraged by lightning because of speed and flexibility concerns (the inability to use a large value of num_workers without bottlenecking for example, which is essential for my use case). Attempting to use the regular DDP strategy however, results in optuna generating a different set of hyperparameters for each RANK, since my optuna main script is repeatedly called. I have considered running my distributed main script in a subprocess started in the objective function, but this would not allow me to use the PytorchLightningPruningCallback since I can't reliable pass this object to the subprocess.

Description

My suggestion is to add a way for optuna to run with regular DDP. Perhaps by tracking whether DDP is being used in the storage, so that when study.optimize is called, the correct trial is produced and the trial suggest methods will return the same hyperparameters across ranks. I do not know enough of the internal workings of optuna to know if it is feasible to implement this. Is this something that can be supported in the future?

Alternatives (optional)

No response

Additional context (optional)

No response

@bbrier bbrier added the feature Change that does not break compatibility, but affects the public interfaces. label Jan 11, 2024
@kirilk-ua
Copy link

You can use optuna.integration.TorchDistributedTrial for ddp mode. There is an example
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py

When I want to use lighting, there is also example, but it is only in ddp_spawn mode
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py

When I try to run it in simple ddp mode with TorchDistributedTrial and lighting, hyperparameters are synchronized, but:

  1. PytorchLightingCallback crashes since it uses trial._study inside, which is not implemented in TorchDistributedTrial
  2. For some reason running in multimode with TorchDistributor().run() gives an unpredictable number of trials, and I can't find a way to debug it. I tried to use optuna.logging.set_verbosity(optional.logging.DEBUG), but there are no additional logs (I excepted to see logs of some decisions for pruning).

If someone has a working example of using lighting with optuna in simple ddp mode (which is recommended by lighting), it would be great.

@nzw0301 nzw0301 changed the title Extension of DDP support Extension of lightning DDP support Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Change that does not break compatibility, but affects the public interfaces.
Projects
None yet
Development

No branches or pull requests

2 participants