Extension of `lightning` DDP support #59

bbrier · 2024-01-11T14:15:46Z

Motivation

I am attempting to use optuna for hyperparameter optimization of a complex, lightning based deep learning framework. It is essential for this framework to run in a distributed setting. In the distributed lightning integration example, ddp_spawn is used as a strategy, which is strongly discouraged by lightning because of speed and flexibility concerns (the inability to use a large value of num_workers without bottlenecking for example, which is essential for my use case). Attempting to use the regular DDP strategy however, results in optuna generating a different set of hyperparameters for each RANK, since my optuna main script is repeatedly called. I have considered running my distributed main script in a subprocess started in the objective function, but this would not allow me to use the PytorchLightningPruningCallback since I can't reliable pass this object to the subprocess.

Description

My suggestion is to add a way for optuna to run with regular DDP. Perhaps by tracking whether DDP is being used in the storage, so that when study.optimize is called, the correct trial is produced and the trial suggest methods will return the same hyperparameters across ranks. I do not know enough of the internal workings of optuna to know if it is feasible to implement this. Is this something that can be supported in the future?

Alternatives (optional)

No response

Additional context (optional)

No response

kirilk-ua · 2024-09-19T09:06:56Z

You can use optuna.integration.TorchDistributedTrial for ddp mode. There is an example
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py

When I want to use lighting, there is also example, but it is only in ddp_spawn mode
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py

When I try to run it in simple ddp mode with TorchDistributedTrial and lighting, hyperparameters are synchronized, but:

PytorchLightingCallback crashes since it uses trial._study inside, which is not implemented in TorchDistributedTrial
For some reason running in multimode with TorchDistributor().run() gives an unpredictable number of trials, and I can't find a way to debug it. I tried to use optuna.logging.set_verbosity(optional.logging.DEBUG), but there are no additional logs (I excepted to see logs of some decisions for pruning).

If someone has a working example of using lighting with optuna in simple ddp mode (which is recommended by lighting), it would be great.

bbrier added the feature Change that does not break compatibility, but affects the public interfaces. label Jan 11, 2024

nzw0301 changed the title ~~Extension of DDP support~~ Extension of lightning DDP support Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension of `lightning` DDP support #59

Extension of `lightning` DDP support #59

bbrier commented Jan 11, 2024

kirilk-ua commented Sep 19, 2024

Extension of lightning DDP support #59

Extension of lightning DDP support #59

Comments

bbrier commented Jan 11, 2024

Motivation

Description

Alternatives (optional)

Additional context (optional)

kirilk-ua commented Sep 19, 2024

Extension of `lightning` DDP support #59

Extension of `lightning` DDP support #59