Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi gpu training #69

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

psteinb
Copy link

@psteinb psteinb commented Feb 3, 2020

This needs a bit more testing, but I think going multi-gpu is somewhat straight forward. Or did you try that already?

@psteinb
Copy link
Author

psteinb commented Feb 11, 2020

Almost there, apparently there is a problematic interplay of tf and keras:
tensorflow/tensorflow#30728
keras-team/keras#13057
keras-team/keras#13255
I need to check how to fix this.

@psteinb psteinb changed the title WIP: Multi gpu training Multi gpu training Feb 12, 2020
@psteinb
Copy link
Author

psteinb commented Feb 12, 2020

done implementing multi-gpu training. I hope putting that into the constructor of N2V was the right choice. I also added an example notebook derived from examples/2D/denoising2D_BSD68/BSD68_reproducibility_multi_gpu.ipynb.

I'll supply more extensive numbers later, my current estimate for training n2v from this notebook is:

  • a single P100 with tf 1.12 and keras 2.2.4: ~93 seconds per epoch after warm-up
  • double P100s with tf 1.12 and keras 2.2.4: ~56 seconds per epoch after warm-up

I'll provide 4 GPU numbers later. Note that this "improvement" is expected to be non-linear as keras internally does parallize the batches, so a batch size of 128 will be parallelized to 2 batches of 64 images. As discussed earlier this approach is currently not support with tf 1.14 and keras 2.2.{4,5} due to the bugs mentioned above.

Would love to hear your feedback on this.

@tibuch
Copy link
Collaborator

tibuch commented Jun 24, 2020

Thank you for this PR!

I have this on my to-do list, but wasn't able to get my hands on a multi-GPU system. I guess the cluster should work for testing.

Although I am very confident that it just works, I would like to test it as well :)

@psteinb
Copy link
Author

psteinb commented Jun 24, 2020

thanks for having a look. Last time I checked, all GPU configs with >=3 GPUs fail to run due to some problems with the keras data augmentations. Maybe this is leveraged by looking into bringing n2v 100% to tf.keras?

@snehashis-roy
Copy link

Hi,
I want to use 2 gpus for training. As explained in the notebook, I used the following config,

config = N2VConfig(X_train, unet_kern_size=3, unet_n_depth=3, unet_n_first = 64,
                           train_steps_per_epoch=int(dim[0] / 128), train_epochs=50, train_loss='mse',
                           batch_norm=True, train_num_gpus=2,
                           train_batch_size=64, n2v_perc_pix=1.0, n2v_patch_shape=(128,128),
                           n2v_manipulator='uniform_withCP', n2v_neighborhood_radius=5)

I have set CUDA_VISIBLE_DEVICES to 1,2 before running the training. I have used pip install n2v to install N2V. My TF-GPU is 1.14.1, keras 2.2.5, numpy 1.19.1

The training still uses 1 GPU. Please let me know what I am missing.

@tibuch
Copy link
Collaborator

tibuch commented Aug 20, 2020

Hi @piby2,

This functionality is not part of the official N2V release yet.

If you would like to test it you would have to clone the fork psteinb/n2v and checkout the branch multi_gpu_training. Then you can run pip install . from inside the git repo and this version will be installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants