Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for using multiple GPUs on training server when training the model #568

Open
wants to merge 2 commits into
base: next
Choose a base branch
from

Conversation

gyathaar
Copy link

@gyathaar gyathaar commented May 9, 2018

Added code to train the model on more than one GPU

When using a single GPU the code is basically unchanged (tf.Variable() is replaced with tf.get_variable() to be able to reuse the variables, and model loading is changed so it can load models previously trained with multiple GPUs)

In my own tests a model training with same batch size runs in around 65% of the time with 2 GPUs compared to 1 GPU... doubling the batch size makes training take about 30% longer than a single GPU with unchanged batch size

Added configurable param for which device to collect and apply the gradients from all the GPUs, in my tests it was running about 10% faster when doing this on the CPU compared to one of the GPUs, but this may be different on other systems depending on the interconnects between the GPUs

@gyathaar gyathaar changed the title Added support for using multiple GPUs when training the model Added support for using multiple GPUs on training server when training the model May 11, 2018
@ganeshkrishnan1
Copy link
Contributor

has this code undergone end to end testing? tfprocess.py has drastic changes and I am not sure how I can check if the training is progressing better with multi gpu support.

My server has 4 gpus and this works but I can't vouch for the model generated from it

@ganeshkrishnan1
Copy link
Contributor

ganeshkrishnan1 commented May 22, 2018

Ran this change for 24 hours and everything seems to be ok including the model built.

Although note that this seems incompatible with the checkpoint generated from single GPU core (due to error clearing the .meta file)

Also any reason you are applying the final steps in CPU and not GPU?

@gyathaar
Copy link
Author

I tested training on same set of games with same random seed and got same loss curves as the normal code.. have not tested extensively if the trained nets will play identical games after x training runs.

You can train the final steps on the GPU if you like.. in my tests the training was about 10% faster when performing it in the CPU, but this will vary depending on what CPU and GPUs you have, and what kind of interconnect speeds between the GPUs

The checkpoint files generated by the main branch will not be able to be loaded (and other way) due to meta data differences (storing device ids and path vs not.. )

@ganeshkrishnan1
Copy link
Contributor

The meta file incompatibility is going to be a deal breaker for many. Possibly even lczero network.
This effectively means we have to start training from scratch unless you can modify the net_to_model.py to create checkpoints compatible with the multi gpu version from the trained model

@gyathaar
Copy link
Author

gyathaar commented May 23, 2018

The meta file incompatibility is not an issue..
you can convert an existing weights.txt file into a checkpoint using the net_to_model.py program.. it is already compatible with the modified version tfprocess.py

@gyathaar gyathaar force-pushed the next branch 2 times, most recently from 2c104c4 to 79d8fd1 Compare May 23, 2018 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants