Added support for using multiple GPUs on training server when training the model #568

gyathaar · 2018-05-09T11:08:26Z

Added code to train the model on more than one GPU

When using a single GPU the code is basically unchanged (tf.Variable() is replaced with tf.get_variable() to be able to reuse the variables, and model loading is changed so it can load models previously trained with multiple GPUs)

In my own tests a model training with same batch size runs in around 65% of the time with 2 GPUs compared to 1 GPU... doubling the batch size makes training take about 30% longer than a single GPU with unchanged batch size

Added configurable param for which device to collect and apply the gradients from all the GPUs, in my tests it was running about 10% faster when doing this on the CPU compared to one of the GPUs, but this may be different on other systems depending on the interconnects between the GPUs

ganeshkrishnan1 · 2018-05-21T14:09:16Z

has this code undergone end to end testing? tfprocess.py has drastic changes and I am not sure how I can check if the training is progressing better with multi gpu support.

My server has 4 gpus and this works but I can't vouch for the model generated from it

ganeshkrishnan1 · 2018-05-22T12:54:44Z

Ran this change for 24 hours and everything seems to be ok including the model built.

Although note that this seems incompatible with the checkpoint generated from single GPU core (due to error clearing the .meta file)

Also any reason you are applying the final steps in CPU and not GPU?

gyathaar · 2018-05-22T14:09:48Z

I tested training on same set of games with same random seed and got same loss curves as the normal code.. have not tested extensively if the trained nets will play identical games after x training runs.

You can train the final steps on the GPU if you like.. in my tests the training was about 10% faster when performing it in the CPU, but this will vary depending on what CPU and GPUs you have, and what kind of interconnect speeds between the GPUs

The checkpoint files generated by the main branch will not be able to be loaded (and other way) due to meta data differences (storing device ids and path vs not.. )

ganeshkrishnan1 · 2018-05-22T14:24:53Z

The meta file incompatibility is going to be a deal breaker for many. Possibly even lczero network.
This effectively means we have to start training from scratch unless you can modify the net_to_model.py to create checkpoints compatible with the multi gpu version from the trained model

gyathaar · 2018-05-23T06:44:14Z

The meta file incompatibility is not an issue..
you can convert an existing weights.txt file into a checkpoint using the net_to_model.py program.. it is already compatible with the modified version tfprocess.py

Added support for using multiple GPUs when traing the model

79d8fd1

gyathaar changed the title ~~Added support for using multiple GPUs when training the model~~ Added support for using multiple GPUs on training server when training the model May 11, 2018

gyathaar force-pushed the next branch 2 times, most recently from 2c104c4 to 79d8fd1 Compare May 23, 2018 08:12

renamed config option to reduce confusion

68fa9dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for using multiple GPUs on training server when training the model #568

Added support for using multiple GPUs on training server when training the model #568

gyathaar commented May 9, 2018

ganeshkrishnan1 commented May 21, 2018

ganeshkrishnan1 commented May 22, 2018 •

edited

Loading

gyathaar commented May 22, 2018

ganeshkrishnan1 commented May 22, 2018

gyathaar commented May 23, 2018 •

edited

Loading

Added support for using multiple GPUs on training server when training the model #568

Are you sure you want to change the base?

Added support for using multiple GPUs on training server when training the model #568

Conversation

gyathaar commented May 9, 2018

ganeshkrishnan1 commented May 21, 2018

ganeshkrishnan1 commented May 22, 2018 • edited Loading

gyathaar commented May 22, 2018

ganeshkrishnan1 commented May 22, 2018

gyathaar commented May 23, 2018 • edited Loading

ganeshkrishnan1 commented May 22, 2018 •

edited

Loading

gyathaar commented May 23, 2018 •

edited

Loading