diff --git a/doc/install/easy-install.md b/doc/install/easy-install.md index cb529acda3..55720b59e4 100644 --- a/doc/install/easy-install.md +++ b/doc/install/easy-install.md @@ -2,7 +2,7 @@ There various easy methods to install DeePMD-kit. Choose one that you prefer. If you want to build by yourself, jump to the next two sections. -After your easy installation, DeePMD-kit (`dp`) and LAMMPS (`lmp`) will be available to execute. You can try `dp -h` and `lmp -h` to see the help. `mpirun` is also available considering you may want to run LAMMPS in parallel. +After your easy installation, DeePMD-kit (`dp`) and LAMMPS (`lmp`) will be available to execute. You can try `dp -h` and `lmp -h` to see the help. `mpirun` is also available considering you may want to train models or run LAMMPS in parallel. - [Install off-line packages](#install-off-line-packages) - [Install with conda](#install-with-conda) @@ -27,13 +27,13 @@ conda create -n deepmd deepmd-kit=*=*cpu libdeepmd=*=*cpu lammps-dp -c https://c Or one may want to create a GPU environment containing [CUDA Toolkit](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver): ```bash -conda create -n deepmd deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=11.3 -c https://conda.deepmodeling.org +conda create -n deepmd deepmd-kit=*=*gpu libdeepmd=*=*gpu lammps-dp cudatoolkit=11.3 horovod -c https://conda.deepmodeling.org ``` One could change the CUDA Toolkit version from `10.1` or `11.3`. One may speficy the DeePMD-kit version such as `2.0.0` using ```bash -conda create -n deepmd deepmd-kit=2.0.0=*cpu libdeepmd=2.0.0=*cpu lammps-dp=2.0.0 -c https://conda.deepmodeling.org +conda create -n deepmd deepmd-kit=2.0.0=*cpu libdeepmd=2.0.0=*cpu lammps-dp=2.0.0 horovod -c https://conda.deepmodeling.org ``` One may enable the environment using diff --git a/doc/install/install-from-source.md b/doc/install/install-from-source.md index b0e6f468b1..7f69427517 100644 --- a/doc/install/install-from-source.md +++ b/doc/install/install-from-source.md @@ -92,6 +92,22 @@ Valid subcommands: test test the model ``` +### Install horovod and mpi4py + +[Horovod](https://github.com/horovod/horovod) and [mpi4py](https://github.com/mpi4py/mpi4py) is used for parallel training. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst). +```bash +# With GPU, prefer NCCL as communicator. +HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install horovod mpi4py +``` + +If your work in CPU environment, please prepare runtime as below: +```bash +# By default, MPI is used as communicator. +HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install horovod mpi4py +``` + +If you don't install horovod, DeePMD-kit will fallback to serial mode. + ## Install the C++ interface If one does not need to use DeePMD-kit with Lammps or I-Pi, then the python interface installed in the previous section does everything and he/she can safely skip this section. diff --git a/doc/train/parallel-training.md b/doc/train/parallel-training.md index d619569c8d..5609468a76 100644 --- a/doc/train/parallel-training.md +++ b/doc/train/parallel-training.md @@ -10,21 +10,8 @@ Testing `examples/water/se_e2_a` on a 8-GPU host, linear acceleration can be obs | 4 | 1.7635 | 56.71*4 | 3.29 | | 8 | 1.7267 | 57.91*8 | 6.72 | -To experience this powerful feature, please intall Horovod and [mpi4py](https://github.com/mpi4py/mpi4py) first. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst). -```bash -# With GPU, prefer NCCL as communicator. -HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip3 install horovod mpi4py -``` - -If your work in CPU environment, please prepare runtime as below: -```bash -# By default, MPI is used as communicator. -HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install horovod mpi4py -``` - Horovod works in the data-parallel mode resulting a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you lauch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677). -With dependencies installed, have a quick try! ```bash # Launch 4 processes on the same host CUDA_VISIBLE_DEVICES=4,5,6,7 horovodrun -np 4 \