Skip to content

Commit

Permalink
docs2 (#995)
Browse files Browse the repository at this point in the history
* docs2

* version update

* docs update

* docs update
  • Loading branch information
Scitator authored Nov 12, 2020
1 parent dd7be23 commit 8b1e3f9
Show file tree
Hide file tree
Showing 7 changed files with 380 additions and 31 deletions.
2 changes: 1 addition & 1 deletion catalyst/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "20.10.1"
__version__ = "20.11"
151 changes: 149 additions & 2 deletions docs/faq/amp.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,155 @@
Mixed precision training
==============================================================================
Catalyst support a variety of backends for mixed precision training.
For the PyTorch versions below 1.6 it's better to use ``Nvidia Apex`` helper.
After PyTorch 1.6 release, it's possible to use AMP natively inside ``torch`` package.

- How to use Nvidia Apex?
- How to use torch.amp?
Suppose you have the following pipeline with Linear Regression:

.. code-block:: python
import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner
# data
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}
# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
# model training
runner = SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir="./logdir",
num_epochs=8,
verbose=True,
)
Nvidia Apex
----------------------------------------------------
To use Nvidia Apex fp16 support you firstly need to install it with,

.. code-block:: bash
!git clone https://github.com/NVIDIA/apex
!pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
After that you could easily extend our current pipeline with just one line of code:

.. code-block:: python
import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner
# data
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}
# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
# model training
runner = SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir="./logdir",
num_epochs=8,
verbose=True,
fp16=dict(apex=True, opt_level="O1") # <-- Nvidia Apex FP16 params -->
)
You could also check out the example above in `this Google Colab notebook`_

Torch AMP
----------------------------------------------------
If you would like to use native AMP support, you could do the following:

.. code-block:: python
import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner
# data
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}
# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
# model training
runner = SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir="./logdir",
num_epochs=8,
verbose=True,
fp16=dict(amp=True) # <-- PyTorch AMP FP16 params -->
)
You could also check out the example above in `this Google Colab notebook`_

.. _`this Google Colab notebook`: https://colab.research.google.com/drive/12ONaj4sMPiOT_64wh2bpH_AvRCuNFxLx?usp=sharing

Nvidia Apex (Config API)
----------------------------------------------------

Firstly, prepare the config. For example, like:

.. code-block:: yaml
distributed_params:
opt_level: "O1"
...
After that you ca easily run:

.. code-block:: bash
catalyst-dl run -C=/path/to/configs --apex
Torch AMP (Config API)
----------------------------------------------------

For native AMP support you only need to pass required flag to the ``run`` command:

.. code-block:: bash
catalyst-dl run -C=/path/to/configs --amp
If you haven't found the answer for your question, feel free to `join our slack`_ for the discussion.

Expand Down
75 changes: 69 additions & 6 deletions docs/faq/checkpointing.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,74 @@
[WIP] Model checkpointing
Model checkpointing
==============================================================================

- how to load bset model?
- notebook and config api
- how to save model?
- how to load model?
- whats the difference between checkpoint and checkpoint_full?
Experiment checkpoints
----------------------------------------------------
With the help of ``CheckpointCallback``
Catalyst creates the following checkpoints structure under selected ``logdir``:

.. code-block:: bash
logdir/
code/ <-- your experiment and catalyst code for reproducibility -->
checkpoints/ <-- theme of the topic -->
{stage_name}.{epoch_index}.pth <-- topK checkpoints based on model selection logic -->
best.pth <-- best model based on specified model selection logic -->
last.pth <-- last model checkpoint in the whole experiment run -->
<-- the same checkpoints with ``_full`` prefix -->
...
This checkpoint are pure PyTorch checkpoints without any mixins with the following structure:

.. code-block:: bash
checkpoint.pth = {
"model_state_dict": model.state_dict(),
"criterion_state_dict": criterion.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"scheduler_state_dict": scheduler.state_dict(),
}
Full checkpoints
----------------------------------------------------
Catalyst saves 2 types of checkpoints:

- ``{checkpoint}.pth`` - which stores only model state dict and could be easily used for production purposes.
- ``{checkpoint}_full.pth`` - which stores all state dicts for model(s), criterion(s), optimizer(s) and scheduler(s) and could be easily used for experiment analysis purposes.

Save model
----------------------------------------------------
Catalyst has a user-friendly utils to save the model:

.. code-block:: python
from catalyst import utils
model = Net()
checkpoint = utils.pack_checkpoint(model=model)
utils.save_checkpoint(checkpoint, logdir="/path/to/logdir", suffix="my_checkpoint")
# now you could find your checkpoint under "/path/to/logdir/my_checkpoint.pth" location
Load model
----------------------------------------------------
With Catalyst utils it's very easy to load models after experiment run:

.. code-block:: python
from catalyst import utils
model = Net()
optimizer = ...
criterion = ...
checkpoint = utils.load_checkpoint(path="/path/to/checkpoint")
utils.unpack_checkpoint(
checkpoint=checkpoint,
model=model,
optimizer=optimizer,
criterion=criterion
)
In this case Catalyst would try to unpack requested state dicts from the checkpoint.


If you haven't found the answer for your question, feel free to `join our slack`_ for the discussion.

Expand Down
143 changes: 140 additions & 3 deletions docs/faq/ddp.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,145 @@
[WIP] Distributed training
Distributed training
==============================================================================
Catalyst supports automatic experiments scaling with distributed training support.

Notebook API
----------------------------------------------------

Suppose you have the following pipeline with Linear Regression:

.. code-block:: python
import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner
# experiment setup
logdir = "./logdir"
num_epochs = 8
# data
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}
# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
# model training
runner = SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir=logdir,
num_epochs=num_epochs,
verbose=True,
)
For correct DDP training, you need to split your dataset creation from the main training.
In this way Catalyst could easily transfer your datasets to the distributed mode
and there would be no data re-creation on each worker.

As a best practice scenario for this case:

.. code-block:: python
import torch
from torch.utils.data import TensorDataset
from catalyst.dl import SupervisedRunner, utils
def datasets_fn(num_features: int):
X = torch.rand(int(1e4), num_features)
y = torch.rand(X.shape[0])
dataset = TensorDataset(X, y)
return {"train": dataset, "valid": dataset}
def train():
num_features = int(1e1)
# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
runner = SupervisedRunner()
runner.train(
model=model,
datasets={
"batch_size": 32,
"num_workers": 1,
"get_datasets_fn": datasets_fn,
"num_features": num_features,
},
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
logdir="./logs/example_3",
num_epochs=8,
verbose=True,
distributed=False,
)
utils.distributed_cmd_run(train)
Config API
----------------------------------------------------
To run Catalyst experiments in the DDP-mode,
the only thing you need to do for the Config API - pass required flag to the ``run`` command:

.. code-block:: bash
catalyst-dl run -C=/path/to/configs --distributed
Launch your training
----------------------------------------------------

In your terminal,
type the following line (adapt `script_name` to your script name ending with .py).

.. code-block:: bash
python {script_name}
You can vary availble GPUs with ``CUDA_VIBIBLE_DEVICES`` option, for example,

.. code-block:: bash
# run only on 1st and 2nd GPUs
CUDA_VISIBLE_DEVICES="1,2" python {script_name}
.. code-block:: bash
# run only on 0, 1st and 3rd GPUs
CUDA_VISIBLE_DEVICES="0,1,3" python {script_name}
What will happen is that the same model will be copied on all your available GPUs.
During training, the full dataset will randomly be split between the GPUs
(that will change at each epoch).
Each GPU will grab a batch (on that fractioned dataset),
pass it through the model, compute the loss then back-propagate the gradients.
Then they will share their results and average them,
which means like your training is the equivalent of a training
with a batch size of ```batch_size x num_gpus``
(where ``batch_size`` is what you used in your script).

Since they all have the same gradients at this stage,
they will al perform the same update,
so the models will still be the same after this step.
Then training continues with the next batch,
until the number of desired iterations is done.

During training Catalyst will automatically average all metrics
and log them on ``Master`` node only. Same logic used for model checkpointing.

- How to run experiments in distributed mode?
- (?) How to collect metrics in distributed mode in the right way?

If you haven't found the answer for your question, feel free to `join our slack`_ for the discussion.

Expand Down
Loading

0 comments on commit 8b1e3f9

Please sign in to comment.