You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
The 2.0 release of AllenNLP introduced a new high-performance DataLoader, the MultiProcessDataLoader, that is optimized for training on large datasets that can't fit into memory. This guide will show you how to get the most out of it.
Best practices
Large datasets
If your dataset is too big to fit into memory (a common problem), you'll need to load it lazily. This is done by simply setting the max_instances_in_memory parameter to a non-zero integer. The optimal value depends on your use case.
If you're using a batch_sampler, you will generally get better samples by setting max_instances_in_memory to a higher number - such as 10 to 100 times your batch size - since this determines how many Instances your batch_sampler gets to sample from at a time.
If you're not using a batch_sampler then this number is much less important. Setting it to 2 to 10 times your batch size is a reasonable value.
Keep in mind that using max_instances_in_memory generally results in a slower training loop unless you load data in worker processes by setting the num_workers option to a non-zero integer (see below). That way data loading won't block the main process.
Performance
The quickest way to increase the performance of data loading is to adjust the num_workers parameter. num_workers determines how many workers are used to read Instances from your DatasetReader. By default, this is set to 0, which means everything is done in the main process.
But before trying to set num_workers to a non-zero number, you should make sure your DatasetReader is optimized for use with multi-process data loading. See the next point.
Optimizing your DatasetReader
There are two things you may need to update in your DatasetReader in order for it to be efficient in the multi-process or distributed data loading context.
The _read() method should handle filtering out all but the instances that each particular worker should generate.
This is important because the default mechanism for filtering out Instances in the distributed or multi-process data loading setting is not very efficient, since every worker would still need to process every single Instance in your dataset.
But by manually handling the filtering / sharding within your _read() method, each worker only needs to perform a subset of the work required to create instances.
For example, if you were training using 2 GPUs and your _read() method reads a file line-by-line, creating one Instance for each line, you could just check the node rank within _read() and then throw away every other line starting at the line number corresponding to the node rank.
The helper method shard_iterable() is there to make this easy for you. You can wrap this around any iterable object in your _read() method, and it will return an iterator that skips the right items based on the distributed training or multi-process loading context. This method can always be called regardless of whether or not you're actually using distributed training or multi-process loading.
Remember though that when you handle the sharding manually within _read(), you need to let the DatasetReader know about this so that it doesn't do any additional filtering. Therefore you need to ensure that both self.manual_distributed_sharding and self.manual_multiprocess_sharding are set to True.
If you call the helper method shard_iterable() without setting these to True, you'll get an exception.
If the instances generated by _read() contain TextFields, those TextFields should not have any token indexers assigned. The token indexers need to be applied in the apply_token_indexers() method instead.
This is highly recommended because if the instances generated by your _read() method have token indexers attached, those indexers will be duplicated when they are sent across processes. If your token indexers contain large objects (such as PretrainedTransformerTokenIndexers) this could take up a massive amount of memory.
Common issues
Dead-locks
Multiprocessing code in Python is complicated! Especially code that involves lower-level libraries which may be spawning their own threads. If you run into dead-locks while using num_workers > 0, luckily there are two simple work-arounds which usually fix the issue.
The first work-around is to disable parallelism for these low-level libraries. For example, setting the environment variables OMP_NUM_THREADS=1 and TOKENIZERS_PARALLELISM=0 will do so for PyTorch and Numpy (for CPU operations) and HuggingFace Tokenizers, respectively.
Alternatively, changing the start_method to "spawn" (when available, depending on your OS) may fix your issues without disabling parallelism for other libraries.
Dead-locks could also be caused by running out of shared memory (see below).
Shared memory restrictions
Tensors are passed between processes using shared memory, and some systems impose strict limits on the allowed size of shared memory.
Luckily this is simple to debug and simple to fix.
First, to verify that this is your issue just watch your shared memory as your data loader runs. For example, run watch -n 0.3 'df -h | grep shm'.
If you're seeing your shared memory blow up until it maxes-out, then you either need to decrease max_instances_in_memory or increase your system's ulimit.
If you're using Docker, you can increase the shared memory available on a container by running it with the option --ipc=host or by setting --shm-size. And for Beaker users, set the environment variable BEAKER_FEATURE_SHARED_MEMORY_OVERRIDE to true in your experiment spec.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The 2.0 release of AllenNLP introduced a new high-performance
DataLoader
, theMultiProcessDataLoader
, that is optimized for training on large datasets that can't fit into memory. This guide will show you how to get the most out of it.Best practices
Large datasets
If your dataset is too big to fit into memory (a common problem), you'll need to load it lazily. This is done by simply setting the
max_instances_in_memory
parameter to a non-zero integer. The optimal value depends on your use case.If you're using a
batch_sampler
, you will generally get better samples by settingmax_instances_in_memory
to a higher number - such as 10 to 100 times your batch size - since this determines how manyInstances
yourbatch_sampler
gets to sample from at a time.If you're not using a
batch_sampler
then this number is much less important. Setting it to 2 to 10 times your batch size is a reasonable value.Keep in mind that using
max_instances_in_memory
generally results in a slower training loop unless you load data in worker processes by setting thenum_workers
option to a non-zero integer (see below). That way data loading won't block the main process.Performance
The quickest way to increase the performance of data loading is to adjust the
num_workers
parameter.num_workers
determines how many workers are used to readInstances
from yourDatasetReader
. By default, this is set to0
, which means everything is done in the main process.But before trying to set
num_workers
to a non-zero number, you should make sure yourDatasetReader
is optimized for use with multi-process data loading. See the next point.Optimizing your
DatasetReader
There are two things you may need to update in your
DatasetReader
in order for it to be efficient in the multi-process or distributed data loading context.The
_read()
method should handle filtering out all but the instances that each particular worker should generate.This is important because the default mechanism for filtering out
Instance
s in the distributed or multi-process data loading setting is not very efficient, since every worker would still need to process every singleInstance
in your dataset.But by manually handling the filtering / sharding within your
_read()
method, each worker only needs to perform a subset of the work required to create instances.For example, if you were training using 2 GPUs and your
_read()
method reads a file line-by-line, creating oneInstance
for each line, you could just check the node rank within_read()
and then throw away every other line starting at the line number corresponding to the node rank.The helper method
shard_iterable()
is there to make this easy for you. You can wrap this around any iterable object in your_read()
method, and it will return an iterator that skips the right items based on the distributed training or multi-process loading context. This method can always be called regardless of whether or not you're actually using distributed training or multi-process loading.Remember though that when you handle the sharding manually within
_read()
, you need to let theDatasetReader
know about this so that it doesn't do any additional filtering. Therefore you need to ensure that bothself.manual_distributed_sharding
andself.manual_multiprocess_sharding
are set toTrue
.If you call the helper method
shard_iterable()
without setting these toTrue
, you'll get an exception.If the instances generated by
_read()
containTextField
s, thoseTextField
s should not have any token indexers assigned. The token indexers need to be applied in theapply_token_indexers()
method instead.This is highly recommended because if the instances generated by your
_read()
method have token indexers attached, those indexers will be duplicated when they are sent across processes. If your token indexers contain large objects (such asPretrainedTransformerTokenIndexer
s) this could take up a massive amount of memory.Common issues
Dead-locks
Multiprocessing code in Python is complicated! Especially code that involves lower-level libraries which may be spawning their own threads. If you run into dead-locks while using
num_workers > 0
, luckily there are two simple work-arounds which usually fix the issue.The first work-around is to disable parallelism for these low-level libraries. For example, setting the environment variables
OMP_NUM_THREADS=1
andTOKENIZERS_PARALLELISM=0
will do so for PyTorch and Numpy (for CPU operations) and HuggingFace Tokenizers, respectively.Alternatively, changing the
start_method
to "spawn" (when available, depending on your OS) may fix your issues without disabling parallelism for other libraries.See issue #4848 for more info.
Dead-locks could also be caused by running out of shared memory (see below).
Shared memory restrictions
Tensors are passed between processes using shared memory, and some systems impose strict limits on the allowed size of shared memory.
Luckily this is simple to debug and simple to fix.
First, to verify that this is your issue just watch your shared memory as your data loader runs. For example, run
watch -n 0.3 'df -h | grep shm'
.If you're seeing your shared memory blow up until it maxes-out, then you either need to decrease
max_instances_in_memory
or increase your system'sulimit
.If you're using Docker, you can increase the shared memory available on a container by running it with the option
--ipc=host
or by setting--shm-size
. And for Beaker users, set the environment variableBEAKER_FEATURE_SHARED_MEMORY_OVERRIDE
totrue
in your experiment spec.See issue #4847 for more info.
I hope this was helpful! Please let us know if we are missing anything 👇
Beta Was this translation helpful? Give feedback.
All reactions