Data loading guide for AllenNLP 2.0 👋 #4935

epwalsh · 2021-01-27T18:32:43Z

epwalsh
Jan 27, 2021
Maintainer

The 2.0 release of AllenNLP introduced a new high-performance DataLoader, the MultiProcessDataLoader, that is optimized for training on large datasets that can't fit into memory. This guide will show you how to get the most out of it.

Best practices

Large datasets

If your dataset is too big to fit into memory (a common problem), you'll need to load it lazily. This is done by simply setting the max_instances_in_memory parameter to a non-zero integer. The optimal value depends on your use case.

If you're using a batch_sampler, you will generally get better samples by setting max_instances_in_memory to a higher number - such as 10 to 100 times your batch size - since this determines how many Instances your batch_sampler gets to sample from at a time.

If you're not using a batch_sampler then this number is much less important. Setting it to 2 to 10 times your batch size is a reasonable value.

Keep in mind that using max_instances_in_memory generally results in a slower training loop unless you load data in worker processes by setting the num_workers option to a non-zero integer (see below). That way data loading won't block the main process.
Performance

The quickest way to increase the performance of data loading is to adjust the num_workers parameter. num_workers determines how many workers are used to read Instances from your DatasetReader. By default, this is set to 0, which means everything is done in the main process.

But before trying to set num_workers to a non-zero number, you should make sure your DatasetReader is optimized for use with multi-process data loading. See the next point.
Optimizing your DatasetReader

There are two things you may need to update in your DatasetReader in order for it to be efficient in the multi-process or distributed data loading context.
1. The _read() method should handle filtering out all but the instances that each particular worker should generate.
  
  This is important because the default mechanism for filtering out Instances in the distributed or multi-process data loading setting is not very efficient, since every worker would still need to process every single Instance in your dataset.
  
  But by manually handling the filtering / sharding within your _read() method, each worker only needs to perform a subset of the work required to create instances.
  
  For example, if you were training using 2 GPUs and your _read() method reads a file line-by-line, creating one Instance for each line, you could just check the node rank within _read() and then throw away every other line starting at the line number corresponding to the node rank.
  
  The helper method shard_iterable() is there to make this easy for you. You can wrap this around any iterable object in your _read() method, and it will return an iterator that skips the right items based on the distributed training or multi-process loading context. This method can always be called regardless of whether or not you're actually using distributed training or multi-process loading.
  
  Remember though that when you handle the sharding manually within _read(), you need to let the DatasetReader know about this so that it doesn't do any additional filtering. Therefore you need to ensure that both self.manual_distributed_sharding and self.manual_multiprocess_sharding are set to True.
  
  If you call the helper method shard_iterable() without setting these to True, you'll get an exception.
2. If the instances generated by _read() contain TextFields, those TextFields should not have any token indexers assigned. The token indexers need to be applied in the apply_token_indexers() method instead.
  
  This is highly recommended because if the instances generated by your _read() method have token indexers attached, those indexers will be duplicated when they are sent across processes. If your token indexers contain large objects (such as PretrainedTransformerTokenIndexers) this could take up a massive amount of memory.

Common issues

Dead-locks

Multiprocessing code in Python is complicated! Especially code that involves lower-level libraries which may be spawning their own threads. If you run into dead-locks while using num_workers > 0, luckily there are two simple work-arounds which usually fix the issue.

The first work-around is to disable parallelism for these low-level libraries. For example, setting the environment variables OMP_NUM_THREADS=1 and TOKENIZERS_PARALLELISM=0 will do so for PyTorch and Numpy (for CPU operations) and HuggingFace Tokenizers, respectively.

Alternatively, changing the start_method to "spawn" (when available, depending on your OS) may fix your issues without disabling parallelism for other libraries.

See issue #4848 for more info.

Dead-locks could also be caused by running out of shared memory (see below).
Shared memory restrictions

Tensors are passed between processes using shared memory, and some systems impose strict limits on the allowed size of shared memory.

Luckily this is simple to debug and simple to fix.

First, to verify that this is your issue just watch your shared memory as your data loader runs. For example, run watch -n 0.3 'df -h | grep shm'.

If you're seeing your shared memory blow up until it maxes-out, then you either need to decrease max_instances_in_memory or increase your system's ulimit.

If you're using Docker, you can increase the shared memory available on a container by running it with the option --ipc=host or by setting --shm-size. And for Beaker users, set the environment variable BEAKER_FEATURE_SHARED_MEMORY_OVERRIDE to true in your experiment spec.

See issue #4847 for more info.

I hope this was helpful! Please let us know if we are missing anything 👇

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loading guide for AllenNLP 2.0 👋 #4935

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Data loading guide for AllenNLP 2.0 👋 #4935

epwalsh Jan 27, 2021 Maintainer

Best practices

Common issues

Replies: 0 comments

epwalsh
Jan 27, 2021
Maintainer