-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Native pytorch Multiprocessing for data loading #3079
Comments
I haven't thought through the details yet, but I like the basic idea a lot. |
Seems reasonable to me. Integrating with native pytorch data stuff has been on our wish list for a long time, it just never seemed feasible. If you can make it work, great. |
@DeNeutoy Any progress in this front? I progress along this line to integrate pytorch dataloaders this would be ideal combined with a move to DistributedDataParallel. |
@sai-prasanna this is something we've got bookmarked as a possible change for allennlp v1.0, as it might be a bit difficult to make it backward compatible. Actually i'm not so sure that this would impact moving to |
@DeNeutoy I've posted about my attempt to bring in |
A move to DDP would require changes in dataset loading for splitting the dataset across workers. Pytorch provides two ways for doing this depending on the style of dataset being used.
|
@scarecrow1123 / @sai-prasanna Thanks. I don't think it's actually required - it's just a nice to have. I think we should not modify existing dataset readers to allow them to do this, as people who are working with DDP are likely to be advanced users, and it is relatively easy to make a dataset reader conform to either of these specs. After we have DDP and pytorch dataloaders actually working, if loads of people are using it, we can think about making existing dataset readers easy to use out of the box with DDP. Does that sound good? |
@DeNeutoy I am coming from the perspective that making DDP as the default way of multi GPU would be better (if it proves faster for most tasks). And also that since 1.0 will have breaking changes, we can solve this problem in that milestone. I am currently trying out @scarecrow1123 's DDP code and inside the dataset reader, using pytorch's Dataset and Dataloader to speed things up. Here are a few points which I think might make the design user friendly. Dataset vs IterableDatasetThough Dataset needs implementing the len, it is generally less cognitive overload for end user. And I believe most datasets can implement it, the case for using true streams (as in from the database)is not generally norm
Wrapping of Dataset Reader vs Replace Dataset Readers with torch DatasetsInstead of wrapping datasetreader in a pytorch dataset, I think it would be preferable to make all DatasetReaders into pytorch Datasets. This would require removing the _read function and passing the paths in the constructor for train, test and validation. If we wrap dataset readers it would be possible to only use IterableDatasets, but with replacing datasetreaders with Dataset approach, we can support both Dataset and IterableDataset. Minor issues with mutation across multiprocessingEven when only using DataLoader inside a DatasetReader, I faced an issue where the pickling and depickling during multi processing destroys object references, so stateful update of token_indexers don't happen as expected. i.e |
@sai-prasanna thanks - all good points! I wasn't sure initially that we were going to pick this up, and the wrapper was the easiest way to make a proof of concept that something like this was possible. We'll definitely be inheriting from one of the dataset classes. That pickling thing is very annoying. There might be a way to get round that, which is to ensure that the dataset reader is stateless when it is building the vocab, or just require (initially) that you pre-construct a vocab. I'll be sure to check this when implementing this though. Thanks for your input! Do you have a link to your modifications of @scarecrow1123's DDP code which internally uses the torch datasets? |
Update on this, to figure out precisely what speedup we expect: The current implementation of multiprocessing does provide small speedups when the data reading function (in allennlp's case, the Full Multiprocessing Worker Comparison (Mean 5 runs)
Full Multiprocessing Worker Comparison (Mean 5 runs)
So in summary:
|
Here is the DatasetReader for seq2seq which uses torch Datasets. |
Closing as fixed by #3700. |
Summary
We've struggled for a while to make our multiprocessing code performant (and in it's current state, i'm not actually sure it works). Instead of doing this ourselves (or even trying to fix what we currently have), we should use the optimized
torch.util.data.DatasetLoader
when pytorch release v1.1.X.I dug into this a bit and I think I came up with a solution that will work with the next pytorch release. The reason for this is the introduction of a
torch.utils.data.IterableDataset
, which allows datasets to be viewed as streams (compared to being indexable, which currently they are required to be).https://pytorch.org/docs/master/data.html#torch.utils.data.IterableDataset
Current Blockers
The pytorch
Dataset
class assumes all datasets have a__len__
method, which may not be true of allennlp datasets, if they are lazy, or infinite streams.The way pytorch exposes creating iterables is via
__getitem__
and correspondingly which indices a batch is composed of. This is not suitable for allennlp, where we must be able to control the full generator to do things like bucketing by length.Implementation idea
I think that this can be implemented in a backward compatible way via:
A
torch.utils.data.IterableDataset
subclass which performs the functions of our iterators, e.g loadingmax_instances_in_memory
at a time, sorting by length etc but still yielding single instances.Modifying
MultiProcessDataIterator
to create aDataLoader
inside its__call__
method (as it has the sameIterable[TensorDict]
api as theDataLoader
), where internally, theIterable[Instances]
it is passed is expected to be theIterableDataset
subclass from above.Basically the current
Iterator
functionality would be moved into subclassedIterableDatasets
and we would iterate over these datasets using only theDataLoader
, where the only job of the DataLoader would be to 1. do multiprocessing and 2. convert instances to tensor dicts.Controlling multiple workers
Rather than having wrapper classes which control sharding in dataset readers like the
MultiprocessDatasetIterator/Reader
, dataset_readers should control their own sharding if they expect to be used for multiprocess reading usingtorch.utils.data.get_worker_info()
. This returnsNone
if not in a data reading process, but returns worker information if it is. This could be used to e.g read different data shards from a directory.Possible difficult things
Variable sized batches - some of the allennlp data functionality allows you to return variable sized batches, such as batches with a fixed number of tokens in. I'm not sure how to do this currently with the torch dataloader. I think it involves implementing a
BatchSampler
which returns different lengths of indices, but it's a little complicated because we want to view our datasets as streams, and not indexable objects, so i'm not quite sure how that would work.Caching instances - I'm not sure how this works with multiprocessing - but part of the reason we need it in the first place is because the multiprocessing doesn't work properly anyway.
Below is an example of a very simple wrapper of an allennlp dataset reader to allow use with the
DataLoader
from pytorch. Currently it is difficult to bucket this reader, because of the way we are forced to access only one element at a time.Let me know what you think or if you can foresee any other big problems!
@matt-gardner, @joelgrus and @brendan-ai2, would be good to get your opinion on this.
The text was updated successfully, but these errors were encountered: