Unable to see the gains by using `MultiProcessDatasetReader` #1890

HarshTrivedi · 2018-10-10T02:47:50Z

@joelgrus, I am unable to see the gains by using the MultiProcessDatasetReader. In fact it's taking 7x more time to dry-run in this particular case. I could be doing something wrong here, so please let me know if so.

I have created a minimally reproducible branch here. It runs ESIM textual entailment (one already present in allennlp) by reading SNLI train dataset

top 50 K pairs (single process) vs
top 50 K pairs (multi process where each file contains 10K splits)

Edit: Actually esim is only for borrowing config file, I am only doing dry-run here.

DatasetReader config is as follows:

    "dataset_reader": {
        "type": "multiprocess",
        "base_reader" : {
            "type": "snli",
            "token_indexers": {
                "tokens": {
                    "type": "single_id",
                    "lowercase_tokens": true
                }
            }
        },
        "num_workers": 10,
        "epochs_per_read": 3
    },

To reproduce, just run following on that branch:

time python allennlp/run.py dry-run experiment_single_process.jsonnet -s single_process_serialization

114.71s user 8.85s system 85% cpu 2:24.73 total

vs

time python allennlp/run.py dry-run experiment_multi_process.jsonnet -s multi_process_serialization

734.79s user 119.35s system 134% cpu 10:33.51 total

Btw, I did change snli reader to use allennlp Token instead of spacy tokens as in #1887. You can check that in last commit on that branch.

The text was updated successfully, but these errors were encountered:

schmmd · 2018-10-12T20:14:23Z

@HarshTrivedi how large was your original dataset? We have not tested this code thoroughly, but we're going through it soon so, if there are issues, we might sort them out soon.

HarshTrivedi · 2018-10-13T13:57:26Z

@schmmd My original dataset is about ~2M pairs of sentences. I wasn't able to see gain in that so I made this small case with only 50K entailment pairs (from snli) - with and without multiprocess-reader enabled. This is easily reproducible w/o any part of my code ... It might help debug the problem.

schmmd · 2018-10-15T17:12:38Z

Thanks @HarshTrivedi for putting together the repro case. That said, troubleshooting this issue isn't a high priority for our team right now--but if there are any issues we should run into them with some of our current work for the quarter.

brendan-ai2 · 2018-10-30T22:48:57Z

I'm hitting this now with the Calypso port. Investigating.

brendan-ai2 · 2018-10-31T03:17:13Z

One issue is that you'll also likely need the multiprocess iterator. Add something like the following to your config:

  "iterator": {
    "type": "multiprocess",
    "iterator": {
      "type": "basic",
      "batch_size": 32
    },
    "num_workers": 8,
  },

I'm still studying this code to fully understand the interaction between the reader and iterator. Should have more for you soon.

HarshTrivedi · 2018-10-31T03:43:46Z

@brendan-ai2 Thanks for looking into this! Can you confirm that you could reproduce the issue? Unfortunately, I am out of town till Thursday at least and won't be able try your suggestions till then.

Btw, I am not using MultiProcessDatasetReader now. I tried using MultiprocessIterator alone and with MultiProcessDatasetReader. If I remember correctly, One / some of these 3 things did save me time, but don't remember what exactly. However, it was yet not clear why using only MultiProcessDatasetReader should take up so much more time.

In my particular case, since dataset fits in memory, changing datasetreader to use multiprocessing map turned out to work well ( @matt-gardner suggested it in #1891 ).

brendan-ai2 · 2018-11-02T01:05:20Z

@HarshTrivedi, I have not attempted to reproduce with your exact code, but I have reproduced with my own which is fairly similar. My suspicion for dry-run in particular is simply that we don't take advantage of multiprocessing when building the vocab. So we pay overhead to put the instances in a synchronized queue, but we count the tokens in a single thread.

brendan-ai2 · 2018-11-03T03:32:22Z

Quick update, I've written a version of the MultiProcessDatasetReader and Vocabulary that truly parallelizes counting the tokens. (Token count dictionaries are accumulated for each shard in parallel and then merged afterwards.) Now I'm looking into the indexing. This could be a bit trickier. Overall, a fix for this is likely to somewhat involved.

joelgrus · 2018-11-03T06:40:45Z

is there such a strong need to parallelize counting the tokens? it only happens once?

I thought the main idea of the multiprocess stuff was to parallelize the tensor generation (which happens over and over again), but I could be missing something.

matt-gardner · 2018-11-03T10:19:16Z

Yeah, I guess I'd like to see absolute numbers on the performance difference before being sure we should invest heavily in fixing the vocab creation part. If it really is a large bottleneck, it might be worth it, I just don't know how bad the problem is.

brendan-ai2 · 2018-11-05T20:35:32Z

The token counting was proceeding at a speed that it would have taken about 5-6 hours to complete. Processing the shards independently brought that down to about 30 minutes. Yes, in the ideal case that cost is borne only once, but it's still a pretty unfortunate user experience and will be incurred whenever the dataset changes or a vocab parameter is tweaked.

I'm still trying to grok how the current setup is intending to parallelize the tensor generation. My concern is that the queue is going to be a bottleneck even when it's properly exposed as a queue instead of an iterable. I'll be able to verify this once my branch is cleaned up. Effectively I need to pass the underlying queue through everywhere and remove any code that combines datasets as iterables naively.

rangwani-harsh · 2018-12-14T20:20:48Z

@brendan-ai2 Any updates on this?

brendan-ai2 · 2018-12-14T20:53:36Z

Hi @rangwani-harsh, thanks for the ping. For now I've discarded my code that parallelizes counting the tokens. Getting an API for that that fits with the existing AllenNLP codebase is challenging. Instead I've pre-generated a vocabulary (which can be done with dry-run or make-vocab) and I'm pointing my model at it during training time with the directory_path option. See https://github.com/allenai/allennlp/blob/master/training_config/bidirectional_lm.jsonnet#L55 for an example.

I'd suggest taking the same approach for now. So, when pre-generating the vocabulary don't use the MultiprocessDatasetReader. Just use your underlying reader. Then, during training, you can use MultiprocessDatasetReader along with the MultiprocessIterator. One thing to keep in mind when using the multiprocess tools is that your bottleneck may very well not be related to data loading. For instance, in the language modeling use case I was working on it was more important to pack the batches with a variable number of sentences to maximize utilization of GPU memory.

I hope that helps. Let me know if you have any followup questions.

rangwani-harsh · 2018-12-15T08:20:41Z

Hi,
The main issue here is that using dry-run or make-vocab doesn't make use of the laziness in the dataset reader. So I can't prepare my vocab with dry-run or make-vocab commands. As my model cannot fit into the memory.
I am trying to get through it by setting the num_epochs = 0 while training.
But I feel that laziness should be used anywhere we are using the dataset_reader and should not depend on what we are trying to do (i.e. train or evaluate or dry-run).

In the PR that introduced lazy dataset reader, there is an assumption that laziness is only required for training. Can you please share some insight on it @joelgrus. And is it not possible for the dry-run and make-vocab to work with laziness.

Thanks

rangwani-harsh · 2018-12-15T12:52:11Z

@brendan-ai2 I still can't get through as allennlp train breaks down when we specify 0 as num_of_epochs. This can be fixed if we just introduce this check if trainer._num_epochs :
before loading the best weights.

- Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: #2962, #1890

…3119) - Adds a script to benchmark iterators. - Average speed - Introspects queues - Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction. - Specifically, removes a redundant queue that was populated by a single process. - Removes managers which have significant overhead. - Results on training_config/bidirectional_language_model.jsonnet: - original code, no multiprocessing: 0.047 s/b over 10000 batches - original code, workers = 1: 0.073 s/b over 10000 batches - original code, workers = 10: 0.078 s/b over 10000 batches - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there. - Related issues: allenai#2962, allenai#1890

matt-gardner · 2019-12-17T15:58:48Z

With the merge yesterday, can this be closed? @DeNeutoy, @brendan-ai2?

brendan-ai2 · 2019-12-17T20:48:02Z

@matt-gardner, yes we should close this. #3119 addressed one fundamental problem with the multiprocess data/iterator setup, but we're trying to move away from that anyway with @DeNeutoy's new iterator PR: #3386. Of course, that PR is still in flight, but it's the likely path forward. We're not intending to invest more effort into MultiprocessDatasetReader.

brendan-ai2 · 2019-12-17T20:49:30Z

Superseding issue: #3079

schmmd self-assigned this Oct 12, 2018

schmmd added the P3 label Oct 15, 2018

schmmd removed their assignment Oct 15, 2018

brendan-ai2 self-assigned this Oct 30, 2018

brendan-ai2 added P2 and removed P3 labels Oct 30, 2018

matt-gardner mentioned this issue Jun 17, 2019

Enabling multi-gpu training #1882

Closed

brendan-ai2 mentioned this issue Aug 7, 2019

Benchmark iterator, avoid redundant queue, remove managers. #3119

Merged

brendan-ai2 closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to see the gains by using `MultiProcessDatasetReader` #1890

Unable to see the gains by using `MultiProcessDatasetReader` #1890

HarshTrivedi commented Oct 10, 2018 •

edited

Loading

schmmd commented Oct 12, 2018

HarshTrivedi commented Oct 13, 2018

schmmd commented Oct 15, 2018

brendan-ai2 commented Oct 30, 2018

brendan-ai2 commented Oct 31, 2018

HarshTrivedi commented Oct 31, 2018

brendan-ai2 commented Nov 2, 2018

brendan-ai2 commented Nov 3, 2018

joelgrus commented Nov 3, 2018

matt-gardner commented Nov 3, 2018

brendan-ai2 commented Nov 5, 2018

rangwani-harsh commented Dec 14, 2018

brendan-ai2 commented Dec 14, 2018

rangwani-harsh commented Dec 15, 2018 •

edited

Loading

rangwani-harsh commented Dec 15, 2018 •

edited

Loading

matt-gardner commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019 •

edited

Loading

brendan-ai2 commented Dec 17, 2019

Unable to see the gains by using MultiProcessDatasetReader #1890

Unable to see the gains by using MultiProcessDatasetReader #1890

Comments

HarshTrivedi commented Oct 10, 2018 • edited Loading

schmmd commented Oct 12, 2018

HarshTrivedi commented Oct 13, 2018

schmmd commented Oct 15, 2018

brendan-ai2 commented Oct 30, 2018

brendan-ai2 commented Oct 31, 2018

HarshTrivedi commented Oct 31, 2018

brendan-ai2 commented Nov 2, 2018

brendan-ai2 commented Nov 3, 2018

joelgrus commented Nov 3, 2018

matt-gardner commented Nov 3, 2018

brendan-ai2 commented Nov 5, 2018

rangwani-harsh commented Dec 14, 2018

brendan-ai2 commented Dec 14, 2018

rangwani-harsh commented Dec 15, 2018 • edited Loading

rangwani-harsh commented Dec 15, 2018 • edited Loading

matt-gardner commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019 • edited Loading

brendan-ai2 commented Dec 17, 2019

Unable to see the gains by using `MultiProcessDatasetReader` #1890

Unable to see the gains by using `MultiProcessDatasetReader` #1890

HarshTrivedi commented Oct 10, 2018 •

edited

Loading

rangwani-harsh commented Dec 15, 2018 •

edited

Loading

rangwani-harsh commented Dec 15, 2018 •

edited

Loading

brendan-ai2 commented Dec 17, 2019 •

edited

Loading