Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Unable to see the gains by using MultiProcessDatasetReader #1890

Closed
HarshTrivedi opened this issue Oct 10, 2018 · 18 comments
Closed

Unable to see the gains by using MultiProcessDatasetReader #1890

HarshTrivedi opened this issue Oct 10, 2018 · 18 comments
Assignees

Comments

@HarshTrivedi
Copy link
Contributor

HarshTrivedi commented Oct 10, 2018

@joelgrus, I am unable to see the gains by using the MultiProcessDatasetReader. In fact it's taking 7x more time to dry-run in this particular case. I could be doing something wrong here, so please let me know if so.

I have created a minimally reproducible branch here. It runs ESIM textual entailment (one already present in allennlp) by reading SNLI train dataset

  1. top 50 K pairs (single process) vs
  2. top 50 K pairs (multi process where each file contains 10K splits)

Edit: Actually esim is only for borrowing config file, I am only doing dry-run here.

DatasetReader config is as follows:

    "dataset_reader": {
        "type": "multiprocess",
        "base_reader" : {
            "type": "snli",
            "token_indexers": {
                "tokens": {
                    "type": "single_id",
                    "lowercase_tokens": true
                }
            }
        },
        "num_workers": 10,
        "epochs_per_read": 3
    },

To reproduce, just run following on that branch:

time python allennlp/run.py dry-run experiment_single_process.jsonnet -s single_process_serialization

114.71s user 8.85s system 85% cpu 2:24.73 total

vs

time python allennlp/run.py dry-run experiment_multi_process.jsonnet -s multi_process_serialization

734.79s user 119.35s system 134% cpu 10:33.51 total

Btw, I did change snli reader to use allennlp Token instead of spacy tokens as in #1887. You can check that in last commit on that branch.

@schmmd schmmd self-assigned this Oct 12, 2018
@schmmd
Copy link
Member

schmmd commented Oct 12, 2018

@HarshTrivedi how large was your original dataset? We have not tested this code thoroughly, but we're going through it soon so, if there are issues, we might sort them out soon.

@HarshTrivedi
Copy link
Contributor Author

@schmmd My original dataset is about ~2M pairs of sentences. I wasn't able to see gain in that so I made this small case with only 50K entailment pairs (from snli) - with and without multiprocess-reader enabled. This is easily reproducible w/o any part of my code ... It might help debug the problem.

@schmmd
Copy link
Member

schmmd commented Oct 15, 2018

Thanks @HarshTrivedi for putting together the repro case. That said, troubleshooting this issue isn't a high priority for our team right now--but if there are any issues we should run into them with some of our current work for the quarter.

@schmmd schmmd added the P3 label Oct 15, 2018
@schmmd schmmd removed their assignment Oct 15, 2018
@brendan-ai2 brendan-ai2 self-assigned this Oct 30, 2018
@brendan-ai2
Copy link
Contributor

I'm hitting this now with the Calypso port. Investigating.

@brendan-ai2 brendan-ai2 added P2 and removed P3 labels Oct 30, 2018
@brendan-ai2
Copy link
Contributor

One issue is that you'll also likely need the multiprocess iterator. Add something like the following to your config:

  "iterator": {
    "type": "multiprocess",
    "iterator": {
      "type": "basic",
      "batch_size": 32
    },
    "num_workers": 8,
  },

I'm still studying this code to fully understand the interaction between the reader and iterator. Should have more for you soon.

@HarshTrivedi
Copy link
Contributor Author

@brendan-ai2 Thanks for looking into this! Can you confirm that you could reproduce the issue? Unfortunately, I am out of town till Thursday at least and won't be able try your suggestions till then.

Btw, I am not using MultiProcessDatasetReader now. I tried using MultiprocessIterator alone and with MultiProcessDatasetReader. If I remember correctly, One / some of these 3 things did save me time, but don't remember what exactly. However, it was yet not clear why using only MultiProcessDatasetReader should take up so much more time.

In my particular case, since dataset fits in memory, changing datasetreader to use multiprocessing map turned out to work well ( @matt-gardner suggested it in #1891 ).

@brendan-ai2
Copy link
Contributor

@HarshTrivedi, I have not attempted to reproduce with your exact code, but I have reproduced with my own which is fairly similar. My suspicion for dry-run in particular is simply that we don't take advantage of multiprocessing when building the vocab. So we pay overhead to put the instances in a synchronized queue, but we count the tokens in a single thread.

@brendan-ai2
Copy link
Contributor

Quick update, I've written a version of the MultiProcessDatasetReader and Vocabulary that truly parallelizes counting the tokens. (Token count dictionaries are accumulated for each shard in parallel and then merged afterwards.) Now I'm looking into the indexing. This could be a bit trickier. Overall, a fix for this is likely to somewhat involved.

@joelgrus
Copy link
Contributor

joelgrus commented Nov 3, 2018

is there such a strong need to parallelize counting the tokens? it only happens once?

I thought the main idea of the multiprocess stuff was to parallelize the tensor generation (which happens over and over again), but I could be missing something.

@matt-gardner
Copy link
Contributor

Yeah, I guess I'd like to see absolute numbers on the performance difference before being sure we should invest heavily in fixing the vocab creation part. If it really is a large bottleneck, it might be worth it, I just don't know how bad the problem is.

@brendan-ai2
Copy link
Contributor

The token counting was proceeding at a speed that it would have taken about 5-6 hours to complete. Processing the shards independently brought that down to about 30 minutes. Yes, in the ideal case that cost is borne only once, but it's still a pretty unfortunate user experience and will be incurred whenever the dataset changes or a vocab parameter is tweaked.

I'm still trying to grok how the current setup is intending to parallelize the tensor generation. My concern is that the queue is going to be a bottleneck even when it's properly exposed as a queue instead of an iterable. I'll be able to verify this once my branch is cleaned up. Effectively I need to pass the underlying queue through everywhere and remove any code that combines datasets as iterables naively.

@rangwani-harsh
Copy link
Contributor

@brendan-ai2 Any updates on this?

@brendan-ai2
Copy link
Contributor

Hi @rangwani-harsh, thanks for the ping. For now I've discarded my code that parallelizes counting the tokens. Getting an API for that that fits with the existing AllenNLP codebase is challenging. Instead I've pre-generated a vocabulary (which can be done with dry-run or make-vocab) and I'm pointing my model at it during training time with the directory_path option. See https://github.com/allenai/allennlp/blob/master/training_config/bidirectional_lm.jsonnet#L55 for an example.

I'd suggest taking the same approach for now. So, when pre-generating the vocabulary don't use the MultiprocessDatasetReader. Just use your underlying reader. Then, during training, you can use MultiprocessDatasetReader along with the MultiprocessIterator. One thing to keep in mind when using the multiprocess tools is that your bottleneck may very well not be related to data loading. For instance, in the language modeling use case I was working on it was more important to pack the batches with a variable number of sentences to maximize utilization of GPU memory.

I hope that helps. Let me know if you have any followup questions.

@rangwani-harsh
Copy link
Contributor

rangwani-harsh commented Dec 15, 2018

Hi,
The main issue here is that using dry-run or make-vocab doesn't make use of the laziness in the dataset reader. So I can't prepare my vocab with dry-run or make-vocab commands. As my model cannot fit into the memory.
I am trying to get through it by setting the num_epochs = 0 while training.
But I feel that laziness should be used anywhere we are using the dataset_reader and should not depend on what we are trying to do (i.e. train or evaluate or dry-run).

In the PR that introduced lazy dataset reader, there is an assumption that laziness is only required for training. Can you please share some insight on it @joelgrus. And is it not possible for the dry-run and make-vocab to work with laziness.

Thanks

@rangwani-harsh
Copy link
Contributor

rangwani-harsh commented Dec 15, 2018

@brendan-ai2 I still can't get through as allennlp train breaks down when we specify 0 as num_of_epochs. This can be fixed if we just introduce this check if trainer._num_epochs :
before loading the best weights.

brendan-ai2 added a commit that referenced this issue Aug 29, 2019
- Adds a script to benchmark iterators.
  - Average speed
  - Introspects queues
- Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction.
  - Specifically, removes a redundant queue that was populated by a single process.
- Removes managers which have significant overhead.
- Results on training_config/bidirectional_language_model.jsonnet:
  - original code, no multiprocessing: 0.047 s/b over 10000 batches
  - original code, workers = 1: 0.073 s/b over 10000 batches
  - original code, workers = 10: 0.078 s/b over 10000 batches
  - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches
  - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches
  - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there.
- Related issues: #2962, #1890
reiyw pushed a commit to reiyw/allennlp that referenced this issue Nov 12, 2019
…3119)

- Adds a script to benchmark iterators.
  - Average speed
  - Introspects queues
- Removes a bottleneck when `MultiprocessDatasetReader` and `MultiprocessIterator` are used in conjunction.
  - Specifically, removes a redundant queue that was populated by a single process.
- Removes managers which have significant overhead.
- Results on training_config/bidirectional_language_model.jsonnet:
  - original code, no multiprocessing: 0.047 s/b over 10000 batches
  - original code, workers = 1: 0.073 s/b over 10000 batches
  - original code, workers = 10: 0.078 s/b over 10000 batches
  - this PR (-queue), workers = 1: 0.073 s/b over 10000 batches
  - this PR (-queue), workers = 10: 0.046 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 1: 0.063 s/b over 10000 batches
  - this PR (-queue, - manager), workers = 10: 0.020 s/b over 10000 batches
  - Notably, previously we did not see any benefit from scaling to multiple workers. Now we do, albeit worse than linearly. More work required there.
- Related issues: allenai#2962, allenai#1890
@matt-gardner
Copy link
Contributor

With the merge yesterday, can this be closed? @DeNeutoy, @brendan-ai2?

@brendan-ai2
Copy link
Contributor

brendan-ai2 commented Dec 17, 2019

@matt-gardner, yes we should close this. #3119 addressed one fundamental problem with the multiprocess data/iterator setup, but we're trying to move away from that anyway with @DeNeutoy's new iterator PR: #3386. Of course, that PR is still in flight, but it's the likely path forward. We're not intending to invest more effort into MultiprocessDatasetReader.

@brendan-ai2
Copy link
Contributor

Superseding issue: #3079

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants