Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

Closed
HiromuHota opened this issue Jun 10, 2020 · 0 comments · Fixed by #439

Comments

@HiromuHota
Copy link
Contributor

Describe the bug

This is not a bug, but a memory-inefficient computing.
When executing UDFRunner.apply (like Parser.apply), all the documents are loaded onto memory at once (see the code below) before processing them in mutiple processes parallelly.

for doc in doc_loader:
in_queue.put(doc)

When len(doc_loader) is large, each file is large, and/or machine's RAM is small, this leads to memory paging and critically slows down the execution.

To Reproduce
Steps to reproduce the behavior:

  1. Process a large set of documents on a machine with small RAM

Expected behavior

Load and process a document one at a time.
(When parallelism=N, load and process N documents at a time.)

Environment (please complete the following information):

  • OS: N/A
  • PostgreSQL Version: N/A
  • Poppler Utils Version: N/A
  • Fonduer Version: 0.8.2

Additional context
Add any other context about the problem here.

lukehsiao pushed a commit that referenced this issue Jun 19, 2020
Currently, preprocessor and parser are executed in a complete sequential
order. i.e., preprocess N docs (and load them into a queue), then parse
N docs. This has two drawbacks:
  1. the progress bar shows nothing during preprocessing.
  2. the machine RAM may not be large enough to hold N preprocessed docs.
They become more serious when N is large and/or each HTML file is large.

This changes the flow such that a single, single thread runs all of the
preprocessing, filling a queue with preprocessed documents. Then, a
thread pool takes documents from this queue and parses them. 

Note that the parser itself is still very memory-hungry, which will be
addressed in a future patch.

An example of how this looks in execution is:

    Main process put        112823 into in_queue (in_queue.qsize: 0)
    1-th worker process got 112823 from in_queue (in_queue.qsize: 0)
    Main process put        2N3906-D into in_queue (in_queue.qsize: 0)
    0-th worker process got 2N3906-D from in_queue (in_queue.qsize: 0)
    Main process put        2N3906 into in_queue (in_queue.qsize: 1)
    Main process put        2N4123-D into in_queue (in_queue.qsize: 2)
    Main process put        2N4124 into in_queue (in_queue.qsize: 3)
    Main process put        2N6426-D into in_queue (in_queue.qsize: 4)
    1-th worker process got 2N3906 from in_queue (in_queue.qsize: 4)
    Main process put        2N6427 into in_queue (in_queue.qsize: 4)
    0-th worker process got 2N4123-D from in_queue (in_queue.qsize: 4)
    Main process put        AUKCS04635-1 into in_queue (in_queue.qsize: 4)
    1-th worker process got 2N4124 from in_queue (in_queue.qsize: 4)
    Main process put        BC182-D into in_queue (in_queue.qsize: 4)
    Main process put        BC182 into in_queue (in_queue.qsize: 4)
    0-th worker process got 2N6426-D from in_queue (in_queue.qsize: 4)
    1-th worker process got 2N6427 from in_queue (in_queue.qsize: 3)
    Main process put        BC337-D into in_queue (in_queue.qsize: 4)
    0-th worker process got AUKCS04635-1 from in_queue (in_queue.qsize: 4)
    Main process put        BC337 into in_queue (in_queue.qsize: 4)
    0-th worker process got BC182-D from in_queue (in_queue.qsize: 3)
    1-th worker process got BC182 from in_queue (in_queue.qsize: 2)
    0-th worker process got BC337-D from in_queue (in_queue.qsize: 1)
    1-th worker process got BC337 from in_queue (in_queue.qsize: 0)

Fixes #435.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant