All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

HiromuHota · 2020-06-10T02:01:25Z

Describe the bug

This is not a bug, but a memory-inefficient computing.
When executing UDFRunner.apply (like Parser.apply), all the documents are loaded onto memory at once (see the code below) before processing them in mutiple processes parallelly.

fonduer/src/fonduer/utils/udf.py

Lines 115 to 116 in 8134e4d

    
           for doc in doc_loader: 
        
               in_queue.put(doc)

When len(doc_loader) is large, each file is large, and/or machine's RAM is small, this leads to memory paging and critically slows down the execution.

To Reproduce
Steps to reproduce the behavior:

Process a large set of documents on a machine with small RAM

Expected behavior

Load and process a document one at a time.
(When parallelism=N, load and process N documents at a time.)

Environment (please complete the following information):

OS: N/A
PostgreSQL Version: N/A
Poppler Utils Version: N/A
Fonduer Version: 0.8.2

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks: 1. the progress bar shows nothing during preprocessing. 2. the machine RAM may not be large enough to hold N preprocessed docs. They become more serious when N is large and/or each HTML file is large. This changes the flow such that a single, single thread runs all of the preprocessing, filling a queue with preprocessed documents. Then, a thread pool takes documents from this queue and parses them. Note that the parser itself is still very memory-hungry, which will be addressed in a future patch. An example of how this looks in execution is: Main process put 112823 into in_queue (in_queue.qsize: 0) 1-th worker process got 112823 from in_queue (in_queue.qsize: 0) Main process put 2N3906-D into in_queue (in_queue.qsize: 0) 0-th worker process got 2N3906-D from in_queue (in_queue.qsize: 0) Main process put 2N3906 into in_queue (in_queue.qsize: 1) Main process put 2N4123-D into in_queue (in_queue.qsize: 2) Main process put 2N4124 into in_queue (in_queue.qsize: 3) Main process put 2N6426-D into in_queue (in_queue.qsize: 4) 1-th worker process got 2N3906 from in_queue (in_queue.qsize: 4) Main process put 2N6427 into in_queue (in_queue.qsize: 4) 0-th worker process got 2N4123-D from in_queue (in_queue.qsize: 4) Main process put AUKCS04635-1 into in_queue (in_queue.qsize: 4) 1-th worker process got 2N4124 from in_queue (in_queue.qsize: 4) Main process put BC182-D into in_queue (in_queue.qsize: 4) Main process put BC182 into in_queue (in_queue.qsize: 4) 0-th worker process got 2N6426-D from in_queue (in_queue.qsize: 4) 1-th worker process got 2N6427 from in_queue (in_queue.qsize: 3) Main process put BC337-D into in_queue (in_queue.qsize: 4) 0-th worker process got AUKCS04635-1 from in_queue (in_queue.qsize: 4) Main process put BC337 into in_queue (in_queue.qsize: 4) 0-th worker process got BC182-D from in_queue (in_queue.qsize: 3) 1-th worker process got BC182 from in_queue (in_queue.qsize: 2) 0-th worker process got BC337-D from in_queue (in_queue.qsize: 1) 1-th worker process got BC337 from in_queue (in_queue.qsize: 0) Fixes #435.

HiromuHota mentioned this issue Jun 10, 2020

Execute preprocessing and parsing in parallel #439

Merged

4 tasks

lukehsiao closed this as completed in #439 Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

HiromuHota commented Jun 10, 2020

All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

All the documents are loaded onto memory at once during UDFRunner.apply and this leads to a slowness if they exceeds machine's memory size #435

Comments

HiromuHota commented Jun 10, 2020