You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug, but a memory-inefficient computing.
When executing UDFRunner.apply (like Parser.apply), all the documents are loaded onto memory at once (see the code below) before processing them in mutiple processes parallelly.
Currently, preprocessor and parser are executed in a complete sequential
order. i.e., preprocess N docs (and load them into a queue), then parse
N docs. This has two drawbacks:
1. the progress bar shows nothing during preprocessing.
2. the machine RAM may not be large enough to hold N preprocessed docs.
They become more serious when N is large and/or each HTML file is large.
This changes the flow such that a single, single thread runs all of the
preprocessing, filling a queue with preprocessed documents. Then, a
thread pool takes documents from this queue and parses them.
Note that the parser itself is still very memory-hungry, which will be
addressed in a future patch.
An example of how this looks in execution is:
Main process put 112823 into in_queue (in_queue.qsize: 0)
1-th worker process got 112823 from in_queue (in_queue.qsize: 0)
Main process put 2N3906-D into in_queue (in_queue.qsize: 0)
0-th worker process got 2N3906-D from in_queue (in_queue.qsize: 0)
Main process put 2N3906 into in_queue (in_queue.qsize: 1)
Main process put 2N4123-D into in_queue (in_queue.qsize: 2)
Main process put 2N4124 into in_queue (in_queue.qsize: 3)
Main process put 2N6426-D into in_queue (in_queue.qsize: 4)
1-th worker process got 2N3906 from in_queue (in_queue.qsize: 4)
Main process put 2N6427 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N4123-D from in_queue (in_queue.qsize: 4)
Main process put AUKCS04635-1 into in_queue (in_queue.qsize: 4)
1-th worker process got 2N4124 from in_queue (in_queue.qsize: 4)
Main process put BC182-D into in_queue (in_queue.qsize: 4)
Main process put BC182 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N6426-D from in_queue (in_queue.qsize: 4)
1-th worker process got 2N6427 from in_queue (in_queue.qsize: 3)
Main process put BC337-D into in_queue (in_queue.qsize: 4)
0-th worker process got AUKCS04635-1 from in_queue (in_queue.qsize: 4)
Main process put BC337 into in_queue (in_queue.qsize: 4)
0-th worker process got BC182-D from in_queue (in_queue.qsize: 3)
1-th worker process got BC182 from in_queue (in_queue.qsize: 2)
0-th worker process got BC337-D from in_queue (in_queue.qsize: 1)
1-th worker process got BC337 from in_queue (in_queue.qsize: 0)
Fixes#435.
Describe the bug
This is not a bug, but a memory-inefficient computing.
When executing
UDFRunner.apply
(likeParser.apply
), all the documents are loaded onto memory at once (see the code below) before processing them in mutiple processes parallelly.fonduer/src/fonduer/utils/udf.py
Lines 115 to 116 in 8134e4d
When
len(doc_loader)
is large, each file is large, and/or machine's RAM is small, this leads to memory paging and critically slows down the execution.To Reproduce
Steps to reproduce the behavior:
Expected behavior
Load and process a document one at a time.
(When parallelism=N, load and process N documents at a time.)
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: